Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function.

The loss function describes how well the model will perform given the current set of parameters (weights and biases) and gradient descent is used to find the *best *set of parameters. This is achieved by taking the partial derivative at a given point and then iteratively traversing the search space in the negative direction of the function gradient.

As the loss function improves, the parameters of a model (weights) are updated until it reaches the optimal point which is the **minima **of the loss function (the weights are updated in proportion to the derivative of the error). The two key aspects of Gradient descent are a) the direction to move and b) the size of the step (learning rate, discussed below).

Gradient descent is used when the model parameters cannot be calculated using straightforward math (e.g. linear algebra) and must be searched for using an optimization algorithm.

There are several variants of gradient descent including **batch**, **stochastic**, and **mini-batch**.

There are also several optimization algorithms including momentum, adagrad, nesterov accelerated gradient, RMSprop, adam, etc. Here is a blog post that covers the differences between these algorithms.

Gradient descent has a parameter called **learning rate** which represents the size of the steps taken as that network navigates the curve in search of the valley. If the learning rate is too high, the network may overshoot the minima. If it's too low, the training will take too long and may never reach the minima or else get stuck in local minima.

Check out the in-depth explanation of Gradient Descent in this blog post.