## DEV Community

Gradient Descent is an iterative approach which can be used for Empirical Risk Minimization. There are other ways of doing the Empirical Risk Minimization as well which is outside the scope of this blog.

## What is Empirical Risk Minimization?

It's a process of building a machine learning model by examining many examples and attempting to find a model that minimizes loss in supervised machine learning. Attempting to find a model means simply Training Model that is to find the good value for all the weights and bias in the model.

## What is loss?

Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. The Loss function is used to compute the Loss. It is also known as COST function.

### Example:

Mean Square Error is an example of Loss function. It is used in many places in Machine Learning and even in Deep Learning for instance Linear Regression. It is simply the average error across all the examples.

For each example/instance of dataset,

= the square of the difference between the label and the prediction
= (observation - prediction(x))2
= (y - y')2


For entire dataset,

$MSE = \frac {1} {n} (y - y')2$

Here, n is the total number of instances, examples in the dataset.

## Gradient Descent an Optimization Technique

In statistics, Machine Learning and Data science field we optimize a lot of stuff. Gradeint Descent can help us optimize a lot of things. Some of them are

1. When we fit a line with Linear Regression, we optimize the intercept and slope.
2. When we use Logistic Regression, we optimize the squiggle (threshold/cut point).
3. When we use T-SNE, we optimize the forming clusters.

Gradient Descent identifies the optimal value for the paramters by taking big steps when far away and baby steps when it is close to the Minima

### Computing Gradient for a function

If there is a function

$f(x,y) = x^{2}sin(y)$

We want to compute the gradient for the above function so we will calculate the Partial derivative of the function with respect to each independent variable in the function in our case there are two independent variables (X,Y). After, calculating these Partial derivatives we will put them in a vector. This vector is nothing but a Gradient. Gradient points in the direction of the steepest ascent( rise).

To have a complete indepth understanding of the Gradient, I compiled the playlist of videos on Gradient from Khan academy have a look at this.

We looked into the Loss function and see how it tells us about how well our model parameters (weights and bais) are doing the prediction.

But by just knowing how bad our Machine Learning Model is doing the prediction isn't really helpful. We want to know some method to change these parameters (weights and baises) in a way that our Loss/Cost function gets reduce and our Machine Learning model do even better. Here, comes the Gradient Descent.

Just imagine a simple function that has one number as input and one number as output. You can check the entire video around this topic over here.

We want to know what value of input to this function gives us the minimum output. One possible solution should be to find the value of W such that the slope of the function C(W) is zero that value of input will be the input value with lowest output. Another way of saying this samething is finding the value of W where derivation of C(W) w.r.t W is ZERO. This works for the CONVEX problem but not for the problem which are not convex like neural network where exist more than one MINIMA. Also, doing it for 1000's of weights make it even worse.

Convex problem is the problem/function which has only one global minimum. For instance function below is convex with only one minima. Now, we know what is the problem with above solution. The more feasible solution will be start randomly at any input value and then calculate the value for LOST function. Now, find the derivative at this point which is nothing but the slope of the tangent at that point. If the slope is negative that means our cost function is reducing so we will simply add to current input value and if the slope is positive that means our cost function is increase so we will subtract it from our current input value. In other words, if slope is positive it means our COST function increase however we want to decrease it so will shift to the left and if slope is negative it means our COST function decreasing we want to continue decreasing so we will shift to the right. If we keep on doing this repeatedly checking the new slope and taking the new appropriate step eventually we will approach some local minimum of the function.

In above story, usually the function is the COST function and the parameter are more than one which we need to optimize. This can not be done by taking slope at a single point because we don't want to find the minima w.r.t single parameter but a minima w.r.t all the parameters. The Gradient help us here.

The Gradient is calculated just like we discussed above but the thing is here the function against which we calculate the Gradient is the COST FUNCTION. The Gradient is nothing but when we have two or more derivates of the same function w.r.t to different parameter. Just like we changed the parameter based on the slope for a single variable input function discussed above. For multiple variable input function that is function with more than one independent varaible we use the Gradient.

Suppose our Cost Function graph is something like this The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn't matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The following figure shows that we've picked a starting point slightly greater than 0: The gradient descent algorithm then calculates the gradient (slope) of the loss curve at the starting point. The gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way direction to go in parameter space to find the *Minima. When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.

Note that a gradient is a vector, so it has both of the following characteristics:

• Magnitude
• Direction

The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure: The gradient descent then repeats this process, edging ever closer to the minimum.

When performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. For example, to find the optimal values of both w1 and the bias b, we
calculate the gradients with respect to both w1 and b. Next, we modify the values of w1 and b based on their respective gradients. Then we repeat these steps until we reach minimum loss.

We will use the Gradient to descend to lowest point in the Loss function, thus, this is why this algorithm is called Gradient Descent.

## Learning Rate

We mentioned above

Gradient Descent identifies the optimal value for the paramters by taking big steps when far away and baby steps when it is close to the Minima.

But how is it happening. Thing is size of the step should be related to Slope, since it tell us we need to take a baby step or big step, but we need to make sure that big is not too big. Bypassing the MINIMA and even increasing the loss function. So, Gradient Descent determine the Step Size by mulitplying the Slope by a small number called The Learning Rate.

For example, if the gradient magnitude (slope) is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

### Choosing value for learning rate

Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long: Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well and you may bypass the MINIMA and increase the loss value: In practice, finding a "perfect" (or near-perfect) learning rate is not essential for successful model training. The goal is to find a learning rate large enough that gradient descent converges efficiently, but not so large that it never converges.

### Godilock learning rate

It's the learning rate, where gradient descent reaches the minimum loss/point in the the fewest number of steps.

## What is meant by our model is converged?

If our loss stops changing or at least changes extremely slowly even after continuous iteration. When that happens, we say that the model has converged. 1. Initialize the learningRate, steps (Number Of Iterations) and also decide the COST function you want to use.
2. We will initialize all the parameters to random values we want to tune.
3. We will do the prediction using the current value of all the parameters.
4. We then take the partial derivative of the Loss function for each parameter in it respectively. In fancy Machine learning, we will compute the Gradient of the loss function.
5. Calculate the Step Size
$Step Size = Slope * Learning Rate$
Here, slope is nothing but the value for partial derivative calculated at the step 5. The step size will be different for each parameter respectively.
6. We then update our parameters using the Step Size i.e.
$New Value = Old Value - Step Size$
7. Now, go back to Step 3 and repeat until Step Size is very small or you reach the maximum number of steps.

## Why are we always subtracting the Step Size when updating the parameter?

We know that Gradient points in the direction of the steepest ascent (rise) and Step Size is using the Slope from the Gradient Vector in it's computation. By default Step Size (Gradient) will move towards the maximum but we want to go towards the MINIMA that's why we subtract it.

What if in case we want to reach the maximum?
You gussed it right. We will simply add it instead of subtracting it.

This is a good video explaining the Gradient Descent algorithm step by step.

## Stopping criteria for Gradient Descent

1. When the Step Size is very small to ZERO. It will be zero when the slope is zero. In practice, smaller slope mean anything less than 0.001.
2. Gradient descent also includes a limit on the number of steps it will take before giving up. We can configure this number depending upon the problem and our experience. Usually, the maximum number of steps are 1000 or even greater. So, even if Step Size is large, if there have been more than the Maximum number of steps, Gradient Descent will stop.

We can also say that the stopping criteria for Gradient Descent is simply. When our model get converged.

## Does weight (parameter) initialization matter? • For convex problem, initialization value doesn't really matter as you will eventually reach the minimum if you choose the small enough learning rate.
• For neural networks, where there are more than one minimum and some are better than others then the initialization value does matter and we do need to assign them carefully.

## Other variations of the Gradient Descent

### Why there are other variants of Gradient Descent?

The default Gradient Descent doesn't work really well for large dataset it's efficiency really sucks. When we computing the gradient of the loss function Math's suggest that we should compute the gradient over all examples in our dataset. This is the only way to guarantee that our gradient steps are in the right direction. But for a large dataset with million or billion examples that would be a lot of computation in order to perform each step. As the Loss function will be really large and calculating it's derivative will result in large computation time.

Empirically, rather than using the entire dataset we can compute
the gradient of the loss function over the single randomly selected example over the dataset that mostly works too. Even though they have to take more steps but the amount of computation require to reach solution is often much smaller. This is known as Stochastic Gradient Descent. This significantly reduces the time spent calculating the derivative of the loss function.