Day 8 of Machine Learning ||Linear Regression Part 2

#datascience #machinelearning #tutorial #beginners

Hey reader👋Hope you are doing well😊
In the last post we have read about Linear regression and some of its basics.
In this post we are going to discuss about how we can minimize our cost function using gradient descent algorithm.
So let's get started🔥

Gradient Descent

It is the algorithm used to find the value of theta that minimizes the cost function.
Cost Function -:

According to this algorithm we will start with some random value of theta let's say Θ = 0 vector that is value for all parameters Θ's is 0 and then keep changing Θ to reduce the cost function.

where j=0,1,2.....,n
α is learning rate (in practice α=0.01) this indicates that we are taking small steps or we are making small change in the value of Θ.

If α is too large then the steps taken are too large and if α is too small then number of iterations will be more and algorithm will become slow.
To understand it better consider that you are on mountain and you want to get to the lowest point in the valley. Gradient descent is like taking steps downhill in the direction that decreases the altitude. Each step is based on slope of the mountain at current point.

Now let's find the value of Θ -:

For m training samples -:

This is how we can compute value of parameters that minimizes the cost function.
So here you are seeing that we are starting from Θ = 0 then calculating the predicted output for all training samples and then reducing the value of Θ in order to minimize the cost function.
This algorithm is also known as Batch Gradient Descent.
The main disadvantage of this algorithm is that it will fail for large dataset because in order to make one update we need to calculate sum of all training examples.
An alternative to this algorithm is Stochastic Gradient Descent.

Stochastic Gradient Descent

In this algorithm instead of using the whole dataset, we use only one training point at a time to update model's parameters.
[Note -> Stochastic means random]
Stochastic Gradient Descent picks one data point to compute the gradient and update the model.

So here the algorithm will pick any random data point and compute gradient for this then it will pick another value and will do the same thing.
The main disadvantage of this algorithm is that it can be more noisy and less stable because it uses only one data point at a time which can lead to fluctuating gradients.