From Linear Regression to Gradient Descent

#beginners #datascience #machinelearning #tutorial

Hello, I'm Ganesh. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on Github. Star git-lrc on GitHub to help more developers discover the project. Do give it a try and share your feedback for improving the product.

In the previous section, we learned that linear regression finds the best-fitting line by determining the optimal slope and intercept.

In this article, we will discuss how to calculate the optimal slope and intercept using Gradient Descent.

How to calculate the optimal slope and intercept using Gradient Descent

The quality of that line is measured using the Sum of Squared Residuals (SSR), which represents the total prediction error.

SSR = sum( (y_observed - y_predicted)^2 )

The best regression line is simply the line that produces the smallest SSR.

When studying linear regression, it's easy to think that the slope and intercept magically appear from a formula. In reality, they are the values that minimize the prediction error. This is where Gradient Descent comes in.

Instead of calculating the optimal slope and intercept directly using a closed-form equation, Gradient Descent starts with arbitrary values and gradually improves them. After each step, it measures how the SSR changes and adjusts the parameters in the direction that reduces the error.

Step-by-Step Gradient Descent Example

Let's illustrate how Gradient Descent works using the exact same dataset of 4 points from Part 10:

1. The Dataset

Point 1: (1, 2)
Point 2: (2, 3)
Point 3: (3, 5)
Point 4: (4, 4)

2. Simplifying the Problem

To make the math easy to trace, we will hold the Slope (m) constant at its optimal value of 0.8 and focus purely on finding the optimal Intercept (b).

Our prediction equation is:

y_predicted = 0.8 * x + b

We start with an initial guess for the intercept: b = 0.

3. Calculating the Initial SSR (at b = 0)

Let's find the predicted values and calculate the residuals (observed - predicted):

For Point 1 (1, 2):
- y_predicted = 0.8 * 1 + 0 = 0.8
- Residual_1 = 2 - 0.8 = 1.2
For Point 2 (2, 3):
- y_predicted = 0.8 * 2 + 0 = 1.6
- Residual_2 = 3 - 1.6 = 1.4
For Point 3 (3, 5):
- y_predicted = 0.8 * 3 + 0 = 2.4
- Residual_3 = 5 - 2.4 = 2.6
For Point 4 (4, 4):
- y_predicted = 0.8 * 4 + 0 = 3.2
- Residual_4 = 4 - 3.2 = 0.8

Now, sum the squared residuals:

SSR = 1.2^2 + 1.4^2 + 2.6^2 + 0.8^2
    = 1.44 + 1.96 + 6.76 + 0.64
    = 10.8

4. Derivation of the Gradient (d(SSR)/db)

To know which direction to move the intercept b and by how much, we take the derivative of SSR with respect to b:

SSR = sum( (y_observed - (0.8 * x_observed + b))^2 )

Applying the chain rule:

d(SSR)/db = sum( 2 * (y_observed - (0.8 * x_observed + b)) * (-1) )
          = -2 * sum( y_observed - y_predicted )
          = -2 * sum( Residuals )

The gradient is simply -2 times the sum of the residuals.

5. Updating the Intercept

The update rule is:

b_new = b_old - (Learning Rate * Gradient)

Let's choose a Learning Rate (LR) of 0.1.

Step 1:
- Gradient: d(SSR)/db = -2 * (1.2 + 1.4 + 2.6 + 0.8) = -2 * 6.0 = -12.0
- Step Size: Gradient * LR = -12.0 * 0.1 = -1.2
- New Intercept: b_new = 0 - (-1.2) = 1.2
Step 2:
- With b = 1.2, the predictions are closer to the actual values.
- The new residuals are: 0.0, 0.2, 1.4, and -0.4.
- SSR: 0.0^2 + 0.2^2 + 1.4^2 + (-0.4)^2 = 2.16
- Gradient: -2 * (0.0 + 0.2 + 1.4 - 0.4) = -2.4
- Step Size: -2.4 * 0.1 = -0.24
- New Intercept: b_new = 1.2 - (-0.24) = 1.44
Step 3 (Convergence):
- With b = 1.44, the new residuals are: -0.24, -0.04, 1.16, and -0.64.
- Gradient: -2 * (-0.24 - 0.04 + 1.16 - 0.64) = -0.48
- Step Size: -0.48 * 0.1 = -0.048
- New Intercept: b_new = 1.44 - (-0.048) = 1.488
- We repeat this loop. As we approach the optimal intercept, the residuals sum up closer to 0, which shrinks the gradient and steps.
- After several iterations, the gradient becomes 0, and the intercept converges to the exact optimal value of 1.5 (where SSR reaches its minimum value of 1.8).

Conclusion

We started with an arbitrary intercept of 0 and adjusted it step-by-step. Each step was guided by the gradient, which told us exactly how much to change the intercept to reduce the prediction error (SSR). We repeated this process until the error reached its minimum.

While this example focused on a simple linear regression with a single variable, this same principle applies to deep neural networks with millions of parameters. Gradient descent is the engine that drives learning in machine learning.