# Linear Regression Using Gradient Descent for Beginners - Intuition, Math and Code

###
Dev Khadka
*Updated on *
・11 min read

# Linear Regression Using Gradient Descent for Beginners- Intuition, Math and Code

In this blog we'll first try to understand Linear Regression and Gradient Descent intuitively then we'll see math behind the algorithm and then do a basic implementation of it in python.

Structure of this blog is as follow

- We'll setup a hypothetical problem for using linear regression.
- We'll gradually develop intuition of each step of the algorithm. We'll answer why we use that step and what it does.
- Then we'll look at math behind each step and represent it in and mathematical expression
- We'll implement the algorithm using the mathematical expressions we derived using python.

## Prerequisite

Even if you don't have any pre-knowledge you can get some intuition about how linear regression using gradient descent works but it is better if you know

- Some concept of derivative from calculus
- How matrix multiplication works
- Basic python programming knowledge

## Problem Setup

Let's assume a hypothetical experiment in which students of agriculture science collected some data from different farms/green houses through out many years. Structure of the collected data is as follow

What we want to do is predict harvests of different farms/green houses given Average Temperature and Average Nitrite In Soil. For this we have made some assumptions based on initial analysis of the data

Relation of Temperature and Average Nitrite In Soil is somewhat linear with yield

For simplicity we also assume that other factors like water in soil, light etc. either don't affect the yield or are kept fairly similar across all the farms

Lets represent average temperature (°C) as x_{1} average Nitrite in Soil(ppm) as x_{2} and harvest yield (kg/sq. m) as y

Then based on the above assumptions we can say that there exists a linear equation

Which will fairly describe relationship between our independent and dependent variables x_{1}, x_{2} and y. This relationship is also called hypothesis equation. We'll also refer to this as a model.

where w_{0} is a bias term and w_{1} and w_{2} are weights for x_{1} and x_{2} respectively

Now our aim is to find value of w_{0}, w_{1} and w_{2} which will best describes the relationship

## Cost Function

Now let's find the way to measure how good the model is. One straight forward way is to measure how much error the model makes in predictions. Lets say

**Where**

**Y** is target values of known data in our example "yield"

**P** is corresponding predicted values and

**err** is errors made in each predictions

Here err is array of values, we need a single numerical value to compare which model is better.

We need to combine the err values to give a single value, straight sum or average will not work because positive and negative error will cancel each other.

So we take average of square of individual error. We are taking average because we don't want the metric to depend on number of items on training data. This value is called Mean Square Error(MSE), this is our cost function which we need to minimize.

Replacing p_{i} in above equation with our hypothesis function we get

We added subscript i to represent i^{th} row of data. So, x_{(i,1)} means 1st feature of i^{th} row of training data

Obviously another valid cost function will be average of absolute value of individual error also known as Mean Absolute Error(MAE) but we will use MSE for now due to its simplicity.

## Finding Optimum Model (Bias and Weights)

Now we got our cost function as equation(2.2) above, lets look at it in detail

- MSE=MSE(w
_{0}, w_{1}, w_{2}) is equation with three variables w_{0}, w_{1}, w_{2}as they are the unknown values. Whereas the x’s and y’s are constants from training data - We want to find w
_{0}, w_{1}and w_{2}which will give minimum value of MSE

**How would we do it?**

Side Note There is a straight forward way to do it using calculus which need finding partial derivatives of the function and equating them to ? zero. As at lowest point of function with continuous derivative, tangent will be equal to zero (i.e at this point tangent is changing from positive to negative or vice versa). Then find the value of variables by solving the equations.

But this method is not always practical due to following reasons

- Process will be different for different type of equation
- If cost function is very complex then it is very hard to find symbolic partial derivative or solve the partial derivative equations

## Gradient Descent

There is a general algorithm to minimize any function (**function must be convex else it will only find local minima) known as gradient descent. We can summarize gradient descent algorithm as

- Choose an arbitrary point lets say W
_{a}, ie arbitrary values of w_{0}, w_{1}and w_{2} - Find small change in w
_{0}, w_{1}and w_{2}which will result in fastest decrease/descent of the MSE. - For finding the small change we’ll find rate of change of MSE in direction of w
_{0}, w_{1}and w_{2}separately, these are called partial derivatives. The combined vector of rate of change in each direction is called gradient. Then we multiply the gradient by a small number.

- where η is a small step size and ∇MSE is gradient of MSE at point W
_{a}. ∇MSE represents the direction in which MSE changes fastest at point W_{a}and its magnitude gives rate of change of cost in that direction. - Now find new point using following equation

which will have lower MSE value. Then replace W_{a} with W_{b} and repeat from step 2 for a specified number of times or until change on MSE in each step become very small

### Thought Experiment

To understand the algorithm visually/intuitively, let’s do a thought experiment.

Lets say you are operating a nano robot in a temperature field (temperature varies in 3D space and is function of location) and your objective it to take the robot to lowest temperature as fast as possible else you risk damaging it.

You can ask the robot for temperature at its current position and move robot in any direction in 3D space, how will you guide the robot to lowest temperature?

**Some points to note**

- Here temperature is analogous to cost and location of robot is analogous to point w
_{0}, w_{1}, w_{2}your objective is to find the point where cost is lowest. - Lets say w
_{0},w_1 and w_{2}represents left-right, top-bottom and front-back axis - It is not possible to plot MSE against w
_{0}, w_{1}, w_{2}at once because it will be in 4D plane. So we need to visualize it by plotting MSE against w_{0}, w_{1}, w_{2}one at a time. - If we plot temperature against only w
_{0}keeping w_{1}and w_{2}at some fixed value, we’ll see curve like below. Which is a quadratic curve which has convex shape.

**You can follow the following steps**

- Lets say the robot starts at it current location W_a
- To find rate of change of temperature in left-right direction

- move robot to right little bit get temperature there and comeback to previous position.
- Rate of change of temperature is change in temperature divided by the distance moved.
- When the distance moved tends to zero it is called the partial derivative with respect to w
_{0}because we kept w_{1}and w_{2}constant

- Similarly you can find rate of change in temperature in top-bottom and front-back direction
- Combining the rate of change in all 3 directions as vector gives the gradient of temperature in the field. Which is a vector and has direction in which temperature is changing fastest and its magnitude gives the rate of change. Lets represent it by ∇T
- Now we need to move the robot in direction opposite of ∇T one step so that in every step we move to lower temperature. The step size should be small enough so that it doesn’t jump to other side of curve where temperature could be higher (see curve above). So new location of the robot will be

- Now repeat the steps from 2 until either you reach a satisfactory temperature point or you run out of battery (i.e. steps)

### Partial Derivative

If we take partial derivative of equation(1.1) w.r.t w_{j}, where j is coefficient index

from equation(1.1) we know

then if we take partial derivative of p_{i} w.r.t w_{j} all other terms except w_{j} is constant. so the result will be

Rearranging

## Matrix Representation

Matrix operations like matrix multiplication are optimized and are more efficient than doing simple array calculations. Also matrix representation make the equations more readable. So we’ll convert above equations to matrix operations

### Prediction using matrix multiplication

We can represent equation 1.1 which predicts value for one instance using matrix operation to predict for all instances at once as

**Here:**

**P** is prediction which is a column vector of length m

**X** is matrix with m rows and n columns, m is number of data points and n number of features plus one bias term which is always 1

**W** is column vector of length n

### Gradient using matrix operations

In equation (4.1) we found partial derivative of MSE w.r.t w_j which is j th coefficient of regression model, which is j th component of gradient vector. Based on equation(4.1) Whole gradient can be represented using matrix operation like

Here:

X^{T} is transpose of matrix X. It will have n rows and m columns

### Training step

## Coding Time

In this part we’ll implement linear regression using basic python and numpy library. numpy library is the fundamental package for scientific computing in python. Its make computation on large array of data easy and efficient.

Numpy Introduction

Please don’t get overwhelmed by numpy, we’ll only use some basic functions of it. It will be explained on comment what each function does.

Installing Numpy If you have installed python using conda or miniconda it should be already installed. If its not installed in you system you can use pip to install it

`pip install numpy`

Then you can import numpy library in you jupyter notebook or python file as follow

`import numpy as np`

Numpy Array Operations Numpy provides shorthand notations to performs arithmetic operation on array, some examples are as follow

```
arr1+arr2 # will returns new array which is element wise sum of two arrays
2*arr1 # will multiply each element of arr1 by 2
arr1*arr2 # will return new array which is element wise multiplication of two array
```

### Let's Generate Data

```
import numpy as np
# np.random.randn(100) will give array of 100 normally distributed random
# numbers with mean 0 and std-dev 1
# 2*np.random.randn(100)+12 changes the normal distrubution to have mean 12 and std-dev 2
nitrate = 2*np.random.randn(100)+12
temperature = 4*np.random.rand(100) + 26
# np.c_ concatinates (joins) two array column wise
x_farm = np.c_[nitrate,temperature]
# This is imaginary equation describing relation between yeild, nitrate and temperature.
# Obiously this is not know in real world problem. We are using it to generate dummy data.
yeild_ideal = .1*nitrate + .08*temperature +.6
# adding some noise on the ideal equation.
# The noise is normally distributed with 0 mean and std-dev 0.4
yeild = yeild_ideal + .4*np.random.randn(100)
print("few instances of generated data\n", x_farm[:5])
print("\nfew instances of generated targets\n", yeild[:5] )
```

**Output**

```
few instances of generated data
[[12.15862374 26.41754744]
[ 7.7100876 26.58160233]
[12.39040209 27.5565463 ]
[14.89781833 26.14161419]
[12.89437241 29.96356333]]
few instances of generated targets
[4.37946646 3.41541999 3.41400057 4.3290903 3.70089309]
```

### Lets Define Some Functions

```
# Our equations above need bais term in x which should always be 1
def add_bais_term(x):
# np.ones(n) will give new array of length n whose all elements are 1
# np.c_ concatinates two array column wise
return np.c_[np.ones(len(x)),x]
# Root mean square cost function
def rmse_cost_func(P,Y):
## model is array with bais and coffecients values
return np.sqrt(np.mean((P-Y)**2))
# Finds gradient of cost function using eq(5.2) above
def gradient_of_cost(x,y,model):
preds = predict(x,model)
error_term = preds-y
# np.matmul performs matrix multiplication
# x.T is transpose of matrix x
xt_dot_error_term = np.matmul(x.T,error_term)/len(x)
return xt_dot_error_term
# Do prediction using eq(5.1) above
def predict(x,model):
#np.matmul performs matrix multiplication
return np.matmul(x,model)
def find_linear_regression_model(x,y):
n_epochs = 10000
neta = 0.001
# Initialize all parameters(wj's) to zero
model = np.zeros(len(x[0]))
# do n_epochs iteration
for _ in range(n_epochs):
grad = gradient_of_cost(x,y,model)
# move parameters closer to optimum solution in every step
next_model = model - neta*grad
model = next_model
return model
```

### Train Model

```
x_farm_with_bais = add_bais_term(x_farm)
print("First 3 data with bias\n",x_farm_with_bais[:3])
model = find_linear_regression_model(x_farm_with_bais, yeild)
print("\nmodel(w0,w1,w2)\n", model)
```

**Output**

```
First 3 data with bias
[[ 1. 12.15862374 26.41754744]
[ 1. 7.7100876 26.58160233]
[ 1. 12.39040209 27.5565463 ]]
model(w0,w1,w2)
[0.03759733 0.09930154 0.10164468]
```

### Predict

```
# predict yeild for
yeild_predicted = predict(x_farm_with_bais, model)
yeild_predicted[:15]
```

**Output**

```
array([3.93017065, 3.50509946, 4.06895977, 4.17412975, 4.36366529,
3.79386452, 3.74566146, 3.81563257, 4.19976404, 3.90262785,
4.21451357, 4.05253678, 4.34120954, 4.06686309, 4.24635708])
```

### Evaluate

```
cost = rmse_cost_func(yeild_predicted,yeild)
print(cost)
```

**Output**

```
0.3679644024562277
```

**Hurray!!** We have now implemented linear regression and used it to make predictions. We also measured RMSE, which is equivalent to mean of absolute error made when doing a prediction. The mean error is close to std-dev of random error added which mean the model is working quite good.

## Disclamer

Above examples are only for illustration purpose to understand the working of linear regression from inside. Following are few things to be noted

- In real world we will not know the relationship of dependent and independent variables in advance as we do now. Or even we don't know if relationship is linear, we can infer something about the relation using scatter plot or other data analysis.
- We usually don't evaluate performance of a model using training data because a model can memorize the training data to get high performance in training data but may not work well with new unseen data. To avoid this - - we usually separate data into three sets, training, validation and testing. Validation set is used to validate model performance and used to improve model. Testing set is used only after final model is found to estimate performance of the model on unseen data.
- This is a very basic implementation of linear regression to understand the core concepts. We hardly ever need to do our own implementation. We'll always use linear regression implementation from popular libraries which will be much more robust, efficient and provides customization parameters

## Summary

Here is short summary of linear regression algorithm

Please find the jupyter notebook file for whole blog and code here