In this blog we'll first try to understand Linear Regression and Gradient Descent intuitively then we'll see math behind the algorithm and then do a basic implementation of it in python.
Structure of this blog is as follow
- We'll setup a hypothetical problem for using linear regression.
- We'll gradually develop intuition of each step of the algorithm. We'll answer why we use that step and what it does.
- Then we'll look at math behind each step and represent it in and mathematical expression
- We'll implement the algorithm using the mathematical expressions we derived using python.
Even if you don't have any pre-knowledge you can get some intuition about how linear regression using gradient descent works but it is better if you know
- Some concept of derivative from calculus
- How matrix multiplication works
- Basic python programming knowledge
Let's assume a hypothetical experiment in which students of agriculture science collected some data from different farms/green houses through out many years. Structure of the collected data is as follow
What we want to do is predict harvests of different farms/green houses given Average Temperature and Average Nitrite In Soil. For this we have made some assumptions based on initial analysis of the data
Relation of Temperature and Average Nitrite In Soil is somewhat linear with yield
For simplicity we also assume that other factors like water in soil, light etc. either don't affect the yield or are kept fairly similar across all the farms
Lets represent average temperature (°C) as x1 average Nitrite in Soil(ppm) as x2 and harvest yield (kg/sq. m) as y
Then based on the above assumptions we can say that there exists a linear equation
Which will fairly describe relationship between our independent and dependent variables x1, x2 and y. This relationship is also called hypothesis equation. We'll also refer to this as a model.
where w0 is a bias term and w1 and w2 are weights for x1 and x2 respectively
Now our aim is to find value of w0, w1 and w2 which will best describes the relationship
Now let's find the way to measure how good the model is. One straight forward way is to measure how much error the model makes in predictions. Lets say
Y is target values of known data in our example "yield"
P is corresponding predicted values and
err is errors made in each predictions
Here err is array of values, we need a single numerical value to compare which model is better.
We need to combine the err values to give a single value, straight sum or average will not work because positive and negative error will cancel each other.
So we take average of square of individual error. We are taking average because we don't want the metric to depend on number of items on training data. This value is called Mean Square Error(MSE), this is our cost function which we need to minimize.
Replacing pi in above equation with our hypothesis function we get
We added subscript i to represent ith row of data. So, x(i,1) means 1st feature of ith row of training data
Obviously another valid cost function will be average of absolute value of individual error also known as Mean Absolute Error(MAE) but we will use MSE for now due to its simplicity.
Now we got our cost function as equation(2.2) above, lets look at it in detail
- MSE=MSE(w0, w1, w2) is equation with three variables w0, w1, w2 as they are the unknown values. Whereas the x’s and y’s are constants from training data
- We want to find w0, w1 and w2 which will give minimum value of MSE
How would we do it?
Side Note There is a straight forward way to do it using calculus which need finding partial derivatives of the function and equating them to ? zero. As at lowest point of function with continuous derivative, tangent will be equal to zero (i.e at this point tangent is changing from positive to negative or vice versa). Then find the value of variables by solving the equations.
But this method is not always practical due to following reasons
- Process will be different for different type of equation
- If cost function is very complex then it is very hard to find symbolic partial derivative or solve the partial derivative equations
There is a general algorithm to minimize any function (**function must be convex else it will only find local minima) known as gradient descent. We can summarize gradient descent algorithm as
- Choose an arbitrary point lets say Wa, ie arbitrary values of w0, w1 and w2
- Find small change in w0, w1 and w2 which will result in fastest decrease/descent of the MSE.
- For finding the small change we’ll find rate of change of MSE in direction of w0, w1 and w2 separately, these are called partial derivatives. The combined vector of rate of change in each direction is called gradient. Then we multiply the gradient by a small number.
- where η is a small step size and ∇MSE is gradient of MSE at point Wa. ∇MSE represents the direction in which MSE changes fastest at point Wa and its magnitude gives rate of change of cost in that direction.
- Now find new point using following equation
which will have lower MSE value. Then replace Wa with Wb and repeat from step 2 for a specified number of times or until change on MSE in each step become very small
To understand the algorithm visually/intuitively, let’s do a thought experiment.
Lets say you are operating a nano robot in a temperature field (temperature varies in 3D space and is function of location) and your objective it to take the robot to lowest temperature as fast as possible else you risk damaging it.
You can ask the robot for temperature at its current position and move robot in any direction in 3D space, how will you guide the robot to lowest temperature?
Some points to note
- Here temperature is analogous to cost and location of robot is analogous to point w0, w1, w2 your objective is to find the point where cost is lowest.
- Lets say w0,w_1 and w2 represents left-right, top-bottom and front-back axis
- It is not possible to plot MSE against w0, w1, w2 at once because it will be in 4D plane. So we need to visualize it by plotting MSE against w0, w1, w2 one at a time.
- If we plot temperature against only w0 keeping w1 and w2 at some fixed value, we’ll see curve like below. Which is a quadratic curve which has convex shape.
You can follow the following steps
- Lets say the robot starts at it current location W_a
- To find rate of change of temperature in left-right direction
- move robot to right little bit get temperature there and comeback to previous position.
- Rate of change of temperature is change in temperature divided by the distance moved.
- When the distance moved tends to zero it is called the partial derivative with respect to w0 because we kept w1 and w2 constant
- Similarly you can find rate of change in temperature in top-bottom and front-back direction
- Combining the rate of change in all 3 directions as vector gives the gradient of temperature in the field. Which is a vector and has direction in which temperature is changing fastest and its magnitude gives the rate of change. Lets represent it by ∇T
- Now we need to move the robot in direction opposite of ∇T one step so that in every step we move to lower temperature. The step size should be small enough so that it doesn’t jump to other side of curve where temperature could be higher (see curve above). So new location of the robot will be
- Now repeat the steps from 2 until either you reach a satisfactory temperature point or you run out of battery (i.e. steps)
If we take partial derivative of equation(1.1) w.r.t wj, where j is coefficient index
from equation(1.1) we know
then if we take partial derivative of pi w.r.t wj all other terms except wj is constant. so the result will be
Matrix operations like matrix multiplication are optimized and are more efficient than doing simple array calculations. Also matrix representation make the equations more readable. So we’ll convert above equations to matrix operations
We can represent equation 1.1 which predicts value for one instance using matrix operation to predict for all instances at once as
P is prediction which is a column vector of length m
X is matrix with m rows and n columns, m is number of data points and n number of features plus one bias term which is always 1
W is column vector of length n
In equation (4.1) we found partial derivative of MSE w.r.t w_j which is j th coefficient of regression model, which is j th component of gradient vector. Based on equation(4.1) Whole gradient can be represented using matrix operation like
XT is transpose of matrix X. It will have n rows and m columns
In this part we’ll implement linear regression using basic python and numpy library. numpy library is the fundamental package for scientific computing in python. Its make computation on large array of data easy and efficient.
Please don’t get overwhelmed by numpy, we’ll only use some basic functions of it. It will be explained on comment what each function does.
Installing Numpy If you have installed python using conda or miniconda it should be already installed. If its not installed in you system you can use pip to install it
pip install numpy
Then you can import numpy library in you jupyter notebook or python file as follow
import numpy as np
Numpy Array Operations Numpy provides shorthand notations to performs arithmetic operation on array, some examples are as follow
arr1+arr2 # will returns new array which is element wise sum of two arrays 2*arr1 # will multiply each element of arr1 by 2 arr1*arr2 # will return new array which is element wise multiplication of two array
import numpy as np # np.random.randn(100) will give array of 100 normally distributed random # numbers with mean 0 and std-dev 1 # 2*np.random.randn(100)+12 changes the normal distrubution to have mean 12 and std-dev 2 nitrate = 2*np.random.randn(100)+12 temperature = 4*np.random.rand(100) + 26 # np.c_ concatinates (joins) two array column wise x_farm = np.c_[nitrate,temperature] # This is imaginary equation describing relation between yeild, nitrate and temperature. # Obiously this is not know in real world problem. We are using it to generate dummy data. yeild_ideal = .1*nitrate + .08*temperature +.6 # adding some noise on the ideal equation. # The noise is normally distributed with 0 mean and std-dev 0.4 yeild = yeild_ideal + .4*np.random.randn(100) print("few instances of generated data\n", x_farm[:5]) print("\nfew instances of generated targets\n", yeild[:5] )
few instances of generated data [[12.15862374 26.41754744] [ 7.7100876 26.58160233] [12.39040209 27.5565463 ] [14.89781833 26.14161419] [12.89437241 29.96356333]] few instances of generated targets [4.37946646 3.41541999 3.41400057 4.3290903 3.70089309]
# Our equations above need bais term in x which should always be 1 def add_bais_term(x): # np.ones(n) will give new array of length n whose all elements are 1 # np.c_ concatinates two array column wise return np.c_[np.ones(len(x)),x] # Root mean square cost function def rmse_cost_func(P,Y): ## model is array with bais and coffecients values return np.sqrt(np.mean((P-Y)**2)) # Finds gradient of cost function using eq(5.2) above def gradient_of_cost(x,y,model): preds = predict(x,model) error_term = preds-y # np.matmul performs matrix multiplication # x.T is transpose of matrix x xt_dot_error_term = np.matmul(x.T,error_term)/len(x) return xt_dot_error_term # Do prediction using eq(5.1) above def predict(x,model): #np.matmul performs matrix multiplication return np.matmul(x,model) def find_linear_regression_model(x,y): n_epochs = 10000 neta = 0.001 # Initialize all parameters(wj's) to zero model = np.zeros(len(x)) # do n_epochs iteration for _ in range(n_epochs): grad = gradient_of_cost(x,y,model) # move parameters closer to optimum solution in every step next_model = model - neta*grad model = next_model return model
x_farm_with_bais = add_bais_term(x_farm) print("First 3 data with bias\n",x_farm_with_bais[:3]) model = find_linear_regression_model(x_farm_with_bais, yeild) print("\nmodel(w0,w1,w2)\n", model)
First 3 data with bias [[ 1. 12.15862374 26.41754744] [ 1. 7.7100876 26.58160233] [ 1. 12.39040209 27.5565463 ]] model(w0,w1,w2) [0.03759733 0.09930154 0.10164468]
# predict yeild for yeild_predicted = predict(x_farm_with_bais, model) yeild_predicted[:15]
array([3.93017065, 3.50509946, 4.06895977, 4.17412975, 4.36366529, 3.79386452, 3.74566146, 3.81563257, 4.19976404, 3.90262785, 4.21451357, 4.05253678, 4.34120954, 4.06686309, 4.24635708])
cost = rmse_cost_func(yeild_predicted,yeild) print(cost)
Hurray!! We have now implemented linear regression and used it to make predictions. We also measured RMSE, which is equivalent to mean of absolute error made when doing a prediction. The mean error is close to std-dev of random error added which mean the model is working quite good.
Above examples are only for illustration purpose to understand the working of linear regression from inside. Following are few things to be noted
- In real world we will not know the relationship of dependent and independent variables in advance as we do now. Or even we don't know if relationship is linear, we can infer something about the relation using scatter plot or other data analysis.
- We usually don't evaluate performance of a model using training data because a model can memorize the training data to get high performance in training data but may not work well with new unseen data. To avoid this - - we usually separate data into three sets, training, validation and testing. Validation set is used to validate model performance and used to improve model. Testing set is used only after final model is found to estimate performance of the model on unseen data.
- This is a very basic implementation of linear regression to understand the core concepts. We hardly ever need to do our own implementation. We'll always use linear regression implementation from popular libraries which will be much more robust, efficient and provides customization parameters
Here is short summary of linear regression algorithm
Please find the jupyter notebook file for whole blog and code here
One of the most consolidated misconceptions about programming, since the early days, is the idea that such activity is purely technical, completely exact in nature, like Math and Physics. Computation is exact, but programming is not.