Linear regression is an important algorithms in machine learning. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation (for example a line) to observed data. This blog will walk through the mathematics behind linear regression models.
What is Linear Regression?
Linear regression finds a linear function that best predicts the target variable y based on predictor variables x. The model equation is:
where is the intercept and are the coefficients (weights) for features . In linear regression we aim to find optimal value for to converge the cost function.
But, What is Cost function?
The cost function in linear regression is a mathematical measure of how well the model's predicted values match the actual target values in the training data. It quantifies the difference (error) between the predicted values and the true values , representing this error as a single number that the learning algorithm tries to minimize. Now, how does learning algorithm minimizes it? The answer is using Gradient descent.
Gradient Descent
Gradient descent is used to find the global minima of the cost function, lower the cost better the model fits into the data set. But, how it finds the global minima? Remember functions, differentiation, partial derivatives etc. ? We use these mathematical concepts to achieve global minima. Lets understand the mathematics behind it.
Let's consider a simple cost function, a parabolic function
Now from the graph it is clearly visible that at
, we have the global minima. But, it is not possible to trace graph and observe the global minima for all cost functions because cost functions can be as complex as Mean squared error function with y depending on not just x but multiple independent set of variable like
. For example,
where, the predicted values is , the true values is , is the cost function, is number of data points and n is no of feature in each data point. You see, it is very difficult to plot this cost function and observe the local minima, now here comes mathematics to rescue us.
In gradient descent, the core idea is to move in the direction of steepest slope based on calculus (gradient at a point or the slope). So, by walking down the hill step by step we will reach the global minima. Now, how do we find this mathematically. So lets take example of Mean Squared Error (MSE) function, our cost function for linear regression,
for regression line , our goal here is to find optimal values of m (slope) and b (intercept) to reduce the cost function. So lets take partial derivative of . So, substituting and taking partial derivative with respect to m, we get,
again, taking partial derivative of with respect to b, we get,
Let and be current values of slope and intercept and be the rate of learning, then out new slope and intercept will be,
We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for slope (m) and intercept (b).
Now, this was just to fit a simple straight line containing one variable , what is we have n number of variables? well in that case we make use of linear algebra with multi-variate calculus.
The General Case
In real life you will not get data whose outcome will depend on a single parameter. In real life, we will have n number of independent parameters on which out outcome would depend. So, how to express these in terms of mathematical equation? Here comes linear algebra and vectors. So, lets say for our regression line
, we will try to express each terms in form of matrix for example,
and Y being our outcome matrix, we can now express our line as a dot product of two matrix,
Expanding out idea further we can now express a multi-vatiate regression line as,
and to express our line, we can take dot product of M and X same as Eq.1 mentioned above. Now lets try to express all our calculation in terms of matrix.
where,
is Jacobian expression for
and
. Similarly our new M would be
We keep iterating over this process till our cost function converges, and once the cost function converges we get our optimal value for M.
So, this is it, we have covered the mathematics behind gradient descent and how to apply it in optimizing the cost function of models. We will further discuss its implementation using python, till then take care!
Top comments (0)