DEV Community

Raphael Gutierrez
Raphael Gutierrez

Posted on

Formulas Behind Linear Regression

Linear regression is an approach for predicting a quantitative response Y on the basis of a single predictor variable X (or in multiple linear regression, on the basis of multiple predictors). It's a simple yet powerful model to estimate continuous variables.

Here, I'll be discussing important formulas behind linear regression.
 

Linear model

Since linear regression is a linear model, it assumes that the dependence of Y on X1, X2, …, Xp is also linear. Because of that, a simple linear regression can be interpreted in the form:

Y=β0+β1X+ϵY=\beta_0+\beta_1X+\epsilon

Where:
    β0 = intercept term (expected value of Y when X=0)
    β1 = the average increase in Y associated with a one-unit increase in X
    ϵ = a catch-all for what we miss with this simple model. We assume that the error term is independent of X

Here, X is the independent variable and Y is the dependent variable (also the estimated value), but what we're most interested to look at here are the coefficients β0 (slope) and β1 (intercept).

Recall your high school geometry class where slope-intercept form was introduced. It's basically similar as to what's happening in a simple linear regression equation, where the higher the slope is, the steeper the line goes and the greater the rate of change becomes. Meanwhile, the intercept controls the location where it intersects an axis.

To build a linear model, we must find first the values of these coefficients. This is a job for the least squares approach.
 

Least squares and residuals

The least squares approach chooses the values of 0 and 1 (hat symbol means estimates) that minimize residuals. A residual is the difference between the i th observed response value and the i th response value that is predicted by the linear model, or in the form:

ei=yiy^ie_i=y_i-\hat{y}_i

This is a single instance of a residual, and sometimes it comes up negative, so we usually get the sum of the squares of all residuals, hence the residual sum of squares (RSS):

RSS=e12+e22+...+en2RSS=e_1^2+e_2^2+...+e_n^2

or

RSS=i=1n(yiy^i)2RSS=\textstyle\sum_{i=1}^n(y_i-\hat{y}_i)^2

Since we're all set, we can now compute for the values of 0 and 1 using the formulas:

β1^=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\hat{\beta_{1}}={\textstyle\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) \over \textstyle\sum_{i=1}^n(x_i-\bar{x})^2}
β0^=yˉβ1^xˉ\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}

These are derived from the formula of the regression line. Once the coefficients are calculated, we can now compute for estimates and assess the accuracy of the model.
 

Assessing accuracy

There are various metrics we can use to assess the accuracy of a linear model. Some of them are residual standard error (RSE), R2 statistic, and mean squared error.

The residual standard error or RSE is considered a measure of the lack of fit of the model to the data. If the predictions obtained using the model are very close to the true outcome values, then RSE will be small and we can conclude that the model fits the data very well.

RSE is also an estimate of the standard deviation of ϵ. Roughly speaking, it is the average amount that the response will deviate from the true regression line.

It uses RSS to compute its value, thus having a form of:

RSE=1n2RSSRSE=\sqrt{ {1 \over {n-2}} RSS }

The coefficient of determination or R2 statistic is the percentage of variation in Y which is explained by all the X variables. In other terms, it is the square of the correlation R of the data (statistical relationship between variables).

R2 has an interpretational advantage over RSE since unlike RSE, it always lies between 0 and 1.

The formula to find R2 is:

R2=1RSSi=1n(yiyˉ)2R^2=1-{RSS \over {\textstyle\sum_{i=1}^n}(y_i-\bar{y})^2}

If you recall in other references, the denominator is the total sum of squares (TSS). With that, we can rewrite the formula to:

R2=1RSSTSSR^2=1-{RSS \over TSS}

Finally, the mean squared error or MSE tells us the average squared difference between the predicted values and the actual values. The lower the MSE, the better a model fits the data.

The formula to calculate the MSE is:

MSE=1ni=1n(yiy^i)2MSE={1 \over n} {\textstyle\sum_{i=1}^n}(y_i-\hat{y}_i)^2

Upon observation, we can see that the MSE can also be computed by dividing the RSS to the total number of data points n.
 

Reference:
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.).

Top comments (4)

Collapse
 
devmehta profile image
Dev Mehta

Great post buddy. If you want to also understand Gradient Descent and how linear regression works behind the scenes with visual learning of the topic, I would recommend checking out this blog post. Also, learning about basic linear algebra and calculus would help new developers getting into this field :)

Collapse
 
dendihandian profile image
Dendi Handian

it would be cool if those formula are written in markdown github.blog/2022-05-19-math-suppor...

Collapse
 
ralphgutz profile image
Raphael Gutierrez

Thanks! But it looks like dev.to supports KaTeX for mathematical expression dev.to/p/editor_guide#katex-embed

Collapse
 
dendihandian profile image
Dendi Handian

seems like you did it, great!