Raphael Gutierrez

Posted on Jul 12, 2022

Formulas Behind Linear Regression

#datascience #statistics #machinelearning

Linear regression is an approach for predicting a quantitative response Y on the basis of a single predictor variable X (or in multiple linear regression, on the basis of multiple predictors). It's a simple yet powerful model to estimate continuous variables.

Here, I'll be discussing important formulas behind linear regression.

Linear model

Since linear regression is a linear model, it assumes that the dependence of Y on X₁, X₂, …, X_p is also linear. Because of that, a simple linear regression can be interpreted in the form:

Y=\beta_0+\beta_1X+\epsilon

Where:
    β₀ = intercept term (expected value of Y when X=0)
    β₁ = the average increase in Y associated with a one-unit increase in X
    ϵ = a catch-all for what we miss with this simple model. We assume that the error term is independent of X

Here, X is the independent variable and Y is the dependent variable (also the estimated value), but what we're most interested to look at here are the coefficients β₀ (slope) and β₁ (intercept).

Recall your high school geometry class where slope-intercept form was introduced. It's basically similar as to what's happening in a simple linear regression equation, where the higher the slope is, the steeper the line goes and the greater the rate of change becomes. Meanwhile, the intercept controls the location where it intersects an axis.

To build a linear model, we must find first the values of these coefficients. This is a job for the least squares approach.

Least squares and residuals

The least squares approach chooses the values of B̂₀ and B̂₁ (hat symbol means estimates) that minimize residuals. A residual is the difference between the i th observed response value and the i th response value that is predicted by the linear model, or in the form:

e_i=y_i-\hat{y}_i

This is a single instance of a residual, and sometimes it comes up negative, so we usually get the sum of the squares of all residuals, hence the residual sum of squares (RSS):

RSS=e_1^2+e_2^2+...+e_n^2

RSS=\textstyle\sum_{i=1}^n(y_i-\hat{y}_i)^2

Since we're all set, we can now compute for the values of B̂₀ and B̂₁ using the formulas:

\hat{\beta_{1}}={\textstyle\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) \over \textstyle\sum_{i=1}^n(x_i-\bar{x})^2}

\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}

These are derived from the formula of the regression line. Once the coefficients are calculated, we can now compute for estimates and assess the accuracy of the model.

Assessing accuracy

There are various metrics we can use to assess the accuracy of a linear model. Some of them are residual standard error (RSE), R² statistic, and mean squared error.

The residual standard error or RSE is considered a measure of the lack of fit of the model to the data. If the predictions obtained using the model are very close to the true outcome values, then RSE will be small and we can conclude that the model fits the data very well.

RSE is also an estimate of the standard deviation of ϵ. Roughly speaking, it is the average amount that the response will deviate from the true regression line.

It uses RSS to compute its value, thus having a form of:

RSE=\sqrt{ {1 \over {n-2}} RSS }

The coefficient of determination or R² statistic is the percentage of variation in Y which is explained by all the X variables. In other terms, it is the square of the correlation R of the data (statistical relationship between variables).

R² has an interpretational advantage over RSE since unlike RSE, it always lies between 0 and 1.

The formula to find R² is:

R^2=1-{RSS \over {\textstyle\sum_{i=1}^n}(y_i-\bar{y})^2}

If you recall in other references, the denominator is the total sum of squares (TSS). With that, we can rewrite the formula to:

R^2=1-{RSS \over TSS}

Finally, the mean squared error or MSE tells us the average squared difference between the predicted values and the actual values. The lower the MSE, the better a model fits the data.

The formula to calculate the MSE is:

MSE={1 \over n} {\textstyle\sum_{i=1}^n}(y_i-\hat{y}_i)^2

Upon observation, we can see that the MSE can also be computed by dividing the RSS to the total number of data points n.

Reference:
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.).

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (4)

Dev Mehta • Jan 24 '23

Great post buddy. If you want to also understand Gradient Descent and how linear regression works behind the scenes with visual learning of the topic, I would recommend checking out this blog post. Also, learning about basic linear algebra and calculus would help new developers getting into this field :)

Dendi Handian • Jul 12 '22

it would be cool if those formula are written in markdown github.blog/2022-05-19-math-suppor...

Raphael Gutierrez • Jul 13 '22

Thanks! But it looks like dev.to supports KaTeX for mathematical expression dev.to/p/editor_guide#katex-embed

Dendi Handian • Aug 21 '22

seems like you did it, great!

Try REST API Generation for MS SQL Server.

DreamFactory generates live REST APIs from database schemas with standardized endpoints for tables, views, and procedures in OpenAPI format. We support on-prem deployment with firewall security and include RBAC for secure, granular security controls.

See more!

DEV Community

Formulas Behind Linear Regression

Linear model

Least squares and residuals

Assessing accuracy

See why 4M developers consider Sentry, “not bad.”

Top comments (4)

Try REST API Generation for MS SQL Server.

Read next

AI Engineer's Tool Review: Arize

Building Secure RAG Applications with Go: An Introduction to GoRag

How to Build your very own Google's NotebookLM

How Digital Onboarding KYC is Transforming Identity Verification

Okay