A Basic Guide to OLS

#python #datascience #ols #linearregression

If you have ever run an OLS model from statsmodels for Linear Regression Line, you would naturally come to a question: "What the heck does all this mean?"

The OLS summary can be intimidating as it presents not just R-squared score, but many test scores and statistics associated with Linear Regression model. This post is intended to demystify OLS and provide guidance to interpretation of its summary.

Background
Let's start with a background of Linear Regression and OLS.

Linear regression is a statistical method to calculate the relationships between predictors (also referred as features, independent variables (X)) and the target (dependent/response (y)) variable we are trying to predict. This relationship should have a linear relationships in order to use this model.

                       Linear equation
             ŷ = β0 + β1x1 + β2x2 + β3x3 + βnxn

ŷ (y-hat) - estimated value of target (response)
β0 - value of y if x=0 (intercept/constant)
x1, x2, x3, xn - predictors
β1, β2, β3, βn - slope or coefficients of corresponding predictors.

OLS stands for Ordinary Least Squares. Least Squares is referring to mathematical formula it is using to calculate errors. Since OLS model is non-robust it is sensitive to outliers.

Importing libraries:

import statsmodels.api as sm
from statsmodels.formula.api import ols

Code to run OLS model:

y = df['price']
X = df.drop('price', axis=1)
base_model = sm.OLS(y, sm.add_constant(X)).fit()
base_model.summary()

(Please note that capitalized OLS model doesn't automatically add intercept(constant), so you have to add it to X (independent variables)).

Let's look at the example of OLS:

OLS Summary Interpretation

Left side of the top table lists model and data specs:

Dep. Variable
Your target variable.
Model
An abbreviated version of Method.
Method
Name of the technique/formula behind the model.
Date and Time
Date/time the model was created.
No. Observations
Number of data entries (rows) in the dataset used for the model.
Df Residuals (Degrees of Freedom)
The number of observations - the number of parameters/features (including intercept/constant).
Df Model
The number of parameters (not including intercept).
Covariance Type
OLS is non-robust model which is sensitive to outliers.

Right side of the top table is for model performance (goodness of fit):

R-squared (also known as coefficient of determination) determines model performance. How much the model explains the target. For instance if our model's R-squared = 0.75 it means that 75% of the target (price) is explained by this model.
adj. R-squared
Modified version of R-squared adjusted for the number of independent features used in the model. Since R-squared can't go down, but only can be =/> with every additional parameter, adj. R-squared penalizes the model for extra features. If model has more than one independent feature then disregard R-Squared and use adj. R-squared (adjusted).
F-statistic and Prob (F-statistic)
F-statistic is an indication on whether our findings are significantly important. Prior to running our model we assume that there is no correlation between predictors (X independent variables or features) and target (price). In other words, H0: no correlation and H1: there is correlation between predictors and target. Model assumes that alpha (significance level) is at 0.05. Since Prob (F-statistic) is 0.00 for our model, we reject H0 and accept H1.
Log-Likelihood
Measures model fit. The higher the value, the better the model fits the data. The value can range from negative infinity to positive infinity.
AIC
The Akaike information criterion (AIC) is a metric to compare the fit of regression models.
AIC value is neither “good” or “bad” because AIC is being used to compare regression models. The model with the lowest AIC suggests the best fit. The AIC value on its own (when not compared to other models's AIC) is not important.
BIC
The Bayesian Information Criterion is a metric that is also used to compare the fit of regression models. Similar to AIC the lowest score indicates the better fit model.

Middle table presents the coefficients report:
Left column is listing constant and predictors used in a model.

coef
Estimated value of a parameter.
std err
Standard deviation of coefficient.
t
t-statistic used for calculating significance of a specific parameter.
P>|t|
Probability of t-statistic associated with a parameter. P value < 0.05 indicates that a parameter is significant to use in the model.
[95.0% Coef. Interval]
Interval/range of a coefficient.

Bottom table covers residuals, multicollinearity, and homoscedasticity:

Skew
Indicates whether the errors are normally distributed. Skewness = 0 translates to a perfect symmetry. Negative skew indicates that errors are left skewed. Positive skew is that residuals distribution is right skewed.
Kurtosis
Measures model peakiness. Kurtosis = 3 indicates normal distribution. Kurtosis > 3, higher peakiness.
Omnibus D'Angostino's test
Test for skewness and kurtosis (or normality of residuals).
Prob(Omnibus)
Probability of Omnibus D'Angostino's test statistic. H0: residual data is normally distributed. Alpha = 0.05. If p(JB) < 0.05 (alpha), reject H0. Residual data is not normally distributed.
Jarque-Bera
Another test for skewness and kurtosis (normality of residuals). H0: residual data is normally distributed. Alpha = 0.05. If JB ~>6 reject H0. If JB ~ 0, accept H0.
Prob(JB)
Probability of Jarque-Bera statistic. If p(JB) < 0.05 (alpha), reject H0. Residual data is not normally distributed.
Durbin-Watson
Homoskedasticity test. Values should be between ~1.5 and ~2.5.
Cond. No
Multicollinearity test. If parameters are related, you will get a note at the bottom of OLS reports like this: "The condition number is large, 7.53e+05. This might indicate that there are strong multicollinearity or other numerical problems."