Sairam manne

Posted on May 12 • Edited on May 15

Statsmodels Library: An Overview

#machinelearning #datascience #beginners #statsmodels

Introduction
History
When to use
Installation
Features
Ordinary least squares
Statistical tests
Learning Resources
Conclusion

Introduction

Python ecosystem is equipped with many tools and libraries which primarily focus on prediction or machine learning.
For example,

scikit-learn focuses on predictive modeling and machine learning and does not provide statistical summaries (like p-values, confidence intervals, R² adj.).
SciPy.statsfocuses on Individual statistical tests and distributions but has no modeling framework (like OLS or GLM).
Other libraries like linearmodels , PyMC / Bambi , Pingouin have their own limitations.

Statsmodels was developed to fill the gap created by these existing tools.

History

The development of statsmodels began as an effort to bring robust statistical modeling capabilities to the Python ecosystem that, at the time, were largely dominated by R and MATLAB. While Python had gained popularity in data science due to its general-purpose programming strengths and libraries like NumPy, SciPy, and pandas, it lacked a comprehensive library for classical statistical inference and econometric modeling.

When to use?

use statsmodels library when you need a detailed statistical output and tests.
when the focus is on inference and interpretability and not on prediction.
when working with time series or econometrics.

Installation

By using Python's package manager pip you can easily install it. In your terminal type the following:

pip install statsmodels

Installing with conda:

conda install -c conda-forge statsmodels

For importing in google colab:

# loads the main API module of statsmodels which has datasets, statistical tests, time series models, statistical models (like OLS, GLM)
import statsmodels.api as sm

Features

Statistical models:

Linear Regression (OLS)
Generalized Linear Models (logistic regression, poisson regression)
Time series analysis
Mixed Linear Models

Statistical tests:

T-tests
ANOVA
Chi-square tests
Cointegration tests

Data Exploration:

Summary Statistics
correlation analysis
Multicollinearity analysis

Model diagnostics and inference:

Confidence intervals
p-value
R - squared

Ordinary least squares

We will be performing ordinary least squares using this library.
OLS is a statistical method in linear regression to estimate the relationship between a dependable variable and one or more independent variables.

Ordinary Least Squares tries to minimize the sum of the squared differences, also called as residuals between the observed values and the values predicted by a linear model.

OLS finds the best-fitting line that explains the relationship between the input variables and the output variable i.e X and Y.

To perform OLS, we need to import a dataset. In our case, we will be using iris dataset.

The below code imports all the necessary libraries

pandas - used for data manipulation and data analysis
numpy - used for numerical computations.

Sklearn library has few in-built datasets from which we will import iris dataset.

import pandas as pd
import numpy as  np
import statsmodels.api as sm
from sklearn import datasets

To perform operations on a dataset, we load it first

iris = datasets.load_iris()

Here, iris contains data in the form of an array and not a dataframe

print(iris.keys())

# output
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

To convert the data into a pandas dataframe,

# independent variables
features = pd.DataFrame(iris.data,columns = iris.feature_names)

# dependent variable
target = pd.DataFrame(iris.target, columns= ['target'])

Now, the array like data is being converted into the pandas dataframe

features.sample(5)

output:

target.sample(5)

Now, load independent variables into X and dependent variable to y to train the model.

X = features
y = target

Later by using the statsmodels library, we fit the model.

model = sm.OLS(y,X).fit()

prediction = model.predict(X)

print(model.summary())

The above screenshot shows the detailed summary of the fitted model. This is highly useful for evaluating model's performance and assumptions.

Statistical tests

Two-sample t-test (independent):

import the necessary libraries required to perform the task

import statsmodels.stats.weightstats as smw
import numpy as np

create 2 variables which represent two independent groups using numpy library

group1 = np.array([2.9, 3.0, 2.5, 2.6, 3.2])
group2 = np.array([3.8, 2.7, 4.0, 2.4, 2.9])

t_stat: This is the calculated t-statistic value. It measures how far apart the group means are.

p_value: This is used to determine the significance of the hypothesis test and is expressed between 0 & 1.

small p-value (<=0.05) indicates strong proof against null
hypothesis.
large p-value (>0.05) indicates weak evidence against null hypothesis.

df: Degrees of freedom used in the test. Influences the shape of the t-distribution.

t_stat, p_value, df = smw.ttest_ind(group1, group2, usevar='pooled')

print("t-statistic:", t_stat)
print("p-value:", p_value)
print("degrees of freedom:", df)

one way anova:

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from statsmodels.stats.anova import anova_lm

data = {
    'score': [88, 90, 92, 85, 87, 89, 75, 78, 72],
    'method': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
}

df = pd.DataFrame(data)

# Fit OLS model with categorical predictor
model = smf.ols('score ~ C(method)', data=df).fit()

# Perform ANOVA
anova_results = anova_lm(model, typ=2)  # Type II SS is standard
print(anova_results)

Chi-square test:

import pandas as pd
from scipy.stats import chi2_contingency

data = {
    'Method_A': [30, 10],  # [Pass, Fail]
    'Method_B': [25, 15],
    'Method_C': [20, 20]
}
df = pd.DataFrame(data, index=['Pass', 'Fail'])

chi2, p, dof, expected = chi2_contingency(df)

print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:\n", expected)

Learning Resources

Conclusion

Hope this article helped you get started with statsmodels and understand its key features. If you found it useful, feel free to share it with others who might benefit too. Keep exploring, experimenting, and happy learning!

DEV Community