DEV Community

Sairam manne
Sairam manne

Posted on • Edited on

Statsmodels Library: An Overview

Table of Contents

  1. Introduction
  2. History
  3. When to use
  4. Installation
  5. Features
  6. Ordinary least squares
  7. Statistical tests
  8. Learning Resources
  9. Conclusion

Introduction

Python ecosystem is equipped with many tools and libraries which primarily focus on prediction or machine learning.
For example,

  • scikit-learn focuses on predictive modeling and machine learning and does not provide statistical summaries (like p-values, confidence intervals, R² adj.).

  • SciPy.statsfocuses on Individual statistical tests and distributions but has no modeling framework (like OLS or GLM).

  • Other libraries like linearmodels , PyMC / Bambi , Pingouin have their own limitations.

Statsmodels was developed to fill the gap created by these existing tools.

History

The development of statsmodels began as an effort to bring robust statistical modeling capabilities to the Python ecosystem that, at the time, were largely dominated by R and MATLAB. While Python had gained popularity in data science due to its general-purpose programming strengths and libraries like NumPy, SciPy, and pandas, it lacked a comprehensive library for classical statistical inference and econometric modeling.

When to use?

  • use statsmodels library when you need a detailed statistical output and tests.

  • when the focus is on inference and interpretability and not on prediction.

  • when working with time series or econometrics.

Installation

By using Python's package manager pip you can easily install it. In your terminal type the following:

pip install statsmodels

Enter fullscreen mode Exit fullscreen mode

Installing with conda:

conda install -c conda-forge statsmodels

Enter fullscreen mode Exit fullscreen mode

For importing in google colab:

# loads the main API module of statsmodels which has datasets, statistical tests, time series models, statistical models (like OLS, GLM)
import statsmodels.api as sm

Enter fullscreen mode Exit fullscreen mode

Features

Statistical models:

  • Linear Regression (OLS)
  • Generalized Linear Models (logistic regression, poisson regression)
  • Time series analysis
  • Mixed Linear Models

Statistical tests:

  • T-tests

  • ANOVA

  • Chi-square tests

  • Cointegration tests

Data Exploration:

  • Summary Statistics
  • correlation analysis
  • Multicollinearity analysis

Model diagnostics and inference:

  • Confidence intervals
  • p-value
  • R - squared

Ordinary least squares

We will be performing ordinary least squares using this library.
OLS is a statistical method in linear regression to estimate the relationship between a dependable variable and one or more independent variables.

Ordinary Least Squares tries to minimize the sum of the squared differences, also called as residuals between the observed values and the values predicted by a linear model.

OLS finds the best-fitting line that explains the relationship between the input variables and the output variable i.e X and Y.

To perform OLS, we need to import a dataset. In our case, we will be using iris dataset.

The below code imports all the necessary libraries

  • pandas - used for data manipulation and data analysis

  • numpy - used for numerical computations.

Sklearn library has few in-built datasets from which we will import iris dataset.

import pandas as pd
import numpy as  np
import statsmodels.api as sm
from sklearn import datasets
Enter fullscreen mode Exit fullscreen mode

To perform operations on a dataset, we load it first

iris = datasets.load_iris()
Enter fullscreen mode Exit fullscreen mode

Here, iris contains data in the form of an array and not a dataframe

print(iris.keys())
Enter fullscreen mode Exit fullscreen mode
# output
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
Enter fullscreen mode Exit fullscreen mode

To convert the data into a pandas dataframe,

# independent variables
features = pd.DataFrame(iris.data,columns = iris.feature_names)
Enter fullscreen mode Exit fullscreen mode
# dependent variable
target = pd.DataFrame(iris.target, columns= ['target'])
Enter fullscreen mode Exit fullscreen mode

Now, the array like data is being converted into the pandas dataframe

features.sample(5)
Enter fullscreen mode Exit fullscreen mode

output:

Preview of the dataset

target.sample(5)
Enter fullscreen mode Exit fullscreen mode

Output of target variable

Now, load independent variables into X and dependent variable to y to train the model.

X = features
y = target
Enter fullscreen mode Exit fullscreen mode

Later by using the statsmodels library, we fit the model.

model = sm.OLS(y,X).fit()
Enter fullscreen mode Exit fullscreen mode
prediction = model.predict(X)
Enter fullscreen mode Exit fullscreen mode
print(model.summary())
Enter fullscreen mode Exit fullscreen mode

output of the model

The above screenshot shows the detailed summary of the fitted model. This is highly useful for evaluating model's performance and assumptions.

Statistical tests

  1. Two-sample t-test (independent):

import the necessary libraries required to perform the task

import statsmodels.stats.weightstats as smw
import numpy as np

Enter fullscreen mode Exit fullscreen mode

create 2 variables which represent two independent groups using numpy library

group1 = np.array([2.9, 3.0, 2.5, 2.6, 3.2])
group2 = np.array([3.8, 2.7, 4.0, 2.4, 2.9])
Enter fullscreen mode Exit fullscreen mode

t_stat: This is the calculated t-statistic value. It measures how far apart the group means are.

p_value: This is used to determine the significance of the hypothesis test and is expressed between 0 & 1.

  • small p-value (<=0.05) indicates strong proof against null
    hypothesis.

  • large p-value (>0.05) indicates weak evidence against null hypothesis.

df: Degrees of freedom used in the test. Influences the shape of the t-distribution.

t_stat, p_value, df = smw.ttest_ind(group1, group2, usevar='pooled') 
Enter fullscreen mode Exit fullscreen mode
print("t-statistic:", t_stat)
print("p-value:", p_value)
print("degrees of freedom:", df)
Enter fullscreen mode Exit fullscreen mode

output image

  1. one way anova:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from statsmodels.stats.anova import anova_lm
Enter fullscreen mode Exit fullscreen mode
data = {
    'score': [88, 90, 92, 85, 87, 89, 75, 78, 72],
    'method': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
}
Enter fullscreen mode Exit fullscreen mode
df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode
# Fit OLS model with categorical predictor
model = smf.ols('score ~ C(method)', data=df).fit()
Enter fullscreen mode Exit fullscreen mode
# Perform ANOVA
anova_results = anova_lm(model, typ=2)  # Type II SS is standard
print(anova_results)
Enter fullscreen mode Exit fullscreen mode
  1. Chi-square test:
import pandas as pd
from scipy.stats import chi2_contingency
Enter fullscreen mode Exit fullscreen mode
data = {
    'Method_A': [30, 10],  # [Pass, Fail]
    'Method_B': [25, 15],
    'Method_C': [20, 20]
}
df = pd.DataFrame(data, index=['Pass', 'Fail'])
Enter fullscreen mode Exit fullscreen mode
chi2, p, dof, expected = chi2_contingency(df)

print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:\n", expected)
Enter fullscreen mode Exit fullscreen mode

result image

Learning Resources

  1. Official Documentation

  2. GeeksforGeeks

Conclusion

Hope this article helped you get started with statsmodels and understand its key features. If you found it useful, feel free to share it with others who might benefit too. Keep exploring, experimenting, and happy learning!

Top comments (0)