Table of Contents
- Introduction
- History
- When to use
- Installation
- Features
- Ordinary least squares
- Statistical tests
- Learning Resources
- Conclusion
Introduction
Python ecosystem is equipped with many tools and libraries which primarily focus on prediction or machine learning.
For example,
scikit-learn
focuses on predictive modeling and machine learning and does not provide statistical summaries (like p-values, confidence intervals, R² adj.).SciPy.stats
focuses on Individual statistical tests and distributions but has no modeling framework (like OLS or GLM).Other libraries like
linearmodels
,PyMC / Bambi
,Pingouin
have their own limitations.
Statsmodels
was developed to fill the gap created by these existing tools.
History
The development of statsmodels
began as an effort to bring robust statistical modeling capabilities to the Python ecosystem that, at the time, were largely dominated by R and MATLAB. While Python had gained popularity in data science due to its general-purpose programming strengths and libraries like NumPy
, SciPy
, and pandas
, it lacked a comprehensive library for classical statistical inference and econometric modeling.
When to use?
use statsmodels library when you need a detailed statistical output and tests.
when the focus is on inference and interpretability and not on prediction.
when working with time series or econometrics.
Installation
By using Python's package manager pip you can easily install it. In your terminal type the following:
pip install statsmodels
Installing with conda:
conda install -c conda-forge statsmodels
For importing in google colab:
# loads the main API module of statsmodels which has datasets, statistical tests, time series models, statistical models (like OLS, GLM)
import statsmodels.api as sm
Features
Statistical models:
- Linear Regression (OLS)
- Generalized Linear Models (logistic regression, poisson regression)
- Time series analysis
- Mixed Linear Models
Statistical tests:
T-tests
ANOVA
Chi-square tests
Cointegration tests
Data Exploration:
- Summary Statistics
- correlation analysis
- Multicollinearity analysis
Model diagnostics and inference:
- Confidence intervals
- p-value
- R - squared
Ordinary least squares
We will be performing ordinary least squares using this library.
OLS is a statistical method in linear regression to estimate the relationship between a dependable variable and one or more independent variables.
Ordinary Least Squares tries to minimize the sum of the squared differences, also called as residuals between the observed values and the values predicted by a linear model.
OLS finds the best-fitting line that explains the relationship between the input variables and the output variable i.e X and Y.
To perform OLS, we need to import a dataset. In our case, we will be using iris dataset.
The below code imports all the necessary libraries
Sklearn library has few in-built datasets from which we will import iris dataset.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn import datasets
To perform operations on a dataset, we load it first
iris = datasets.load_iris()
Here, iris
contains data in the form of an array and not a dataframe
print(iris.keys())
# output
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
To convert the data into a pandas dataframe,
# independent variables
features = pd.DataFrame(iris.data,columns = iris.feature_names)
# dependent variable
target = pd.DataFrame(iris.target, columns= ['target'])
Now, the array like data is being converted into the pandas dataframe
features.sample(5)
output:
target.sample(5)
Now, load independent variables into X and dependent variable to y to train the model.
X = features
y = target
Later by using the statsmodels library, we fit the model.
model = sm.OLS(y,X).fit()
prediction = model.predict(X)
print(model.summary())
The above screenshot shows the detailed summary of the fitted model. This is highly useful for evaluating model's performance and assumptions.
Statistical tests
- Two-sample t-test (independent):
import the necessary libraries required to perform the task
import statsmodels.stats.weightstats as smw
import numpy as np
create 2 variables which represent two independent groups using numpy library
group1 = np.array([2.9, 3.0, 2.5, 2.6, 3.2])
group2 = np.array([3.8, 2.7, 4.0, 2.4, 2.9])
t_stat
: This is the calculated t-statistic value. It measures how far apart the group means are.
p_value
: This is used to determine the significance of the hypothesis test and is expressed between 0 & 1.
small p-value (<=0.05) indicates strong proof against null
hypothesis.large p-value (>0.05) indicates weak evidence against null hypothesis.
df
: Degrees of freedom used in the test. Influences the shape of the t-distribution.
t_stat, p_value, df = smw.ttest_ind(group1, group2, usevar='pooled')
print("t-statistic:", t_stat)
print("p-value:", p_value)
print("degrees of freedom:", df)
- one way anova:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from statsmodels.stats.anova import anova_lm
data = {
'score': [88, 90, 92, 85, 87, 89, 75, 78, 72],
'method': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
}
df = pd.DataFrame(data)
# Fit OLS model with categorical predictor
model = smf.ols('score ~ C(method)', data=df).fit()
# Perform ANOVA
anova_results = anova_lm(model, typ=2) # Type II SS is standard
print(anova_results)
- Chi-square test:
import pandas as pd
from scipy.stats import chi2_contingency
data = {
'Method_A': [30, 10], # [Pass, Fail]
'Method_B': [25, 15],
'Method_C': [20, 20]
}
df = pd.DataFrame(data, index=['Pass', 'Fail'])
chi2, p, dof, expected = chi2_contingency(df)
print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:\n", expected)
Learning Resources
Conclusion
Hope this article helped you get started with statsmodels
and understand its key features. If you found it useful, feel free to share it with others who might benefit too. Keep exploring, experimenting, and happy learning!
Top comments (0)