DEV Community

ライフポータル
ライフポータル

Posted on • Originally published at code-izumi.com

Python Logistic Regression: A Practical Guide with scikit-learn and statsmodels (p-values, Odds Ratios, and ROC)

Python Logistic Regression: A Practical Guide with scikit-learn and statsmodels (p-values, Odds Ratios, and ROC)

Whether you're predicting the probability of an event occurring or analyzing how specific factors influence an outcome, Logistic Regression remains one of the most fundamental and powerful tools in data science.

In business settings, it's frequently used for binary classification problems, such as "Will a customer buy this product?" or "Is this email spam?"

When implementing this in Python, your choice of library depends on your goal:

  • Use scikit-learn if you prioritize predictive accuracy and machine learning workflows.
  • Use statsmodels if you need detailed statistical summaries, such as p-values and confidence intervals.

In this guide, we’ll walk through implementation using both libraries, and cover essential interpretation techniques like odds ratios and ROC curves.


What is Logistic Regression?

Despite its name, Logistic Regression is primarily used for classification, not numerical regression. It predicts the probability that an observation belongs to one of two classes (0 or 1).

Difference from Linear Regression

While linear regression predicts a continuous numerical value, logistic regression uses the Sigmoid function to squash the output between 0 and 1.

If the output exceeds a threshold (typically 0.5), it is classified as "1" (Event occurred); otherwise, it is "0" (Event did not occur).


Preparation: Loading and Scaling Data

We’ll use the "Breast Cancer Wisconsin" dataset, a classic binary classification problem included in scikit-learn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Standardization: Crucial for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

Why Standardize?
Logistic Regression is sensitive to the scale of input features. Using StandardScaler to ensure a mean of 0 and variance of 1 helps the model converge faster and makes the resulting coefficients comparable.


Implementation 1: scikit-learn (Machine Learning Focus)

scikit-learn is the go-to library for building predictive models. It’s concise and follows a standardized workflow.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize model
# 'C' is the inverse regularization strength (smaller means stronger regularization)
model = LogisticRegression(C=1.0, random_state=42)

# Training
model.fit(X_train_scaled, y_train)

# Prediction
y_pred = model.predict(X_test_scaled)

# Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

The output report provides not just Accuracy, but also Precision, Recall, and the F1-score, giving you a complete picture of the model's performance.


Implementation 2: statsmodels (Statistical Focus)

If you need to know why a model made a prediction—specifically, which variables are statistically significant—statsmodels is the better choice.

import statsmodels.api as sm

# statsmodels requires adding a constant term (intercept) manually
X_train_const = sm.add_constant(X_train_scaled)

# Build and train the Logit model
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit()

# Display the summary report
print(result.summary())
Enter fullscreen mode Exit fullscreen mode

In the resulting summary, look for the P>|z| column. A p-value of less than 0.05 generally indicates that the feature is statistically significant to the outcome.


Interpreting Results: Odds Ratios

In business, explaining coefficients can be difficult. Odds Ratios are much more intuitive. An odds ratio represents the ratio of the probability of an event happening to the probability of it not happening.

Since Logistic Regression coefficients are in "log-odds," we convert them using the exponential function.

# Extract coefficients and calculate Odds Ratios
coefficients = model.coef_[0]
coef_df = pd.DataFrame({
    'Feature': data.feature_names,
    'Coefficient': coefficients,
    'Odds_Ratio': np.exp(coefficients) 
})

print(coef_df.sort_values(by='Odds_Ratio', ascending=False).head())
Enter fullscreen mode Exit fullscreen mode

If an Odds Ratio is greater than 1, an increase in that feature increases the probability of the target being "1." For example, an odds ratio of 2.5 means that for every one-unit increase in the feature, the odds of the event occurring increase by 2.5 times.


Visualizing Accuracy: ROC Curve and AUC

To evaluate how well your model distinguishes between the two classes, we plot the ROC Curve and calculate the AUC (Area Under the Curve).

from sklearn.metrics import roc_curve, roc_auc_score

# Get predicted probabilities for class 1
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Calculate FPR, TPR, and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

# Plotting
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Logistic Regression')
plt.legend()
plt.grid()
plt.show()
Enter fullscreen mode Exit fullscreen mode

An AUC score ranges from 0.5 (random guessing) to 1.0 (perfect model). The closer the curve is to the top-left corner, the more robust your model is.


Conclusion

Building a logistic regression model is straightforward, but interpreting it correctly is where the real value lies. Use scikit-learn for quick, high-accuracy predictions, and turn to statsmodels when you need to justify your findings with statistical rigor.


Originally published at: [https://code-izumi.com/python/logistic-regression/]

Top comments (0)