The One-Line Summary: Stratified sampling ensures that when you split your data, each split has the same proportion of each class as the original. Without it, your rare class might accidentally end up mostly in training or mostly in testing — making your evaluation meaningless.
The Pollster's Catastrophic Prediction
November 2024. Smithville is holding its mayoral election.
Pollster Pete needs to predict the winner. The city has 100,000 voters across five neighborhoods:
SMITHVILLE VOTER DISTRIBUTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Neighborhood | Voters | Leans | %
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Downtown | 10,000 | 80% Blue | 10%
Suburbs North | 35,000 | 60% Red | 35%
Suburbs South | 30,000 | 55% Red | 30%
University | 5,000 | 90% Blue | 5%
Industrial | 20,000 | 70% Red | 20%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TOTAL | 100,000 | ~58% Red | 100%
True outcome: Red wins 58-42.
Pete's "Random" Sample
Pete surveys 1,000 random voters. But "random" has a problem...
import numpy as np
np.random.seed(unlucky_seed)
# Pete's random sample
sample = random_sample(smithville, n=1000)
# What Pete got:
Downtown: 312 people (31.2%) ← Should be 10%!
Suburbs North: 245 people (24.5%) ← Should be 35%
Suburbs South: 198 people (19.8%) ← Should be 30%
University: 187 people (18.7%) ← Should be 5%!
Industrial: 58 people ( 5.8%) ← Should be 20%
# Pete's prediction based on this sample:
# Blue: 67% Red: 33%
# "BLUE LANDSLIDE INCOMING!"
Pete's prediction: Blue wins 67-33.
Actual result: Red wins 58-42.
Pete was off by 25 POINTS.
What Went Wrong?
Random sampling doesn't guarantee proportional representation.
By pure chance, Pete's sample over-represented Blue-leaning areas (Downtown, University) and under-represented Red-leaning areas (Industrial, Suburbs).
THE SAMPLING DISASTER:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Neighborhood | Actual % | Pete's % | Error
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Downtown | 10% | 31% | +21% (Blue area!)
University | 5% | 19% | +14% (Blue area!)
Suburbs North | 35% | 25% | -10% (Red area)
Suburbs South | 30% | 20% | -10% (Red area)
Industrial | 20% | 6% | -14% (Red area!)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Pete's sample was UNREPRESENTATIVE.
His prediction was GARBAGE.
Stratified Sampling Saves The Day
Smart Susan uses stratified sampling:
"I'll ensure my sample has the SAME proportions as the population."
# Susan's stratified sample
sample = stratified_sample(smithville, n=1000, stratify_by='neighborhood')
# What Susan got:
Downtown: 100 people (10.0%) ← Exactly right!
Suburbs North: 350 people (35.0%) ← Exactly right!
Suburbs South: 300 people (30.0%) ← Exactly right!
University: 50 people ( 5.0%) ← Exactly right!
Industrial: 200 people (20.0%) ← Exactly right!
# Susan's prediction:
# Blue: 41% Red: 59%
# "Red wins, close race."
Susan's prediction: Red wins 59-41.
Actual result: Red wins 58-42.
Susan was off by only 1 POINT.
Stratified Sampling in Machine Learning
The same problem happens when splitting data for ML:
# Your fraud detection dataset
total_samples = 10,000
fraud_cases = 200 (2%)
normal_cases = 9,800 (98%)
# RANDOM split (80/20)
# What you HOPE to get:
# Train: 160 fraud (2%), 7,840 normal (98%)
# Test: 40 fraud (2%), 1,960 normal (98%)
# What you MIGHT get (bad luck):
# Train: 185 fraud (2.3%), 7,815 normal
# Test: 15 fraud (0.75%), 1,985 normal ← Almost no fraud to test on!
With only 15 fraud cases in your test set, your evaluation is statistically meaningless. One or two lucky/unlucky predictions swing your metrics wildly.
The Math: Why Random Fails
When you have rare classes, random sampling has HIGH VARIANCE:
import numpy as np
from scipy import stats
# Simulation: 10,000 samples, 2% positive class
# Random 80/20 split, repeated 1000 times
n_samples = 10000
positive_rate = 0.02
test_size = 0.2
n_simulations = 1000
test_positive_rates = []
for _ in range(n_simulations):
# Create dataset
y = np.random.binomial(1, positive_rate, n_samples)
# Random split
test_indices = np.random.choice(n_samples, int(n_samples * test_size), replace=False)
y_test = y[test_indices]
# Record positive rate in test set
test_positive_rates.append(y_test.mean())
print(f"Expected positive rate: {positive_rate:.2%}")
print(f"Actual test positive rates:")
print(f" Mean: {np.mean(test_positive_rates):.2%}")
print(f" Std: {np.std(test_positive_rates):.2%}")
print(f" Min: {np.min(test_positive_rates):.2%}")
print(f" Max: {np.max(test_positive_rates):.2%}")
print(f" Range: {np.min(test_positive_rates):.2%} to {np.max(test_positive_rates):.2%}")
Output:
Expected positive rate: 2.00%
Actual test positive rates:
Mean: 2.00%
Std: 0.31%
Min: 1.10%
Max: 3.05%
Range: 1.10% to 3.05%
Your test set positive rate varies from 1.1% to 3.05%!
That's a 3x difference in how many positive cases you're evaluating on — purely due to random chance.
Stratified Sampling: The Solution
from sklearn.model_selection import train_test_split
# WITHOUT stratification (dangerous!)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Check the distribution
print(f"Original: {y.mean():.2%} positive")
print(f"Train set: {y_train.mean():.2%} positive")
print(f"Test set: {y_test.mean():.2%} positive")
Original: 2.00% positive
Train set: 2.13% positive ← Drifted!
Test set: 1.50% positive ← Drifted more!
# WITH stratification (safe!)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # ← Magic parameter!
)
# Check the distribution
print(f"Original: {y.mean():.2%} positive")
print(f"Train set: {y_train.mean():.2%} positive")
print(f"Test set: {y_test.mean():.2%} positive")
Original: 2.00% positive
Train set: 2.00% positive ← Exactly right!
Test set: 2.00% positive ← Exactly right!
One parameter. Problem solved.
Visual: Random vs Stratified
ORIGINAL DATA (2% positive):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○●○○○○○○○○
○○○○○○○○○○○○○○○○○○○○○●○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○○
(● = positive, ○ = negative)
RANDOM SPLIT (unlucky):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Train (80%): ○○○○○○○○●○○○○○○○○○○○○○○○○○○○○○○●○○○○○○○○○○○○
(2.5% positive - too many!)
Test (20%): ○○○○○○○○○○○○○○○○○○○○○
(0% positive - NONE! 😱)
STRATIFIED SPLIT (guaranteed):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Train (80%): ○○○○○○○○○○○○○○○○○○●○○○○○○○○○○○○○○○○○○○●○○○○
(2.0% positive - exact!)
Test (20%): ○○○○○○○○○○●○○○○○○○○○○
(2.0% positive - exact! ✓)
When Stratified Sampling Is Critical
1. Imbalanced Classification
# Fraud detection: 0.1% fraud
# Disease diagnosis: 2% positive
# Churn prediction: 5% churners
# Rare event prediction: <10% positive
# ALWAYS use stratify=y for these!
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y
)
2. Multi-Class with Rare Classes
# Image classification with 100 classes
# Some classes have 10,000 images, others have 50
# Random split might put ALL 50 rare images in training!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y # Preserves ALL class proportions
)
# Verify
import pandas as pd
print("Class distribution:")
print(pd.DataFrame({
'Original': pd.Series(y).value_counts(normalize=True),
'Train': pd.Series(y_train).value_counts(normalize=True),
'Test': pd.Series(y_test).value_counts(normalize=True)
}))
3. Cross-Validation
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Regular KFold - DANGEROUS for imbalanced data
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
# Stratified KFold - SAFE for imbalanced data
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Use stratified for classification!
scores = cross_val_score(model, X, y, cv=skfold, scoring='f1')
print(f"F1 scores: {scores}")
print(f"Mean: {scores.mean():.3f} ± {scores.std():.3f}")
4. Time Series with Categories
# Sales prediction by product category
# Some categories are rare (luxury items)
# Need stratified sampling WITHIN time-based splits
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
for train_idx, test_idx in sss.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Each split has correct class proportions
The Consequences of NOT Stratifying
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, recall_score
from sklearn.datasets import make_classification
# Create imbalanced dataset (5% positive)
X, y = make_classification(
n_samples=2000, n_features=20,
weights=[0.95, 0.05], random_state=42
)
# Run experiment: compare random vs stratified splits
n_experiments = 100
random_f1s = []
stratified_f1s = []
for seed in range(n_experiments):
# Random split
X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(
X, y, test_size=0.2, random_state=seed
)
# Stratified split
X_tr_s, X_te_s, y_tr_s, y_te_s = train_test_split(
X, y, test_size=0.2, random_state=seed, stratify=y
)
# Train same model
model_r = LogisticRegression(max_iter=1000).fit(X_tr_r, y_tr_r)
model_s = LogisticRegression(max_iter=1000).fit(X_tr_s, y_tr_s)
# Evaluate
random_f1s.append(f1_score(y_te_r, model_r.predict(X_te_r)))
stratified_f1s.append(f1_score(y_te_s, model_s.predict(X_te_s)))
print("F1 Score Comparison (100 experiments):")
print(f"\nRandom splits:")
print(f" Mean: {np.mean(random_f1s):.3f}")
print(f" Std: {np.std(random_f1s):.3f}")
print(f" Range: {np.min(random_f1s):.3f} - {np.max(random_f1s):.3f}")
print(f"\nStratified splits:")
print(f" Mean: {np.mean(stratified_f1s):.3f}")
print(f" Std: {np.std(stratified_f1s):.3f}")
print(f" Range: {np.min(stratified_f1s):.3f} - {np.max(stratified_f1s):.3f}")
Output:
F1 Score Comparison (100 experiments):
Random splits:
Mean: 0.542
Std: 0.089
Range: 0.286 - 0.727
Stratified splits:
Mean: 0.548
Std: 0.047
Range: 0.444 - 0.654
Key findings:
| Metric | Random | Stratified |
|---|---|---|
| Mean F1 | 0.542 | 0.548 |
| Std Dev | 0.089 | 0.047 (47% lower!) |
| Range | 0.286-0.727 | 0.444-0.654 |
Stratified sampling cut variance nearly in HALF!
With random splits, your F1 could be anywhere from 0.29 to 0.73 — a huge range that makes model comparison nearly impossible.
Stratified Sampling for Regression?
For regression, you don't have classes. But you can still stratify!
Option 1: Stratify by Binned Target
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# House prices (continuous target)
y_prices = df['price'].values
# Create bins for stratification
y_binned = pd.qcut(y_prices, q=10, labels=False) # 10 equal-frequency bins
# Stratify by bins
X_train, X_test, y_train, y_test = train_test_split(
X, y_prices, test_size=0.2, stratify=y_binned
)
# Verify distribution
print("Price distribution preserved:")
print(f"Train mean: ${y_train.mean():,.0f}")
print(f"Test mean: ${y_test.mean():,.0f}")
print(f"Full mean: ${y_prices.mean():,.0f}")
Option 2: Stratify by Important Category
# If you have a categorical feature that matters
# (e.g., property_type for house prices)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=df['property_type']
)
# Now each property type is proportionally represented
Multi-Label Stratification
When each sample can have MULTIPLE labels:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
# pip install iterative-stratification
# Multi-label: each sample can have multiple tags
# y shape: (n_samples, n_labels)
y_multilabel = np.array([
[1, 0, 1], # Sample 1: labels 0 and 2
[0, 1, 0], # Sample 2: label 1
[1, 1, 1], # Sample 3: all labels
# ...
])
mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in mskf.split(X, y_multilabel):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y_multilabel[train_idx], y_multilabel[test_idx]
# Each label's proportion is preserved!
Group-Aware Stratified Splitting
Sometimes you need to keep groups together AND stratify:
# Medical data: multiple samples per patient
# - Can't have same patient in train AND test (data leakage!)
# - But still want stratified disease distribution
from sklearn.model_selection import StratifiedGroupKFold
# patient_ids: which patient each sample belongs to
# y: disease labels
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in sgkf.split(X, y, groups=patient_ids):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Patients are kept together (no leakage)
# AND class distribution is stratified!
Common Mistakes
Mistake 1: Forgetting to Stratify with Imbalanced Data
# ❌ WRONG
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# With 1% positive class, test set might have 0 positives!
# ✅ RIGHT
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y
)
Mistake 2: Stratifying on Wrong Variable
# ❌ WRONG: Stratifying on a feature instead of target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=X['age_group']
)
# ✅ RIGHT: Stratify on the TARGET variable
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y
)
# Or stratify on both if needed:
stratify_col = y.astype(str) + '_' + X['age_group'].astype(str)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=stratify_col
)
Mistake 3: Not Stratifying in Cross-Validation
# ❌ WRONG: Regular KFold for classification
from sklearn.model_selection import KFold, cross_val_score
scores = cross_val_score(model, X, y, cv=KFold(5))
# ✅ RIGHT: StratifiedKFold for classification
from sklearn.model_selection import StratifiedKFold
scores = cross_val_score(model, X, y, cv=StratifiedKFold(5))
# Or simply:
scores = cross_val_score(model, X, y, cv=5) # Default uses StratifiedKFold for classifiers!
Mistake 4: Impossible Stratification
# ❌ ERROR: Class has fewer samples than n_splits
# If you have 3 samples of class A and want 5-fold CV...
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1] # Only 3 positives
skf = StratifiedKFold(n_splits=5) # ERROR! Can't put 3 samples into 5 folds
# ✅ FIX: Reduce n_splits or use StratifiedShuffleSplit
skf = StratifiedKFold(n_splits=3) # Works with 3 positives
# Or use repeated holdout:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2) # Works!
Complete Example: The Right Way
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from collections import Counter
# Load imbalanced data
# Simulating: 95% negative, 5% positive
np.random.seed(42)
n_samples = 5000
X = np.random.randn(n_samples, 20)
y = np.random.choice([0, 1], size=n_samples, p=[0.95, 0.05])
print("="*60)
print("ORIGINAL DATA")
print("="*60)
print(f"Class distribution: {Counter(y)}")
print(f"Positive rate: {y.mean():.2%}")
# STEP 1: Stratified train/test split
print("\n" + "="*60)
print("STEP 1: STRATIFIED TRAIN/TEST SPLIT")
print("="*60)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train - Positive rate: {y_train.mean():.2%} (n={len(y_train)})")
print(f"Test - Positive rate: {y_test.mean():.2%} (n={len(y_test)})")
print("✓ Proportions preserved!")
# STEP 2: Stratified cross-validation on training set
print("\n" + "="*60)
print("STEP 2: STRATIFIED CROSS-VALIDATION")
print("="*60)
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Verify each fold has correct proportions
print("\nFold class distributions:")
for i, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
y_fold_train = y_train[train_idx]
y_fold_val = y_train[val_idx]
print(f" Fold {i+1}: Train={y_fold_train.mean():.2%}, Val={y_fold_val.mean():.2%}")
# Cross-validation scores
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
print(f"\nCV F1 Scores: {scores.round(3)}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")
# STEP 3: Final evaluation on test set
print("\n" + "="*60)
print("STEP 3: FINAL TEST EVALUATION")
print("="*60)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
Output:
============================================================
ORIGINAL DATA
============================================================
Class distribution: Counter({0: 4741, 1: 259})
Positive rate: 5.18%
============================================================
STEP 1: STRATIFIED TRAIN/TEST SPLIT
============================================================
Train - Positive rate: 5.18% (n=4000)
Test - Positive rate: 5.20% (n=1000)
✓ Proportions preserved!
============================================================
STEP 2: STRATIFIED CROSS-VALIDATION
============================================================
Fold class distributions:
Fold 1: Train=5.19%, Val=5.12%
Fold 2: Train=5.16%, Val=5.25%
Fold 3: Train=5.16%, Val=5.25%
Fold 4: Train=5.19%, Val=5.12%
Fold 5: Train=5.19%, Val=5.12%
CV F1 Scores: [0.462 0.488 0.421 0.505 0.471]
Mean F1: 0.469 ± 0.029
============================================================
STEP 3: FINAL TEST EVALUATION
============================================================
precision recall f1-score support
Negative 0.97 0.99 0.98 948
Positive 0.60 0.37روم 0.46 52
accuracy 0.96 1000
macro avg 0.79 0.68 0.72 1000
weighted avg 0.95 0.96 0.95 1000
Every split has ~5.2% positive rate. Evaluation is reliable!
Quick Reference
When to Stratify
| Scenario | Stratify? | How |
|---|---|---|
| Binary classification | Yes | stratify=y |
| Multi-class, balanced | Optional | stratify=y |
| Multi-class, imbalanced | Yes | stratify=y |
| Regression | Optional | stratify=binned_y |
| Multi-label | Yes | MultilabelStratifiedKFold |
| Groups + classes | Yes | StratifiedGroupKFold |
The One-Line Fix
# Add this parameter to ALL your train_test_split calls:
stratify=y
# Add this to ALL your cross-validation:
cv=StratifiedKFold(n_splits=5)
Key Takeaways
Random sampling doesn't guarantee proportions — Rare classes can accidentally cluster
Stratified sampling preserves class ratios — Each split mirrors the original
Critical for imbalanced data — Without it, test sets may have few/no minority samples
Reduces evaluation variance — More reliable performance estimates
One parameter:
stratify=y— Easy to implement, huge impactUse StratifiedKFold for CV — Not regular KFold for classification
Works for regression too — Bin the target, then stratify
Check your distributions — Always verify after splitting
The One-Sentence Summary
Pollster Pete surveyed 1,000 "random" voters but accidentally oversampled Blue neighborhoods and predicted a 67-33 landslide when the real result was 42-58 the other way — stratified sampling ensures your train/test splits have the same proportions as your full data, so your model isn't trained on a different reality than it's tested on.
What's Next?
Now that you understand stratified sampling, you're ready for:
- Cross-Validation Deep Dive — K-fold, Leave-One-Out, Time Series CV
- Handling Extreme Imbalance — When stratification isn't enough
- Sampling Strategies — Over-sampling, under-sampling, SMOTE
- Bootstrap Methods — Sampling with replacement
Follow me for the next article in this series!
Let's Connect!
If stratified sampling finally clicked, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Have you been burned by non-stratified splits? I once had a model with 0 positive cases in the validation set. Took hours to figure out why metrics were NaN! 😅
The difference between a reliable F1 score and one that swings wildly depending on random seed? Stratified sampling. Your test set should look like your training set should look like your real data. When that chain breaks, your evaluation is fiction.
Share this with someone who keeps getting different metrics every time they run their code. It might not be randomness in the model — it might be randomness in the split.
Happy stratifying! 🗳️
Top comments (0)