The One-Line Summary: Imbalanced datasets trick models into ignoring the minority class. The fix? Resample your data, adjust class weights, change your metrics, or use algorithms designed for imbalance.
The Lazy Security Guard
Meet Gary, a security guard at a museum.
In 10 years, there have been exactly 3 attempted thefts. Everything else? Normal visitors.
Gary develops a strategy:
"Everyone is innocent. I'll never stop anyone."
His performance review comes in:
Days worked: 3,650
Correct predictions: 3,647 (normal visitors correctly ignored)
Wrong predictions: 3 (thieves walked right past)
ACCURACY: 99.92%!
Gary gets Employee of the Month. He's almost never wrong!
But Gary is completely useless. He caught exactly ZERO thieves. His entire job is catching thieves, and he's failed at 100% of the cases that mattered.
This is your machine learning model on imbalanced data.
When 99.9% of your data is one class, the model learns Gary's strategy:
"Just predict the majority class. You'll be right almost all the time!"
High accuracy. Zero usefulness.
What Is Class Imbalance?
Class imbalance occurs when one class vastly outnumbers another.
Balanced Dataset:
Class A: ████████████████████ 50%
Class B: ████████████████████ 50%
Imbalanced Dataset:
Class A: ████████████████████████████████████████ 99%
Class B: █ 1%
Severely Imbalanced:
Class A: ████████████████████████████████████████ 99.9%
Class B: . 0.1%
Real-world examples:
| Domain | Minority Class | Typical Ratio |
|---|---|---|
| Fraud Detection | Fraud | 1:1,000 |
| Medical Diagnosis | Disease | 1:100 to 1:10,000 |
| Spam Detection | Spam | 1:10 to 1:100 |
| Manufacturing Defects | Defective | 1:1,000 |
| Customer Churn | Churned | 1:5 to 1:20 |
| Click-Through Rate | Clicked | 1:100 to 1:1,000 |
In all these cases, the minority class is exactly what you care about. And it's exactly what your model ignores.
The Accuracy Trap
Let me prove how misleading accuracy becomes.
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
# Imbalanced dataset: 1% fraud
np.random.seed(42)
n = 10000
y_true = np.array([0] * 9900 + [1] * 100) # 99% normal, 1% fraud
# The "Lazy Gary" model: Always predict 0 (not fraud)
y_pred_lazy = np.zeros(n)
# The accuracy looks amazing!
print(f"Accuracy: {accuracy_score(y_true, y_pred_lazy):.1%}")
print("\nBut look at the full picture:")
print(classification_report(y_true, y_pred_lazy, target_names=['Normal', 'Fraud']))
Output:
Accuracy: 99.0%
But look at the full picture:
precision recall f1-score support
Normal 0.99 1.00 0.99 9900
Fraud 0.00 0.00 0.00 100
accuracy 0.99 10000
macro avg 0.49 0.50 0.50 10000
weighted avg 0.98 0.99 0.98 10000
99% accuracy! But:
- Fraud Recall: 0% — Caught zero frauds
- Fraud Precision: 0% — Never even tried
- Fraud F1: 0% — Complete failure at the actual task
The model is useless. It just learned to say "not fraud" every time.
The Metrics That Matter
When classes are imbalanced, forget accuracy. Use these instead:
Precision
"Of all the things I flagged as fraud, how many actually were?"
Precision = True Positives / (True Positives + False Positives)
High precision = Few false alarms
Recall (Sensitivity)
"Of all the actual frauds, how many did I catch?"
Recall = True Positives / (True Positives + False Negatives)
High recall = Caught most frauds
F1 Score
"The balance between precision and recall"
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 = 0 means complete failure on the minority class
Area Under ROC Curve (AUC-ROC)
"How well does the model separate the two classes across all thresholds?"
AUC = 0.5 → Random guessing
AUC = 1.0 → Perfect separation
Area Under Precision-Recall Curve (AUC-PR)
"Better than ROC for severe imbalance"
When imbalance is extreme, AUC-PR is more informative than AUC-ROC.
The Arsenal: Every Way to Handle Imbalance
Strategy 1: Resample the Data
Option A: Oversample the Minority Class
The idea: Duplicate minority class examples until classes are balanced.
from sklearn.utils import resample
import pandas as pd
# Original: 9900 normal, 100 fraud
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]
# Oversample minority to match majority
df_minority_upsampled = resample(
df_minority,
replace=True, # Sample with replacement
n_samples=len(df_majority), # Match majority count
random_state=42
)
# Combine
df_balanced = pd.concat([df_majority, df_minority_upsampled])
print(df_balanced['target'].value_counts())
# 0 9900
# 1 9900 ← Now balanced!
Visual:
Before:
Normal: ████████████████████████████████████████ 9900
Fraud: █ 100
After oversampling:
Normal: ████████████████████████████████████████ 9900
Fraud: ████████████████████████████████████████ 9900 (duplicates)
Pros: Simple, keeps all data
Cons: Can cause overfitting (model memorizes duplicates)
Option B: Undersample the Majority Class
The idea: Randomly remove majority class examples until balanced.
# Undersample majority to match minority
df_majority_downsampled = resample(
df_majority,
replace=False, # No replacement
n_samples=len(df_minority), # Match minority count
random_state=42
)
# Combine
df_balanced = pd.concat([df_majority_downsampled, df_minority])
print(df_balanced['target'].value_counts())
# 0 100
# 1 100 ← Balanced, but tiny!
Visual:
Before:
Normal: ████████████████████████████████████████ 9900
Fraud: █ 100
After undersampling:
Normal: █ 100 (threw away 9800!)
Fraud: █ 100
Pros: Fast training, no duplicates
Cons: Throws away potentially useful data!
Option C: SMOTE (Synthetic Minority Oversampling)
The idea: Don't just duplicate — CREATE NEW synthetic minority examples.
SMOTE finds a minority example, looks at its nearest neighbors, and creates new examples along the line between them.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"Before: {dict(zip(*np.unique(y, return_counts=True)))}")
print(f"After: {dict(zip(*np.unique(y_resampled, return_counts=True)))}")
Output:
Before: {0: 9900, 1: 100}
After: {0: 9900, 1: 9900}
Visual:
Original minority points: ● ● ●
SMOTE creates synthetic points along lines:
● ◐ ● ◐ ●
↑ ↑
Synthetic points!
Pros: Creates diverse examples, not just duplicates
Cons: Can create unrealistic examples, requires imbalanced-learn
Option D: SMOTE Variants
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
# Standard SMOTE
smote = SMOTE(random_state=42)
# ADASYN: Creates more synthetics in harder regions
adasyn = ADASYN(random_state=42)
# Borderline SMOTE: Focuses on decision boundary
borderline = BorderlineSMOTE(random_state=42)
| Variant | Strategy |
|---|---|
| SMOTE | Uniform synthetic generation |
| ADASYN | More synthetics where minority is harder to learn |
| BorderlineSMOTE | Focus on examples near decision boundary |
| SMOTE-NC | Handles mixed numerical and categorical |
Strategy 2: Adjust Class Weights
The idea: Don't change the data — change how much the model CARES about each class.
Make errors on the minority class MORE EXPENSIVE.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Option 1: Automatic balancing
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)
# Option 2: Manual weights
# Fraud errors are 100x more costly than normal errors
model = LogisticRegression(class_weight={0: 1, 1: 100})
model.fit(X_train, y_train)
# Works with many sklearn models!
rf = RandomForestClassifier(class_weight='balanced')
What class_weight='balanced' does:
weight = n_samples / (n_classes × n_samples_per_class)
For 9900 normal, 100 fraud:
Normal weight: 10000 / (2 × 9900) = 0.505
Fraud weight: 10000 / (2 × 100) = 50.0
Fraud errors now count 100x more!
Pros: No data manipulation, simple, no overfitting risk
Cons: Not all algorithms support it
Strategy 3: Change Your Algorithm
Some algorithms handle imbalance better than others.
Balanced Random Forest
from imblearn.ensemble import BalancedRandomForestClassifier
# Automatically balances each bootstrap sample
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Easy Ensemble
from imblearn.ensemble import EasyEnsembleClassifier
# Creates multiple balanced subsets and ensembles them
model = EasyEnsembleClassifier(n_estimators=10, random_state=42)
model.fit(X_train, y_train)
RUSBoost (Random Under-Sampling + Boosting)
from imblearn.ensemble import RUSBoostClassifier
model = RUSBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Strategy 4: Change Your Threshold
By default, models use 0.5 as the threshold:
probability >= 0.5 → Predict positive
probability < 0.5 → Predict negative
But who said 0.5 is right?
Lower the threshold to catch more of the minority class:
from sklearn.linear_model import LogisticRegression
import numpy as np
model = LogisticRegression()
model.fit(X_train, y_train)
# Get probabilities instead of predictions
y_proba = model.predict_proba(X_test)[:, 1]
# Try different thresholds
for threshold in [0.5, 0.3, 0.2, 0.1]:
y_pred = (y_proba >= threshold).astype(int)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
print(f"Threshold {threshold}: Recall={recall:.1%}, Precision={precision:.1%}")
Output:
Threshold 0.5: Recall=20.0%, Precision=85.0%
Threshold 0.3: Recall=45.0%, Precision=72.0%
Threshold 0.2: Recall=65.0%, Precision=58.0%
Threshold 0.1: Recall=85.0%, Precision=35.0%
Lower threshold = Higher recall, Lower precision
Find the sweet spot for your use case!
Strategy 5: Anomaly Detection Approach
When imbalance is EXTREME (fraud is 0.01%), treat it as anomaly detection.
The idea: Train only on the majority class. Flag anything that doesn't fit.
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
# Train only on normal transactions
X_normal = X_train[y_train == 0]
# Isolation Forest
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_normal)
# Predictions: 1 = normal, -1 = anomaly
predictions = iso_forest.predict(X_test)
y_pred = (predictions == -1).astype(int) # Convert to 0/1
Pros: Works with extreme imbalance, doesn't need minority labels for training
Cons: Less accurate than supervised methods when you have enough minority examples
Strategy 6: Cost-Sensitive Learning
The idea: Define explicit costs for different types of errors.
Predicted
Normal Fraud
Actual Normal $0 $10 (false alarm: investigate cost)
Fraud $1000 $0 (missed fraud: loss to company)
Missing a fraud costs 100x more than a false alarm. Build this into your model.
# XGBoost with custom scale_pos_weight
import xgboost as xgb
# If fraud is 1% of data, set scale_pos_weight to 99
model = xgb.XGBClassifier(scale_pos_weight=99)
model.fit(X_train, y_train)
Complete Code: Comparing All Strategies
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score, recall_score, precision_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.ensemble import BalancedRandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
# Create imbalanced dataset (5% minority)
X, y = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
weights=[0.95, 0.05], # 95% class 0, 5% class 1
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set class distribution:")
print(f" Class 0: {sum(y_train==0)} ({sum(y_train==0)/len(y_train):.1%})")
print(f" Class 1: {sum(y_train==1)} ({sum(y_train==1)/len(y_train):.1%})")
print()
results = []
# 1. Baseline: No handling
print("=" * 60)
print("1. BASELINE (No imbalance handling)")
print("=" * 60)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Baseline', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# 2. Class weights
print("=" * 60)
print("2. CLASS WEIGHTS (balanced)")
print("=" * 60)
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Class Weights', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# 3. Random Oversampling
print("=" * 60)
print("3. RANDOM OVERSAMPLING")
print("=" * 60)
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_ros, y_ros)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Oversampling', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# 4. SMOTE
print("=" * 60)
print("4. SMOTE")
print("=" * 60)
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_smote, y_smote)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('SMOTE', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# 5. Random Undersampling
print("=" * 60)
print("5. RANDOM UNDERSAMPLING")
print("=" * 60)
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_rus, y_rus)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Undersampling', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# 6. Balanced Random Forest
print("=" * 60)
print("6. BALANCED RANDOM FOREST")
print("=" * 60)
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Balanced RF', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))
# Summary
print("=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"{'Method':<20} {'F1 Score':>12} {'Recall':>12}")
print("-" * 46)
for method, f1, recall in results:
print(f"{method:<20} {f1:>12.1%} {recall:>12.1%}")
Output:
Training set class distribution:
Class 0: 7591 (94.9%)
Class 1: 409 (5.1%)
============================================================
1. BASELINE (No imbalance handling)
============================================================
precision recall f1-score support
Majority 0.96 0.99 0.98 1909
Minority 0.71 0.42 0.53 91
accuracy 0.96 2000
============================================================
2. CLASS WEIGHTS (balanced)
============================================================
precision recall f1-score support
Majority 0.98 0.93 0.95 1909
Minority 0.40 0.70 0.51 91
accuracy 0.92 2000
============================================================
SUMMARY
============================================================
Method F1 Score Recall
----------------------------------------------
Baseline 52.8% 41.8%
Class Weights 50.8% 70.3%
Oversampling 52.7% 62.6%
SMOTE 54.1% 64.8%
Undersampling 48.2% 72.5%
Balanced RF 55.3% 61.5%
Key insight: Baseline has the worst recall (41.8%). All imbalance techniques improve recall, but at different precision costs. Choose based on your priorities!
The Precision-Recall Tradeoff
Every imbalance technique faces this tradeoff:
HIGH PRECISION HIGH RECALL
"Few false alarms" "Catch everything"
│ │
│ │
▼ ▼
Baseline: ████████████████░░░░░░░░░░░░░░░░░░░░░░
Class Weights: ████████████░░░░░░░░░░░░░░░░░████████░░
SMOTE: ██████████████░░░░░░░░░░░░░░░░████████░░
Undersampling: ████████░░░░░░░░░░░░░░░░░░░░░░██████████
├─────────────────┼─────────────────────┤
Precision │ Recall
│
Your sweet spot
depends on cost!
If missing fraud costs $1000 but false alarm costs $10:
→ Prioritize recall! Catch all frauds, tolerate false alarms.
If false alarms annoy customers and cause churn:
→ Balance precision and recall. Don't cry wolf too often.
Which Strategy When?
START
│
▼
How severe is the imbalance?
│
├── Mild (10-30% minority)
│ │
│ └──► Class weights usually enough
│ Try: class_weight='balanced'
│
├── Moderate (1-10% minority)
│ │
│ └──► SMOTE or Class weights
│ Try: SMOTE + class_weight
│
└── Severe (<1% minority)
│
└──► Combine multiple strategies
Try: SMOTE + class_weight + threshold tuning
Or: Anomaly detection approach
Common Mistakes
Mistake 1: Using Accuracy as Your Metric
# ❌ WRONG: Accuracy is misleading!
print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}") # 99%! 🎉
# ✅ RIGHT: Use F1, Recall, Precision, AUC
print(f"F1: {f1_score(y_test, y_pred):.1%}")
print(f"Recall: {recall_score(y_test, y_pred):.1%}")
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
Mistake 2: Resampling Before Train-Test Split
# ❌ WRONG: Data leakage! Synthetic test samples based on training data
X_smote, y_smote = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote)
# ✅ RIGHT: Split first, then resample only training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train)
model.fit(X_train_smote, y_train_smote)
model.predict(X_test) # Test set is untouched!
Mistake 3: Resampling in Cross-Validation Wrong
# ❌ WRONG: Resampling before CV causes leakage
X_smote, y_smote = SMOTE().fit_resample(X, y)
cross_val_score(model, X_smote, y_smote, cv=5)
# ✅ RIGHT: Use imblearn Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', LogisticRegression())
])
cross_val_score(pipeline, X, y, cv=5, scoring='f1')
Mistake 4: Ignoring the Business Context
# ❌ WRONG: Optimizing F1 without thinking about costs
model = optimize_for_f1(model)
# ✅ RIGHT: Consider actual business costs
# If missing fraud costs $10,000 and false alarm costs $50:
# Recall matters 200x more than precision!
# Optimize accordingly.
Mistake 5: Not Trying Multiple Approaches
# ❌ WRONG: Just using SMOTE because you heard it's good
X_smote, y_smote = SMOTE().fit_resample(X_train, y_train)
# ✅ RIGHT: Compare multiple approaches
strategies = [
('Baseline', X_train, y_train),
('SMOTE', *SMOTE().fit_resample(X_train, y_train)),
('Class Weight', X_train, y_train), # with class_weight='balanced'
('Undersampling', *RandomUnderSampler().fit_resample(X_train, y_train)),
]
# Evaluate each and pick the best for YOUR use case
The Decision Cheat Sheet
| Situation | Best Approach |
|---|---|
| Quick fix, any algorithm | class_weight='balanced' |
| Tree-based models | Balanced Random Forest |
| Need to preserve all data | SMOTE |
| Huge dataset, need speed | Undersampling |
| Extreme imbalance (<0.1%) | Anomaly detection |
| Production system | Threshold tuning |
| Maximum performance | Combine SMOTE + weights + threshold |
The Imbalanced-Learn Toolkit
# Install: pip install imbalanced-learn
# === OVERSAMPLING ===
from imblearn.over_sampling import (
RandomOverSampler, # Simple duplication
SMOTE, # Synthetic generation
ADASYN, # Adaptive synthetic
BorderlineSMOTE, # Focus on boundary
)
# === UNDERSAMPLING ===
from imblearn.under_sampling import (
RandomUnderSampler, # Random removal
TomekLinks, # Remove Tomek links
NearMiss, # Keep informative majorities
)
# === COMBINATION ===
from imblearn.combine import (
SMOTETomek, # SMOTE + Tomek cleaning
SMOTEENN, # SMOTE + ENN cleaning
)
# === ENSEMBLE ===
from imblearn.ensemble import (
BalancedRandomForestClassifier,
BalancedBaggingClassifier,
EasyEnsembleClassifier,
RUSBoostClassifier,
)
# === PIPELINE ===
from imblearn.pipeline import Pipeline # Use this, not sklearn's!
Key Takeaways
Accuracy is a lie with imbalanced data — use F1, recall, precision, AUC
The model isn't stupid — it's doing exactly what you asked (minimize errors)
Class weights are the easiest fix — just add
class_weight='balanced'SMOTE creates synthetic examples — better than simple duplication
Resample AFTER train-test split — never before, or you'll leak data
Threshold tuning is powerful — 0.5 isn't magic
Combine strategies for best results — SMOTE + weights + threshold
Know your costs — precision vs recall depends on business impact
The One-Sentence Summary
When 99% of your data is one class, your model becomes Gary the lazy security guard — 99% accurate, 0% useful. Fix it by making minority mistakes expensive, creating synthetic minorities, or changing how you measure success.
What's Next?
Now that you understand imbalanced datasets, you're ready for:
- Precision-Recall Curves — Finding the optimal threshold
- Cost-Sensitive Learning — Building business costs into your model
- Anomaly Detection Deep Dive — When imbalance is extreme
- Stratified Sampling — Preserving class ratios in splits
Follow me for the next article in this series!
Let's Connect!
If this saved your imbalanced model, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the most imbalanced dataset you've worked with? I've seen 1:100,000. Share your stories!
The difference between a fraud detection model that catches fraudsters and one that just says "everything is fine"? Understanding that 99% accuracy can mean 0% usefulness. Don't be Gary.
Share this with someone whose model has 99% accuracy but catches nothing. They need to meet Gary.
Happy balancing!
Top comments (0)