Sachin Kr. Rajput

Posted on Jan 21

Imbalanced Datasets: When Your Model Gets 99% Accuracy by Being Completely Useless

#datascience #python #beginners #programming

The One-Line Summary: Imbalanced datasets trick models into ignoring the minority class. The fix? Resample your data, adjust class weights, change your metrics, or use algorithms designed for imbalance.

The Lazy Security Guard

Meet Gary, a security guard at a museum.

In 10 years, there have been exactly 3 attempted thefts. Everything else? Normal visitors.

Gary develops a strategy:

"Everyone is innocent. I'll never stop anyone."

His performance review comes in:

Days worked:         3,650
Correct predictions: 3,647  (normal visitors correctly ignored)
Wrong predictions:   3      (thieves walked right past)

ACCURACY: 99.92%!

Gary gets Employee of the Month. He's almost never wrong!

But Gary is completely useless. He caught exactly ZERO thieves. His entire job is catching thieves, and he's failed at 100% of the cases that mattered.

This is your machine learning model on imbalanced data.

When 99.9% of your data is one class, the model learns Gary's strategy:

"Just predict the majority class. You'll be right almost all the time!"

High accuracy. Zero usefulness.

What Is Class Imbalance?

Class imbalance occurs when one class vastly outnumbers another.

Balanced Dataset:
Class A: ████████████████████ 50%
Class B: ████████████████████ 50%

Imbalanced Dataset:
Class A: ████████████████████████████████████████ 99%
Class B: █ 1%

Severely Imbalanced:
Class A: ████████████████████████████████████████ 99.9%
Class B: . 0.1%

Real-world examples:

Domain	Minority Class	Typical Ratio
Fraud Detection	Fraud	1:1,000
Medical Diagnosis	Disease	1:100 to 1:10,000
Spam Detection	Spam	1:10 to 1:100
Manufacturing Defects	Defective	1:1,000
Customer Churn	Churned	1:5 to 1:20
Click-Through Rate	Clicked	1:100 to 1:1,000

In all these cases, the minority class is exactly what you care about. And it's exactly what your model ignores.

The Accuracy Trap

Let me prove how misleading accuracy becomes.

import numpy as np
from sklearn.metrics import accuracy_score, classification_report

# Imbalanced dataset: 1% fraud
np.random.seed(42)
n = 10000
y_true = np.array([0] * 9900 + [1] * 100)  # 99% normal, 1% fraud

# The "Lazy Gary" model: Always predict 0 (not fraud)
y_pred_lazy = np.zeros(n)

# The accuracy looks amazing!
print(f"Accuracy: {accuracy_score(y_true, y_pred_lazy):.1%}")
print("\nBut look at the full picture:")
print(classification_report(y_true, y_pred_lazy, target_names=['Normal', 'Fraud']))

Output:

Accuracy: 99.0%

But look at the full picture:
              precision    recall  f1-score   support

      Normal       0.99      1.00      0.99      9900
       Fraud       0.00      0.00      0.00       100

    accuracy                           0.99     10000
   macro avg       0.49      0.50      0.50     10000
weighted avg       0.98      0.99      0.98     10000

99% accuracy! But:

Fraud Recall: 0% — Caught zero frauds
Fraud Precision: 0% — Never even tried
Fraud F1: 0% — Complete failure at the actual task

The model is useless. It just learned to say "not fraud" every time.

The Metrics That Matter

When classes are imbalanced, forget accuracy. Use these instead:

Precision

"Of all the things I flagged as fraud, how many actually were?"

Precision = True Positives / (True Positives + False Positives)

High precision = Few false alarms

Recall (Sensitivity)

"Of all the actual frauds, how many did I catch?"

Recall = True Positives / (True Positives + False Negatives)

High recall = Caught most frauds

F1 Score

"The balance between precision and recall"

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 = 0 means complete failure on the minority class

Area Under ROC Curve (AUC-ROC)

"How well does the model separate the two classes across all thresholds?"

AUC = 0.5 → Random guessing
AUC = 1.0 → Perfect separation

Area Under Precision-Recall Curve (AUC-PR)

"Better than ROC for severe imbalance"

When imbalance is extreme, AUC-PR is more informative than AUC-ROC.

The Arsenal: Every Way to Handle Imbalance

Strategy 1: Resample the Data

Option A: Oversample the Minority Class

The idea: Duplicate minority class examples until classes are balanced.

from sklearn.utils import resample
import pandas as pd

# Original: 9900 normal, 100 fraud
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]

# Oversample minority to match majority
df_minority_upsampled = resample(
    df_minority,
    replace=True,              # Sample with replacement
    n_samples=len(df_majority), # Match majority count
    random_state=42
)

# Combine
df_balanced = pd.concat([df_majority, df_minority_upsampled])
print(df_balanced['target'].value_counts())
# 0    9900
# 1    9900  ← Now balanced!

Visual:

Before:
Normal: ████████████████████████████████████████ 9900
Fraud:  █ 100

After oversampling:
Normal: ████████████████████████████████████████ 9900
Fraud:  ████████████████████████████████████████ 9900 (duplicates)

Pros: Simple, keeps all data
Cons: Can cause overfitting (model memorizes duplicates)

Option B: Undersample the Majority Class

The idea: Randomly remove majority class examples until balanced.

# Undersample majority to match minority
df_majority_downsampled = resample(
    df_majority,
    replace=False,             # No replacement
    n_samples=len(df_minority), # Match minority count
    random_state=42
)

# Combine
df_balanced = pd.concat([df_majority_downsampled, df_minority])
print(df_balanced['target'].value_counts())
# 0    100
# 1    100  ← Balanced, but tiny!

Visual:

Before:
Normal: ████████████████████████████████████████ 9900
Fraud:  █ 100

After undersampling:
Normal: █ 100 (threw away 9800!)
Fraud:  █ 100

Pros: Fast training, no duplicates
Cons: Throws away potentially useful data!

Option C: SMOTE (Synthetic Minority Oversampling)

The idea: Don't just duplicate — CREATE NEW synthetic minority examples.

SMOTE finds a minority example, looks at its nearest neighbors, and creates new examples along the line between them.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"Before: {dict(zip(*np.unique(y, return_counts=True)))}")
print(f"After:  {dict(zip(*np.unique(y_resampled, return_counts=True)))}")

Output:

Before: {0: 9900, 1: 100}
After:  {0: 9900, 1: 9900}

Visual:

Original minority points:     ●       ●       ●

SMOTE creates synthetic points along lines:
                             ●   ◐   ●   ◐   ●
                               ↑       ↑
                          Synthetic points!

Pros: Creates diverse examples, not just duplicates
Cons: Can create unrealistic examples, requires imbalanced-learn

Option D: SMOTE Variants

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE

# Standard SMOTE
smote = SMOTE(random_state=42)

# ADASYN: Creates more synthetics in harder regions
adasyn = ADASYN(random_state=42)

# Borderline SMOTE: Focuses on decision boundary
borderline = BorderlineSMOTE(random_state=42)

Variant	Strategy
SMOTE	Uniform synthetic generation
ADASYN	More synthetics where minority is harder to learn
BorderlineSMOTE	Focus on examples near decision boundary
SMOTE-NC	Handles mixed numerical and categorical

Strategy 2: Adjust Class Weights

The idea: Don't change the data — change how much the model CARES about each class.

Make errors on the minority class MORE EXPENSIVE.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Option 1: Automatic balancing
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Option 2: Manual weights
# Fraud errors are 100x more costly than normal errors
model = LogisticRegression(class_weight={0: 1, 1: 100})
model.fit(X_train, y_train)

# Works with many sklearn models!
rf = RandomForestClassifier(class_weight='balanced')

What class_weight='balanced' does:

weight = n_samples / (n_classes × n_samples_per_class)

For 9900 normal, 100 fraud:
  Normal weight: 10000 / (2 × 9900) = 0.505
  Fraud weight:  10000 / (2 × 100)  = 50.0

Fraud errors now count 100x more!

Pros: No data manipulation, simple, no overfitting risk
Cons: Not all algorithms support it

Strategy 3: Change Your Algorithm

Some algorithms handle imbalance better than others.

Balanced Random Forest

from imblearn.ensemble import BalancedRandomForestClassifier

# Automatically balances each bootstrap sample
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Easy Ensemble

from imblearn.ensemble import EasyEnsembleClassifier

# Creates multiple balanced subsets and ensembles them
model = EasyEnsembleClassifier(n_estimators=10, random_state=42)
model.fit(X_train, y_train)

RUSBoost (Random Under-Sampling + Boosting)

from imblearn.ensemble import RUSBoostClassifier

model = RUSBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Strategy 4: Change Your Threshold

By default, models use 0.5 as the threshold:

probability >= 0.5 → Predict positive
probability < 0.5  → Predict negative

But who said 0.5 is right?

Lower the threshold to catch more of the minority class:

from sklearn.linear_model import LogisticRegression
import numpy as np

model = LogisticRegression()
model.fit(X_train, y_train)

# Get probabilities instead of predictions
y_proba = model.predict_proba(X_test)[:, 1]

# Try different thresholds
for threshold in [0.5, 0.3, 0.2, 0.1]:
    y_pred = (y_proba >= threshold).astype(int)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(f"Threshold {threshold}: Recall={recall:.1%}, Precision={precision:.1%}")

Output:

Threshold 0.5: Recall=20.0%, Precision=85.0%
Threshold 0.3: Recall=45.0%, Precision=72.0%
Threshold 0.2: Recall=65.0%, Precision=58.0%
Threshold 0.1: Recall=85.0%, Precision=35.0%

Lower threshold = Higher recall, Lower precision

Find the sweet spot for your use case!

Strategy 5: Anomaly Detection Approach

When imbalance is EXTREME (fraud is 0.01%), treat it as anomaly detection.

The idea: Train only on the majority class. Flag anything that doesn't fit.

from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM

# Train only on normal transactions
X_normal = X_train[y_train == 0]

# Isolation Forest
iso_forest = IsolationForest(contamination=0.01, random_state=42)
iso_forest.fit(X_normal)

# Predictions: 1 = normal, -1 = anomaly
predictions = iso_forest.predict(X_test)
y_pred = (predictions == -1).astype(int)  # Convert to 0/1

Pros: Works with extreme imbalance, doesn't need minority labels for training
Cons: Less accurate than supervised methods when you have enough minority examples

Strategy 6: Cost-Sensitive Learning

The idea: Define explicit costs for different types of errors.

                    Predicted
                 Normal    Fraud
Actual  Normal     $0      $10    (false alarm: investigate cost)
        Fraud     $1000     $0    (missed fraud: loss to company)

Missing a fraud costs 100x more than a false alarm. Build this into your model.

# XGBoost with custom scale_pos_weight
import xgboost as xgb

# If fraud is 1% of data, set scale_pos_weight to 99
model = xgb.XGBClassifier(scale_pos_weight=99)
model.fit(X_train, y_train)

Complete Code: Comparing All Strategies

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score, recall_score, precision_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.ensemble import BalancedRandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

# Create imbalanced dataset (5% minority)
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set class distribution:")
print(f"  Class 0: {sum(y_train==0)} ({sum(y_train==0)/len(y_train):.1%})")
print(f"  Class 1: {sum(y_train==1)} ({sum(y_train==1)/len(y_train):.1%})")
print()

results = []

# 1. Baseline: No handling
print("=" * 60)
print("1. BASELINE (No imbalance handling)")
print("=" * 60)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Baseline', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# 2. Class weights
print("=" * 60)
print("2. CLASS WEIGHTS (balanced)")
print("=" * 60)
model = LogisticRegression(class_weight='balanced', max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Class Weights', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# 3. Random Oversampling
print("=" * 60)
print("3. RANDOM OVERSAMPLING")
print("=" * 60)
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_ros, y_ros)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Oversampling', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# 4. SMOTE
print("=" * 60)
print("4. SMOTE")
print("=" * 60)
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_smote, y_smote)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('SMOTE', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# 5. Random Undersampling
print("=" * 60)
print("5. RANDOM UNDERSAMPLING")
print("=" * 60)
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
model = LogisticRegression(max_iter=1000)
model.fit(X_rus, y_rus)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Undersampling', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# 6. Balanced Random Forest
print("=" * 60)
print("6. BALANCED RANDOM FOREST")
print("=" * 60)
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['Majority', 'Minority']))
results.append(('Balanced RF', f1_score(y_test, y_pred), recall_score(y_test, y_pred)))

# Summary
print("=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"{'Method':<20} {'F1 Score':>12} {'Recall':>12}")
print("-" * 46)
for method, f1, recall in results:
    print(f"{method:<20} {f1:>12.1%} {recall:>12.1%}")

Output:

Training set class distribution:
  Class 0: 7591 (94.9%)
  Class 1: 409 (5.1%)

============================================================
1. BASELINE (No imbalance handling)
============================================================
              precision    recall  f1-score   support

    Majority       0.96      0.99      0.98      1909
    Minority       0.71      0.42      0.53        91

    accuracy                           0.96      2000

============================================================
2. CLASS WEIGHTS (balanced)
============================================================
              precision    recall  f1-score   support

    Majority       0.98      0.93      0.95      1909
    Minority       0.40      0.70      0.51        91

    accuracy                           0.92      2000

============================================================
SUMMARY
============================================================
Method                   F1 Score       Recall
----------------------------------------------
Baseline                    52.8%        41.8%
Class Weights               50.8%        70.3%
Oversampling                52.7%        62.6%
SMOTE                       54.1%        64.8%
Undersampling               48.2%        72.5%
Balanced RF                 55.3%        61.5%

Key insight: Baseline has the worst recall (41.8%). All imbalance techniques improve recall, but at different precision costs. Choose based on your priorities!

The Precision-Recall Tradeoff

Every imbalance technique faces this tradeoff:

                HIGH PRECISION              HIGH RECALL
                "Few false alarms"          "Catch everything"
                        │                         │
                        │                         │
                        ▼                         ▼
Baseline:       ████████████████░░░░░░░░░░░░░░░░░░░░░░
Class Weights:  ████████████░░░░░░░░░░░░░░░░░████████░░
SMOTE:          ██████████████░░░░░░░░░░░░░░░░████████░░
Undersampling:  ████████░░░░░░░░░░░░░░░░░░░░░░██████████

               ├─────────────────┼─────────────────────┤
               Precision         │              Recall
                                 │
                        Your sweet spot
                        depends on cost!

If missing fraud costs $1000 but false alarm costs $10:
→ Prioritize recall! Catch all frauds, tolerate false alarms.

If false alarms annoy customers and cause churn:
→ Balance precision and recall. Don't cry wolf too often.

Which Strategy When?

START
  │
  ▼
How severe is the imbalance?
  │
  ├── Mild (10-30% minority)
  │     │
  │     └──► Class weights usually enough
  │          Try: class_weight='balanced'
  │
  ├── Moderate (1-10% minority)
  │     │
  │     └──► SMOTE or Class weights
  │          Try: SMOTE + class_weight
  │
  └── Severe (<1% minority)
        │
        └──► Combine multiple strategies
             Try: SMOTE + class_weight + threshold tuning
             Or: Anomaly detection approach

Common Mistakes

Mistake 1: Using Accuracy as Your Metric

# ❌ WRONG: Accuracy is misleading!
print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")  # 99%! 🎉

# ✅ RIGHT: Use F1, Recall, Precision, AUC
print(f"F1: {f1_score(y_test, y_pred):.1%}")
print(f"Recall: {recall_score(y_test, y_pred):.1%}")
print(f"Precision: {precision_score(y_test, y_pred):.1%}")

Mistake 2: Resampling Before Train-Test Split

# ❌ WRONG: Data leakage! Synthetic test samples based on training data
X_smote, y_smote = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote)

# ✅ RIGHT: Split first, then resample only training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train)
model.fit(X_train_smote, y_train_smote)
model.predict(X_test)  # Test set is untouched!

Mistake 3: Resampling in Cross-Validation Wrong

# ❌ WRONG: Resampling before CV causes leakage
X_smote, y_smote = SMOTE().fit_resample(X, y)
cross_val_score(model, X_smote, y_smote, cv=5)

# ✅ RIGHT: Use imblearn Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline

pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression())
])
cross_val_score(pipeline, X, y, cv=5, scoring='f1')

Mistake 4: Ignoring the Business Context

# ❌ WRONG: Optimizing F1 without thinking about costs
model = optimize_for_f1(model)

# ✅ RIGHT: Consider actual business costs
# If missing fraud costs $10,000 and false alarm costs $50:
# Recall matters 200x more than precision!
# Optimize accordingly.

Mistake 5: Not Trying Multiple Approaches

# ❌ WRONG: Just using SMOTE because you heard it's good
X_smote, y_smote = SMOTE().fit_resample(X_train, y_train)

# ✅ RIGHT: Compare multiple approaches
strategies = [
    ('Baseline', X_train, y_train),
    ('SMOTE', *SMOTE().fit_resample(X_train, y_train)),
    ('Class Weight', X_train, y_train),  # with class_weight='balanced'
    ('Undersampling', *RandomUnderSampler().fit_resample(X_train, y_train)),
]

# Evaluate each and pick the best for YOUR use case

The Decision Cheat Sheet

Situation	Best Approach
Quick fix, any algorithm	`class_weight='balanced'`
Tree-based models	Balanced Random Forest
Need to preserve all data	SMOTE
Huge dataset, need speed	Undersampling
Extreme imbalance (<0.1%)	Anomaly detection
Production system	Threshold tuning
Maximum performance	Combine SMOTE + weights + threshold

The Imbalanced-Learn Toolkit

# Install: pip install imbalanced-learn

# === OVERSAMPLING ===
from imblearn.over_sampling import (
    RandomOverSampler,    # Simple duplication
    SMOTE,                # Synthetic generation
    ADASYN,               # Adaptive synthetic
    BorderlineSMOTE,      # Focus on boundary
)

# === UNDERSAMPLING ===
from imblearn.under_sampling import (
    RandomUnderSampler,   # Random removal
    TomekLinks,           # Remove Tomek links
    NearMiss,             # Keep informative majorities
)

# === COMBINATION ===
from imblearn.combine import (
    SMOTETomek,           # SMOTE + Tomek cleaning
    SMOTEENN,             # SMOTE + ENN cleaning
)

# === ENSEMBLE ===
from imblearn.ensemble import (
    BalancedRandomForestClassifier,
    BalancedBaggingClassifier,
    EasyEnsembleClassifier,
    RUSBoostClassifier,
)

# === PIPELINE ===
from imblearn.pipeline import Pipeline  # Use this, not sklearn's!

Key Takeaways

Accuracy is a lie with imbalanced data — use F1, recall, precision, AUC
The model isn't stupid — it's doing exactly what you asked (minimize errors)
Class weights are the easiest fix — just add class_weight='balanced'
SMOTE creates synthetic examples — better than simple duplication
Resample AFTER train-test split — never before, or you'll leak data
Threshold tuning is powerful — 0.5 isn't magic
Combine strategies for best results — SMOTE + weights + threshold
Know your costs — precision vs recall depends on business impact

The One-Sentence Summary

When 99% of your data is one class, your model becomes Gary the lazy security guard — 99% accurate, 0% useful. Fix it by making minority mistakes expensive, creating synthetic minorities, or changing how you measure success.

What's Next?

Now that you understand imbalanced datasets, you're ready for:

Precision-Recall Curves — Finding the optimal threshold
Cost-Sensitive Learning — Building business costs into your model
Anomaly Detection Deep Dive — When imbalance is extreme
Stratified Sampling — Preserving class ratios in splits

Follow me for the next article in this series!

Let's Connect!

If this saved your imbalanced model, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the most imbalanced dataset you've worked with? I've seen 1:100,000. Share your stories!

The difference between a fraud detection model that catches fraudsters and one that just says "everything is fine"? Understanding that 99% accuracy can mean 0% usefulness. Don't be Gary.

Share this with someone whose model has 99% accuracy but catches nothing. They need to meet Gary.

Happy balancing!