Sachin Kr. Rajput

Posted on Jan 21

SMOTE: The Creature Creation Lab That Saves Your Minority Class From Extinction

#datascience #python #beginners #programming

The One-Line Summary: SMOTE creates synthetic minority samples by drawing lines between existing minority neighbors and placing new points along those lines. It's not cloning — it's controlled breeding of believable new examples.

The Endangered Species Problem

Dr. Elena runs a wildlife classification AI for a nature reserve.

Her dataset:

Species         Count
─────────────────────────
Sheep           10,000
Cows            8,500
Horses          7,200
Dragons         47

Yes, dragons. This is a magical nature reserve.

Her model trains. Results:

Sheep accuracy:   99.2%
Cow accuracy:     98.8%
Horse accuracy:   97.5%
Dragon accuracy:  12.0%  💀

The model barely recognizes dragons. Why? Because with only 47 examples, it learned to basically ignore them. Predicting "not a dragon" is right 99.5% of the time!

The Failed Solutions

Attempt 1: Photocopy the Dragons

"I'll just duplicate each dragon 200 times!"

Dragons: 47 → 9,400 (each copied ~200x)

The model trains again. Now it recognizes dragons! But...

It's memorized the exact 47 dragons. Show it a NEW dragon? Fails completely. The model overfit to the duplicates.

Cloning doesn't create diversity.

Attempt 2: Download More Dragons

"I'll find more dragon images online!"

But dragons are rare. There aren't more examples. That's the whole problem.

Attempt 3: The Creature Creation Lab

Dr. Elena has a wild idea.

"What if I don't COPY dragons... but CREATE new ones?"

She builds a genetics lab. The process:

Take two similar dragons (neighbors in "dragon feature space")
Blend their DNA at a random ratio
Birth a NEW dragon that inherits traits from both parents
This new dragon is DIFFERENT from both parents, but still believably a dragon

Parent Dragon A: Red scales, 40ft wingspan, fire breath
Parent Dragon B: Red scales, 35ft wingspan, fire breath

Offspring (70% A, 30% B):
  - Red scales (both had it)
  - 38.5ft wingspan (0.7 × 40 + 0.3 × 35)
  - Fire breath (both had it)

A NEW dragon that never existed, but is totally realistic!

She creates 9,353 synthetic dragons. Now:

Dragons: 47 real + 9,353 synthetic = 9,400 total

The model trains. Dragon accuracy: 94.7%!

And it generalizes to NEW dragons because the synthetic ones added DIVERSITY, not just copies.

This is SMOTE.

Synthetic Minority Over-sampling TEchnique.

It doesn't clone. It breeds.

How SMOTE Actually Works

Let me show you the exact algorithm, step by step.

Step 1: Pick a Minority Sample

Start with any minority class example. Let's call it Point A.

Feature Space (2D for visualization):

    ↑
    │           ● B
    │     
    │  ● A          ● C
    │        
    │              ● D
    └────────────────────→

A, B, C, D are all minority class (dragons)

Step 2: Find Its K Nearest Neighbors

Find the K closest minority samples to Point A. Default K=5.

A's nearest neighbors: B, C, D (let's say K=3 for simplicity)

    ↑
    │           ● B ←── neighbor
    │     
    │  ● A          ● C ←── neighbor
    │        
    │              ● D ←── neighbor
    └────────────────────→

Step 3: Pick One Neighbor Randomly

Randomly select one of the neighbors. Let's pick Point B.

Selected pair: A and B

    ↑
    │           ● B
    │          ╱
    │         ╱ ← This line!
    │        ╱
    │  ● A
    │        
    └────────────────────→

Step 4: Draw a Line Between Them

Imagine a line connecting A to B.

Step 5: Place a New Point on That Line

Pick a random position along the line (random number between 0 and 1).

If random = 0.4:
  New point = A + 0.4 × (B - A)

    ↑
    │           ● B
    │         ◐ ← NEW synthetic point!
    │        ╱   (40% of the way from A to B)
    │       ╱
    │  ● A
    │        
    └────────────────────→

Step 6: Repeat Until Balanced

Keep creating synthetic points until you have enough minority samples.

After many iterations:

    ↑
    │    ◐      ● B    ◐
    │      ◐  ◐    ◐
    │  ● A    ◐     ● C
    │      ◐    ◐ ◐
    │        ◐    ● D
    └────────────────────→

● = Original minority samples
◐ = Synthetic samples created by SMOTE

The Math (It's Simpler Than You Think)

For each synthetic sample:

X_synthetic = X_i + λ × (X_nn - X_i)

Where:
  X_i    = Original minority sample
  X_nn   = Randomly chosen nearest neighbor
  λ      = Random number between 0 and 1

That's it. Linear interpolation between neighbors.

Example with real numbers:

# Original dragon (features: wingspan, weight, fire_temp)
dragon_A = [40, 2000, 1500]  # 40ft, 2000kg, 1500°C fire

# Neighbor dragon
dragon_B = [35, 1800, 1650]

# Random λ = 0.3
lambda_val = 0.3

# Synthetic dragon
synthetic = dragon_A + 0.3 × (dragon_B - dragon_A)
          = [40, 2000, 1500] + 0.3 × ([35, 1800, 1650] - [40, 2000, 1500])
          = [40, 2000, 1500] + 0.3 × ([-5, -200, 150])
          = [40, 2000, 1500] + [-1.5, -60, 45]
          = [38.5, 1940, 1545]

# New dragon: 38.5ft wingspan, 1940kg, 1545°C fire breath
# Believable! Falls between the two parents.

SMOTE in Code

Basic SMOTE

from imblearn.over_sampling import SMOTE
import numpy as np
from collections import Counter

# Create imbalanced dataset
np.random.seed(42)
X = np.random.randn(1000, 5)  # 1000 samples, 5 features
y = np.array([0] * 950 + [1] * 50)  # 95% class 0, 5% class 1

print(f"Before SMOTE: {Counter(y)}")
# Counter({0: 950, 1: 50})

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"After SMOTE:  {Counter(y_resampled)}")
# Counter({0: 950, 1: 950})

print(f"Created {sum(y_resampled==1) - sum(y==1)} synthetic minority samples")
# Created 900 synthetic minority samples

Visualizing SMOTE

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

# Create 2D imbalanced dataset for visualization
X, y = make_classification(
    n_samples=200, n_features=2, n_informative=2,
    n_redundant=0, weights=[0.9, 0.1], random_state=42
)

# Apply SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before SMOTE
axes[0].scatter(X[y==0, 0], X[y==0, 1], c='blue', label=f'Majority ({sum(y==0)})', alpha=0.6)
axes[0].scatter(X[y==1, 0], X[y==1, 1], c='red', label=f'Minority ({sum(y==1)})', alpha=0.8, s=100)
axes[0].set_title('Before SMOTE', fontsize=14)
axes[0].legend()

# After SMOTE
axes[1].scatter(X_smote[y_smote==0, 0], X_smote[y_smote==0, 1], c='blue', label=f'Majority ({sum(y_smote==0)})', alpha=0.6)
# Original minority
original_minority = X[y==1]
axes[1].scatter(original_minority[:, 0], original_minority[:, 1], c='red', label=f'Original Minority ({sum(y==1)})', alpha=0.8, s=100)
# Synthetic minority (new ones)
synthetic = X_smote[len(X):]
axes[1].scatter(synthetic[y_smote[len(X):]==1, 0], synthetic[y_smote[len(X):]==1, 1], c='orange', label=f'Synthetic ({len(synthetic)})', alpha=0.6, s=50, marker='*')
axes[1].set_title('After SMOTE', fontsize=14)
axes[1].legend()

plt.tight_layout()
plt.savefig('smote_visualization.png', dpi=150)
plt.show()

Visual Output:

BEFORE SMOTE:                      AFTER SMOTE:

  Majority (blue): ●●●●●●●●        Majority (blue): ●●●●●●●●
  Minority (red):  ●●              Original (red):  ●●
                                   Synthetic (★):   ★★★★★★★★★★★

    ●●●●●●●●                           ●●●●●●●●
   ●●●●●●●●●                          ●●●●●●●●●
    ●●●●●●●●  ●                        ●●●●●●●●  ●★★
     ●●●●●●●   ●                        ●●●●●●●  ★●★★
      ●●●●●●                             ●●●●●●  ★★★★
                                                  ★★★

Synthetic samples fill the minority region!

Why SMOTE Works Better Than Duplication

Let me prove it.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
import numpy as np

# Create imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=15,
    weights=[0.95, 0.05], random_state=42
)

# Method 1: No resampling (baseline)
baseline_scores = cross_val_score(
    LogisticRegression(max_iter=1000), X, y, cv=5, scoring='f1'
)

# Method 2: Random oversampling (duplication)
dup_pipeline = ImbPipeline([
    ('sampler', RandomOverSampler(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000))
])
dup_scores = cross_val_score(dup_pipeline, X, y, cv=5, scoring='f1')

# Method 3: SMOTE (synthetic generation)
smote_pipeline = ImbPipeline([
    ('sampler', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000))
])
smote_scores = cross_val_score(smote_pipeline, X, y, cv=5, scoring='f1')

print("F1 Scores (5-fold CV):")
print(f"  Baseline (no resampling): {baseline_scores.mean():.3f} ± {baseline_scores.std():.3f}")
print(f"  Random Oversampling:      {dup_scores.mean():.3f} ± {dup_scores.std():.3f}")
print(f"  SMOTE:                    {smote_scores.mean():.3f} ± {smote_scores.std():.3f}")

Output:

F1 Scores (5-fold CV):
  Baseline (no resampling): 0.421 ± 0.089
  Random Oversampling:      0.502 ± 0.075
  SMOTE:                    0.548 ± 0.062

SMOTE beats both! It creates diversity that helps the model generalize.

The SMOTE Family Tree

SMOTE has many variants, each solving a specific problem.

Standard SMOTE

from imblearn.over_sampling import SMOTE

smote = SMOTE(
    k_neighbors=5,      # Number of neighbors to consider
    random_state=42
)

Best for: General use, first thing to try.

Borderline-SMOTE

The idea: Only create synthetics near the decision boundary — that's where they help most!

from imblearn.over_sampling import BorderlineSMOTE

borderline = BorderlineSMOTE(
    kind='borderline-1',  # or 'borderline-2'
    random_state=42
)

Regular SMOTE creates everywhere:      Borderline-SMOTE focuses here:

   Majority region                        Majority region
   ●●●●●●●●●●                            ●●●●●●●●●●
   ●●●●●●●●●●                            ●●●●●●●●●●
   ●●●●●●●●●●                            ●●●●●●●●●●
   ──────────── boundary ────────        ──────────── boundary ────────
   ★★★★★★★★★★  ← synthetics              ★★★★★★★★★★  ← MORE synthetics here!
   ★★★★★★★★★★     everywhere             ○○○○○○○○○○
   ○○○○○○○○○○                            ○○○○○○○○○○  ← Original minority
   ○○○○○○○○○○                                          (no synthetics far from boundary)

Best for: When you want to strengthen the decision boundary.

ADASYN (Adaptive Synthetic Sampling)

The idea: Create MORE synthetics in regions where minority samples are harder to learn.

from imblearn.over_sampling import ADASYN

adasyn = ADASYN(random_state=42)

Density of synthetic samples adapts to difficulty:

Easy region (minority surrounded by minority):
   ○○○○○○○    ← Few synthetics needed
   ○○○○○○○
   ○○○○○○○

Hard region (minority surrounded by majority):
   ●●●●●●●
   ●●○★★●●    ← LOTS of synthetics here!
   ●●●●●●●       Minority is outnumbered and harder to learn

Best for: When minority class has regions of varying difficulty.

SMOTE-NC (Nominal + Continuous)

The idea: Standard SMOTE only works with numbers. SMOTE-NC handles mixed data!

from imblearn.over_sampling import SMOTENC

# Specify which columns are categorical
smote_nc = SMOTENC(
    categorical_features=[0, 3, 7],  # Indices of categorical columns
    random_state=42
)

For categorical features, SMOTE-NC uses the most common category among neighbors instead of interpolating.

Best for: Datasets with both numerical and categorical features.

SVMSMOTE

The idea: Use SVM to find the borderline, then create synthetics along it.

from imblearn.over_sampling import SVMSMOTE

svm_smote = SVMSMOTE(random_state=42)

Best for: When the decision boundary is complex.

K-Means SMOTE

The idea: Cluster minority samples first, then oversample within clusters.

from imblearn.over_sampling import KMeansSMOTE

kmeans_smote = KMeansSMOTE(
    cluster_balance_threshold=0.1,
    random_state=42
)

Best for: When minority class has distinct subgroups.

Comparing SMOTE Variants

from imblearn.over_sampling import (
    SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN, KMeansSMOTE
)
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline as ImbPipeline

# Create challenging imbalanced dataset
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=15,
    n_clusters_per_class=3, weights=[0.95, 0.05],
    flip_y=0.1, random_state=42
)

variants = {
    'SMOTE': SMOTE(random_state=42),
    'Borderline-SMOTE': BorderlineSMOTE(random_state=42),
    'SVM-SMOTE': SVMSMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42),
    'KMeans-SMOTE': KMeansSMOTE(random_state=42),
}

print("F1 Scores by SMOTE Variant:")
print("-" * 45)

for name, sampler in variants.items():
    try:
        pipeline = ImbPipeline([
            ('sampler', sampler),
            ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
        ])
        scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
        print(f"{name:<20}: {scores.mean():.3f} ± {scores.std():.3f}")
    except Exception as e:
        print(f"{name:<20}: Error - {str(e)[:30]}")

Output:

F1 Scores by SMOTE Variant:
---------------------------------------------
SMOTE               : 0.542 ± 0.058
Borderline-SMOTE    : 0.556 ± 0.051
SVM-SMOTE           : 0.549 ± 0.063
ADASYN              : 0.538 ± 0.071
KMeans-SMOTE        : 0.561 ± 0.047

Different variants win for different datasets. Always experiment!

When SMOTE Fails

SMOTE isn't magic. It can fail spectacularly.

Failure 1: Noisy Data

If your minority samples include mislabeled examples (noise), SMOTE creates synthetics between them — amplifying the noise!

Actual situation:
   ●●●●●●●●●●
   ●●●●●●●●●●     ○ ← Real minority
   ●●●●○●●●●●     ✗ ← Mislabeled (actually majority!)
   ●●●●●●●●●●     

SMOTE creates synthetics between ○ and ✗:
   ●●●●●●●●●●
   ●●●●●●●●●●     ○ ★ ★ ★ ✗   ← Synthetic garbage!
   ●●●●●●●●●●     

The synthetics are in majority territory!

Solution: Clean your data first, or use SMOTE-ENN (SMOTE + Edited Nearest Neighbors).

from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

Failure 2: Small Sample Size

With very few minority samples, SMOTE's neighbors are too far apart. Synthetics become unrealistic.

Only 3 minority samples, far apart:

   ○                           ○


              ★ ★ ★ ★ ← Synthetics in "no man's land"

                        ○

These synthetics might not represent real minority patterns!

Rule of thumb: SMOTE needs at least 6+ minority samples per feature to work reliably. Fewer? Consider collecting more data or using a simpler approach.

Failure 3: High Dimensionality

In high-dimensional space, "nearest neighbors" become meaningless (curse of dimensionality). SMOTE creates synthetics between points that aren't actually similar.

# High dimensionality warning
if n_features > 50 and n_minority_samples < n_features * 5:
    print("Warning: SMOTE may not work well here!")
    print("Consider dimensionality reduction first (PCA)")

Solution: Apply PCA or feature selection before SMOTE.

Failure 4: Categorical Features (Without SMOTE-NC)

Standard SMOTE interpolates numbers. Interpolating between "Red" and "Blue" gives you... 0.5? That's not a color!

# ❌ WRONG: Standard SMOTE on categorical data
X_with_categorical = ...  # Has columns like "Color", "Size"
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_with_categorical, y)
# "Color" might become 0.73 — meaningless!

# ✅ RIGHT: Use SMOTE-NC
smote_nc = SMOTENC(categorical_features=[0, 3], random_state=42)
X_resampled, y_resampled = smote_nc.fit_resample(X_with_categorical, y)

SMOTE Best Practices

1. Always Split Before SMOTE

# ❌ WRONG: SMOTE before split = data leakage!
X_smote, y_smote = SMOTE().fit_resample(X, y)
X_train, X_test = train_test_split(X_smote, y_smote)
# Test set contains synthetics based on training data!

# ✅ RIGHT: Split first, SMOTE only training
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train)
model.fit(X_train_smote, y_train_smote)
model.predict(X_test)  # Test set is pristine!

2. Use imblearn Pipeline for Cross-Validation

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score

# This handles SMOTE correctly in each fold!
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier())
])

scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')

3. Don't Oversample to Exactly 50-50

Sometimes partial oversampling works better.

# Instead of balancing to 50-50
smote_full = SMOTE(sampling_strategy=1.0)  # 1.0 = match majority

# Try partial oversampling
smote_partial = SMOTE(sampling_strategy=0.5)  # 0.5 = half of majority

4. Combine with Undersampling

SMOTE + Tomek Links = Best of both worlds.

from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

Tomek Links removes majority samples that are "too close" to minority ones, cleaning the boundary.

The Complete Decision Guide

START
  │
  ▼
Do you have categorical features?
  │
  ├── YES ──────────────────────────────► SMOTE-NC
  │
  └── NO
       │
       ▼
Do you have noisy labels?
  │
  ├── YES ──────────────────────────────► SMOTE-ENN or SMOTE-Tomek
  │
  └── NO
       │
       ▼
Does minority class have distinct subgroups?
  │
  ├── YES ──────────────────────────────► KMeans-SMOTE
  │
  └── NO
       │
       ▼
Do you want to focus on the decision boundary?
  │
  ├── YES ──────────────────────────────► Borderline-SMOTE
  │
  └── NO
       │
       ▼
Is minority class harder to learn in some regions?
  │
  ├── YES ──────────────────────────────► ADASYN
  │
  └── NO ───────────────────────────────► Standard SMOTE

Quick Reference

Variant	Use When
SMOTE	Default choice, general use
Borderline-SMOTE	Decision boundary matters most
ADASYN	Some minority regions are harder
SMOTE-NC	Mix of numerical + categorical
SVM-SMOTE	Complex decision boundary
KMeans-SMOTE	Minority has distinct clusters
SMOTE-ENN	Data has noisy labels
SMOTE-Tomek	Want cleaner boundaries

Key Takeaways

SMOTE creates synthetic samples by interpolating between minority neighbors
It's not cloning — synthetics are NEW examples that add diversity
The math is simple: X_new = X_i + λ × (X_neighbor - X_i)
Always SMOTE after train-test split — never before!
Many variants exist — Borderline, ADASYN, SMOTE-NC for different situations
SMOTE can fail with noise, few samples, high dimensions, or categoricals
Combine with undersampling (SMOTE-Tomek) for cleaner results
Use imblearn Pipeline for proper cross-validation

The One-Sentence Summary

SMOTE is a creature creation lab that breeds NEW minority samples by blending the DNA of existing neighbors — giving your model the diverse examples it needs to actually learn the rare class.

What's Next?

Now that you understand SMOTE, you're ready for:

ADASYN Deep Dive — Adaptive synthetic sampling
Combining Over and Undersampling — SMOTE-Tomek, SMOTE-ENN
Cost-Sensitive Learning — When resampling isn't enough
Evaluation Metrics for Imbalanced Data — Beyond accuracy

Follow me for the next article in this series!

Let's Connect!

If this finally made SMOTE click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which SMOTE variant is your favorite? I'm a Borderline-SMOTE fan myself!

The difference between a model that recognizes 12% of dragons and one that recognizes 95%? Not photocopying the same 47 dragons — but breeding 9,000 new ones that inherit realistic dragon traits. SMOTE: where data science meets creature creation.

Share this with someone struggling with imbalanced data. They don't need more data — they need a creature creation lab.

Happy synthesizing! 🐉