The One-Line Summary: SMOTE creates synthetic minority samples by drawing lines between existing minority neighbors and placing new points along those lines. It's not cloning — it's controlled breeding of believable new examples.
The Endangered Species Problem
Dr. Elena runs a wildlife classification AI for a nature reserve.
Her dataset:
Species Count
─────────────────────────
Sheep 10,000
Cows 8,500
Horses 7,200
Dragons 47
Yes, dragons. This is a magical nature reserve.
Her model trains. Results:
Sheep accuracy: 99.2%
Cow accuracy: 98.8%
Horse accuracy: 97.5%
Dragon accuracy: 12.0% 💀
The model barely recognizes dragons. Why? Because with only 47 examples, it learned to basically ignore them. Predicting "not a dragon" is right 99.5% of the time!
The Failed Solutions
Attempt 1: Photocopy the Dragons
"I'll just duplicate each dragon 200 times!"
Dragons: 47 → 9,400 (each copied ~200x)
The model trains again. Now it recognizes dragons! But...
It's memorized the exact 47 dragons. Show it a NEW dragon? Fails completely. The model overfit to the duplicates.
Cloning doesn't create diversity.
Attempt 2: Download More Dragons
"I'll find more dragon images online!"
But dragons are rare. There aren't more examples. That's the whole problem.
Attempt 3: The Creature Creation Lab
Dr. Elena has a wild idea.
"What if I don't COPY dragons... but CREATE new ones?"
She builds a genetics lab. The process:
- Take two similar dragons (neighbors in "dragon feature space")
- Blend their DNA at a random ratio
- Birth a NEW dragon that inherits traits from both parents
- This new dragon is DIFFERENT from both parents, but still believably a dragon
Parent Dragon A: Red scales, 40ft wingspan, fire breath
Parent Dragon B: Red scales, 35ft wingspan, fire breath
Offspring (70% A, 30% B):
- Red scales (both had it)
- 38.5ft wingspan (0.7 × 40 + 0.3 × 35)
- Fire breath (both had it)
A NEW dragon that never existed, but is totally realistic!
She creates 9,353 synthetic dragons. Now:
Dragons: 47 real + 9,353 synthetic = 9,400 total
The model trains. Dragon accuracy: 94.7%!
And it generalizes to NEW dragons because the synthetic ones added DIVERSITY, not just copies.
This is SMOTE.
Synthetic Minority Over-sampling TEchnique.
It doesn't clone. It breeds.
How SMOTE Actually Works
Let me show you the exact algorithm, step by step.
Step 1: Pick a Minority Sample
Start with any minority class example. Let's call it Point A.
Feature Space (2D for visualization):
↑
│ ● B
│
│ ● A ● C
│
│ ● D
└────────────────────→
A, B, C, D are all minority class (dragons)
Step 2: Find Its K Nearest Neighbors
Find the K closest minority samples to Point A. Default K=5.
A's nearest neighbors: B, C, D (let's say K=3 for simplicity)
↑
│ ● B ←── neighbor
│
│ ● A ● C ←── neighbor
│
│ ● D ←── neighbor
└────────────────────→
Step 3: Pick One Neighbor Randomly
Randomly select one of the neighbors. Let's pick Point B.
Selected pair: A and B
↑
│ ● B
│ ╱
│ ╱ ← This line!
│ ╱
│ ● A
│
└────────────────────→
Step 4: Draw a Line Between Them
Imagine a line connecting A to B.
Step 5: Place a New Point on That Line
Pick a random position along the line (random number between 0 and 1).
If random = 0.4:
New point = A + 0.4 × (B - A)
↑
│ ● B
│ ◐ ← NEW synthetic point!
│ ╱ (40% of the way from A to B)
│ ╱
│ ● A
│
└────────────────────→
Step 6: Repeat Until Balanced
Keep creating synthetic points until you have enough minority samples.
After many iterations:
↑
│ ◐ ● B ◐
│ ◐ ◐ ◐
│ ● A ◐ ● C
│ ◐ ◐ ◐
│ ◐ ● D
└────────────────────→
● = Original minority samples
◐ = Synthetic samples created by SMOTE
The Math (It's Simpler Than You Think)
For each synthetic sample:
X_synthetic = X_i + λ × (X_nn - X_i)
Where:
X_i = Original minority sample
X_nn = Randomly chosen nearest neighbor
λ = Random number between 0 and 1
That's it. Linear interpolation between neighbors.
Example with real numbers:
# Original dragon (features: wingspan, weight, fire_temp)
dragon_A = [40, 2000, 1500] # 40ft, 2000kg, 1500°C fire
# Neighbor dragon
dragon_B = [35, 1800, 1650]
# Random λ = 0.3
lambda_val = 0.3
# Synthetic dragon
synthetic = dragon_A + 0.3 × (dragon_B - dragon_A)
= [40, 2000, 1500] + 0.3 × ([35, 1800, 1650] - [40, 2000, 1500])
= [40, 2000, 1500] + 0.3 × ([-5, -200, 150])
= [40, 2000, 1500] + [-1.5, -60, 45]
= [38.5, 1940, 1545]
# New dragon: 38.5ft wingspan, 1940kg, 1545°C fire breath
# Believable! Falls between the two parents.
SMOTE in Code
Basic SMOTE
from imblearn.over_sampling import SMOTE
import numpy as np
from collections import Counter
# Create imbalanced dataset
np.random.seed(42)
X = np.random.randn(1000, 5) # 1000 samples, 5 features
y = np.array([0] * 950 + [1] * 50) # 95% class 0, 5% class 1
print(f"Before SMOTE: {Counter(y)}")
# Counter({0: 950, 1: 50})
# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"After SMOTE: {Counter(y_resampled)}")
# Counter({0: 950, 1: 950})
print(f"Created {sum(y_resampled==1) - sum(y==1)} synthetic minority samples")
# Created 900 synthetic minority samples
Visualizing SMOTE
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
# Create 2D imbalanced dataset for visualization
X, y = make_classification(
n_samples=200, n_features=2, n_informative=2,
n_redundant=0, weights=[0.9, 0.1], random_state=42
)
# Apply SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Before SMOTE
axes[0].scatter(X[y==0, 0], X[y==0, 1], c='blue', label=f'Majority ({sum(y==0)})', alpha=0.6)
axes[0].scatter(X[y==1, 0], X[y==1, 1], c='red', label=f'Minority ({sum(y==1)})', alpha=0.8, s=100)
axes[0].set_title('Before SMOTE', fontsize=14)
axes[0].legend()
# After SMOTE
axes[1].scatter(X_smote[y_smote==0, 0], X_smote[y_smote==0, 1], c='blue', label=f'Majority ({sum(y_smote==0)})', alpha=0.6)
# Original minority
original_minority = X[y==1]
axes[1].scatter(original_minority[:, 0], original_minority[:, 1], c='red', label=f'Original Minority ({sum(y==1)})', alpha=0.8, s=100)
# Synthetic minority (new ones)
synthetic = X_smote[len(X):]
axes[1].scatter(synthetic[y_smote[len(X):]==1, 0], synthetic[y_smote[len(X):]==1, 1], c='orange', label=f'Synthetic ({len(synthetic)})', alpha=0.6, s=50, marker='*')
axes[1].set_title('After SMOTE', fontsize=14)
axes[1].legend()
plt.tight_layout()
plt.savefig('smote_visualization.png', dpi=150)
plt.show()
Visual Output:
BEFORE SMOTE: AFTER SMOTE:
Majority (blue): ●●●●●●●● Majority (blue): ●●●●●●●●
Minority (red): ●● Original (red): ●●
Synthetic (★): ★★★★★★★★★★★
●●●●●●●● ●●●●●●●●
●●●●●●●●● ●●●●●●●●●
●●●●●●●● ● ●●●●●●●● ●★★
●●●●●●● ● ●●●●●●● ★●★★
●●●●●● ●●●●●● ★★★★
★★★
Synthetic samples fill the minority region!
Why SMOTE Works Better Than Duplication
Let me prove it.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.pipeline import Pipeline as ImbPipeline
import numpy as np
# Create imbalanced dataset
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=15,
weights=[0.95, 0.05], random_state=42
)
# Method 1: No resampling (baseline)
baseline_scores = cross_val_score(
LogisticRegression(max_iter=1000), X, y, cv=5, scoring='f1'
)
# Method 2: Random oversampling (duplication)
dup_pipeline = ImbPipeline([
('sampler', RandomOverSampler(random_state=42)),
('classifier', LogisticRegression(max_iter=1000))
])
dup_scores = cross_val_score(dup_pipeline, X, y, cv=5, scoring='f1')
# Method 3: SMOTE (synthetic generation)
smote_pipeline = ImbPipeline([
('sampler', SMOTE(random_state=42)),
('classifier', LogisticRegression(max_iter=1000))
])
smote_scores = cross_val_score(smote_pipeline, X, y, cv=5, scoring='f1')
print("F1 Scores (5-fold CV):")
print(f" Baseline (no resampling): {baseline_scores.mean():.3f} ± {baseline_scores.std():.3f}")
print(f" Random Oversampling: {dup_scores.mean():.3f} ± {dup_scores.std():.3f}")
print(f" SMOTE: {smote_scores.mean():.3f} ± {smote_scores.std():.3f}")
Output:
F1 Scores (5-fold CV):
Baseline (no resampling): 0.421 ± 0.089
Random Oversampling: 0.502 ± 0.075
SMOTE: 0.548 ± 0.062
SMOTE beats both! It creates diversity that helps the model generalize.
The SMOTE Family Tree
SMOTE has many variants, each solving a specific problem.
Standard SMOTE
from imblearn.over_sampling import SMOTE
smote = SMOTE(
k_neighbors=5, # Number of neighbors to consider
random_state=42
)
Best for: General use, first thing to try.
Borderline-SMOTE
The idea: Only create synthetics near the decision boundary — that's where they help most!
from imblearn.over_sampling import BorderlineSMOTE
borderline = BorderlineSMOTE(
kind='borderline-1', # or 'borderline-2'
random_state=42
)
Regular SMOTE creates everywhere: Borderline-SMOTE focuses here:
Majority region Majority region
●●●●●●●●●● ●●●●●●●●●●
●●●●●●●●●● ●●●●●●●●●●
●●●●●●●●●● ●●●●●●●●●●
──────────── boundary ──────── ──────────── boundary ────────
★★★★★★★★★★ ← synthetics ★★★★★★★★★★ ← MORE synthetics here!
★★★★★★★★★★ everywhere ○○○○○○○○○○
○○○○○○○○○○ ○○○○○○○○○○ ← Original minority
○○○○○○○○○○ (no synthetics far from boundary)
Best for: When you want to strengthen the decision boundary.
ADASYN (Adaptive Synthetic Sampling)
The idea: Create MORE synthetics in regions where minority samples are harder to learn.
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state=42)
Density of synthetic samples adapts to difficulty:
Easy region (minority surrounded by minority):
○○○○○○○ ← Few synthetics needed
○○○○○○○
○○○○○○○
Hard region (minority surrounded by majority):
●●●●●●●
●●○★★●● ← LOTS of synthetics here!
●●●●●●● Minority is outnumbered and harder to learn
Best for: When minority class has regions of varying difficulty.
SMOTE-NC (Nominal + Continuous)
The idea: Standard SMOTE only works with numbers. SMOTE-NC handles mixed data!
from imblearn.over_sampling import SMOTENC
# Specify which columns are categorical
smote_nc = SMOTENC(
categorical_features=[0, 3, 7], # Indices of categorical columns
random_state=42
)
For categorical features, SMOTE-NC uses the most common category among neighbors instead of interpolating.
Best for: Datasets with both numerical and categorical features.
SVMSMOTE
The idea: Use SVM to find the borderline, then create synthetics along it.
from imblearn.over_sampling import SVMSMOTE
svm_smote = SVMSMOTE(random_state=42)
Best for: When the decision boundary is complex.
K-Means SMOTE
The idea: Cluster minority samples first, then oversample within clusters.
from imblearn.over_sampling import KMeansSMOTE
kmeans_smote = KMeansSMOTE(
cluster_balance_threshold=0.1,
random_state=42
)
Best for: When minority class has distinct subgroups.
Comparing SMOTE Variants
from imblearn.over_sampling import (
SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN, KMeansSMOTE
)
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from imblearn.pipeline import Pipeline as ImbPipeline
# Create challenging imbalanced dataset
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=15,
n_clusters_per_class=3, weights=[0.95, 0.05],
flip_y=0.1, random_state=42
)
variants = {
'SMOTE': SMOTE(random_state=42),
'Borderline-SMOTE': BorderlineSMOTE(random_state=42),
'SVM-SMOTE': SVMSMOTE(random_state=42),
'ADASYN': ADASYN(random_state=42),
'KMeans-SMOTE': KMeansSMOTE(random_state=42),
}
print("F1 Scores by SMOTE Variant:")
print("-" * 45)
for name, sampler in variants.items():
try:
pipeline = ImbPipeline([
('sampler', sampler),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
print(f"{name:<20}: {scores.mean():.3f} ± {scores.std():.3f}")
except Exception as e:
print(f"{name:<20}: Error - {str(e)[:30]}")
Output:
F1 Scores by SMOTE Variant:
---------------------------------------------
SMOTE : 0.542 ± 0.058
Borderline-SMOTE : 0.556 ± 0.051
SVM-SMOTE : 0.549 ± 0.063
ADASYN : 0.538 ± 0.071
KMeans-SMOTE : 0.561 ± 0.047
Different variants win for different datasets. Always experiment!
When SMOTE Fails
SMOTE isn't magic. It can fail spectacularly.
Failure 1: Noisy Data
If your minority samples include mislabeled examples (noise), SMOTE creates synthetics between them — amplifying the noise!
Actual situation:
●●●●●●●●●●
●●●●●●●●●● ○ ← Real minority
●●●●○●●●●● ✗ ← Mislabeled (actually majority!)
●●●●●●●●●●
SMOTE creates synthetics between ○ and ✗:
●●●●●●●●●●
●●●●●●●●●● ○ ★ ★ ★ ✗ ← Synthetic garbage!
●●●●●●●●●●
The synthetics are in majority territory!
Solution: Clean your data first, or use SMOTE-ENN (SMOTE + Edited Nearest Neighbors).
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
Failure 2: Small Sample Size
With very few minority samples, SMOTE's neighbors are too far apart. Synthetics become unrealistic.
Only 3 minority samples, far apart:
○ ○
★ ★ ★ ★ ← Synthetics in "no man's land"
○
These synthetics might not represent real minority patterns!
Rule of thumb: SMOTE needs at least 6+ minority samples per feature to work reliably. Fewer? Consider collecting more data or using a simpler approach.
Failure 3: High Dimensionality
In high-dimensional space, "nearest neighbors" become meaningless (curse of dimensionality). SMOTE creates synthetics between points that aren't actually similar.
# High dimensionality warning
if n_features > 50 and n_minority_samples < n_features * 5:
print("Warning: SMOTE may not work well here!")
print("Consider dimensionality reduction first (PCA)")
Solution: Apply PCA or feature selection before SMOTE.
Failure 4: Categorical Features (Without SMOTE-NC)
Standard SMOTE interpolates numbers. Interpolating between "Red" and "Blue" gives you... 0.5? That's not a color!
# ❌ WRONG: Standard SMOTE on categorical data
X_with_categorical = ... # Has columns like "Color", "Size"
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_with_categorical, y)
# "Color" might become 0.73 — meaningless!
# ✅ RIGHT: Use SMOTE-NC
smote_nc = SMOTENC(categorical_features=[0, 3], random_state=42)
X_resampled, y_resampled = smote_nc.fit_resample(X_with_categorical, y)
SMOTE Best Practices
1. Always Split Before SMOTE
# ❌ WRONG: SMOTE before split = data leakage!
X_smote, y_smote = SMOTE().fit_resample(X, y)
X_train, X_test = train_test_split(X_smote, y_smote)
# Test set contains synthetics based on training data!
# ✅ RIGHT: Split first, SMOTE only training
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_smote, y_train_smote = SMOTE().fit_resample(X_train, y_train)
model.fit(X_train_smote, y_train_smote)
model.predict(X_test) # Test set is pristine!
2. Use imblearn Pipeline for Cross-Validation
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score
# This handles SMOTE correctly in each fold!
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')
3. Don't Oversample to Exactly 50-50
Sometimes partial oversampling works better.
# Instead of balancing to 50-50
smote_full = SMOTE(sampling_strategy=1.0) # 1.0 = match majority
# Try partial oversampling
smote_partial = SMOTE(sampling_strategy=0.5) # 0.5 = half of majority
4. Combine with Undersampling
SMOTE + Tomek Links = Best of both worlds.
from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)
Tomek Links removes majority samples that are "too close" to minority ones, cleaning the boundary.
The Complete Decision Guide
START
│
▼
Do you have categorical features?
│
├── YES ──────────────────────────────► SMOTE-NC
│
└── NO
│
▼
Do you have noisy labels?
│
├── YES ──────────────────────────────► SMOTE-ENN or SMOTE-Tomek
│
└── NO
│
▼
Does minority class have distinct subgroups?
│
├── YES ──────────────────────────────► KMeans-SMOTE
│
└── NO
│
▼
Do you want to focus on the decision boundary?
│
├── YES ──────────────────────────────► Borderline-SMOTE
│
└── NO
│
▼
Is minority class harder to learn in some regions?
│
├── YES ──────────────────────────────► ADASYN
│
└── NO ───────────────────────────────► Standard SMOTE
Quick Reference
| Variant | Use When |
|---|---|
| SMOTE | Default choice, general use |
| Borderline-SMOTE | Decision boundary matters most |
| ADASYN | Some minority regions are harder |
| SMOTE-NC | Mix of numerical + categorical |
| SVM-SMOTE | Complex decision boundary |
| KMeans-SMOTE | Minority has distinct clusters |
| SMOTE-ENN | Data has noisy labels |
| SMOTE-Tomek | Want cleaner boundaries |
Key Takeaways
SMOTE creates synthetic samples by interpolating between minority neighbors
It's not cloning — synthetics are NEW examples that add diversity
The math is simple:
X_new = X_i + λ × (X_neighbor - X_i)Always SMOTE after train-test split — never before!
Many variants exist — Borderline, ADASYN, SMOTE-NC for different situations
SMOTE can fail with noise, few samples, high dimensions, or categoricals
Combine with undersampling (SMOTE-Tomek) for cleaner results
Use imblearn Pipeline for proper cross-validation
The One-Sentence Summary
SMOTE is a creature creation lab that breeds NEW minority samples by blending the DNA of existing neighbors — giving your model the diverse examples it needs to actually learn the rare class.
What's Next?
Now that you understand SMOTE, you're ready for:
- ADASYN Deep Dive — Adaptive synthetic sampling
- Combining Over and Undersampling — SMOTE-Tomek, SMOTE-ENN
- Cost-Sensitive Learning — When resampling isn't enough
- Evaluation Metrics for Imbalanced Data — Beyond accuracy
Follow me for the next article in this series!
Let's Connect!
If this finally made SMOTE click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Which SMOTE variant is your favorite? I'm a Borderline-SMOTE fan myself!
The difference between a model that recognizes 12% of dragons and one that recognizes 95%? Not photocopying the same 47 dragons — but breeding 9,000 new ones that inherit realistic dragon traits. SMOTE: where data science meets creature creation.
Share this with someone struggling with imbalanced data. They don't need more data — they need a creature creation lab.
Happy synthesizing! 🐉
Top comments (0)