Everyone Says SMOTE. I Ran 240 Experiments to Find Out if That's True.

#python #datascience #statistics #machinelearning

Every ML tutorial handles class imbalance the same way. Dataset imbalanced?
Apply SMOTE. Done. Next topic.

Nobody tests it. Nobody asks whether SMOTE actually helps or whether it just
feels like the responsible thing to do. It's become one of those default moves
people make without thinking — like adding dropout to every neural network or
scaling features before every model.

I got annoyed enough to actually test it.

What I built

A benchmark. 4 classifiers, 4 sampling strategies, 3 real datasets, 5-fold
cross validation on every combination. 240 runs total. Every result stored in
PostgreSQL. Every claim tested with Wilcoxon signed-rank and Friedman tests
before I wrote it down.

Classifiers: Logistic Regression, Random Forest, XGBoost, KNN

Sampling strategies: Nothing (baseline), SMOTE, ADASYN, Random Undersampling

Datasets:

Credit Card Fraud — 284,807 transactions, 0.17% fraud
Mammography — 11,183 samples, 2.3% malignant
Phoneme — 5,404 samples, 9.1% minority class

Three different imbalance severities. Three different domains. If a pattern
shows up across all three, it's real.

What I found

SMOTE didn't consistently help

On Credit Card Fraud, Logistic Regression with no sampling got F1: 0.7263.
Add SMOTE and it drops to 0.1499. That's not a rounding error — SMOTE made
it significantly worse on the hardest dataset.

Random Forest with no sampling: F1 0.8588. With SMOTE: 0.8565. Essentially
identical. The sampling strategy did almost nothing.

Random undersampling is lying to you

This is the one I keep coming back to.

Random Forest + undersampling on Credit Card Fraud:

AUC-ROC: 0.9777 ✓ looks great
F1: 0.1157 ✗ completely wrong
MCC: 0.2325 ✗ useless

Random Forest + no sampling:

AUC-ROC: 0.9497
F1: 0.8588
MCC: 0.8625

Same classifier. The AUC-ROC number went up. Everything else fell off a cliff.

If you only report AUC-ROC — which a lot of people do — you'd conclude
undersampling works well. It doesn't. It learned to predict the majority
class with confidence and AUC-ROC rewarded it for that.

This pattern held on Mammography and Phoneme too. Every time.

Classifier choice mattered more than anything else

Switching from Logistic Regression to Random Forest improved F1 more than
any sampling strategy — on every dataset. If you're spending time tuning
SMOTE parameters on a weak classifier, you're solving the wrong problem.

The differences are statistically real

Friedman test p=0.0000 on all three datasets. The differences aren't noise.
Wilcoxon confirmed most pairwise comparisons too.

One interesting exception: Random Forest vs XGBoost on Mammography was
p=0.9563 on F1 — meaning those two are genuinely equivalent there.
Sometimes the honest answer is "it doesn't matter which one you pick."

The metric problem

AUC-ROC measures whether your model ranks positives above negatives.
It doesn't care about threshold. It doesn't care whether your minority
class predictions are actually useful.

F1 and MCC both penalize you for missing the minority class. They're
harder to game. And they told a completely different story than AUC-ROC
in this experiment.

If I had only reported AUC-ROC, the conclusion would be:

"Sampling strategy doesn't matter much, undersampling is fine."

The actual conclusion:

"Undersampling destroys your ability to detect the minority class
while making your AUC-ROC look better."

Those are opposite findings from the same data. The metric chose the story.

The stack

Python, Scikit-learn, XGBoost, imbalanced-learn
PostgreSQL on Neon (free tier) — all results stored here
SciPy — Wilcoxon + Friedman tests
Streamlit — interactive dashboard, 4 tabs
Docker — one command to run everything

What I'd do differently

Borderline-SMOTE and SVM-SMOTE work differently from standard SMOTE and
might tell a different story. I want to test those next.

I also want to push the imbalance ratio below 0.1% to see where things
break down completely.

And I should have set up experiment logging from day one. I retrofitted
it halfway through and it cost me time.

The takeaway

SMOTE isn't wrong. It's just not automatically right. The real answer
depends on your classifier, your dataset, and which metric actually
matters for your problem.

Test it. Don't assume it.

GitHub: https://github.com/Sai-manohar695/ml-imbalance-benchmark

Portfolio: https://sai-manohar695.github.io