<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sai Manohar</title>
    <description>The latest articles on DEV Community by Sai Manohar (@saimanohar695).</description>
    <link>https://dev.to/saimanohar695</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3863537%2F1813872e-2913-4ce1-81e2-2a8329142f43.png</url>
      <title>DEV Community: Sai Manohar</title>
      <link>https://dev.to/saimanohar695</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saimanohar695"/>
    <language>en</language>
    <item>
      <title>Everyone Says SMOTE. I Ran 240 Experiments to Find Out if That's True.</title>
      <dc:creator>Sai Manohar</dc:creator>
      <pubDate>Mon, 06 Apr 2026 09:09:06 +0000</pubDate>
      <link>https://dev.to/saimanohar695/everyone-says-smote-i-ran-240-experiments-to-find-out-if-thats-true-51o4</link>
      <guid>https://dev.to/saimanohar695/everyone-says-smote-i-ran-240-experiments-to-find-out-if-thats-true-51o4</guid>
      <description>&lt;p&gt;Every ML tutorial handles class imbalance the same way. Dataset imbalanced? &lt;br&gt;
Apply SMOTE. Done. Next topic.&lt;/p&gt;

&lt;p&gt;Nobody tests it. Nobody asks whether SMOTE actually helps or whether it just &lt;br&gt;
feels like the responsible thing to do. It's become one of those default moves &lt;br&gt;
people make without thinking — like adding dropout to every neural network or &lt;br&gt;
scaling features before every model.&lt;/p&gt;

&lt;p&gt;I got annoyed enough to actually test it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;A benchmark. 4 classifiers, 4 sampling strategies, 3 real datasets, 5-fold &lt;br&gt;
cross validation on every combination. 240 runs total. Every result stored in &lt;br&gt;
PostgreSQL. Every claim tested with Wilcoxon signed-rank and Friedman tests &lt;br&gt;
before I wrote it down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classifiers:&lt;/strong&gt; Logistic Regression, Random Forest, XGBoost, KNN&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling strategies:&lt;/strong&gt; Nothing (baseline), SMOTE, ADASYN, Random Undersampling&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Datasets:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Credit Card Fraud — 284,807 transactions, 0.17% fraud&lt;/li&gt;
&lt;li&gt;Mammography — 11,183 samples, 2.3% malignant&lt;/li&gt;
&lt;li&gt;Phoneme — 5,404 samples, 9.1% minority class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three different imbalance severities. Three different domains. If a pattern &lt;br&gt;
shows up across all three, it's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I found
&lt;/h2&gt;

&lt;h3&gt;
  
  
  SMOTE didn't consistently help
&lt;/h3&gt;

&lt;p&gt;On Credit Card Fraud, Logistic Regression with no sampling got F1: 0.7263. &lt;br&gt;
Add SMOTE and it drops to 0.1499. That's not a rounding error — SMOTE made &lt;br&gt;
it significantly worse on the hardest dataset.&lt;/p&gt;

&lt;p&gt;Random Forest with no sampling: F1 0.8588. With SMOTE: 0.8565. Essentially &lt;br&gt;
identical. The sampling strategy did almost nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Random undersampling is lying to you
&lt;/h3&gt;

&lt;p&gt;This is the one I keep coming back to.&lt;/p&gt;

&lt;p&gt;Random Forest + undersampling on Credit Card Fraud:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AUC-ROC: &lt;strong&gt;0.9777&lt;/strong&gt; ✓ looks great&lt;/li&gt;
&lt;li&gt;F1: &lt;strong&gt;0.1157&lt;/strong&gt; ✗ completely wrong&lt;/li&gt;
&lt;li&gt;MCC: &lt;strong&gt;0.2325&lt;/strong&gt; ✗ useless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Random Forest + no sampling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AUC-ROC: 0.9497&lt;/li&gt;
&lt;li&gt;F1: &lt;strong&gt;0.8588&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;MCC: &lt;strong&gt;0.8625&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same classifier. The AUC-ROC number went up. Everything else fell off a cliff.&lt;/p&gt;

&lt;p&gt;If you only report AUC-ROC — which a lot of people do — you'd conclude &lt;br&gt;
undersampling works well. It doesn't. It learned to predict the majority &lt;br&gt;
class with confidence and AUC-ROC rewarded it for that.&lt;/p&gt;

&lt;p&gt;This pattern held on Mammography and Phoneme too. Every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Classifier choice mattered more than anything else
&lt;/h3&gt;

&lt;p&gt;Switching from Logistic Regression to Random Forest improved F1 more than &lt;br&gt;
any sampling strategy — on every dataset. If you're spending time tuning &lt;br&gt;
SMOTE parameters on a weak classifier, you're solving the wrong problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  The differences are statistically real
&lt;/h3&gt;

&lt;p&gt;Friedman test p=0.0000 on all three datasets. The differences aren't noise.&lt;br&gt;
Wilcoxon confirmed most pairwise comparisons too.&lt;/p&gt;

&lt;p&gt;One interesting exception: Random Forest vs XGBoost on Mammography was &lt;br&gt;
p=0.9563 on F1 — meaning those two are genuinely equivalent there. &lt;br&gt;
Sometimes the honest answer is "it doesn't matter which one you pick."&lt;/p&gt;

&lt;h2&gt;
  
  
  The metric problem
&lt;/h2&gt;

&lt;p&gt;AUC-ROC measures whether your model ranks positives above negatives. &lt;br&gt;
It doesn't care about threshold. It doesn't care whether your minority &lt;br&gt;
class predictions are actually useful.&lt;/p&gt;

&lt;p&gt;F1 and MCC both penalize you for missing the minority class. They're &lt;br&gt;
harder to game. And they told a completely different story than AUC-ROC &lt;br&gt;
in this experiment.&lt;/p&gt;

&lt;p&gt;If I had only reported AUC-ROC, the conclusion would be:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Sampling strategy doesn't matter much, undersampling is fine."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual conclusion:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Undersampling destroys your ability to detect the minority class &lt;br&gt;
while making your AUC-ROC look better."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those are opposite findings from the same data. The metric chose the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python, Scikit-learn, XGBoost, imbalanced-learn&lt;/li&gt;
&lt;li&gt;PostgreSQL on Neon (free tier) — all results stored here&lt;/li&gt;
&lt;li&gt;SciPy — Wilcoxon + Friedman tests&lt;/li&gt;
&lt;li&gt;Streamlit — interactive dashboard, 4 tabs&lt;/li&gt;
&lt;li&gt;Docker — one command to run everything&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Borderline-SMOTE and SVM-SMOTE work differently from standard SMOTE and &lt;br&gt;
might tell a different story. I want to test those next.&lt;/p&gt;

&lt;p&gt;I also want to push the imbalance ratio below 0.1% to see where things &lt;br&gt;
break down completely.&lt;/p&gt;

&lt;p&gt;And I should have set up experiment logging from day one. I retrofitted &lt;br&gt;
it halfway through and it cost me time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;SMOTE isn't wrong. It's just not automatically right. The real answer &lt;br&gt;
depends on your classifier, your dataset, and which metric actually &lt;br&gt;
matters for your problem.&lt;/p&gt;

&lt;p&gt;Test it. Don't assume it.&lt;/p&gt;




&lt;p&gt;GitHub: &lt;a href="https://github.com/Sai-manohar695/ml-imbalance-benchmark" rel="noopener noreferrer"&gt;https://github.com/Sai-manohar695/ml-imbalance-benchmark&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Portfolio: &lt;a href="https://sai-manohar695.github.io" rel="noopener noreferrer"&gt;https://sai-manohar695.github.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>statistics</category>
    </item>
  </channel>
</rss>
