<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mickaël Andrieu</title>
    <description>The latest articles on DEV Community by Mickaël Andrieu (@mickael__andrieu).</description>
    <link>https://dev.to/mickael__andrieu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3176944%2F397c029c-d8c8-4e9e-939a-c02f3b5d0404.png</url>
      <title>DEV Community: Mickaël Andrieu</title>
      <link>https://dev.to/mickael__andrieu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mickael__andrieu"/>
    <language>en</language>
    <item>
      <title>"LLMs Can Do Everything": Autopsy of a Myth</title>
      <dc:creator>Mickaël Andrieu</dc:creator>
      <pubDate>Tue, 06 Jan 2026 02:01:32 +0000</pubDate>
      <link>https://dev.to/mickael__andrieu/llms-can-do-everything-autopsy-of-a-myth-2ika</link>
      <guid>https://dev.to/mickael__andrieu/llms-can-do-everything-autopsy-of-a-myth-2ika</guid>
      <description>&lt;p&gt;&lt;em&gt;We tested this belief on 19 use cases, using 3 OpenAI GPT models and basic Machine Learning algorithms, for a total of 570 experiments. Here's the truth no one wants to hear.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Prologue: The Promise
&lt;/h2&gt;

&lt;p&gt;January 2026. In the hallways of tech companies, a conviction has settled in as self-evident: &lt;strong&gt;LLMs make classical Machine Learning obsolete&lt;/strong&gt;. Why bother with sklearn pipelines when GPT can do everything in one line of code?&lt;/p&gt;

&lt;p&gt;We wanted to test this hypothesis. Not with opinions. With data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The protocol&lt;/strong&gt;: 19 real-world use cases, from spam detection to housing price prediction. 3 OpenAI models (GPT-4o-mini, GPT-5-nano, GPT-4.1-nano). A deliberately simple sklearn baseline (TF-IDF + Logistic Regression, RandomForest). 30 iterations per experiment, Monte Carlo cross-validation, rigorous statistical tests.&lt;/p&gt;

&lt;p&gt;The verdict? It will surprise you.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymqz2g6ivwq4n3iufm37.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fymqz2g6ivwq4n3iufm37.png" alt="Effect Sizes Overview" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ACT I: THE ILLUSION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Expected Triumph
&lt;/h3&gt;

&lt;p&gt;Let's start with what works. And it works spectacularly well.&lt;/p&gt;

&lt;h4&gt;
  
  
  When the LLM Crushes Everything
&lt;/h4&gt;

&lt;p&gt;On sentiment analysis, GenAI doesn't just win. It &lt;strong&gt;dominates&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GPT-4o-mini&lt;/th&gt;
&lt;th&gt;Gain&lt;/th&gt;
&lt;th&gt;Effect Size&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/cardiffnlp/tweet_eval" rel="noopener noreferrer"&gt;Twitter Sentiment&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.367&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.690&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+88%&lt;/td&gt;
&lt;td&gt;d=-13.6 (massive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/imdb" rel="noopener noreferrer"&gt;IMDB Reviews&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.784&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.938&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+20%&lt;/td&gt;
&lt;td&gt;d=-7.5 (massive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/yassiracharki/Amazon_Reviews_for_Sentiment_Analysis_fine_grained_5_classes" rel="noopener noreferrer"&gt;Amazon Reviews&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.377&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.608&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+61%&lt;/td&gt;
&lt;td&gt;d=-10.5 (massive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/Yelp/yelp_review_full" rel="noopener noreferrer"&gt;Yelp Reviews&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.428&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.665&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+55%&lt;/td&gt;
&lt;td&gt;d=-11.0 (massive)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A Cohen's d of -13.6 on Twitter Sentiment. For perspective: in social sciences, an effect is considered "large" when |d| &amp;gt; 0.8. Here, we're &lt;strong&gt;17 times beyond the threshold&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The LLM understands irony, sarcasm, cultural nuances. It knows that "This movie was... something else" isn't a compliment. TF-IDF only sees words.&lt;/p&gt;

&lt;h4&gt;
  
  
  Spam Detection: A Massacre
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SMS Spam Detection
------------------
sklearn (TF-IDF + LogReg) : 0.581 F1
GPT-4o-mini              : 0.965 F1

Difference : +66%
p-value    : 3.03e-26 (28 zeros after the decimal)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With just one few-shot example, the LLM already reaches 0.90 F1. sklearn needs thousands of examples to approach this score.&lt;/p&gt;

&lt;h4&gt;
  
  
  Semantic Extraction: Where sklearn Surrenders
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GPT-4o-mini&lt;/th&gt;
&lt;th&gt;Multiplier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/midas/kpcrowd" rel="noopener noreferrer"&gt;KPCrowd&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.030&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.419&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;x14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/midas/duc2001" rel="noopener noreferrer"&gt;DUC2001&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.078&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.303&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;x4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/midas/krapivin" rel="noopener noreferrer"&gt;Krapivin&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.046&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.223&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;x5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On keyword extraction, GPT-4o-mini is &lt;strong&gt;14 times better&lt;/strong&gt; than the baseline. Cohen's d = -38.7. We're leaving the realm of statistics to enter the absurd.&lt;/p&gt;

&lt;p&gt;At this point in our analysis, the myth seemed to confirm itself. GenAI dominated 12 use cases out of 19. 63% win rate. Classical ML appeared doomed.&lt;/p&gt;

&lt;p&gt;And then we looked at the regression data.&lt;/p&gt;




&lt;h2&gt;
  
  
  ACT II: THE CONFRONTATION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The First Shock: &lt;a href="https://huggingface.co/datasets/GonzaloA/fake_news" rel="noopener noreferrer"&gt;Fake News&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Before even reaching regression, a first warning signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fake News Detection
-------------------
sklearn     : 0.884 F1
GPT-4o-mini : 0.822 F1

sklearn wins.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strange. Why does the LLM, which excels everywhere else in classification, fail here?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis&lt;/strong&gt;: Fake news detection relies less on semantic understanding than on &lt;strong&gt;subtle statistical patterns&lt;/strong&gt;. Word frequencies, unusual syntactic structures, stylistic markers. TF-IDF captures these signals better than an LLM looking for "meaning".&lt;/p&gt;

&lt;p&gt;First crack in the myth. But this was just an appetizer.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cataclysm: Negative R² Scores
&lt;/h3&gt;

&lt;p&gt;Brace yourself. What follows is the heart of our discovery.&lt;/p&gt;

&lt;h4&gt;
  
  
  Housing Price Prediction
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sklearn R²     : 0.710 (explains 71% of variance)
GPT-4o-mini R² : -6,536,549,708
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You read that right. &lt;strong&gt;Negative six and a half billion&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An R² of 0 means "as good as predicting the mean". A negative R² means "worse than the mean". An R² of -6.5 billion means the model produces predictions &lt;strong&gt;so aberrant&lt;/strong&gt; that squared residuals explode toward infinity.&lt;/p&gt;

&lt;p&gt;When we asked GPT-4o-mini to predict a housing price, it responded with things like "$45" or "$999,999,999,999". Not because it's stupid. Because it's not designed for this type of task.&lt;/p&gt;

&lt;h4&gt;
  
  
  Wine Quality Prediction
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sklearn R²     : 0.405
GPT-4o-mini R² : -379,022
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Negative three hundred seventy-nine thousand. To predict a wine score between 1 and 10.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Horror Table
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;sklearn R²&lt;/th&gt;
&lt;th&gt;GenAI R²&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.openml.org/d/42225" rel="noopener noreferrer"&gt;Diamond Price&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.926&lt;/td&gt;
&lt;td&gt;0.810&lt;/td&gt;
&lt;td&gt;GenAI correct but inferior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/VarunKumarGupta2003/Car-Price-Dataset" rel="noopener noreferrer"&gt;Car Price&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;-1.79&lt;/td&gt;
&lt;td&gt;GenAI failing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/hugginglearners/data-science-job-salaries" rel="noopener noreferrer"&gt;Salary&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.346&lt;/td&gt;
&lt;td&gt;0.208&lt;/td&gt;
&lt;td&gt;GenAI weak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/codesignal/wine-quality" rel="noopener noreferrer"&gt;Wine Quality&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.405&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-379,022&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Catastrophic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/gvlassis/california_housing" rel="noopener noreferrer"&gt;Housing Price&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.710&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-6,536,549,709&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Apocalyptic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;sklearn wins in regression: 5/5. That's 100%.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4r55tgixhxxu0bk4xlom.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4r55tgixhxxu0bk4xlom.png" alt="The Regression Wall" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Why This Disaster?
&lt;/h4&gt;

&lt;p&gt;LLMs are language models. They predict the &lt;strong&gt;most probable next token&lt;/strong&gt;, not a continuous numerical value.&lt;/p&gt;

&lt;p&gt;When you ask "What's the price of this house?", the LLM generates a &lt;strong&gt;linguistically plausible&lt;/strong&gt; response, not a &lt;strong&gt;mathematically correct&lt;/strong&gt; one. It produces numbers that "sound right": round amounts, typical prices it saw in its training corpus.&lt;/p&gt;

&lt;p&gt;But regression demands &lt;strong&gt;numerical precision&lt;/strong&gt;. A 10% error on a housing price is $50,000. A 1% error across 10,000 predictions is millions in cumulative error.&lt;/p&gt;

&lt;p&gt;The LLM can't do this. It's not a flaw. It's an &lt;strong&gt;architectural characteristic&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The GPT-5-nano Paradox
&lt;/h3&gt;

&lt;p&gt;Here's an even more troubling discovery.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Relative Cost&lt;/th&gt;
&lt;th&gt;Win Rate vs sklearn&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;63%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5-nano (reasoning)&lt;/td&gt;
&lt;td&gt;$$$&lt;/td&gt;
&lt;td&gt;61%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1-nano&lt;/td&gt;
&lt;td&gt;$&lt;/td&gt;
&lt;td&gt;47%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5-nano, with its additional "reasoning" capabilities, performs worse than GPT-4o-mini on our ML benchmarks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On &lt;a href="https://huggingface.co/datasets/fancyzhx/dbpedia_14" rel="noopener noreferrer"&gt;DBpedia &lt;/a&gt;(14-class classification):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4o-mini: &lt;strong&gt;0.964 F1&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;GPT-5-nano: 0.792 F1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The "smarter" model loses by 17 points. How do we explain this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis&lt;/strong&gt;: GPT-5-nano's reasoning capabilities are optimized for complex tasks (mathematics, logic, planning). For standard text classification, this "reasoning" adds noise rather than value. The model "thinks too much" when it should just classify.&lt;/p&gt;

&lt;p&gt;More expensive doesn't mean better. More sophisticated doesn't mean more suitable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypkdyoo4x9tc2j2twfix.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypkdyoo4x9tc2j2twfix.png" alt="The Reasoning Paradox" width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ACT III: THE WISDOM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Law of the First Example
&lt;/h3&gt;

&lt;p&gt;Amid this confrontation, a ray of hope for GenAI practitioners.&lt;/p&gt;

&lt;p&gt;Our few-shot analysis reveals a striking law:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transition&lt;/th&gt;
&lt;th&gt;Average Gain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 -&amp;gt; 1 example&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+55.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 -&amp;gt; 3 examples&lt;/td&gt;
&lt;td&gt;+3.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 -&amp;gt; 5 examples&lt;/td&gt;
&lt;td&gt;+2.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 -&amp;gt; 10 examples&lt;/td&gt;
&lt;td&gt;+1.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 -&amp;gt; 50 examples&lt;/td&gt;
&lt;td&gt;+4.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The first example delivers 56% of the total improvement.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Going from 0-shot to 50-shot multiplies your cost by 37x but only adds 28% additional performance beyond the first example.&lt;/p&gt;

&lt;p&gt;On DBpedia (14 classes), however, multiple examples remain crucial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0-shot: 0.321 F1&lt;/li&gt;
&lt;li&gt;1-shot: 0.650 F1 (+102%)&lt;/li&gt;
&lt;li&gt;50-shot: 0.937 F1 (+192% vs zero-shot)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pragmatic rule&lt;/strong&gt;: For simple tasks (&amp;lt;6 classes), 1-3 examples suffice. For complex taxonomies (&amp;gt;10 classes), invest in 20-50 examples.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt8v2il0ku9sdsnwsdyp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt8v2il0ku9sdsnwsdyp.png" alt="The First Example Curve" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Matrix
&lt;/h3&gt;

&lt;p&gt;After 570 experiments, here's what we know:&lt;/p&gt;

&lt;h4&gt;
  
  
  Use GenAI when...
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Justification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Semantic task&lt;/td&gt;
&lt;td&gt;Sentiment, emotion&lt;/td&gt;
&lt;td&gt;The LLM understands meaning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few labeled data&lt;/td&gt;
&lt;td&gt;Prototype, MVP&lt;/td&gt;
&lt;td&gt;1 example = 90% of the way&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concept extraction&lt;/td&gt;
&lt;td&gt;Keywords, entities&lt;/td&gt;
&lt;td&gt;The LLM sees beyond n-grams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency not critical&lt;/td&gt;
&lt;td&gt;Batch processing&lt;/td&gt;
&lt;td&gt;807x slower but more accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Use sklearn when...
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Justification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Numerical prediction&lt;/td&gt;
&lt;td&gt;Prices, scores&lt;/td&gt;
&lt;td&gt;R² &amp;gt; 0 guaranteed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Statistical pattern detection&lt;/td&gt;
&lt;td&gt;Fraud, fake news&lt;/td&gt;
&lt;td&gt;TF-IDF captures subtle signals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Massive volume&lt;/td&gt;
&lt;td&gt;1M+ predictions/day&lt;/td&gt;
&lt;td&gt;Cost = $0 vs $500/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Critical latency&lt;/td&gt;
&lt;td&gt;Real-time&lt;/td&gt;
&lt;td&gt;807x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvggmsmnobwo4r9up7g9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvggmsmnobwo4r9up7g9.png" alt="The Decision Matrix" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cost of the Illusion
&lt;/h3&gt;

&lt;p&gt;Let's talk money.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GenAI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total benchmark cost&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$9.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total time&lt;/td&gt;
&lt;td&gt;8.1 minutes&lt;/td&gt;
&lt;td&gt;6,533 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed ratio&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;td&gt;807x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In production:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Daily Volume&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GenAI (GPT-4o-mini)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1,000 requests&lt;/td&gt;
&lt;td&gt;~$0&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000 requests&lt;/td&gt;
&lt;td&gt;~$0&lt;/td&gt;
&lt;td&gt;~$50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000,000 requests&lt;/td&gt;
&lt;td&gt;~$0&lt;/td&gt;
&lt;td&gt;~$500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One million predictions per day: $0 vs $182,500/year.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gd01vp463xtpojycf2p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gd01vp463xtpojycf2p.png" alt="Cost at Scale" width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that's just API cost. Add latency (user response time), external service dependency (outage = downtime), data confidentiality (your texts go to OpenAI).&lt;/p&gt;




&lt;h3&gt;
  
  
  Epilogue: The New Map
&lt;/h3&gt;

&lt;p&gt;The myth "LLMs can do everything" is false. But reality is more interesting than a simple classical ML victory.&lt;/p&gt;

&lt;h4&gt;
  
  
  What We Learned
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. LLMs are champions of semantics, not numerics.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They understand meaning, nuances, context. They fail as soon as you ask for mathematical precision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Sophistication doesn't mean suitability.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GPT-5-nano with its "reasoning" loses to GPT-4o-mini on simple tasks. The most advanced tool isn't always the right tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The first few-shot example is magical.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;56% gain for a single example. It's the best ROI in all of Machine Learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Regression remains sklearn's undisputed domain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;R² = -6.5 billion isn't a bug. It's a fundamental characteristic of transformer architectures applied where they don't belong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. GenAI's hidden costs are real.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;807x slower. $182k/year for 1M daily predictions. External dependency. These factors often disappear from POC evaluations.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Question to Ask
&lt;/h4&gt;

&lt;p&gt;Before every project, ask yourself:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does my task require understanding meaning, or predicting numbers?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If it's meaning: GenAI will probably excel.&lt;br&gt;
If it's numbers: sklearn will probably win.&lt;br&gt;
If it's both: build a hybrid architecture.&lt;/p&gt;




&lt;h3&gt;
  
  
  Methodology
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: Monte Carlo Cross-Validation, 30 iterations per configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical tests&lt;/strong&gt;: Paired t-test, Wilcoxon signed-rank, Bootstrap CI (10,000 resamples)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: F1-score (classification/extraction), R² (regression)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sklearn baseline&lt;/strong&gt;: TF-IDF + LogisticRegression (classification), RandomForest (regression)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GenAI models&lt;/strong&gt;: GPT-4o-mini, GPT-5-nano (reasoning medium), GPT-4.1-nano&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Few-shot&lt;/strong&gt;: 20 examples by default, 0-50 examples analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Acknowledged limitation&lt;/strong&gt;: Our sklearn baseline is deliberately simple. More sophisticated approaches (sentence-transformers, XGBoost, fine-tuning) would likely narrow the gap on some tasks. This benchmark measures "GenAI vs accessible ML", not "GenAI vs SOTA".&lt;/p&gt;




&lt;h3&gt;
  
  
  This Series
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="//./01-one-shot-revolution.md"&gt;LLMs for Classification: One Example is All You Need&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//./02-classical-ml-crushes-regression.md"&gt;Why Classical ML Still Crushes GenAI at Regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Myth Confronted with Reality&lt;/strong&gt; (this article)&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Classical ML isn't dead. It's never been more relevant. The real innovation isn't replacing sklearn with GPT — it's knowing when to use each.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>LLMs for Classification: One Example is All You Need</title>
      <dc:creator>Mickaël Andrieu</dc:creator>
      <pubDate>Tue, 06 Jan 2026 00:06:21 +0000</pubDate>
      <link>https://dev.to/mickael__andrieu/llms-for-classification-one-example-is-all-you-need-19al</link>
      <guid>https://dev.to/mickael__andrieu/llms-for-classification-one-example-is-all-you-need-19al</guid>
      <description>&lt;p&gt;&lt;em&gt;The first few-shot example delivers +55.9% gains. The next 49? Just +25% more.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx5d85lhritgbqkqqexc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyx5d85lhritgbqkqqexc.png" alt="One-Shot Impact" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counter-Intuitive Discovery
&lt;/h2&gt;

&lt;p&gt;After running &lt;strong&gt;35 benchmark configurations&lt;/strong&gt; across 5 datasets with rigorous Monte Carlo cross-validation (30 iterations each), we discovered something that challenges conventional prompt engineering wisdom:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The first example you give a Large Language Model improves performance by +55.9% on average. The second? Just +3.1%.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This finding has massive implications for how we design GenAI pipelines — and how much we spend on them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data: What We Tested
&lt;/h2&gt;

&lt;p&gt;We benchmarked GPT-4.1-nano across multiple tasks using publicly available datasets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Classes&lt;/th&gt;
&lt;th&gt;Zero-Shot F1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/ucirvine/sms_spam" rel="noopener noreferrer"&gt;SMS Spam&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Binary Classification&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.820&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/dair-ai/emotion" rel="noopener noreferrer"&gt;Emotion&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Multi-class&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;0.344&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/mteb/tweet_sentiment_extraction" rel="noopener noreferrer"&gt;Twitter Sentiment&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sentiment&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.488&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/fancyzhx/dbpedia_14" rel="noopener noreferrer"&gt;DBpedia&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Multi-class&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/datasets/midas/inspec" rel="noopener noreferrer"&gt;Keyword Extraction (Inspec)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Extraction&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;0.155&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each configuration was tested with 0, 1, 3, 5, 10, 20, and 50 few-shot examples.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All datasets are open-source and available on &lt;a href="https://huggingface.co/datasets" rel="noopener noreferrer"&gt;Hugging Face Datasets&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Diminishing Returns Curve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj3n1j0vv6yc25zf2yss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj3n1j0vv6yc25zf2yss.png" alt="Learning Curves" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pattern is striking across all datasets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transition&lt;/th&gt;
&lt;th&gt;Average Gain&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 → 1 shot&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+55.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive improvement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 → 3 shots&lt;/td&gt;
&lt;td&gt;+3.1%&lt;/td&gt;
&lt;td&gt;Diminishing returns begin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 → 5 shots&lt;/td&gt;
&lt;td&gt;+2.8%&lt;/td&gt;
&lt;td&gt;Marginal gains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 → 10 shots&lt;/td&gt;
&lt;td&gt;+1.3%&lt;/td&gt;
&lt;td&gt;Near plateau&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 → 50 shots&lt;/td&gt;
&lt;td&gt;+8.8%*&lt;/td&gt;
&lt;td&gt;Only significant for multi-class&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Inflated by DBpedia (14 classes) which continues improving.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Does This Happen?
&lt;/h3&gt;

&lt;p&gt;The first example teaches the model three critical things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Output format&lt;/strong&gt; — How to structure the response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task semantics&lt;/strong&gt; — What "classify" or "extract" means in this context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Label space&lt;/strong&gt; — What categories or outputs are expected&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once these are established, additional examples provide marginal refinement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Statistical Proof: This Is Not Random
&lt;/h2&gt;

&lt;p&gt;We used Welch's ANOVA to verify these findings weren't due to chance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;F-statistic&lt;/th&gt;
&lt;th&gt;p-value&lt;/th&gt;
&lt;th&gt;Significant?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SMS Spam&lt;/td&gt;
&lt;td&gt;76.75&lt;/td&gt;
&lt;td&gt;&amp;lt;0.0001&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emotion&lt;/td&gt;
&lt;td&gt;73.89&lt;/td&gt;
&lt;td&gt;&amp;lt;0.0001&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Twitter Sentiment&lt;/td&gt;
&lt;td&gt;92.57&lt;/td&gt;
&lt;td&gt;&amp;lt;0.0001&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DBpedia&lt;/td&gt;
&lt;td&gt;439.88&lt;/td&gt;
&lt;td&gt;&amp;lt;0.0001&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keyword Extraction&lt;/td&gt;
&lt;td&gt;160.30&lt;/td&gt;
&lt;td&gt;&amp;lt;0.0001&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The effect is &lt;strong&gt;statistically significant&lt;/strong&gt; for ALL datasets. DBpedia shows the strongest effect (F=439.88), reflecting its large improvement range from zero-shot to 50-shot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pairwise Comparisons (Bonferroni-corrected)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comparison&lt;/th&gt;
&lt;th&gt;Cohen's d&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0 vs 1&lt;/td&gt;
&lt;td&gt;d = -4.3&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Large&lt;/strong&gt; (always significant)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 vs 3&lt;/td&gt;
&lt;td&gt;d = -0.5&lt;/td&gt;
&lt;td&gt;Medium (rarely significant)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 vs 5&lt;/td&gt;
&lt;td&gt;d = -0.2&lt;/td&gt;
&lt;td&gt;Negligible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 vs 10&lt;/td&gt;
&lt;td&gt;d = -0.1&lt;/td&gt;
&lt;td&gt;Negligible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: While the overall effect is significant, pairwise comparisons confirm that differences between adjacent levels (3 vs 5, 5 vs 10) are &lt;strong&gt;not statistically significant&lt;/strong&gt; after correction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Exception: Multi-Class Tasks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxqjuzka922vb5uat3g8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgxqjuzka922vb5uat3g8.png" alt="DBpedia Exception" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DBpedia (14 classes) tells a different story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Few-shots&lt;/th&gt;
&lt;th&gt;F1-Score&lt;/th&gt;
&lt;th&gt;Gain vs Zero-Shot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.321&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.650&lt;/td&gt;
&lt;td&gt;+102.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0.760&lt;/td&gt;
&lt;td&gt;+136.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.876&lt;/td&gt;
&lt;td&gt;+173.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;0.937&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+192.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The rule of thumb&lt;/strong&gt;: For tasks with &amp;gt;10 classes, plan for ~3-4 examples per class.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ROI Reality Check
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flaiwsnx1fdhyknadayyj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flaiwsnx1fdhyknadayyj.png" alt="Cost vs Performance" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what nobody talks about — the economics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Performance Gain&lt;/th&gt;
&lt;th&gt;Cost Multiplier&lt;/th&gt;
&lt;th&gt;ROI Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1-shot&lt;/td&gt;
&lt;td&gt;+53.2%&lt;/td&gt;
&lt;td&gt;1.75x&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.4&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3-shot&lt;/td&gt;
&lt;td&gt;+59.0%&lt;/td&gt;
&lt;td&gt;3.25x&lt;/td&gt;
&lt;td&gt;18.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5-shot&lt;/td&gt;
&lt;td&gt;+63.5%&lt;/td&gt;
&lt;td&gt;4.75x&lt;/td&gt;
&lt;td&gt;13.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10-shot&lt;/td&gt;
&lt;td&gt;+66.9%&lt;/td&gt;
&lt;td&gt;8.5x&lt;/td&gt;
&lt;td&gt;7.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50-shot&lt;/td&gt;
&lt;td&gt;+81.1%&lt;/td&gt;
&lt;td&gt;37.5x&lt;/td&gt;
&lt;td&gt;2.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The verdict&lt;/strong&gt;: 1-shot delivers the best ROI by far. Going from 1 to 50 examples costs 21x more for only +28% additional performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Recommendations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Decision Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Task&lt;/th&gt;
&lt;th&gt;Recommended Examples&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Binary classification&lt;/td&gt;
&lt;td&gt;3-5&lt;/td&gt;
&lt;td&gt;Plateau at 5, marginal gains after&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sentiment (3 classes)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Very early plateau, oscillation after&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emotion (6 classes)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Early saturation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-class (&amp;gt;10)&lt;/td&gt;
&lt;td&gt;20-50&lt;/td&gt;
&lt;td&gt;Continues improving significantly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extraction&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Benefits from format examples&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Invest in your first example&lt;/strong&gt; — It's worth 55.9% of your total potential improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop at 3-5 for simple tasks&lt;/strong&gt; — Binary and low-cardinality multi-class plateau early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale up only for complex taxonomies&lt;/strong&gt; — &amp;gt;10 classes benefit from 20-50 examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure, don't assume&lt;/strong&gt; — Run your own benchmarks with Monte Carlo cross-validation&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Protocol&lt;/strong&gt;: Monte Carlo Cross-Validation (30 random train/test splits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model&lt;/strong&gt;: GPT-4.1-nano&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: F1-score (macro-averaged for multi-class)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical tests&lt;/strong&gt;: Welch's ANOVA, Bonferroni-corrected post-hoc tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Effect size&lt;/strong&gt;: Cohen's d&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Next in this series: "&lt;a href="https://dev.to/mickael__andrieu/why-classical-ml-still-crushes-genai-at-regression-5l5"&gt;Why Classical ML Still Crushes GenAI at Regression&lt;/a&gt;" — The surprising tasks where sklearn wins 100% of the time.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why Classical ML Still Crushes GenAI at Regression</title>
      <dc:creator>Mickaël Andrieu</dc:creator>
      <pubDate>Mon, 05 Jan 2026 23:45:36 +0000</pubDate>
      <link>https://dev.to/mickael__andrieu/why-classical-ml-still-crushes-genai-at-regression-5l5</link>
      <guid>https://dev.to/mickael__andrieu/why-classical-ml-still-crushes-genai-at-regression-5l5</guid>
      <description>&lt;p&gt;&lt;em&gt;The surprising benchmark results that every ML engineer should know before using LLMs for numerical predictions&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0rjtrp0jcesfiz9le27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0rjtrp0jcesfiz9le27.png" alt="Regression Benchmark" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;In the age of GPT-5 and advanced reasoning models, I ran a comprehensive benchmark comparing &lt;strong&gt;Classical ML (sklearn)&lt;/strong&gt; against &lt;strong&gt;Generative AI&lt;/strong&gt; across 19 use cases. The results for regression tasks were... humbling for GenAI.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Classical ML wins 100% of regression tasks.&lt;/strong&gt; Not 90%. Not 95%. One hundred percent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And it's not even close.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data: What I've Found
&lt;/h2&gt;

&lt;p&gt;I've tested three state-of-the-art LLM models against sklearn's RandomForestRegressor across 5 regression datasets:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;sklearn R²&lt;/th&gt;
&lt;th&gt;GPT-4o-mini R²&lt;/th&gt;
&lt;th&gt;GPT-5-nano R²&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Car Price&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;-3.498&lt;/td&gt;
&lt;td&gt;-5.172&lt;/td&gt;
&lt;td&gt;sklearn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diamond Price&lt;/td&gt;
&lt;td&gt;0.926&lt;/td&gt;
&lt;td&gt;0.806&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;sklearn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Housing Price&lt;/td&gt;
&lt;td&gt;0.710&lt;/td&gt;
&lt;td&gt;-6.78B&lt;/td&gt;
&lt;td&gt;-29.27B&lt;/td&gt;
&lt;td&gt;sklearn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wine Quality&lt;/td&gt;
&lt;td&gt;0.405&lt;/td&gt;
&lt;td&gt;-1.33M&lt;/td&gt;
&lt;td&gt;-1.631&lt;/td&gt;
&lt;td&gt;sklearn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Salary Prediction&lt;/td&gt;
&lt;td&gt;0.346&lt;/td&gt;
&lt;td&gt;0.184&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;sklearn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Wait, negative R²?&lt;/strong&gt; Yes. A negative R² means the model performs &lt;em&gt;worse than simply predicting the mean&lt;/em&gt;. The LLMs aren't just losing — they're producing predictions that are mathematically worse than doing nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Catastrophe
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What R² Actually Means
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;R² = 1.0&lt;/strong&gt;: Perfect predictions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R² = 0.0&lt;/strong&gt;: Predictions as good as the mean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;R² &amp;lt; 0&lt;/strong&gt;: Predictions &lt;em&gt;worse&lt;/em&gt; than the mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When GPT-4o-mini produces an R² of &lt;strong&gt;-6.78 billion&lt;/strong&gt; on housing prices, it means the model's predictions are so far off that they add massive error compared to simply guessing the average price every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Scale of the Problem
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foasbtzuzpvnqeo2m0sii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foasbtzuzpvnqeo2m0sii.png" alt="Error Distribution" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's visualize what "negative R² in the billions" actually means:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GPT-4o-mini&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Housing RMSE&lt;/td&gt;
&lt;td&gt;~$45,000&lt;/td&gt;
&lt;td&gt;~$8.2 billion&lt;/td&gt;
&lt;td&gt;LLM errors are 180,000x larger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Car Price RMSE&lt;/td&gt;
&lt;td&gt;~$2,100&lt;/td&gt;
&lt;td&gt;~$18,500&lt;/td&gt;
&lt;td&gt;LLM errors are 9x larger&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wine Quality RMSE&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;1,152&lt;/td&gt;
&lt;td&gt;LLM errors are 1,772x larger&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why LLMs Fail at Regression
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Token Generation ≠ Numerical Reasoning
&lt;/h3&gt;

&lt;p&gt;LLMs generate text token by token. When asked to predict "$347,500", they're essentially:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deciding "3" is the first digit&lt;/li&gt;
&lt;li&gt;Then "4" seems reasonable&lt;/li&gt;
&lt;li&gt;Then "7", etc.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is fundamentally different from computing a weighted sum of features, which is what regression actually requires.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. No Gradient Optimization
&lt;/h3&gt;

&lt;p&gt;Classical ML models like RandomForest are optimized to minimize prediction error through mathematical optimization. LLMs are optimized to predict the next token in a sequence — a completely different objective.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scale Sensitivity
&lt;/h3&gt;

&lt;p&gt;LLMs have no inherent understanding of numerical scale. The difference between $100,000 and $1,000,000 is just different tokens to an LLM, but it's a massive error in regression terms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Methodology
&lt;/h2&gt;

&lt;p&gt;To ensure fair comparison, I've used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monte Carlo Cross-Validation&lt;/strong&gt;: 30 random train/test splits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical Tests&lt;/strong&gt;: Paired t-test, Wilcoxon signed-rank, Bootstrap CI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Effect Size&lt;/strong&gt;: Cohen's d for interpretation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sklearn baseline&lt;/strong&gt;: TF-IDF + RandomForestRegressor (simple, reproducible)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All differences were &lt;strong&gt;statistically significant&lt;/strong&gt; (p &amp;lt; 0.05) with &lt;strong&gt;large effect sizes&lt;/strong&gt; (Cohen's d &amp;gt; 0.8).&lt;/p&gt;

&lt;h2&gt;
  
  
  But What About Few-Shot Learning?
&lt;/h2&gt;

&lt;p&gt;I've tested whether more examples could help LLMs with regression. The answer: &lt;strong&gt;marginally, but still catastrophic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqklixfxs4ibve3nms20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqklixfxs4ibve3nms20.png" alt="Few-Shot Regression" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;5-shot R²&lt;/th&gt;
&lt;th&gt;20-shot R²&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;th&gt;Still Negative?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Car Price&lt;/td&gt;
&lt;td&gt;-3.498&lt;/td&gt;
&lt;td&gt;-1.794&lt;/td&gt;
&lt;td&gt;+48.7%&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wine Quality&lt;/td&gt;
&lt;td&gt;-1.33M&lt;/td&gt;
&lt;td&gt;-379K&lt;/td&gt;
&lt;td&gt;+71.5%&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Housing Price&lt;/td&gt;
&lt;td&gt;-6.78B&lt;/td&gt;
&lt;td&gt;-6.54B&lt;/td&gt;
&lt;td&gt;+3.5%&lt;/td&gt;
&lt;td&gt;YES&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More examples make LLM predictions "less catastrophically wrong" — but they remain worse than useless compared to classical ML.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Dimension
&lt;/h2&gt;

&lt;p&gt;Beyond accuracy, there's a massive cost and speed difference:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;sklearn&lt;/th&gt;
&lt;th&gt;GenAI&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Training time&lt;/td&gt;
&lt;td&gt;8.1 min&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference time&lt;/td&gt;
&lt;td&gt;8.1 min total&lt;/td&gt;
&lt;td&gt;6,533 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;807x slower&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$0 (local)&lt;/td&gt;
&lt;td&gt;$9.39&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Infinite&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You're paying 807x more in time (and real money) for predictions that are mathematically worse than guessing the average.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What: The Decision Framework
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fajywv8wd9z9x4zs13stc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fajywv8wd9z9x4zs13stc.png" alt="Decision Framework" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Classical ML (sklearn) When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your target variable is &lt;strong&gt;continuous/numerical&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;regression&lt;/strong&gt; predictions (price, quantity, score)&lt;/li&gt;
&lt;li&gt;You have &lt;strong&gt;structured tabular data&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency matters&lt;/strong&gt; (&amp;lt;100ms response time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost matters&lt;/strong&gt; (high volume predictions)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Consider GenAI When:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Your task involves &lt;strong&gt;natural language understanding&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You're doing &lt;strong&gt;sentiment analysis&lt;/strong&gt; or &lt;strong&gt;emotion detection&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;zero-shot generalization&lt;/strong&gt; on text&lt;/li&gt;
&lt;li&gt;You're doing &lt;strong&gt;information extraction&lt;/strong&gt; from unstructured text&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;If you must use LLMs in a pipeline that involves numerical predictions, consider these patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: LLM for Features, sklearn for Prediction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text Input → LLM extracts features → sklearn predicts value
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the LLM to convert unstructured text into structured features, then let sklearn do the numerical prediction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: LLM for Binning, then Aggregation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → LLM classifies into bins → Map bins to numerical ranges
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of asking "What is the price?", ask "Is this low, medium, or high priced?" — a classification task where LLMs perform better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: LLM as Sanity Checker
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sklearn prediction → LLM validates → Final output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use sklearn for the prediction, then optionally use an LLM to flag predictions that seem unreasonable given the context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Never use LLMs for regression&lt;/strong&gt; — You'll get R² scores worse than random guessing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sklearn dominates numerical prediction&lt;/strong&gt; — With 100% win rate in our benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More examples don't fix the problem&lt;/strong&gt; — Even 20-shot learning produces negative R²&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-performance ratio is catastrophic&lt;/strong&gt; — 807x slower for worse results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid approaches exist&lt;/strong&gt; — Use LLMs for what they're good at (text), sklearn for numbers&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;LLMs are poets, not accountants.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They excel at understanding language, sentiment, and nuance. They fail spectacularly at the mathematical precision required for regression. Know your tools, and use them where they shine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Methodology Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Datasets&lt;/strong&gt;: Car Price, Diamond Price, Housing Price, Wine Quality, Salary Prediction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sklearn Model&lt;/strong&gt;: RandomForestRegressor with default parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM Models&lt;/strong&gt;: GPT-4o-mini, GPT-5-nano (reasoning), GPT-4.1-nano&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: 30-iteration Monte Carlo Cross-Validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt;: R² (coefficient of determination), RMSE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Statistical Tests&lt;/strong&gt;: Paired t-test, Wilcoxon, Bootstrap CI (10,000 resamples)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Next in this series: &lt;a href="https://dev.to/mickael__andrieu/llms-can-do-everything-autopsy-of-a-myth-2ika"&gt;"LLMs Can Do Everything": Autopsy of a Myth&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
