DEV Community

Cover image for "LLMs Can Do Everything": Autopsy of a Myth
Mickaël Andrieu
Mickaël Andrieu

Posted on

"LLMs Can Do Everything": Autopsy of a Myth

We tested this belief on 19 use cases, using 3 OpenAI GPT models and basic Machine Learning algorithms, for a total of 570 experiments. Here's the truth no one wants to hear.


Prologue: The Promise

January 2026. In the hallways of tech companies, a conviction has settled in as self-evident: LLMs make classical Machine Learning obsolete. Why bother with sklearn pipelines when GPT can do everything in one line of code?

We wanted to test this hypothesis. Not with opinions. With data.

The protocol: 19 real-world use cases, from spam detection to housing price prediction. 3 OpenAI models (GPT-4o-mini, GPT-5-nano, GPT-4.1-nano). A deliberately simple sklearn baseline (TF-IDF + Logistic Regression, RandomForest). 30 iterations per experiment, Monte Carlo cross-validation, rigorous statistical tests.

The verdict? It will surprise you.

Effect Sizes Overview


ACT I: THE ILLUSION

The Expected Triumph

Let's start with what works. And it works spectacularly well.

When the LLM Crushes Everything

On sentiment analysis, GenAI doesn't just win. It dominates.

Dataset sklearn GPT-4o-mini Gain Effect Size
Twitter Sentiment 0.367 0.690 +88% d=-13.6 (massive)
IMDB Reviews 0.784 0.938 +20% d=-7.5 (massive)
Amazon Reviews 0.377 0.608 +61% d=-10.5 (massive)
Yelp Reviews 0.428 0.665 +55% d=-11.0 (massive)

A Cohen's d of -13.6 on Twitter Sentiment. For perspective: in social sciences, an effect is considered "large" when |d| > 0.8. Here, we're 17 times beyond the threshold.

The LLM understands irony, sarcasm, cultural nuances. It knows that "This movie was... something else" isn't a compliment. TF-IDF only sees words.

Spam Detection: A Massacre

SMS Spam Detection
------------------
sklearn (TF-IDF + LogReg) : 0.581 F1
GPT-4o-mini              : 0.965 F1

Difference : +66%
p-value    : 3.03e-26 (28 zeros after the decimal)
Enter fullscreen mode Exit fullscreen mode

With just one few-shot example, the LLM already reaches 0.90 F1. sklearn needs thousands of examples to approach this score.

Semantic Extraction: Where sklearn Surrenders

Benchmark sklearn GPT-4o-mini Multiplier
KPCrowd 0.030 0.419 x14
DUC2001 0.078 0.303 x4
Krapivin 0.046 0.223 x5

On keyword extraction, GPT-4o-mini is 14 times better than the baseline. Cohen's d = -38.7. We're leaving the realm of statistics to enter the absurd.

At this point in our analysis, the myth seemed to confirm itself. GenAI dominated 12 use cases out of 19. 63% win rate. Classical ML appeared doomed.

And then we looked at the regression data.


ACT II: THE CONFRONTATION

The First Shock: Fake News

Before even reaching regression, a first warning signal.

Fake News Detection
-------------------
sklearn     : 0.884 F1
GPT-4o-mini : 0.822 F1

sklearn wins.
Enter fullscreen mode Exit fullscreen mode

Strange. Why does the LLM, which excels everywhere else in classification, fail here?

Hypothesis: Fake news detection relies less on semantic understanding than on subtle statistical patterns. Word frequencies, unusual syntactic structures, stylistic markers. TF-IDF captures these signals better than an LLM looking for "meaning".

First crack in the myth. But this was just an appetizer.

The Cataclysm: Negative R² Scores

Brace yourself. What follows is the heart of our discovery.

Housing Price Prediction

sklearn R²     : 0.710 (explains 71% of variance)
GPT-4o-mini R² : -6,536,549,708
Enter fullscreen mode Exit fullscreen mode

You read that right. Negative six and a half billion.

An R² of 0 means "as good as predicting the mean". A negative R² means "worse than the mean". An R² of -6.5 billion means the model produces predictions so aberrant that squared residuals explode toward infinity.

When we asked GPT-4o-mini to predict a housing price, it responded with things like "$45" or "$999,999,999,999". Not because it's stupid. Because it's not designed for this type of task.

Wine Quality Prediction

sklearn R²     : 0.405
GPT-4o-mini R² : -379,022
Enter fullscreen mode Exit fullscreen mode

Negative three hundred seventy-nine thousand. To predict a wine score between 1 and 10.

The Horror Table

Dataset sklearn R² GenAI R² Interpretation
Diamond Price 0.926 0.810 GenAI correct but inferior
Car Price 0.802 -1.79 GenAI failing
Salary 0.346 0.208 GenAI weak
Wine Quality 0.405 -379,022 Catastrophic
Housing Price 0.710 -6,536,549,709 Apocalyptic

sklearn wins in regression: 5/5. That's 100%.

The Regression Wall

Why This Disaster?

LLMs are language models. They predict the most probable next token, not a continuous numerical value.

When you ask "What's the price of this house?", the LLM generates a linguistically plausible response, not a mathematically correct one. It produces numbers that "sound right": round amounts, typical prices it saw in its training corpus.

But regression demands numerical precision. A 10% error on a housing price is $50,000. A 1% error across 10,000 predictions is millions in cumulative error.

The LLM can't do this. It's not a flaw. It's an architectural characteristic.

The GPT-5-nano Paradox

Here's an even more troubling discovery.

Model Relative Cost Win Rate vs sklearn
GPT-4o-mini $ 63%
GPT-5-nano (reasoning) $$$ 61%
GPT-4.1-nano $ 47%

GPT-5-nano, with its additional "reasoning" capabilities, performs worse than GPT-4o-mini on our ML benchmarks.

On DBpedia (14-class classification):

  • GPT-4o-mini: 0.964 F1
  • GPT-5-nano: 0.792 F1

The "smarter" model loses by 17 points. How do we explain this?

Hypothesis: GPT-5-nano's reasoning capabilities are optimized for complex tasks (mathematics, logic, planning). For standard text classification, this "reasoning" adds noise rather than value. The model "thinks too much" when it should just classify.

More expensive doesn't mean better. More sophisticated doesn't mean more suitable.

The Reasoning Paradox


ACT III: THE WISDOM

The Law of the First Example

Amid this confrontation, a ray of hope for GenAI practitioners.

Our few-shot analysis reveals a striking law:

Transition Average Gain
0 -> 1 example +55.9%
1 -> 3 examples +3.1%
3 -> 5 examples +2.8%
5 -> 10 examples +1.3%
10 -> 50 examples +4.4%

The first example delivers 56% of the total improvement.

Going from 0-shot to 50-shot multiplies your cost by 37x but only adds 28% additional performance beyond the first example.

On DBpedia (14 classes), however, multiple examples remain crucial:

  • 0-shot: 0.321 F1
  • 1-shot: 0.650 F1 (+102%)
  • 50-shot: 0.937 F1 (+192% vs zero-shot)

Pragmatic rule: For simple tasks (<6 classes), 1-3 examples suffice. For complex taxonomies (>10 classes), invest in 20-50 examples.

The First Example Curve

The Decision Matrix

After 570 experiments, here's what we know:

Use GenAI when...

Criterion Example Justification
Semantic task Sentiment, emotion The LLM understands meaning
Few labeled data Prototype, MVP 1 example = 90% of the way
Concept extraction Keywords, entities The LLM sees beyond n-grams
Latency not critical Batch processing 807x slower but more accurate

Use sklearn when...

Criterion Example Justification
Numerical prediction Prices, scores R² > 0 guaranteed
Statistical pattern detection Fraud, fake news TF-IDF captures subtle signals
Massive volume 1M+ predictions/day Cost = $0 vs $500/day
Critical latency Real-time 807x faster

The Decision Matrix

The Cost of the Illusion

Let's talk money.

Metric sklearn GenAI
Total benchmark cost $0 $9.39
Total time 8.1 minutes 6,533 minutes
Speed ratio 1x 807x slower

In production:

Daily Volume sklearn GenAI (GPT-4o-mini)
1,000 requests ~$0 ~$0.50
100,000 requests ~$0 ~$50
1,000,000 requests ~$0 ~$500

One million predictions per day: $0 vs $182,500/year.

Cost at Scale

And that's just API cost. Add latency (user response time), external service dependency (outage = downtime), data confidentiality (your texts go to OpenAI).


Epilogue: The New Map

The myth "LLMs can do everything" is false. But reality is more interesting than a simple classical ML victory.

What We Learned

1. LLMs are champions of semantics, not numerics.

They understand meaning, nuances, context. They fail as soon as you ask for mathematical precision.

2. Sophistication doesn't mean suitability.

GPT-5-nano with its "reasoning" loses to GPT-4o-mini on simple tasks. The most advanced tool isn't always the right tool.

3. The first few-shot example is magical.

56% gain for a single example. It's the best ROI in all of Machine Learning.

4. Regression remains sklearn's undisputed domain.

R² = -6.5 billion isn't a bug. It's a fundamental characteristic of transformer architectures applied where they don't belong.

5. GenAI's hidden costs are real.

807x slower. $182k/year for 1M daily predictions. External dependency. These factors often disappear from POC evaluations.

The Question to Ask

Before every project, ask yourself:

Does my task require understanding meaning, or predicting numbers?

If it's meaning: GenAI will probably excel.
If it's numbers: sklearn will probably win.
If it's both: build a hybrid architecture.


Methodology

  • Validation: Monte Carlo Cross-Validation, 30 iterations per configuration
  • Statistical tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI (10,000 resamples)
  • Metrics: F1-score (classification/extraction), R² (regression)
  • sklearn baseline: TF-IDF + LogisticRegression (classification), RandomForest (regression)
  • GenAI models: GPT-4o-mini, GPT-5-nano (reasoning medium), GPT-4.1-nano
  • Few-shot: 20 examples by default, 0-50 examples analysis

Acknowledged limitation: Our sklearn baseline is deliberately simple. More sophisticated approaches (sentence-transformers, XGBoost, fine-tuning) would likely narrow the gap on some tasks. This benchmark measures "GenAI vs accessible ML", not "GenAI vs SOTA".


This Series

  1. LLMs for Classification: One Example is All You Need
  2. Why Classical ML Still Crushes GenAI at Regression
  3. The Myth Confronted with Reality (this article)

Classical ML isn't dead. It's never been more relevant. The real innovation isn't replacing sklearn with GPT — it's knowing when to use each.


Top comments (0)