Mickaël Andrieu

Posted on Jan 6

"LLMs Can Do Everything": Autopsy of a Myth

#machinelearning #python #datascience #ai

We tested this belief on 19 use cases, using 3 OpenAI GPT models and basic Machine Learning algorithms, for a total of 570 experiments. Here's the truth no one wants to hear.

Prologue: The Promise

January 2026. In the hallways of tech companies, a conviction has settled in as self-evident: LLMs make classical Machine Learning obsolete. Why bother with sklearn pipelines when GPT can do everything in one line of code?

We wanted to test this hypothesis. Not with opinions. With data.

The protocol: 19 real-world use cases, from spam detection to housing price prediction. 3 OpenAI models (GPT-4o-mini, GPT-5-nano, GPT-4.1-nano). A deliberately simple sklearn baseline (TF-IDF + Logistic Regression, RandomForest). 30 iterations per experiment, Monte Carlo cross-validation, rigorous statistical tests.

The verdict? It will surprise you.

ACT I: THE ILLUSION

The Expected Triumph

Let's start with what works. And it works spectacularly well.

When the LLM Crushes Everything

On sentiment analysis, GenAI doesn't just win. It dominates.

Dataset	sklearn	GPT-4o-mini	Gain	Effect Size
Twitter Sentiment	0.367	0.690	+88%	d=-13.6 (massive)
IMDB Reviews	0.784	0.938	+20%	d=-7.5 (massive)
Amazon Reviews	0.377	0.608	+61%	d=-10.5 (massive)
Yelp Reviews	0.428	0.665	+55%	d=-11.0 (massive)

A Cohen's d of -13.6 on Twitter Sentiment. For perspective: in social sciences, an effect is considered "large" when |d| > 0.8. Here, we're 17 times beyond the threshold.

The LLM understands irony, sarcasm, cultural nuances. It knows that "This movie was... something else" isn't a compliment. TF-IDF only sees words.

Spam Detection: A Massacre

SMS Spam Detection
------------------
sklearn (TF-IDF + LogReg) : 0.581 F1
GPT-4o-mini              : 0.965 F1

Difference : +66%
p-value    : 3.03e-26 (28 zeros after the decimal)

With just one few-shot example, the LLM already reaches 0.90 F1. sklearn needs thousands of examples to approach this score.

Semantic Extraction: Where sklearn Surrenders

Benchmark	sklearn	GPT-4o-mini	Multiplier
KPCrowd	0.030	0.419	x14
DUC2001	0.078	0.303	x4
Krapivin	0.046	0.223	x5

On keyword extraction, GPT-4o-mini is 14 times better than the baseline. Cohen's d = -38.7. We're leaving the realm of statistics to enter the absurd.

At this point in our analysis, the myth seemed to confirm itself. GenAI dominated 12 use cases out of 19. 63% win rate. Classical ML appeared doomed.

And then we looked at the regression data.

ACT II: THE CONFRONTATION

The First Shock: Fake News

Before even reaching regression, a first warning signal.

Fake News Detection
-------------------
sklearn     : 0.884 F1
GPT-4o-mini : 0.822 F1

sklearn wins.

Strange. Why does the LLM, which excels everywhere else in classification, fail here?

Hypothesis: Fake news detection relies less on semantic understanding than on subtle statistical patterns. Word frequencies, unusual syntactic structures, stylistic markers. TF-IDF captures these signals better than an LLM looking for "meaning".

First crack in the myth. But this was just an appetizer.

The Cataclysm: Negative R² Scores

Brace yourself. What follows is the heart of our discovery.

Housing Price Prediction

sklearn R²     : 0.710 (explains 71% of variance)
GPT-4o-mini R² : -6,536,549,708

You read that right. Negative six and a half billion.

An R² of 0 means "as good as predicting the mean". A negative R² means "worse than the mean". An R² of -6.5 billion means the model produces predictions so aberrant that squared residuals explode toward infinity.

When we asked GPT-4o-mini to predict a housing price, it responded with things like "$45" or "$999,999,999,999". Not because it's stupid. Because it's not designed for this type of task.

Wine Quality Prediction

sklearn R²     : 0.405
GPT-4o-mini R² : -379,022

Negative three hundred seventy-nine thousand. To predict a wine score between 1 and 10.

The Horror Table

Dataset	sklearn R²	GenAI R²	Interpretation
Diamond Price	0.926	0.810	GenAI correct but inferior
Car Price	0.802	-1.79	GenAI failing
Salary	0.346	0.208	GenAI weak
Wine Quality	0.405	-379,022	Catastrophic
Housing Price	0.710	-6,536,549,709	Apocalyptic

sklearn wins in regression: 5/5. That's 100%.

Why This Disaster?

LLMs are language models. They predict the most probable next token, not a continuous numerical value.

When you ask "What's the price of this house?", the LLM generates a linguistically plausible response, not a mathematically correct one. It produces numbers that "sound right": round amounts, typical prices it saw in its training corpus.

But regression demands numerical precision. A 10% error on a housing price is $50,000. A 1% error across 10,000 predictions is millions in cumulative error.

The LLM can't do this. It's not a flaw. It's an architectural characteristic.

The GPT-5-nano Paradox

Here's an even more troubling discovery.

Model	Relative Cost	Win Rate vs sklearn
GPT-4o-mini	$	63%
GPT-5-nano (reasoning)	$$$	61%
GPT-4.1-nano	$	47%

GPT-5-nano, with its additional "reasoning" capabilities, performs worse than GPT-4o-mini on our ML benchmarks.

On DBpedia (14-class classification):

GPT-4o-mini: 0.964 F1
GPT-5-nano: 0.792 F1

The "smarter" model loses by 17 points. How do we explain this?

Hypothesis: GPT-5-nano's reasoning capabilities are optimized for complex tasks (mathematics, logic, planning). For standard text classification, this "reasoning" adds noise rather than value. The model "thinks too much" when it should just classify.

More expensive doesn't mean better. More sophisticated doesn't mean more suitable.

ACT III: THE WISDOM

The Law of the First Example

Amid this confrontation, a ray of hope for GenAI practitioners.

Our few-shot analysis reveals a striking law:

Transition	Average Gain
0 -> 1 example	+55.9%
1 -> 3 examples	+3.1%
3 -> 5 examples	+2.8%
5 -> 10 examples	+1.3%
10 -> 50 examples	+4.4%

The first example delivers 56% of the total improvement.

Going from 0-shot to 50-shot multiplies your cost by 37x but only adds 28% additional performance beyond the first example.

On DBpedia (14 classes), however, multiple examples remain crucial:

0-shot: 0.321 F1
1-shot: 0.650 F1 (+102%)
50-shot: 0.937 F1 (+192% vs zero-shot)

Pragmatic rule: For simple tasks (<6 classes), 1-3 examples suffice. For complex taxonomies (>10 classes), invest in 20-50 examples.

The Decision Matrix

After 570 experiments, here's what we know:

Use GenAI when...

Criterion	Example	Justification
Semantic task	Sentiment, emotion	The LLM understands meaning
Few labeled data	Prototype, MVP	1 example = 90% of the way
Concept extraction	Keywords, entities	The LLM sees beyond n-grams
Latency not critical	Batch processing	807x slower but more accurate

Use sklearn when...

Criterion	Example	Justification
Numerical prediction	Prices, scores	R² > 0 guaranteed
Statistical pattern detection	Fraud, fake news	TF-IDF captures subtle signals
Massive volume	1M+ predictions/day	Cost = $0 vs $500/day
Critical latency	Real-time	807x faster

The Cost of the Illusion

Let's talk money.

Metric	sklearn	GenAI
Total benchmark cost	$0	$9.39
Total time	8.1 minutes	6,533 minutes
Speed ratio	1x	807x slower

In production:

Daily Volume	sklearn	GenAI (GPT-4o-mini)
1,000 requests	~$0	~$0.50
100,000 requests	~$0	~$50
1,000,000 requests	~$0	~$500

One million predictions per day: $0 vs $182,500/year.

And that's just API cost. Add latency (user response time), external service dependency (outage = downtime), data confidentiality (your texts go to OpenAI).

Epilogue: The New Map

The myth "LLMs can do everything" is false. But reality is more interesting than a simple classical ML victory.

What We Learned

1. LLMs are champions of semantics, not numerics.

They understand meaning, nuances, context. They fail as soon as you ask for mathematical precision.

2. Sophistication doesn't mean suitability.

GPT-5-nano with its "reasoning" loses to GPT-4o-mini on simple tasks. The most advanced tool isn't always the right tool.

3. The first few-shot example is magical.

56% gain for a single example. It's the best ROI in all of Machine Learning.

4. Regression remains sklearn's undisputed domain.

R² = -6.5 billion isn't a bug. It's a fundamental characteristic of transformer architectures applied where they don't belong.

5. GenAI's hidden costs are real.

807x slower. $182k/year for 1M daily predictions. External dependency. These factors often disappear from POC evaluations.

The Question to Ask

Before every project, ask yourself:

Does my task require understanding meaning, or predicting numbers?

If it's meaning: GenAI will probably excel.
If it's numbers: sklearn will probably win.
If it's both: build a hybrid architecture.

Methodology

Validation: Monte Carlo Cross-Validation, 30 iterations per configuration
Statistical tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI (10,000 resamples)
Metrics: F1-score (classification/extraction), R² (regression)
sklearn baseline: TF-IDF + LogisticRegression (classification), RandomForest (regression)
GenAI models: GPT-4o-mini, GPT-5-nano (reasoning medium), GPT-4.1-nano
Few-shot: 20 examples by default, 0-50 examples analysis

Acknowledged limitation: Our sklearn baseline is deliberately simple. More sophisticated approaches (sentence-transformers, XGBoost, fine-tuning) would likely narrow the gap on some tasks. This benchmark measures "GenAI vs accessible ML", not "GenAI vs SOTA".

This Series

LLMs for Classification: One Example is All You Need
Why Classical ML Still Crushes GenAI at Regression
The Myth Confronted with Reality (this article)

Classical ML isn't dead. It's never been more relevant. The real innovation isn't replacing sklearn with GPT — it's knowing when to use each.

DEV Community