DEV Community: Mickaël Andrieu

"LLMs Can Do Everything": Autopsy of a Myth

Mickaël Andrieu — Tue, 06 Jan 2026 02:01:32 +0000

We tested this belief on 19 use cases, using 3 OpenAI GPT models and basic Machine Learning algorithms, for a total of 570 experiments. Here's the truth no one wants to hear.

Prologue: The Promise

January 2026. In the hallways of tech companies, a conviction has settled in as self-evident: LLMs make classical Machine Learning obsolete. Why bother with sklearn pipelines when GPT can do everything in one line of code?

We wanted to test this hypothesis. Not with opinions. With data.

The protocol: 19 real-world use cases, from spam detection to housing price prediction. 3 OpenAI models (GPT-4o-mini, GPT-5-nano, GPT-4.1-nano). A deliberately simple sklearn baseline (TF-IDF + Logistic Regression, RandomForest). 30 iterations per experiment, Monte Carlo cross-validation, rigorous statistical tests.

The verdict? It will surprise you.

ACT I: THE ILLUSION

The Expected Triumph

Let's start with what works. And it works spectacularly well.

When the LLM Crushes Everything

On sentiment analysis, GenAI doesn't just win. It dominates.

Dataset	sklearn	GPT-4o-mini	Gain	Effect Size
Twitter Sentiment	0.367	0.690	+88%	d=-13.6 (massive)
IMDB Reviews	0.784	0.938	+20%	d=-7.5 (massive)
Amazon Reviews	0.377	0.608	+61%	d=-10.5 (massive)
Yelp Reviews	0.428	0.665	+55%	d=-11.0 (massive)

A Cohen's d of -13.6 on Twitter Sentiment. For perspective: in social sciences, an effect is considered "large" when |d| > 0.8. Here, we're 17 times beyond the threshold.

The LLM understands irony, sarcasm, cultural nuances. It knows that "This movie was... something else" isn't a compliment. TF-IDF only sees words.

Spam Detection: A Massacre

SMS Spam Detection
------------------
sklearn (TF-IDF + LogReg) : 0.581 F1
GPT-4o-mini              : 0.965 F1

Difference : +66%
p-value    : 3.03e-26 (28 zeros after the decimal)

With just one few-shot example, the LLM already reaches 0.90 F1. sklearn needs thousands of examples to approach this score.

Semantic Extraction: Where sklearn Surrenders

Benchmark	sklearn	GPT-4o-mini	Multiplier
KPCrowd	0.030	0.419	x14
DUC2001	0.078	0.303	x4
Krapivin	0.046	0.223	x5

On keyword extraction, GPT-4o-mini is 14 times better than the baseline. Cohen's d = -38.7. We're leaving the realm of statistics to enter the absurd.

At this point in our analysis, the myth seemed to confirm itself. GenAI dominated 12 use cases out of 19. 63% win rate. Classical ML appeared doomed.

And then we looked at the regression data.

ACT II: THE CONFRONTATION

The First Shock: Fake News

Before even reaching regression, a first warning signal.

Fake News Detection
-------------------
sklearn     : 0.884 F1
GPT-4o-mini : 0.822 F1

sklearn wins.

Strange. Why does the LLM, which excels everywhere else in classification, fail here?

Hypothesis: Fake news detection relies less on semantic understanding than on subtle statistical patterns. Word frequencies, unusual syntactic structures, stylistic markers. TF-IDF captures these signals better than an LLM looking for "meaning".

First crack in the myth. But this was just an appetizer.

The Cataclysm: Negative R² Scores

Brace yourself. What follows is the heart of our discovery.

Housing Price Prediction

sklearn R²     : 0.710 (explains 71% of variance)
GPT-4o-mini R² : -6,536,549,708

You read that right. Negative six and a half billion.

An R² of 0 means "as good as predicting the mean". A negative R² means "worse than the mean". An R² of -6.5 billion means the model produces predictions so aberrant that squared residuals explode toward infinity.

When we asked GPT-4o-mini to predict a housing price, it responded with things like "$45" or "$999,999,999,999". Not because it's stupid. Because it's not designed for this type of task.

Wine Quality Prediction

sklearn R²     : 0.405
GPT-4o-mini R² : -379,022

Negative three hundred seventy-nine thousand. To predict a wine score between 1 and 10.

The Horror Table

Dataset	sklearn R²	GenAI R²	Interpretation
Diamond Price	0.926	0.810	GenAI correct but inferior
Car Price	0.802	-1.79	GenAI failing
Salary	0.346	0.208	GenAI weak
Wine Quality	0.405	-379,022	Catastrophic
Housing Price	0.710	-6,536,549,709	Apocalyptic

sklearn wins in regression: 5/5. That's 100%.

Why This Disaster?

LLMs are language models. They predict the most probable next token, not a continuous numerical value.

When you ask "What's the price of this house?", the LLM generates a linguistically plausible response, not a mathematically correct one. It produces numbers that "sound right": round amounts, typical prices it saw in its training corpus.

But regression demands numerical precision. A 10% error on a housing price is $50,000. A 1% error across 10,000 predictions is millions in cumulative error.

The LLM can't do this. It's not a flaw. It's an architectural characteristic.

The GPT-5-nano Paradox

Here's an even more troubling discovery.

Model	Relative Cost	Win Rate vs sklearn
GPT-4o-mini	$	63%
GPT-5-nano (reasoning)	$$$	61%
GPT-4.1-nano	$	47%

GPT-5-nano, with its additional "reasoning" capabilities, performs worse than GPT-4o-mini on our ML benchmarks.

On DBpedia (14-class classification):

GPT-4o-mini: 0.964 F1
GPT-5-nano: 0.792 F1

The "smarter" model loses by 17 points. How do we explain this?

Hypothesis: GPT-5-nano's reasoning capabilities are optimized for complex tasks (mathematics, logic, planning). For standard text classification, this "reasoning" adds noise rather than value. The model "thinks too much" when it should just classify.

More expensive doesn't mean better. More sophisticated doesn't mean more suitable.

ACT III: THE WISDOM

The Law of the First Example

Amid this confrontation, a ray of hope for GenAI practitioners.

Our few-shot analysis reveals a striking law:

Transition	Average Gain
0 -> 1 example	+55.9%
1 -> 3 examples	+3.1%
3 -> 5 examples	+2.8%
5 -> 10 examples	+1.3%
10 -> 50 examples	+4.4%

The first example delivers 56% of the total improvement.

Going from 0-shot to 50-shot multiplies your cost by 37x but only adds 28% additional performance beyond the first example.

On DBpedia (14 classes), however, multiple examples remain crucial:

0-shot: 0.321 F1
1-shot: 0.650 F1 (+102%)
50-shot: 0.937 F1 (+192% vs zero-shot)

Pragmatic rule: For simple tasks (<6 classes), 1-3 examples suffice. For complex taxonomies (>10 classes), invest in 20-50 examples.

The Decision Matrix

After 570 experiments, here's what we know:

Use GenAI when...

Criterion	Example	Justification
Semantic task	Sentiment, emotion	The LLM understands meaning
Few labeled data	Prototype, MVP	1 example = 90% of the way
Concept extraction	Keywords, entities	The LLM sees beyond n-grams
Latency not critical	Batch processing	807x slower but more accurate

Use sklearn when...

Criterion	Example	Justification
Numerical prediction	Prices, scores	R² > 0 guaranteed
Statistical pattern detection	Fraud, fake news	TF-IDF captures subtle signals
Massive volume	1M+ predictions/day	Cost = $0 vs $500/day
Critical latency	Real-time	807x faster

The Cost of the Illusion

Let's talk money.

Metric	sklearn	GenAI
Total benchmark cost	$0	$9.39
Total time	8.1 minutes	6,533 minutes
Speed ratio	1x	807x slower

In production:

Daily Volume	sklearn	GenAI (GPT-4o-mini)
1,000 requests	~$0	~$0.50
100,000 requests	~$0	~$50
1,000,000 requests	~$0	~$500

One million predictions per day: $0 vs $182,500/year.

And that's just API cost. Add latency (user response time), external service dependency (outage = downtime), data confidentiality (your texts go to OpenAI).

Epilogue: The New Map

The myth "LLMs can do everything" is false. But reality is more interesting than a simple classical ML victory.

What We Learned

1. LLMs are champions of semantics, not numerics.

They understand meaning, nuances, context. They fail as soon as you ask for mathematical precision.

2. Sophistication doesn't mean suitability.

GPT-5-nano with its "reasoning" loses to GPT-4o-mini on simple tasks. The most advanced tool isn't always the right tool.

3. The first few-shot example is magical.

56% gain for a single example. It's the best ROI in all of Machine Learning.

4. Regression remains sklearn's undisputed domain.

R² = -6.5 billion isn't a bug. It's a fundamental characteristic of transformer architectures applied where they don't belong.

5. GenAI's hidden costs are real.

807x slower. $182k/year for 1M daily predictions. External dependency. These factors often disappear from POC evaluations.

The Question to Ask

Before every project, ask yourself:

Does my task require understanding meaning, or predicting numbers?

If it's meaning: GenAI will probably excel.
If it's numbers: sklearn will probably win.
If it's both: build a hybrid architecture.

Methodology

Validation: Monte Carlo Cross-Validation, 30 iterations per configuration
Statistical tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI (10,000 resamples)
Metrics: F1-score (classification/extraction), R² (regression)
sklearn baseline: TF-IDF + LogisticRegression (classification), RandomForest (regression)
GenAI models: GPT-4o-mini, GPT-5-nano (reasoning medium), GPT-4.1-nano
Few-shot: 20 examples by default, 0-50 examples analysis

Acknowledged limitation: Our sklearn baseline is deliberately simple. More sophisticated approaches (sentence-transformers, XGBoost, fine-tuning) would likely narrow the gap on some tasks. This benchmark measures "GenAI vs accessible ML", not "GenAI vs SOTA".

This Series

LLMs for Classification: One Example is All You Need
Why Classical ML Still Crushes GenAI at Regression
The Myth Confronted with Reality (this article)

Classical ML isn't dead. It's never been more relevant. The real innovation isn't replacing sklearn with GPT — it's knowing when to use each.

LLMs for Classification: One Example is All You Need

Mickaël Andrieu — Tue, 06 Jan 2026 00:06:21 +0000

The first few-shot example delivers +55.9% gains. The next 49? Just +25% more.

The Counter-Intuitive Discovery

After running 35 benchmark configurations across 5 datasets with rigorous Monte Carlo cross-validation (30 iterations each), we discovered something that challenges conventional prompt engineering wisdom:

The first example you give a Large Language Model improves performance by +55.9% on average. The second? Just +3.1%.

This finding has massive implications for how we design GenAI pipelines — and how much we spend on them.

The Data: What We Tested

We benchmarked GPT-4.1-nano across multiple tasks using publicly available datasets:

Dataset	Task Type	Classes	Zero-Shot F1
SMS Spam	Binary Classification	2	0.820
Emotion	Multi-class	6	0.344
Twitter Sentiment	Sentiment	3	0.488
DBpedia	Multi-class	14	0.321
Keyword Extraction (Inspec)	Extraction	-	0.155

Each configuration was tested with 0, 1, 3, 5, 10, 20, and 50 few-shot examples.

All datasets are open-source and available on Hugging Face Datasets.

The Diminishing Returns Curve

The pattern is striking across all datasets:

Transition	Average Gain	What It Means
0 → 1 shot	+55.9%	Massive improvement
1 → 3 shots	+3.1%	Diminishing returns begin
3 → 5 shots	+2.8%	Marginal gains
5 → 10 shots	+1.3%	Near plateau
10 → 50 shots	+8.8%*	Only significant for multi-class

*Inflated by DBpedia (14 classes) which continues improving.

Why Does This Happen?

The first example teaches the model three critical things:

Output format — How to structure the response
Task semantics — What "classify" or "extract" means in this context
Label space — What categories or outputs are expected

Once these are established, additional examples provide marginal refinement.

Statistical Proof: This Is Not Random

We used Welch's ANOVA to verify these findings weren't due to chance:

Dataset	F-statistic	p-value	Significant?
SMS Spam	76.75	<0.0001	YES
Emotion	73.89	<0.0001	YES
Twitter Sentiment	92.57	<0.0001	YES
DBpedia	439.88	<0.0001	YES
Keyword Extraction	160.30	<0.0001	YES

The effect is statistically significant for ALL datasets. DBpedia shows the strongest effect (F=439.88), reflecting its large improvement range from zero-shot to 50-shot.

Pairwise Comparisons (Bonferroni-corrected)

Comparison	Cohen's d	Interpretation
0 vs 1	d = -4.3	Large (always significant)
1 vs 3	d = -0.5	Medium (rarely significant)
3 vs 5	d = -0.2	Negligible
5 vs 10	d = -0.1	Negligible

Key insight: While the overall effect is significant, pairwise comparisons confirm that differences between adjacent levels (3 vs 5, 5 vs 10) are not statistically significant after correction.

The Exception: Multi-Class Tasks

DBpedia (14 classes) tells a different story:

Few-shots	F1-Score	Gain vs Zero-Shot
0	0.321	—
1	0.650	+102.5%
5	0.760	+136.8%
20	0.876	+173.0%
50	0.937	+192.1%

The rule of thumb: For tasks with >10 classes, plan for ~3-4 examples per class.

The ROI Reality Check

Here's what nobody talks about — the economics:

Strategy	Performance Gain	Cost Multiplier	ROI Score
1-shot	+53.2%	1.75x	30.4
3-shot	+59.0%	3.25x	18.2
5-shot	+63.5%	4.75x	13.4
10-shot	+66.9%	8.5x	7.9
50-shot	+81.1%	37.5x	2.2

The verdict: 1-shot delivers the best ROI by far. Going from 1 to 50 examples costs 21x more for only +28% additional performance.

Practical Recommendations

The Decision Matrix

Your Task	Recommended Examples	Why
Binary classification	3-5	Plateau at 5, marginal gains after
Sentiment (3 classes)	3	Very early plateau, oscillation after
Emotion (6 classes)	5	Early saturation
Multi-class (>10)	20-50	Continues improving significantly
Extraction	10	Benefits from format examples

Key Takeaways

Invest in your first example — It's worth 55.9% of your total potential improvement
Stop at 3-5 for simple tasks — Binary and low-cardinality multi-class plateau early
Scale up only for complex taxonomies — >10 classes benefit from 20-50 examples
Measure, don't assume — Run your own benchmarks with Monte Carlo cross-validation

Methodology

Protocol: Monte Carlo Cross-Validation (30 random train/test splits)
Model: GPT-4.1-nano
Metrics: F1-score (macro-averaged for multi-class)
Statistical tests: Welch's ANOVA, Bonferroni-corrected post-hoc tests
Effect size: Cohen's d

Next in this series: "Why Classical ML Still Crushes GenAI at Regression" — The surprising tasks where sklearn wins 100% of the time.

Why Classical ML Still Crushes GenAI at Regression

Mickaël Andrieu — Mon, 05 Jan 2026 23:45:36 +0000

The surprising benchmark results that every ML engineer should know before using LLMs for numerical predictions

The Uncomfortable Truth

In the age of GPT-5 and advanced reasoning models, I ran a comprehensive benchmark comparing Classical ML (sklearn) against Generative AI across 19 use cases. The results for regression tasks were... humbling for GenAI.

Classical ML wins 100% of regression tasks. Not 90%. Not 95%. One hundred percent.

And it's not even close.

The Data: What I've Found

I've tested three state-of-the-art LLM models against sklearn's RandomForestRegressor across 5 regression datasets:

Dataset	sklearn R²	GPT-4o-mini R²	GPT-5-nano R²	Winner
Car Price	0.802	-3.498	-5.172	sklearn
Diamond Price	0.926	0.806	-	sklearn
Housing Price	0.710	-6.78B	-29.27B	sklearn
Wine Quality	0.405	-1.33M	-1.631	sklearn
Salary Prediction	0.346	0.184	-	sklearn

Wait, negative R²? Yes. A negative R² means the model performs worse than simply predicting the mean. The LLMs aren't just losing — they're producing predictions that are mathematically worse than doing nothing.

Understanding the Catastrophe

What R² Actually Means

R² = 1.0: Perfect predictions
R² = 0.0: Predictions as good as the mean
R² < 0: Predictions worse than the mean

When GPT-4o-mini produces an R² of -6.78 billion on housing prices, it means the model's predictions are so far off that they add massive error compared to simply guessing the average price every time.

The Scale of the Problem

Let's visualize what "negative R² in the billions" actually means:

Metric	sklearn	GPT-4o-mini	Interpretation
Housing RMSE	~$45,000	~$8.2 billion	LLM errors are 180,000x larger
Car Price RMSE	~$2,100	~$18,500	LLM errors are 9x larger
Wine Quality RMSE	0.65	1,152	LLM errors are 1,772x larger

Why LLMs Fail at Regression

1. Token Generation ≠ Numerical Reasoning

LLMs generate text token by token. When asked to predict "$347,500", they're essentially:

Deciding "3" is the first digit
Then "4" seems reasonable
Then "7", etc.

This is fundamentally different from computing a weighted sum of features, which is what regression actually requires.

2. No Gradient Optimization

Classical ML models like RandomForest are optimized to minimize prediction error through mathematical optimization. LLMs are optimized to predict the next token in a sequence — a completely different objective.

3. Scale Sensitivity

LLMs have no inherent understanding of numerical scale. The difference between $100,000 and $1,000,000 is just different tokens to an LLM, but it's a massive error in regression terms.

The Benchmark Methodology

To ensure fair comparison, I've used:

Monte Carlo Cross-Validation: 30 random train/test splits
Statistical Tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI
Effect Size: Cohen's d for interpretation
sklearn baseline: TF-IDF + RandomForestRegressor (simple, reproducible)

All differences were statistically significant (p < 0.05) with large effect sizes (Cohen's d > 0.8).

But What About Few-Shot Learning?

I've tested whether more examples could help LLMs with regression. The answer: marginally, but still catastrophic.

Dataset	5-shot R²	20-shot R²	Improvement	Still Negative?
Car Price	-3.498	-1.794	+48.7%	YES
Wine Quality	-1.33M	-379K	+71.5%	YES
Housing Price	-6.78B	-6.54B	+3.5%	YES

More examples make LLM predictions "less catastrophically wrong" — but they remain worse than useless compared to classical ML.

The Cost Dimension

Beyond accuracy, there's a massive cost and speed difference:

Metric	sklearn	GenAI	Ratio
Training time	8.1 min	N/A	-
Inference time	8.1 min total	6,533 min	807x slower
Cost	$0 (local)	$9.39	Infinite

You're paying 807x more in time (and real money) for predictions that are mathematically worse than guessing the average.

When to Use What: The Decision Framework

Use Classical ML (sklearn) When:

Your target variable is continuous/numerical
You need regression predictions (price, quantity, score)
You have structured tabular data
Latency matters (<100ms response time)
Cost matters (high volume predictions)

Consider GenAI When:

Your task involves natural language understanding
You're doing sentiment analysis or emotion detection
You need zero-shot generalization on text
You're doing information extraction from unstructured text

The Hybrid Approach

If you must use LLMs in a pipeline that involves numerical predictions, consider these patterns:

Pattern 1: LLM for Features, sklearn for Prediction

Text Input → LLM extracts features → sklearn predicts value

Use the LLM to convert unstructured text into structured features, then let sklearn do the numerical prediction.

Pattern 2: LLM for Binning, then Aggregation

Input → LLM classifies into bins → Map bins to numerical ranges

Instead of asking "What is the price?", ask "Is this low, medium, or high priced?" — a classification task where LLMs perform better.

Pattern 3: LLM as Sanity Checker

sklearn prediction → LLM validates → Final output

Use sklearn for the prediction, then optionally use an LLM to flag predictions that seem unreasonable given the context.

Key Takeaways

Never use LLMs for regression — You'll get R² scores worse than random guessing
sklearn dominates numerical prediction — With 100% win rate in our benchmarks
More examples don't fix the problem — Even 20-shot learning produces negative R²
Cost-performance ratio is catastrophic — 807x slower for worse results
Hybrid approaches exist — Use LLMs for what they're good at (text), sklearn for numbers

The Bottom Line

LLMs are poets, not accountants.

They excel at understanding language, sentiment, and nuance. They fail spectacularly at the mathematical precision required for regression. Know your tools, and use them where they shine.

Methodology Details

Datasets: Car Price, Diamond Price, Housing Price, Wine Quality, Salary Prediction
sklearn Model: RandomForestRegressor with default parameters
LLM Models: GPT-4o-mini, GPT-5-nano (reasoning), GPT-4.1-nano
Validation: 30-iteration Monte Carlo Cross-Validation
Metrics: R² (coefficient of determination), RMSE
Statistical Tests: Paired t-test, Wilcoxon, Bootstrap CI (10,000 resamples)

Next in this series: "LLMs Can Do Everything": Autopsy of a Myth