We tested this belief on 19 use cases, using 3 OpenAI GPT models and basic Machine Learning algorithms, for a total of 570 experiments. Here's the truth no one wants to hear.
Prologue: The Promise
January 2026. In the hallways of tech companies, a conviction has settled in as self-evident: LLMs make classical Machine Learning obsolete. Why bother with sklearn pipelines when GPT can do everything in one line of code?
We wanted to test this hypothesis. Not with opinions. With data.
The protocol: 19 real-world use cases, from spam detection to housing price prediction. 3 OpenAI models (GPT-4o-mini, GPT-5-nano, GPT-4.1-nano). A deliberately simple sklearn baseline (TF-IDF + Logistic Regression, RandomForest). 30 iterations per experiment, Monte Carlo cross-validation, rigorous statistical tests.
The verdict? It will surprise you.
ACT I: THE ILLUSION
The Expected Triumph
Let's start with what works. And it works spectacularly well.
When the LLM Crushes Everything
On sentiment analysis, GenAI doesn't just win. It dominates.
| Dataset | sklearn | GPT-4o-mini | Gain | Effect Size |
|---|---|---|---|---|
| Twitter Sentiment | 0.367 | 0.690 | +88% | d=-13.6 (massive) |
| IMDB Reviews | 0.784 | 0.938 | +20% | d=-7.5 (massive) |
| Amazon Reviews | 0.377 | 0.608 | +61% | d=-10.5 (massive) |
| Yelp Reviews | 0.428 | 0.665 | +55% | d=-11.0 (massive) |
A Cohen's d of -13.6 on Twitter Sentiment. For perspective: in social sciences, an effect is considered "large" when |d| > 0.8. Here, we're 17 times beyond the threshold.
The LLM understands irony, sarcasm, cultural nuances. It knows that "This movie was... something else" isn't a compliment. TF-IDF only sees words.
Spam Detection: A Massacre
SMS Spam Detection
------------------
sklearn (TF-IDF + LogReg) : 0.581 F1
GPT-4o-mini : 0.965 F1
Difference : +66%
p-value : 3.03e-26 (28 zeros after the decimal)
With just one few-shot example, the LLM already reaches 0.90 F1. sklearn needs thousands of examples to approach this score.
Semantic Extraction: Where sklearn Surrenders
| Benchmark | sklearn | GPT-4o-mini | Multiplier |
|---|---|---|---|
| KPCrowd | 0.030 | 0.419 | x14 |
| DUC2001 | 0.078 | 0.303 | x4 |
| Krapivin | 0.046 | 0.223 | x5 |
On keyword extraction, GPT-4o-mini is 14 times better than the baseline. Cohen's d = -38.7. We're leaving the realm of statistics to enter the absurd.
At this point in our analysis, the myth seemed to confirm itself. GenAI dominated 12 use cases out of 19. 63% win rate. Classical ML appeared doomed.
And then we looked at the regression data.
ACT II: THE CONFRONTATION
The First Shock: Fake News
Before even reaching regression, a first warning signal.
Fake News Detection
-------------------
sklearn : 0.884 F1
GPT-4o-mini : 0.822 F1
sklearn wins.
Strange. Why does the LLM, which excels everywhere else in classification, fail here?
Hypothesis: Fake news detection relies less on semantic understanding than on subtle statistical patterns. Word frequencies, unusual syntactic structures, stylistic markers. TF-IDF captures these signals better than an LLM looking for "meaning".
First crack in the myth. But this was just an appetizer.
The Cataclysm: Negative R² Scores
Brace yourself. What follows is the heart of our discovery.
Housing Price Prediction
sklearn R² : 0.710 (explains 71% of variance)
GPT-4o-mini R² : -6,536,549,708
You read that right. Negative six and a half billion.
An R² of 0 means "as good as predicting the mean". A negative R² means "worse than the mean". An R² of -6.5 billion means the model produces predictions so aberrant that squared residuals explode toward infinity.
When we asked GPT-4o-mini to predict a housing price, it responded with things like "$45" or "$999,999,999,999". Not because it's stupid. Because it's not designed for this type of task.
Wine Quality Prediction
sklearn R² : 0.405
GPT-4o-mini R² : -379,022
Negative three hundred seventy-nine thousand. To predict a wine score between 1 and 10.
The Horror Table
| Dataset | sklearn R² | GenAI R² | Interpretation |
|---|---|---|---|
| Diamond Price | 0.926 | 0.810 | GenAI correct but inferior |
| Car Price | 0.802 | -1.79 | GenAI failing |
| Salary | 0.346 | 0.208 | GenAI weak |
| Wine Quality | 0.405 | -379,022 | Catastrophic |
| Housing Price | 0.710 | -6,536,549,709 | Apocalyptic |
sklearn wins in regression: 5/5. That's 100%.
Why This Disaster?
LLMs are language models. They predict the most probable next token, not a continuous numerical value.
When you ask "What's the price of this house?", the LLM generates a linguistically plausible response, not a mathematically correct one. It produces numbers that "sound right": round amounts, typical prices it saw in its training corpus.
But regression demands numerical precision. A 10% error on a housing price is $50,000. A 1% error across 10,000 predictions is millions in cumulative error.
The LLM can't do this. It's not a flaw. It's an architectural characteristic.
The GPT-5-nano Paradox
Here's an even more troubling discovery.
| Model | Relative Cost | Win Rate vs sklearn |
|---|---|---|
| GPT-4o-mini | $ | 63% |
| GPT-5-nano (reasoning) | $$$ | 61% |
| GPT-4.1-nano | $ | 47% |
GPT-5-nano, with its additional "reasoning" capabilities, performs worse than GPT-4o-mini on our ML benchmarks.
On DBpedia (14-class classification):
- GPT-4o-mini: 0.964 F1
- GPT-5-nano: 0.792 F1
The "smarter" model loses by 17 points. How do we explain this?
Hypothesis: GPT-5-nano's reasoning capabilities are optimized for complex tasks (mathematics, logic, planning). For standard text classification, this "reasoning" adds noise rather than value. The model "thinks too much" when it should just classify.
More expensive doesn't mean better. More sophisticated doesn't mean more suitable.
ACT III: THE WISDOM
The Law of the First Example
Amid this confrontation, a ray of hope for GenAI practitioners.
Our few-shot analysis reveals a striking law:
| Transition | Average Gain |
|---|---|
| 0 -> 1 example | +55.9% |
| 1 -> 3 examples | +3.1% |
| 3 -> 5 examples | +2.8% |
| 5 -> 10 examples | +1.3% |
| 10 -> 50 examples | +4.4% |
The first example delivers 56% of the total improvement.
Going from 0-shot to 50-shot multiplies your cost by 37x but only adds 28% additional performance beyond the first example.
On DBpedia (14 classes), however, multiple examples remain crucial:
- 0-shot: 0.321 F1
- 1-shot: 0.650 F1 (+102%)
- 50-shot: 0.937 F1 (+192% vs zero-shot)
Pragmatic rule: For simple tasks (<6 classes), 1-3 examples suffice. For complex taxonomies (>10 classes), invest in 20-50 examples.
The Decision Matrix
After 570 experiments, here's what we know:
Use GenAI when...
| Criterion | Example | Justification |
|---|---|---|
| Semantic task | Sentiment, emotion | The LLM understands meaning |
| Few labeled data | Prototype, MVP | 1 example = 90% of the way |
| Concept extraction | Keywords, entities | The LLM sees beyond n-grams |
| Latency not critical | Batch processing | 807x slower but more accurate |
Use sklearn when...
| Criterion | Example | Justification |
|---|---|---|
| Numerical prediction | Prices, scores | R² > 0 guaranteed |
| Statistical pattern detection | Fraud, fake news | TF-IDF captures subtle signals |
| Massive volume | 1M+ predictions/day | Cost = $0 vs $500/day |
| Critical latency | Real-time | 807x faster |
The Cost of the Illusion
Let's talk money.
| Metric | sklearn | GenAI |
|---|---|---|
| Total benchmark cost | $0 | $9.39 |
| Total time | 8.1 minutes | 6,533 minutes |
| Speed ratio | 1x | 807x slower |
In production:
| Daily Volume | sklearn | GenAI (GPT-4o-mini) |
|---|---|---|
| 1,000 requests | ~$0 | ~$0.50 |
| 100,000 requests | ~$0 | ~$50 |
| 1,000,000 requests | ~$0 | ~$500 |
One million predictions per day: $0 vs $182,500/year.
And that's just API cost. Add latency (user response time), external service dependency (outage = downtime), data confidentiality (your texts go to OpenAI).
Epilogue: The New Map
The myth "LLMs can do everything" is false. But reality is more interesting than a simple classical ML victory.
What We Learned
1. LLMs are champions of semantics, not numerics.
They understand meaning, nuances, context. They fail as soon as you ask for mathematical precision.
2. Sophistication doesn't mean suitability.
GPT-5-nano with its "reasoning" loses to GPT-4o-mini on simple tasks. The most advanced tool isn't always the right tool.
3. The first few-shot example is magical.
56% gain for a single example. It's the best ROI in all of Machine Learning.
4. Regression remains sklearn's undisputed domain.
R² = -6.5 billion isn't a bug. It's a fundamental characteristic of transformer architectures applied where they don't belong.
5. GenAI's hidden costs are real.
807x slower. $182k/year for 1M daily predictions. External dependency. These factors often disappear from POC evaluations.
The Question to Ask
Before every project, ask yourself:
Does my task require understanding meaning, or predicting numbers?
If it's meaning: GenAI will probably excel.
If it's numbers: sklearn will probably win.
If it's both: build a hybrid architecture.
Methodology
- Validation: Monte Carlo Cross-Validation, 30 iterations per configuration
- Statistical tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI (10,000 resamples)
- Metrics: F1-score (classification/extraction), R² (regression)
- sklearn baseline: TF-IDF + LogisticRegression (classification), RandomForest (regression)
- GenAI models: GPT-4o-mini, GPT-5-nano (reasoning medium), GPT-4.1-nano
- Few-shot: 20 examples by default, 0-50 examples analysis
Acknowledged limitation: Our sklearn baseline is deliberately simple. More sophisticated approaches (sentence-transformers, XGBoost, fine-tuning) would likely narrow the gap on some tasks. This benchmark measures "GenAI vs accessible ML", not "GenAI vs SOTA".
This Series
- LLMs for Classification: One Example is All You Need
- Why Classical ML Still Crushes GenAI at Regression
- The Myth Confronted with Reality (this article)
Classical ML isn't dead. It's never been more relevant. The real innovation isn't replacing sklearn with GPT — it's knowing when to use each.






Top comments (0)