The surprising benchmark results that every ML engineer should know before using LLMs for numerical predictions
The Uncomfortable Truth
In the age of GPT-5 and advanced reasoning models, I ran a comprehensive benchmark comparing Classical ML (sklearn) against Generative AI across 19 use cases. The results for regression tasks were... humbling for GenAI.
Classical ML wins 100% of regression tasks. Not 90%. Not 95%. One hundred percent.
And it's not even close.
The Data: What I've Found
I've tested three state-of-the-art LLM models against sklearn's RandomForestRegressor across 5 regression datasets:
| Dataset | sklearn R² | GPT-4o-mini R² | GPT-5-nano R² | Winner |
|---|---|---|---|---|
| Car Price | 0.802 | -3.498 | -5.172 | sklearn |
| Diamond Price | 0.926 | 0.806 | - | sklearn |
| Housing Price | 0.710 | -6.78B | -29.27B | sklearn |
| Wine Quality | 0.405 | -1.33M | -1.631 | sklearn |
| Salary Prediction | 0.346 | 0.184 | - | sklearn |
Wait, negative R²? Yes. A negative R² means the model performs worse than simply predicting the mean. The LLMs aren't just losing — they're producing predictions that are mathematically worse than doing nothing.
Understanding the Catastrophe
What R² Actually Means
- R² = 1.0: Perfect predictions
- R² = 0.0: Predictions as good as the mean
- R² < 0: Predictions worse than the mean
When GPT-4o-mini produces an R² of -6.78 billion on housing prices, it means the model's predictions are so far off that they add massive error compared to simply guessing the average price every time.
The Scale of the Problem
Let's visualize what "negative R² in the billions" actually means:
| Metric | sklearn | GPT-4o-mini | Interpretation |
|---|---|---|---|
| Housing RMSE | ~$45,000 | ~$8.2 billion | LLM errors are 180,000x larger |
| Car Price RMSE | ~$2,100 | ~$18,500 | LLM errors are 9x larger |
| Wine Quality RMSE | 0.65 | 1,152 | LLM errors are 1,772x larger |
Why LLMs Fail at Regression
1. Token Generation ≠ Numerical Reasoning
LLMs generate text token by token. When asked to predict "$347,500", they're essentially:
- Deciding "3" is the first digit
- Then "4" seems reasonable
- Then "7", etc.
This is fundamentally different from computing a weighted sum of features, which is what regression actually requires.
2. No Gradient Optimization
Classical ML models like RandomForest are optimized to minimize prediction error through mathematical optimization. LLMs are optimized to predict the next token in a sequence — a completely different objective.
3. Scale Sensitivity
LLMs have no inherent understanding of numerical scale. The difference between $100,000 and $1,000,000 is just different tokens to an LLM, but it's a massive error in regression terms.
The Benchmark Methodology
To ensure fair comparison, I've used:
- Monte Carlo Cross-Validation: 30 random train/test splits
- Statistical Tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI
- Effect Size: Cohen's d for interpretation
- sklearn baseline: TF-IDF + RandomForestRegressor (simple, reproducible)
All differences were statistically significant (p < 0.05) with large effect sizes (Cohen's d > 0.8).
But What About Few-Shot Learning?
I've tested whether more examples could help LLMs with regression. The answer: marginally, but still catastrophic.
| Dataset | 5-shot R² | 20-shot R² | Improvement | Still Negative? |
|---|---|---|---|---|
| Car Price | -3.498 | -1.794 | +48.7% | YES |
| Wine Quality | -1.33M | -379K | +71.5% | YES |
| Housing Price | -6.78B | -6.54B | +3.5% | YES |
More examples make LLM predictions "less catastrophically wrong" — but they remain worse than useless compared to classical ML.
The Cost Dimension
Beyond accuracy, there's a massive cost and speed difference:
| Metric | sklearn | GenAI | Ratio |
|---|---|---|---|
| Training time | 8.1 min | N/A | - |
| Inference time | 8.1 min total | 6,533 min | 807x slower |
| Cost | $0 (local) | $9.39 | Infinite |
You're paying 807x more in time (and real money) for predictions that are mathematically worse than guessing the average.
When to Use What: The Decision Framework
Use Classical ML (sklearn) When:
- Your target variable is continuous/numerical
- You need regression predictions (price, quantity, score)
- You have structured tabular data
- Latency matters (<100ms response time)
- Cost matters (high volume predictions)
Consider GenAI When:
- Your task involves natural language understanding
- You're doing sentiment analysis or emotion detection
- You need zero-shot generalization on text
- You're doing information extraction from unstructured text
The Hybrid Approach
If you must use LLMs in a pipeline that involves numerical predictions, consider these patterns:
Pattern 1: LLM for Features, sklearn for Prediction
Text Input → LLM extracts features → sklearn predicts value
Use the LLM to convert unstructured text into structured features, then let sklearn do the numerical prediction.
Pattern 2: LLM for Binning, then Aggregation
Input → LLM classifies into bins → Map bins to numerical ranges
Instead of asking "What is the price?", ask "Is this low, medium, or high priced?" — a classification task where LLMs perform better.
Pattern 3: LLM as Sanity Checker
sklearn prediction → LLM validates → Final output
Use sklearn for the prediction, then optionally use an LLM to flag predictions that seem unreasonable given the context.
Key Takeaways
- Never use LLMs for regression — You'll get R² scores worse than random guessing
- sklearn dominates numerical prediction — With 100% win rate in our benchmarks
- More examples don't fix the problem — Even 20-shot learning produces negative R²
- Cost-performance ratio is catastrophic — 807x slower for worse results
- Hybrid approaches exist — Use LLMs for what they're good at (text), sklearn for numbers
The Bottom Line
LLMs are poets, not accountants.
They excel at understanding language, sentiment, and nuance. They fail spectacularly at the mathematical precision required for regression. Know your tools, and use them where they shine.
Methodology Details
- Datasets: Car Price, Diamond Price, Housing Price, Wine Quality, Salary Prediction
- sklearn Model: RandomForestRegressor with default parameters
- LLM Models: GPT-4o-mini, GPT-5-nano (reasoning), GPT-4.1-nano
- Validation: 30-iteration Monte Carlo Cross-Validation
- Metrics: R² (coefficient of determination), RMSE
- Statistical Tests: Paired t-test, Wilcoxon, Bootstrap CI (10,000 resamples)
Next in this series: "LLMs Can Do Everything": Autopsy of a Myth](https://dev.to/mickael__andrieu/llms-can-do-everything-autopsy-of-a-myth-2ika)





Top comments (0)