Mickaël Andrieu

Posted on Jan 5 • Edited on Jan 7

Why Classical ML Still Crushes GenAI at Regression

#machinelearning #python #datascience #ai

The surprising benchmark results that every ML engineer should know before using LLMs for numerical predictions

The Uncomfortable Truth

In the age of GPT-5 and advanced reasoning models, I ran a comprehensive benchmark comparing Classical ML (sklearn) against Generative AI across 19 use cases. The results for regression tasks were... humbling for GenAI.

Classical ML wins 100% of regression tasks. Not 90%. Not 95%. One hundred percent.

And it's not even close.

The Data: What I've Found

I've tested three state-of-the-art LLM models against sklearn's RandomForestRegressor across 5 regression datasets:

Dataset	sklearn R²	GPT-4o-mini R²	GPT-5-nano R²	Winner
Car Price	0.802	-3.498	-5.172	sklearn
Diamond Price	0.926	0.806	-	sklearn
Housing Price	0.710	-6.78B	-29.27B	sklearn
Wine Quality	0.405	-1.33M	-1.631	sklearn
Salary Prediction	0.346	0.184	-	sklearn

Wait, negative R²? Yes. A negative R² means the model performs worse than simply predicting the mean. The LLMs aren't just losing — they're producing predictions that are mathematically worse than doing nothing.

Understanding the Catastrophe

What R² Actually Means

R² = 1.0: Perfect predictions
R² = 0.0: Predictions as good as the mean
R² < 0: Predictions worse than the mean

When GPT-4o-mini produces an R² of -6.78 billion on housing prices, it means the model's predictions are so far off that they add massive error compared to simply guessing the average price every time.

The Scale of the Problem

Let's visualize what "negative R² in the billions" actually means:

Metric	sklearn	GPT-4o-mini	Interpretation
Housing RMSE	~$45,000	~$8.2 billion	LLM errors are 180,000x larger
Car Price RMSE	~$2,100	~$18,500	LLM errors are 9x larger
Wine Quality RMSE	0.65	1,152	LLM errors are 1,772x larger

Why LLMs Fail at Regression

1. Token Generation ≠ Numerical Reasoning

LLMs generate text token by token. When asked to predict "$347,500", they're essentially:

Deciding "3" is the first digit
Then "4" seems reasonable
Then "7", etc.

This is fundamentally different from computing a weighted sum of features, which is what regression actually requires.

2. No Gradient Optimization

Classical ML models like RandomForest are optimized to minimize prediction error through mathematical optimization. LLMs are optimized to predict the next token in a sequence — a completely different objective.

3. Scale Sensitivity

LLMs have no inherent understanding of numerical scale. The difference between $100,000 and $1,000,000 is just different tokens to an LLM, but it's a massive error in regression terms.

The Benchmark Methodology

To ensure fair comparison, I've used:

Monte Carlo Cross-Validation: 30 random train/test splits
Statistical Tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI
Effect Size: Cohen's d for interpretation
sklearn baseline: TF-IDF + RandomForestRegressor (simple, reproducible)

All differences were statistically significant (p < 0.05) with large effect sizes (Cohen's d > 0.8).

But What About Few-Shot Learning?

I've tested whether more examples could help LLMs with regression. The answer: marginally, but still catastrophic.

Dataset	5-shot R²	20-shot R²	Improvement	Still Negative?
Car Price	-3.498	-1.794	+48.7%	YES
Wine Quality	-1.33M	-379K	+71.5%	YES
Housing Price	-6.78B	-6.54B	+3.5%	YES

More examples make LLM predictions "less catastrophically wrong" — but they remain worse than useless compared to classical ML.

The Cost Dimension

Beyond accuracy, there's a massive cost and speed difference:

Metric	sklearn	GenAI	Ratio
Training time	8.1 min	N/A	-
Inference time	8.1 min total	6,533 min	807x slower
Cost	$0 (local)	$9.39	Infinite

You're paying 807x more in time (and real money) for predictions that are mathematically worse than guessing the average.

When to Use What: The Decision Framework

Use Classical ML (sklearn) When:

Your target variable is continuous/numerical
You need regression predictions (price, quantity, score)
You have structured tabular data
Latency matters (<100ms response time)
Cost matters (high volume predictions)

Consider GenAI When:

Your task involves natural language understanding
You're doing sentiment analysis or emotion detection
You need zero-shot generalization on text
You're doing information extraction from unstructured text

The Hybrid Approach

If you must use LLMs in a pipeline that involves numerical predictions, consider these patterns:

Pattern 1: LLM for Features, sklearn for Prediction

Text Input → LLM extracts features → sklearn predicts value

Use the LLM to convert unstructured text into structured features, then let sklearn do the numerical prediction.

Pattern 2: LLM for Binning, then Aggregation

Input → LLM classifies into bins → Map bins to numerical ranges

Instead of asking "What is the price?", ask "Is this low, medium, or high priced?" — a classification task where LLMs perform better.

Pattern 3: LLM as Sanity Checker

sklearn prediction → LLM validates → Final output

Use sklearn for the prediction, then optionally use an LLM to flag predictions that seem unreasonable given the context.

Key Takeaways

Never use LLMs for regression — You'll get R² scores worse than random guessing
sklearn dominates numerical prediction — With 100% win rate in our benchmarks
More examples don't fix the problem — Even 20-shot learning produces negative R²
Cost-performance ratio is catastrophic — 807x slower for worse results
Hybrid approaches exist — Use LLMs for what they're good at (text), sklearn for numbers

The Bottom Line

LLMs are poets, not accountants.

They excel at understanding language, sentiment, and nuance. They fail spectacularly at the mathematical precision required for regression. Know your tools, and use them where they shine.

Methodology Details

Datasets: Car Price, Diamond Price, Housing Price, Wine Quality, Salary Prediction
sklearn Model: RandomForestRegressor with default parameters
LLM Models: GPT-4o-mini, GPT-5-nano (reasoning), GPT-4.1-nano
Validation: 30-iteration Monte Carlo Cross-Validation
Metrics: R² (coefficient of determination), RMSE
Statistical Tests: Paired t-test, Wilcoxon, Bootstrap CI (10,000 resamples)

Next in this series: "LLMs Can Do Everything": Autopsy of a Myth

DEV Community