DEV Community

Cover image for Why Classical ML Still Crushes GenAI at Regression
Mickaël Andrieu
Mickaël Andrieu

Posted on

Why Classical ML Still Crushes GenAI at Regression

The surprising benchmark results that every ML engineer should know before using LLMs for numerical predictions


Regression Benchmark

The Uncomfortable Truth

In the age of GPT-5 and advanced reasoning models, I ran a comprehensive benchmark comparing Classical ML (sklearn) against Generative AI across 19 use cases. The results for regression tasks were... humbling for GenAI.

Classical ML wins 100% of regression tasks. Not 90%. Not 95%. One hundred percent.

And it's not even close.

The Data: What I've Found

I've tested three state-of-the-art LLM models against sklearn's RandomForestRegressor across 5 regression datasets:

R-Squared Comparison

Dataset sklearn R² GPT-4o-mini R² GPT-5-nano R² Winner
Car Price 0.802 -3.498 -5.172 sklearn
Diamond Price 0.926 0.806 - sklearn
Housing Price 0.710 -6.78B -29.27B sklearn
Wine Quality 0.405 -1.33M -1.631 sklearn
Salary Prediction 0.346 0.184 - sklearn

Wait, negative R²? Yes. A negative R² means the model performs worse than simply predicting the mean. The LLMs aren't just losing — they're producing predictions that are mathematically worse than doing nothing.

Understanding the Catastrophe

What R² Actually Means

  • R² = 1.0: Perfect predictions
  • R² = 0.0: Predictions as good as the mean
  • R² < 0: Predictions worse than the mean

When GPT-4o-mini produces an R² of -6.78 billion on housing prices, it means the model's predictions are so far off that they add massive error compared to simply guessing the average price every time.

The Scale of the Problem

Error Distribution

Let's visualize what "negative R² in the billions" actually means:

Metric sklearn GPT-4o-mini Interpretation
Housing RMSE ~$45,000 ~$8.2 billion LLM errors are 180,000x larger
Car Price RMSE ~$2,100 ~$18,500 LLM errors are 9x larger
Wine Quality RMSE 0.65 1,152 LLM errors are 1,772x larger

Why LLMs Fail at Regression

1. Token Generation ≠ Numerical Reasoning

LLMs generate text token by token. When asked to predict "$347,500", they're essentially:

  1. Deciding "3" is the first digit
  2. Then "4" seems reasonable
  3. Then "7", etc.

This is fundamentally different from computing a weighted sum of features, which is what regression actually requires.

2. No Gradient Optimization

Classical ML models like RandomForest are optimized to minimize prediction error through mathematical optimization. LLMs are optimized to predict the next token in a sequence — a completely different objective.

3. Scale Sensitivity

LLMs have no inherent understanding of numerical scale. The difference between $100,000 and $1,000,000 is just different tokens to an LLM, but it's a massive error in regression terms.

The Benchmark Methodology

To ensure fair comparison, I've used:

  • Monte Carlo Cross-Validation: 30 random train/test splits
  • Statistical Tests: Paired t-test, Wilcoxon signed-rank, Bootstrap CI
  • Effect Size: Cohen's d for interpretation
  • sklearn baseline: TF-IDF + RandomForestRegressor (simple, reproducible)

All differences were statistically significant (p < 0.05) with large effect sizes (Cohen's d > 0.8).

But What About Few-Shot Learning?

I've tested whether more examples could help LLMs with regression. The answer: marginally, but still catastrophic.

Few-Shot Regression

Dataset 5-shot R² 20-shot R² Improvement Still Negative?
Car Price -3.498 -1.794 +48.7% YES
Wine Quality -1.33M -379K +71.5% YES
Housing Price -6.78B -6.54B +3.5% YES

More examples make LLM predictions "less catastrophically wrong" — but they remain worse than useless compared to classical ML.

The Cost Dimension

Beyond accuracy, there's a massive cost and speed difference:

Metric sklearn GenAI Ratio
Training time 8.1 min N/A -
Inference time 8.1 min total 6,533 min 807x slower
Cost $0 (local) $9.39 Infinite

You're paying 807x more in time (and real money) for predictions that are mathematically worse than guessing the average.

When to Use What: The Decision Framework

Decision Framework

Use Classical ML (sklearn) When:

  • Your target variable is continuous/numerical
  • You need regression predictions (price, quantity, score)
  • You have structured tabular data
  • Latency matters (<100ms response time)
  • Cost matters (high volume predictions)

Consider GenAI When:

  • Your task involves natural language understanding
  • You're doing sentiment analysis or emotion detection
  • You need zero-shot generalization on text
  • You're doing information extraction from unstructured text

The Hybrid Approach

If you must use LLMs in a pipeline that involves numerical predictions, consider these patterns:

Pattern 1: LLM for Features, sklearn for Prediction

Text Input → LLM extracts features → sklearn predicts value
Enter fullscreen mode Exit fullscreen mode

Use the LLM to convert unstructured text into structured features, then let sklearn do the numerical prediction.

Pattern 2: LLM for Binning, then Aggregation

Input → LLM classifies into bins → Map bins to numerical ranges
Enter fullscreen mode Exit fullscreen mode

Instead of asking "What is the price?", ask "Is this low, medium, or high priced?" — a classification task where LLMs perform better.

Pattern 3: LLM as Sanity Checker

sklearn prediction → LLM validates → Final output
Enter fullscreen mode Exit fullscreen mode

Use sklearn for the prediction, then optionally use an LLM to flag predictions that seem unreasonable given the context.

Key Takeaways

  1. Never use LLMs for regression — You'll get R² scores worse than random guessing
  2. sklearn dominates numerical prediction — With 100% win rate in our benchmarks
  3. More examples don't fix the problem — Even 20-shot learning produces negative R²
  4. Cost-performance ratio is catastrophic — 807x slower for worse results
  5. Hybrid approaches exist — Use LLMs for what they're good at (text), sklearn for numbers

The Bottom Line

LLMs are poets, not accountants.

They excel at understanding language, sentiment, and nuance. They fail spectacularly at the mathematical precision required for regression. Know your tools, and use them where they shine.


Methodology Details

  • Datasets: Car Price, Diamond Price, Housing Price, Wine Quality, Salary Prediction
  • sklearn Model: RandomForestRegressor with default parameters
  • LLM Models: GPT-4o-mini, GPT-5-nano (reasoning), GPT-4.1-nano
  • Validation: 30-iteration Monte Carlo Cross-Validation
  • Metrics: R² (coefficient of determination), RMSE
  • Statistical Tests: Paired t-test, Wilcoxon, Bootstrap CI (10,000 resamples)

Next in this series: "LLMs Can Do Everything": Autopsy of a Myth](https://dev.to/mickael__andrieu/llms-can-do-everything-autopsy-of-a-myth-2ika)

Top comments (0)