Before I wrote a single line of RealDataAgentBench, I spent time doing something most benchmark builders skip: I mapped out what each major model was actually known to be good at and where each one quietly fell apart.
The observation that started everything was simple: no single model dominates across all dimensions. Every model has a superpower. Every model has a blind spot. And no existing benchmark was measuring all of them in one place, on the same task, at the same time.
That observation became the entire design philosophy behind RDAB's four dimensions.
What I found when I mapped the models honestly
Here's what the research and early runs showed, model by model:
GPT models — strong on correctness, strong on following structured instructions, competitive on code. But they optimize for producing the right-looking answer, not necessarily the right reasoning. Ask GPT-4o to analyze a dataset and it will give you numbers. Ask it whether those numbers are statistically reliable, whether the confidence intervals are appropriate, whether the design supports a causal claim — and it often skips that entirely. The answer looks complete. It isn't.
Claude models — best statistical vocabulary of any family in the benchmark. Claude Sonnet leads on stat validity (0.714) and Claude Haiku isn't far behind (0.701). Claude knows when to sound like a statistician. The problem is efficiency: on feat_005, Claude Haiku consumed 608,861 tokens. GPT-4.1 completed the same task in under 30,000. Claude explores — sometimes more than the task warrants. It enters loops, re-runs the same blocks with minor variations, calls get_column_stats on every column one by one. Correct, but expensive. Exploration without a stopping criterion.
Llama 3.3-70B via Groq — methodical, step-by-step code structure that outperforms GPT-5 on modeling tasks specifically. Free via Groq's free tier. But it sometimes skips statistical rigor on tasks where rigor isn't explicitly cued — strong on structured problems, inconsistent on open-ended analytical ones.
Gemini 2.5 Flash — cheapest model in the benchmark at $0.000075/1K tokens. But it has an output completeness problem: it reaches the right place and then truncates before reporting key metrics. The reasoning is sound. The answer is incomplete. Average correctness of 0.58 despite reasonable reasoning steps — the model arrives at the destination and doesn't finish the sentence.
Grok-3-mini — near-perfect on EDA tasks, zero on anything requiring sklearn. Not a gradual degradation — a hard binary failure. It attempts to import sklearn inside the sandboxed execution environment, hits the restriction, and either retries the failing import repeatedly or gives up. It never adapts. A bimodal distribution hiding behind an aggregate score of 0.639.
The pattern these failures revealed
Look across those five failure modes and you see four distinct dimensions of capability, each failing independently:
- GPT skips statistical reasoning → Stat Validity
- Claude burns excessive tokens → Efficiency
- Gemini truncates outputs → Correctness
- Grok fails on specific tooling → Code Quality (namespace adaptation)
- Llama is inconsistent on rigor → Stat Validity again
Every model fails somewhere. The failures aren't correlated — a model that's correct isn't necessarily efficient, and a model that's efficient isn't necessarily statistically rigorous. My scorer-to-scorer correlation analysis confirmed this: correctness × stat validity sits at r = 0.48, and all other dimension pairs are below r = 0.25. They are measuring genuinely independent capabilities.
That's the entire justification for four dimensions instead of one. If the dimensions were correlated, you could collapse them. They aren't. You can't.
Why Stat Validity was the hardest and most important dimension to add
Every benchmark already measures correctness. HumanEval, SWE-bench, MMLU, GPQA — they all ask some version of "did the model get the right answer?" That's necessary. It's not sufficient.
The specific failure I kept seeing was this: a model would compute the right feature importances, report the right AUC, fit the right coefficients — and then stop. No confidence intervals. No note that ranks 2 and 3 were within noise. No acknowledgment that a 150-sample test set gives you an AUC with ±0.08 uncertainty. Correct output. Unreliable analysis.
In a data science workflow where decisions get made on model outputs, this is not an academic problem. If a team ranks features incorrectly because rank 2 and rank 3 are statistically indistinguishable, they ship a model that drops a relevant predictor. The analysis was correct at the time. The outcome was wrong.
Stat validity is the dimension that catches this. It scores four things independently on every task: does the output report uncertainty (p-values, confidence intervals, standard errors)? Does it use an appropriate statistical method for the task category? Does it interpret results correctly ("statistically significant," "controlling for")? Does it avoid p-hacking language?
The result across 326 runs: stat validity ranges from 0.45 on feature engineering tasks to 0.87 on EDA and statistical inference tasks. Models know when statistical language is expected — when the task name signals it. They don't know when it's warranted but not cued. That category-dependent gap is the finding. It didn't show up until I built a scorer that could detect it independently from correctness.
Why Efficiency became a first-class dimension
The Claude Haiku finding forced this.
Before I added efficiency as a scored dimension, I was looking at per-token pricing and feeling comfortable about cost estimates. Haiku at $0.00025/1K looked cheap. Then I looked at actual token consumption per task.
608,861 tokens on feat_005. The same task completed in under 30,000 by GPT-4.1 with higher correctness.
Token consumption is not just a cost metric. It's a capability signal. A model that understands a task completes it efficiently. A model that doesn't understand it loops — calling the same tools repeatedly, re-running the same code with minor variations, exploring without converging. The token trace tells you which one you're dealing with.
Efficiency as a scored dimension makes this visible. Without it, you'd look at the correctness score, see Haiku performing adequately, and choose it for cost reasons — then pay more per task than a model that costs more per token but uses 20x fewer of them.
Why Code Quality couldn't be left to correctness
Correctness scoring checks the output answer. It doesn't check the code that produced it.
A model can get the right skewness value by writing a loop that iterates over every row of a DataFrame individually — slow, non-vectorized, wrong approach, correct answer. In a production data science pipeline, that code would be a performance problem. The correctness scorer would give it full marks.
Code quality scoring checks the work, not just the result: vectorized operations instead of raw loops, descriptive variable names, no magic numbers, readable structure. These aren't aesthetic preferences they're the difference between code a data team can maintain and code that becomes technical debt the day it ships.
The Grok-3-mini sklearn failure lives here too. The model wasn't wrong about what it was trying to compute. It couldn't adapt its code to the execution environment. Namespace adaptation is a real code quality gap, not a correctness gap — and a correctness-only benchmark would never surface it.
What the four dimensions together give you
A model can score 1.0 on correctness and 0.25 on statistical validity on the same task. A model can score 0.87 on correctness and 0.12 on efficiency. These aren't edge cases — they're the norm in the RDAB results.
The four dimensions together give you a profile, not a ranking. GPT-4.1-mini leads overall (0.854 RDAB) but Claude Sonnet leads on stat validity (0.714). Llama 3.3-70B is free and beats GPT-5 on modeling. Gemini 2.5 Flash is the cheapest but truncates outputs. No single model wins everything.
That's the point. The benchmark was designed around the observation that every model has a superpower and a blind spot and a single correctness score flattens both into one number that tells you almost nothing about which model to use for your specific task.
The four dimensions keep the profile visible. That's what makes the benchmark actionable rather than just academic.
Try it on your own data
The full benchmark is open source 39 tasks across EDA, Feature Engineering, Modeling, Statistical Inference, and ML Engineering. Run any model in under 5 minutes:
bashgit clone https://github.com/patibandlavenkatamanideep/RealDataAgentBench
cd RealDataAgentBench
pip install -e ".[dev]"
cp .env.example .env
# Free run via Groq no credit card needed
dab run eda_001 --model groq --budget 0.05
dab score outputs/eda_001_*.json
Live leaderboard with CI bounds and category filters: patibandlavenkatamanideep.github.io/RealDataAgentBench
If you want to benchmark against your own dataset without running the full harness, CostGuard runs the RDAB evaluation on any CSV you upload and returns the best model recommendation with exact cost estimates in under 15 seconds.
What dimension would you add to a benchmark like this? There's a pre-registered experiment running right now on whether explicit uncertainty prompting closes the stat validity gap — I'll publish results as a follow-up.

Top comments (0)