DEV Community

albert nahas
albert nahas

Posted on

Be Careful What You Wish For: Why AI Benchmarks Are Lying to You

How Goodhart's Law explains why GPT-5 scoring 93% doesn't mean what you think it means


You've probably seen the headlines: "GPT-5 scores 93% on HumanEval!" These numbers drive billion-dollar valuations. But they might be telling us less and less about actual AI capability.

Why? Because of something economists figured out 50 years ago.

The Laws That Explain Everything

Goodhart's Law:

"When a measure becomes a target, it ceases to be a good measure." — Strathern, 1997

Social scientist Donald Campbell noticed the same pattern:

"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures." — Campbell, 1979

When metrics carry consequences, people optimize for the metric — not the thing it measures.

The Evidence

The Contamination Problem: Researchers (Deng et al., 2024) tested if GPT-4 had "seen" MMLU questions during training. If not, it should guess missing answer options correctly 25% of the time. GPT-4 guessed correctly 57% — more than double chance. The model memorized the test.

The Reality Gap: GPT-5 scores 93% on HumanEval. But on NaturalCodeBench (real user coding questions), the gaps are enormous:

Model Benchmark Real-world Gap
GPT-4 90% 53% 37 points
WizardCoder 73% 24% 49 points
Llama-3-70B 82% 39% 43 points

A 40+ point gap. That's not a rounding error — that's a canyon.

We've Seen This Before

Campbell's Law plays out everywhere:

  • Education (NCLB): Teaching to the test, score manipulation (Nichols & Berliner, 2007)
  • Policing (CompStat): 78% of surveyed NYPD captains saw unethical report changes (Eterno & Silverman, 2012)
  • Healthcare (NHS): Ambulances queuing outside ERs to game wait times (Bevan & Hood, 2006)
  • Corporate (Wells Fargo): Millions of fake accounts to hit sales targets

Same pattern: high-stakes metric → gaming → metric ≠ reality.

What This Means for You

Don't trust leaderboards blindly. MMLU scores compressed from 17.5-point spread (2023) to 0.3 points (2024). The benchmark stopped discriminating.

Instead: test models on your use cases. Be skeptical of any single number claiming to capture "intelligence."

The Bigger Picture

If benchmarks are gamed, how do we know we're making progress? This matters especially for safety — safety benchmarks will face the same corruption pressures.

I've written a research paper exploring this in depth: Read the full paper on Zenodo

The core insight is 50 years old:

When you optimize for a metric, you might get the metric — but lose what you actually wanted.


What's your experience with AI benchmark claims vs. real-world performance? I'd love to hear your thoughts in the comments.

Also published on Medium.

Top comments (0)