How Goodhart's Law explains why GPT-5 scoring 93% doesn't mean what you think it means
You've probably seen the headlines: "GPT-5 scores 93% on HumanEval!" These numbers drive billion-dollar valuations. But they might be telling us less and less about actual AI capability.
Why? Because of something economists figured out 50 years ago.
The Laws That Explain Everything
Goodhart's Law:
"When a measure becomes a target, it ceases to be a good measure." — Strathern, 1997
Social scientist Donald Campbell noticed the same pattern:
"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures." — Campbell, 1979
When metrics carry consequences, people optimize for the metric — not the thing it measures.
The Evidence
The Contamination Problem: Researchers (Deng et al., 2024) tested if GPT-4 had "seen" MMLU questions during training. If not, it should guess missing answer options correctly 25% of the time. GPT-4 guessed correctly 57% — more than double chance. The model memorized the test.
The Reality Gap: GPT-5 scores 93% on HumanEval. But on NaturalCodeBench (real user coding questions), the gaps are enormous:
| Model | Benchmark | Real-world | Gap |
|---|---|---|---|
| GPT-4 | 90% | 53% | 37 points |
| WizardCoder | 73% | 24% | 49 points |
| Llama-3-70B | 82% | 39% | 43 points |
A 40+ point gap. That's not a rounding error — that's a canyon.
We've Seen This Before
Campbell's Law plays out everywhere:
- Education (NCLB): Teaching to the test, score manipulation (Nichols & Berliner, 2007)
- Policing (CompStat): 78% of surveyed NYPD captains saw unethical report changes (Eterno & Silverman, 2012)
- Healthcare (NHS): Ambulances queuing outside ERs to game wait times (Bevan & Hood, 2006)
- Corporate (Wells Fargo): Millions of fake accounts to hit sales targets
Same pattern: high-stakes metric → gaming → metric ≠ reality.
What This Means for You
Don't trust leaderboards blindly. MMLU scores compressed from 17.5-point spread (2023) to 0.3 points (2024). The benchmark stopped discriminating.
Instead: test models on your use cases. Be skeptical of any single number claiming to capture "intelligence."
The Bigger Picture
If benchmarks are gamed, how do we know we're making progress? This matters especially for safety — safety benchmarks will face the same corruption pressures.
I've written a research paper exploring this in depth: Read the full paper on Zenodo
The core insight is 50 years old:
When you optimize for a metric, you might get the metric — but lose what you actually wanted.
What's your experience with AI benchmark claims vs. real-world performance? I'd love to hear your thoughts in the comments.
Also published on Medium.
Top comments (0)