An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

The general shape of the problem is that every public LLM benchmark is on a saturation clock that runs from the moment of its publication to the moment a model's training corpus has eaten it. The clock has been running, on the visible benchmarks of the last five years, for somewhere between twelve and thirty months before each one is no longer useful for differentiating frontier models. The benchmarks are not failing. They are doing exactly what they were designed to do, in the order they were designed to do it, and the field has been running through them faster than the people designing them anticipated.

I want to put numbers on the saturation pattern, walk through what the contamination evidence actually says, and then sit with the question of what an honest benchmark would have to look like in 2026 — because the "private held-out eval" answer that the labs are converging on has economics that are worth examining carefully before any of us salute it as the solution.

The saturation timeline, with numbers

HumanEval (Chen et al., OpenAI, July 2021). 164 hand-written Python problems. The benchmark was published with Codex at 28.8% pass@1; the underlying GPT-3 base model scored 0%. GPT-4 (March 2023) hit 67% in the original Technical Report. By late 2024, OpenAI's o1-preview and o1-mini both reached 96.3% pass@1; Claude 3.5 Sonnet sat at 93.7%. The benchmark is saturated in the operational sense — the relative spread across the top ten models is around 10 percentage points, which is too small a gap to differentiate them on, and most of the new models arrive within a percentage point or two of the ceiling. The successor variants (HumanEval+ from EvalPlus, with augmented test cases) are the field's response. Lifespan from publication to operational saturation: about 36 months.

MMLU (Hendrycks et al., September 2020). 57 subjects, ~14,000 multiple-choice questions, taken from publicly-available test prep and academic sources. The problem with MMLU is not that it's saturated in the same way HumanEval is — top scores are in the high 80s rather than against the ceiling — but that the benchmark was built from public sources that ended up in training corpora. The contamination evidence is concrete: a 2023 paper by Deng, Zhao, Tang, Gerstein, and Cohan used a "test-set slot guessing" technique — masking the correct answer and asking the model to guess which option was missing — and reported that ChatGPT could reproduce the missing option 52% of the time on MMLU, GPT-4 57%. Those numbers are well above what chance plus knowledge would predict. The community response, Microsoft's MMLU-CF accepted at ACL 2025, was a contamination-free reconstruction; on it, model rankings shift considerably. Lifespan from publication to demonstrated contamination: about 36 months.

SWE-bench (Jimenez et al., Princeton/MIT, October 2023; SWE-bench Verified, OpenAI, August 2024). The Verified subset is 500 Python-only tasks — real GitHub issues from popular repositories, hand-vetted to remove ambiguous specifications. May 2026 leaderboard: Claude Mythos Preview at 93.9%, Claude Opus 4.7 at 87.6%, with GPT-5.2 trailing at 80.0%. The contamination story here is the most blunt of the three. OpenAI ran an audit on Verified in early 2026 and found that every frontier model tested (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) could reproduce verbatim gold patches or problem-statement specifics for some Verified tasks. OpenAI stopped reporting Verified scores and now recommends SWE-bench Pro (1,865 multi-language tasks, not in the same training-corpus blast radius). Lifespan from Verified's August 2024 publication to OpenAI walking away from it in February 2026: about 18 months.

GPQA Diamond (Rein et al., November 2023). 198 graduate-level science questions, the hardest curated subset of GPQA's 448. The benchmark was designed as Google-proof: domain-expert PhDs scored 65% (74% discounting clear self-identified mistakes); skilled non-experts with unrestricted web access scored 34% over an average 30-minute attempt per question. The benchmark is, by construction, hard. It is also being saturated. GPT-4 at the November 2023 release scored 39%. Frontier scores in 2025–2026: Gemini 3.1 Pro Preview at 94.1%, with several other top frontier reasoning models clustered in the high 80s and low 90s. Lifespan from publication to operational saturation: about 30 months. Faster than the older benchmarks. Notice the pattern.

FrontierMath (Epoch AI, November 2024). The benchmark designed explicitly to resist saturation: tiers 1–3 cover undergraduate through early-postdoc mathematics, tier 4 is research-level. Hundreds of original problems, vetted by working mathematicians, never published in answerable form. At launch in late 2024, no tested model exceeded 2% on the full benchmark. By the end of 2025, frontier reasoning models were solving substantial fractions of tiers 1–3, and Epoch's own framing changed from "a benchmark current AI cannot do" to "a benchmark current AI is starting to crack." Lifespan from publication to first significant scores: about 12 months.

ARC-AGI-2 (Chollet et al., May 2025). The contemporary version of the benchmark Chollet has been running since 2019, designed specifically to resist the kind of scaling that crushes the others. Each task is a small grid-puzzle requiring fluid reasoning rather than knowledge. Humans solve roughly 75% of tasks on average. As of late 2025, the best result on the public leaderboard was about 5% for top frontier LLMs; by mid-2026, Gemini 3 Deep Think reached 84.6% on the public leaderboard, while the strongest entry under the Kaggle resource constraints (NVARC) hit 24%. The gap between the public-leaderboard number (no compute limit) and the private-competition number (resource-constrained) is by far the most interesting datum in the table.

The pattern is not that benchmarks are getting easier. The benchmarks are getting harder, by construction; each one is more carefully engineered to be hard than the one before it. The pattern is that the time between the benchmark publication and the benchmark stopping being a useful frontier-model differentiator is shrinking. HumanEval gave the field 36 months. GPQA Diamond got 30. SWE-bench Verified got 18. FrontierMath got 12. ARC-AGI-2 got, depending on which axis you measure, somewhere between 12 months and "still going."

What the contamination evidence actually says

The naive critique of public benchmarks is the model has seen the test answers. The reality is more textured.

Direct contamination — the test set is in the training corpus verbatim. Sainz et al.'s 2023 paper documented this on MMLU through the slot-guessing technique. OpenAI's 2026 audit on SWE-bench Verified documented it through verbatim gold-patch reproduction. The evidence in both cases is unambiguous: the models can reproduce specific test-set artefacts that they would have no other reason to have learned. This is the strong form of contamination, and the field's response — ACL 2025's MMLU-CF, OpenAI's recommendation away from SWE-bench Verified — is appropriate.

Indirect contamination — the test set is not in the training corpus, but related material is. MMLU-CF's reconstructed contamination-free version produced different model rankings from the original MMLU even when the form of the questions was preserved. The implication is that the original MMLU's signal partly reflected familiarity with the surrounding domain text, not just the test items themselves. This is the form contamination most resistant to detection, because it doesn't show up as verbatim reproduction.

Indirect contamination through downstream artefacts — this is the SWE-bench Verified case in its more interesting form. Verified is built from real GitHub issues from repos like astropy/astropy, django/django, sympy/sympy. The test set isn't the only thing those repos contain; the repositories themselves, including the actual fixes for the chosen issues, are part of the training corpus for any model with a public-code crawl. The model doesn't need to see SWE-bench Verified to score well on a SWE-bench Verified task; it just needs to have read the repository the task was taken from. Filtering the training corpus against the test set is straightforward; filtering against every repository the test set was constructed from is much harder.

The third form is what makes the saturation timeline accelerate. The field can construct each new benchmark to avoid direct contamination by keeping the test set itself private. It cannot construct each new benchmark to avoid indirect contamination through downstream artefacts unless the source domain is also closed. Mathematics research papers, GitHub repositories, scientific abstracts, the entire OpenStax textbook corpus — these are training data, and any benchmark constructed from them inherits the contamination risk of the source.

What "private held-out eval" actually means

The labs' converging answer to public-benchmark saturation is the private held-out eval. Anthropic, OpenAI, Google, Scale AI, METR, Apollo Research, and Epoch AI all run private evaluation suites that they don't publish in answerable form. The economics of these evaluations are worth examining honestly.

The case for private evals. A test set kept private cannot be in any training corpus, by construction. The test items can be refreshed faster than models can be retrained. The evaluators can adversarially design new items to target observed model weaknesses. The published numbers are not contaminated.

The case against, which the labs do not lead with. A private eval is a published number with no independent verification path. The lab claims the model scored 67% on our private eval is a sentence with a measurable difference from the model scored 67% on a public benchmark anyone can re-run. The first is a marketing artefact. The second is a piece of evidence. The history of corporate self-reported benchmarks across every prior tech category — automotive fuel economy, hard-disk read speeds, network-equipment throughput, smartphone battery life — is one in which the published numbers and the independently-measured numbers differ in predictable directions. The same incentive structure applies to the labs.

The intermediate solution that's emerged is the externally-managed private eval. Scale AI runs evaluations with the test items held in escrow; only the labs' submissions and the resulting scores are published. Epoch AI's FrontierMath has the answers private but the problems published — researchers can see what's being asked but cannot game the answer-key directly. METR's autonomy evaluations are run by an external team with the lab's access to the test agent, but the test setup remains private. These are partial solutions. They depend on the evaluator's neutrality, the evaluator's funding model, and the evaluator's willingness to publish embarrassing numbers. None of these properties are guaranteed.

The contamination-free public benchmark is a contradiction in terms. Once a public benchmark is published, it's in the next training corpus. The half-life is bounded above by the model release cycle. This is not a fixable property; it's the consequence of training on the open web. The choice the field is making, slowly and without explicit framing, is between public benchmarks with short useful lives and private benchmarks with no independent verification. Neither is what the public-benchmark era pretended to offer.

What an actually-falsifiable benchmark would have to look like

Listing the properties:

The test items must not be in any training corpus. Strict definition: the items must have been generated after every model's training cutoff under evaluation, on a refresh schedule faster than the model release cadence.
The source domain must not contain answers either. A benchmark drawn from public Stack Overflow inherits Stack Overflow contamination on indirect grounds.
The evaluator must not be the developer. Self-reported scores have a known bias direction.
The evaluator's funding model must not be controlled by the developer. Scale AI's evaluations are paid by the labs. This is structurally not a clean separation.
The score must be reproducible by a third party. A private eval that publishes only the score is one bit of information; it doesn't enable independent verification.
The benchmark must be refreshed. A benchmark that doesn't refresh is on a saturation clock; the half-life of a frozen public benchmark is roughly the gap between two model generations.

LiveBench and LiveCodeBench attempt the refresh property — they generate or curate new test items monthly and publish results against the rolling window. Chatbot Arena (LMSYS) attempts the user-generated test items property — every prompt comes from a real user interaction, so the test distribution is open-ended and not authorable in advance. Each gets one or two of the falsifiability properties above. None of them gets all six.

The summary that matches the data

If I tabulate what's actually visible in the saturation evidence, the picture is unambiguous.

Benchmark	Published	Top score 2025–2026	Saturation lifespan	Primary contamination concern
HumanEval	2021-07	96.3% (o1-preview)	~36 months	Direct: 164 problems publicly indexed since release
MMLU	2020-09	mid-90s	~36 months	Direct: documented test-slot reproduction in 2023
GPQA Diamond	2023-11	94.1% (Gemini 3.1 Pro Preview)	~30 months	Indirect: scientific literature is in training corpora
SWE-bench Verified	2024-08	93.9% (Claude Mythos Preview)	~18 months (OpenAI walked away Feb 2026)	Indirect: training corpus includes the source repos
FrontierMath	2024-11	non-trivial fraction by end-2025	~12 months to first signal	Designed against direct contamination; indirect risk via mathematics literature
ARC-AGI-2	2025-05	84.6% public / 24% Kaggle-constrained	12 months and counting	Designed to resist scaling; the public-vs-constrained gap is the data point

A few things stand out reading this table. The lifespan column is shrinking. The "top score" column has hit the high 80s or above on every benchmark in the table that isn't FrontierMath or ARC-AGI-2 — and even those are starting to move. The contamination-concern column has no row that's clean; even benchmarks designed against direct contamination inherit indirect contamination from the source domain.

What this means for reading benchmark numbers

The published number on a public benchmark is informative for a specific window after publication and roughly noise after that window closes. HumanEval and MMLU and GPQA Diamond are at the ceiling. FrontierMath and ARC-AGI-2 are still informative, and won't stay informative for as long as their predecessors did.

The honest reading of any 2026 frontier-model release is to look at which benchmarks the lab is reporting and which it is conspicuously not. OpenAI's silence on SWE-bench Verified is more informative than any number OpenAI is still publishing. Labs that report across the full saturated slate are doing well on benchmarks they know to be saturated; labs emphasising FrontierMath, ARC-AGI-2, or in-house held-out evals are differentiating on harder ground. The signal is in the choice.

A benchmark is only useful for as long as it's hard, and hard is the gap between the benchmark's source distribution and the model's training distribution — a gap that shrinks with every new corpus. The appropriate posture toward any score is to ask three things: the benchmark's age, the contamination evidence, the spread among the top ten models. Read together, they tell you whether the headline is information or wallpaper.