Train-Test Contamination in LLM Evals: How to Detect It in Your Own Set

#ai #llm #testing #evals

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ran the eval. The new model scored 92% on MMLU, 88% on GSM8K, 81% on HumanEval. Three numbers, three benchmarks, the dashboard is green. Ship the upgrade.

Then a colleague writes the same MMLU question with different wording. Same answer choices, just rephrased. Score drops nine points. Try another rewording. It drops eleven. The model did not get worse at the underlying skill. It got worse at recognising the exact string it had memorised.

That is train-test contamination. The benchmarks that built the leaderboards sat in scraped Common Crawl, Stack Exchange dumps, GitHub repos, and PDF papers for years before they were used to grade frontier models. By 2026 a substantial fraction are partially or fully memorised. The score on the dashboard mixes skill and recall. You can't tell which is which.

How bad is it

The empirical work has been piling up since 2023. A few signals worth pinning to:

Multiple independent contamination surveys published since 2023 consistently find statistically significant training-data overlap between frontier models and the most-cited public benchmarks. The citations below are the load-bearing evidence; treat any single headline figure as a directional finding rather than a measurement.
Several published surveys (Ravaut et al., 2024, Cheng et al., 2025, Xu et al., 2024) catalog the same finding with different methods: HumanEval, GSM8K, and MMLU show measurable contamination across most major models, and at least one survey reports the gap between original-prompt and paraphrased-prompt accuracy widening with model size.
A 2024 method called PaCoST (Zhang et al., 2024) detects contamination by comparing the model's confidence on the original test item against the same item paraphrased into a semantically equivalent rewrite. Big confidence drop on paraphrase = the original was memorised.
Min-K% Prob (Shi et al., 2024) flags contamination by tracking the K% lowest-probability tokens in a sequence. Memorised text has fewer surprising tokens than novel text drawn from the same distribution.

The methods all converge on the same shape: a memorised test item behaves differently from a fresh one. Lower perplexity, higher confidence, and much bigger score drops when you rephrase the question.

That last property is what makes contamination testable inside your own eval set. You only need a paraphraser and a delta — no training-data access required.

The paraphrase-gap test

The shape:

For each test item, compute the model's score on the original prompt.
Generate three to five paraphrases of the same item that preserve the answer.
Compute the model's score on each paraphrase.
Flag any item where the original passed and the mean paraphrase score drops by at least 0.30 (with n_para=3, that's the original passing and any one paraphrase failing).

The 0.30 threshold is a working per-item heuristic, not a proven number. Pick it as your starting line and tune from there once you have data on your own set. The exact threshold is not what matters — uncontaminated items should be roughly invariant to paraphrase, and the contaminated ones drop hard because the model recognises them by surface form. If you prefer the aggregate framing from the paraphrase-gap papers, run audit over the whole set and compare base accuracy to mean paraphrase accuracy at that level instead.

A 100-line contamination harness

Real, runnable. Drop it next to your eval, point it at any test set with a prompt and expected field.

import math
import json
from anthropic import Anthropic
from openai import OpenAI

anth = Anthropic()
oai = OpenAI()

TARGET = "claude-sonnet-4-5"
PARA_MODEL = "claude-haiku-4-5"
PERPLEXITY_MODEL = "gpt-4o-mini"

def score(prompt: str, expected: str) -> float:
    r = anth.messages.create(
        model=TARGET, max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )
    text = r.content[0].text.strip().lower()
    return 1.0 if expected.strip().lower() in text else 0.0

def paraphrase(prompt: str, n: int = 3) -> list[str]:
    instr = (
        f"Rewrite the question below in {n} different ways. "
        "Keep the meaning and the correct answer identical. "
        "Change wording, sentence order, and synonyms. "
        "Return one rewrite per line, no numbering."
    )
    r = anth.messages.create(
        model=PARA_MODEL, max_tokens=600,
        messages=[{"role": "user",
                   "content": f"{instr}\n\n{prompt}"}],
    )
    lines = [l.strip() for l in r.content[0].text.splitlines()
             if l.strip()]
    return lines[:n]

def perplexity_proxy(text: str) -> float:
    r = oai.chat.completions.create(
        model=PERPLEXITY_MODEL,
        messages=[{"role": "user",
                   "content": f"Repeat verbatim:\n{text}"}],
        logprobs=True, max_tokens=max(64, len(text)),
    )
    lps = [tok.logprob for tok
           in r.choices[0].logprobs.content]
    if not lps:
        return float("nan")
    return math.exp(-sum(lps) / len(lps))

def check_item(prompt: str, expected: str,
               n_para: int = 3,
               drop_threshold: float = 0.30) -> dict:
    base = score(prompt, expected)
    paras = paraphrase(prompt, n=n_para)
    para_scores = [score(p, expected) for p in paras]
    para_mean = (sum(para_scores) / len(para_scores)
                 if para_scores else 0.0)
    delta = base - para_mean
    ppl = perplexity_proxy(prompt)
    flagged = base > 0 and delta >= drop_threshold
    return {
        "prompt": prompt,
        "base_score": base,
        "para_scores": para_scores,
        "para_mean": para_mean,
        "score_drop": delta,
        "perplexity_proxy": ppl,
        "flagged": flagged,
        "paraphrases": paras,
    }

def audit(test_set: list[dict]) -> dict:
    results = [check_item(it["prompt"], it["expected"])
               for it in test_set]
    flagged = [r for r in results if r["flagged"]]
    valid_ppl = [r["perplexity_proxy"] for r in results
                 if not math.isnan(r["perplexity_proxy"])]
    median_ppl = (sorted(valid_ppl)[len(valid_ppl) // 2]
                  if valid_ppl else float("nan"))
    return {
        "total": len(results),
        "flagged_count": len(flagged),
        "flagged_pct": len(flagged) / max(1, len(results)),
        "median_perplexity_proxy": median_ppl,
        "items": results,
    }

if __name__ == "__main__":
    with open("test_set.jsonl") as f:
        items = [json.loads(l) for l in f]
    report = audit(items)
    print(f"Flagged: {report['flagged_count']} / "
          f"{report['total']} "
          f"({report['flagged_pct']:.1%})")
    print(f"Median perplexity proxy: "
          f"{report['median_perplexity_proxy']:.2f}")
    with open("contamination_report.json", "w") as f:
        json.dump(report, f, indent=2)

What it does, plain:

score runs the target model on a prompt and checks for the expected substring. Replace with your real grader if you have one.
paraphrase asks a cheap model for n semantically equivalent rewrites. The instruction matters: tell it to keep the answer fixed.
perplexity_proxy asks an OpenAI logprob-returning model to repeat the string verbatim and averages the token logprobs from that regeneration. It approximates "how surprising is this text to a large model trained on similar web data" — a reasonable proxy when target-model logprobs are unavailable, not a true perplexity over the prompt. Lower number = less surprise. A test item with a much-lower-than-median proxy is a candidate for "this string was in pretraining."
check_item ties them together: base score minus mean paraphrase score. If the drop crosses the threshold and the original passed, the item is flagged.
audit runs the whole set and prints the rate.

A note on the perplexity proxy: it uses a different model than the one being graded. The clean version would use the target model's own logprobs, but most hosted chat APIs (Anthropic among them) do not expose them. So the proxy answers "is this string surprising to some large model trained on the same web." It's a corroborator. The paraphrase delta is the load-bearing signal.

What the output tells you

Run it on a 100-item set. Three things to look at:

Flagged percentage. A clean private set should sit near zero. 5% is a yellow flag, 10% or more is a red one. If your "private" set is 10% flagged, parts of it leaked, or the items are templated enough that the model recognises the pattern.
Perplexity-proxy distribution. Plot the histogram. A sharp left tail (a small cluster with much lower perplexity than the rest) usually corresponds to memorised items. Cross-reference with the flagged list.
Per-item rationales. The flagged items are the actionable output. Read them. If the item is a famous coding interview question or a well-known ethics dilemma, you have your answer. Replace it.

The actual fix: build a private set from your traffic

Detection is half the job. The other half is having something to fall back to once contamination kills the public benchmark for you.

The pattern teams settle on:

Pull from production. The last 200 to 500 user conversations, sampled across intent buckets. Real traffic, real ambiguity, your actual distribution.
Label by hand, once. A domain expert assigns the expected answer or rubric. This is the gate that makes the set yours.
Paraphrase-regenerate every six months. Run the same paraphrase loop on your own set. If your private items show up in scraped data later (vendors crawl public dev.to posts, GitHub READMEs, and any wiki you accidentally exposed), the rewrites buy you another cycle of clean signal before you have to re-mine.
Lock the set in Git. Versioned. Diffable. Add a row, bump the manifest. Track Cohen's kappa with the human grader on every run.

The contamination harness above runs on this set too. The flagged rate on a fresh, hand-labelled, internal set should sit at or near zero. When it starts climbing, you know the set has aged.

What this means for the leaderboards

The honest version: a single number on MMLU is a noisy estimate of one capability mixed with an unknown amount of recall. GSM8K and HumanEval are no different. The frontier-model rankings on those benchmarks have been compressed near the ceiling for years, and most of the differentiation between top models on those datasets is inside the noise band of the benchmarks themselves.

Treat the public benchmarks as smoke tests. The decision that affects users should ride on a private, contamination-checked, paraphrase-rotated set. The harness above is the cheap insurance that tells you when the smoke test stops being honest.

Carlini et al.'s extraction work on training data is the reason any of this is testable from the outside. The same property that makes phone numbers leak from GPT-2 makes benchmark items detectable in any large model: the more times a string appeared in training, the more the model behaves like it has seen it. Paraphrase the surface and the recall component falls out as the gap between original and rewrite.

If this was useful

The LLM Observability Pocket Guide covers the eval side of the observability story end to end: how to build a private golden set, how to wire contamination checks into CI, how to track judge-vs-human agreement on every model bump, and which tracing-and-eval tools on the market do this work for you versus which ones leave it to your team. Short book, written for engineers picking the stack and for the ones inheriting one.