Why Your LLM Evals Are Lying to You

#machinelearning #ai #mlops #deeplearning

This article originally appeared in The Forward Pass, a weekly newsletter for ML engineers who ship. Get a free issue every month →

Why Your LLM Evals Are Lying to You

Three failure modes that make most LLM benchmarks decoration, not science.

by Maxim Enis · 4 min read

You ran your model on MMLU. The score went up. You shipped. A week later, the support tickets are different in shape but identical in volume. What happened?

LLM evaluation is the most quietly broken part of the stack right now. The benchmarks have not kept up with what production models can do, and the evals teams build internally rarely correlate with anything users actually care about. Three failure modes to watch.

Contamination is everywhere. The base model has seen MMLU. It has seen GSM8K. It has seen HumanEval. The 2024-era leak audits showed material training-set contamination in every public eval more than 18 months old. If your eval was good before, it is now memorization. The fix is not to write a new MMLU — it is to write evals that are dynamic. Generate questions at eval time from a templated grammar, or freshly translate your private eval set into a synthetic distribution the model cannot have seen. Static evals have a half-life now.

Static evals have a half-life now.

Single-number scores are lossy. Aggregate accuracy hides every interesting failure mode. A model that improves from 68% to 72% on a benchmark might be regressing on the 5% of prompts your users actually send. You need stratified eval — break the score down by prompt difficulty, prompt length, language, domain, and whatever else differs in your traffic. Most teams discover their "improved" model is worse on the long tail when they finally instrument this. Instrument it before the launch, not after the rollback.

LLM-as-a-judge is correlated noise. Using GPT-4 to grade your model is convenient and seductive and wrong about 15% of the time, with biases that are stable per-grader. Same model family judges its own family more favorably. Verbose responses win on style. Order of presentation matters by 5-8 points. If you are going to use a judge, randomize the grader across at least three model families, randomize position, and calibrate against human-graded held-out samples. Otherwise you are measuring grader preference and calling it model quality.

The eval setup we run internally: a private, weekly-rotated test set of 500 prompts drawn from anonymized user traffic, hand-graded against a rubric, plus an automated regression suite of 5000 templated prompts where the correct answer is checkable programmatically. The first catches qualitative drift; the second catches catastrophic regressions. We stopped trusting any benchmark we did not write within the last three months.

We stopped trusting any benchmark we did not write within the last three months.

The deeper point: if you cannot articulate the specific user behavior your eval is supposed to predict, your eval is decoration. Evals are not science fairs. They are production decisions in disguise.