Maya Andersson

Posted on Jul 3

We fixed the worst prompt variant. It got better. That doesn't mean the fix worked.

#llm #ai #analytics #programming

A pattern I've seen on more than one team: weekly eval run finishes, someone sorts the leaderboard, and the worst-performing prompt variant or model checkpoint gets flagged for attention. Someone makes a change, a tweak to the system prompt, a different few-shot example, sometimes just a rewording of one instruction. Next week's eval run shows that variant improved. The team credits the fix.

Sometimes the fix is real. Often, some or all of that improvement would have happened anyway, with no change at all, just because of how "worst performer" gets selected.

The mechanism

If your eval set is small (a few hundred examples, sometimes less) or your judge scoring is noisy (LLM-judge variance, human rater disagreement, both), then any variant's measured score on a given week is its true underlying quality plus some random noise. Most weeks, that noise roughly cancels out. But the variant you pick out as "worst this week" wasn't selected at random. It was selected because it had a low score, and a low score is disproportionately likely to belong to a variant that both has middling true quality and got unlucky that week.

Next time you measure it, the "got unlucky" part isn't guaranteed to repeat. On average, it won't. So the score tends to drift back up toward that variant's actual mean, regardless of whether anyone touched the prompt. This is regression to the mean, and it's one of the oldest documented statistical phenomena (Galton described it in the 1880s studying the heights of parents and children, well before anyone was eval-ing a prompt), and it shows up anywhere you select on an extreme observation and then re-measure.

The eval version of the trap: you didn't just measure the worst variant, you measured, edited, and then re-measured it, and attributed all of the change to the edit. If a genuinely untouched variant would have shown a similar-looking bounce on its own, your attribution is wrong, or at least partly wrong, and you have no way to tell how much of the "improvement" is signal versus reversion unless you have a comparison point.

What that comparison point looks like

The clean version of this check is simple in principle: alongside the variant you changed, keep at least one variant completely untouched and re-run the same eval next week. If the untouched variant "improves" by a similar margin to the one you edited, your fix probably isn't doing what you think it's doing. If the untouched variant stays flat while the edited one jumps, that's a much stronger case for the edit actually working.

Most teams don't do this, because "the variant we already know is our worst" doesn't feel worth spending eval budget on twice. That instinct is exactly backwards. The untouched variant is your control group, and the entire point of the exercise is that you don't know in advance how much of next week's number is reversion.

Simulating it

Here's a small simulation that makes the effect concrete. Twenty prompt variants, each with a stable "true" quality score, plus judge-scoring noise added on top each time you measure. Nothing about any variant changes between week 1 and week 2, no edits at all. We just re-measure the same 20 variants with fresh noise, exactly the way a noisy eval set would behave on a second run.

import numpy as np

rng = np.random.default_rng(3)

n_variants = 20
true_scores = rng.normal(loc=0.75, scale=0.05, size=n_variants)
noise_sd = 0.06

week1 = true_scores + rng.normal(0, noise_sd, n_variants)
week2 = true_scores + rng.normal(0, noise_sd, n_variants)  # nothing changed

worst_idx = np.argmin(week1)
print(f"worst variant in week 1: index {worst_idx}, score {week1[worst_idx]:.3f}, true mean {true_scores[worst_idx]:.3f}")
print(f"same variant, week 2, still unchanged: {week2[worst_idx]:.3f}")
print(f"apparent 'improvement' with zero real change: {week2[worst_idx] - week1[worst_idx]:.3f}")

# average this effect over many simulated weeks to see how reliable it is
n_sims = 2000
deltas = []
for _ in range(n_sims):
    true_scores = rng.normal(loc=0.75, scale=0.05, size=n_variants)
    w1 = true_scores + rng.normal(0, noise_sd, n_variants)
    w2 = true_scores + rng.normal(0, noise_sd, n_variants)
    idx = np.argmin(w1)
    deltas.append(w2[idx] - w1[idx])

print(f"mean apparent improvement for the worst variant, {n_sims} simulated weeks: {np.mean(deltas):.3f}")
print(f"fraction of simulated weeks the untouched worst variant looked better: {np.mean(np.array(deltas) > 0):.2%}")

On my run, the single example showed the worst variant jumping from 0.604 to 0.788, an apparent 0.183 improvement with literally nothing changed. Across 2,000 simulated weeks, the untouched "worst variant" improved on its second measurement about 88% of the time, by an average margin of roughly 0.09. In this simulation, with these noise and true-score parameters, "we fixed the worst variant and it got better" would have been true, in the sense of the number going up, on the vast majority of weeks, with zero actual change to anything.

The specific numbers here (88%, 0.09) are artifacts of the noise_sd and true_scores parameters I chose, not universal constants. Increase noise_sd relative to the spread of true_scores and the effect gets stronger (noisier judging makes this worse, not better). Shrink it and the effect weakens. If you want a feel for your own eval setup, the parameters worth plugging in are your own judge's test-retest variance (score the same output twice, see how much it moves) and the actual spread of scores across your variants, not the placeholder values above.

Why this is easy to miss in practice

A few things conspire to hide this in normal team workflow. First, the "worst variant" framing is compelling on its own, it feels obviously right to go fix whatever scored lowest, so nobody questions the selection process itself. Second, most teams don't keep an explicit untouched control variant running alongside their edits, so there's no comparison point showing what "no change" would have looked like. Third, a single week's bounce is a satisfying, legible story ("we fixed the verbosity issue and the score went up") in a way that "some fraction of this was probably regression to the mean" is not, and legible stories win in a fast-moving team even when they're partly wrong.

None of this means fixes never work. Plenty of prompt edits genuinely help, and this isn't an argument for ignoring your worst-performing variant. It's an argument for having a way to tell how much of the observed change is due to the edit versus due to how the variant got selected in the first place.

Caveats

The simulation assumes independent, symmetric, roughly Gaussian noise around a stable true score, which is a simplification. Real eval noise can be skewed (a judge might systematically overscore short outputs, for instance, which isn't symmetric noise), and a variant's "true" quality isn't actually fixed either, model behavior can drift, your eval set's composition can shift week to week, and a genuinely-worse variant might be worse for a structural reason that a prompt edit can legitimately fix. This simulation isolates the pure noise-driven case to make the mechanism visible. Real situations are a mix of genuine effect and reversion, and the honest answer is usually "some of both," not "all reversion" or "all real."

It's also worth flagging that the size of this effect depends heavily on how extreme the selection is. If you pick the single worst variant out of 20, you're selecting a more extreme outlier than if you pick the worst out of 3, and the regression effect is correspondingly stronger with more variants and more noise relative to their true spread. A team running this same check with 5 variants and a much larger, less noisy eval set would see a smaller effect than the one in this simulation, not the same one.

Discussion

The broader point isn't specific to prompt engineering. Any process where you select an extreme observation, intervene, and then re-measure without a control is vulnerable to this, performance reviews for the "worst" employee of the quarter, marketing campaigns for the "worst" performing ad variant, sports commentary about a player who "turned it around" after a bad game. LLM eval just happens to be a place where the sample sizes are often small and the measurement noise is often high (a 200-example eval set scored by an LLM judge with real day-to-day variance is a genuinely noisy instrument), which makes the effect larger and easier to demonstrate than it would be with a cleaner measurement setup.

The fix is procedural, not statistical wizardry: keep a control, be honest about your noise floor, and treat a single week's bounce as weak evidence until you've seen it hold up.

FAQ

Is regression to the mean the same thing as the eval set just being too small?
Related but not identical. A small eval set increases the noise in each measurement, which makes regression to the mean more visible and more severe, but the phenomenon exists even with a reasonably sized eval set as long as there's any measurement noise and you're selecting on an extreme score. Small samples amplify it, they don't cause it.

How do I know if my "fix" actually worked, then?
Run the edited variant and at least one untouched control variant through the same fresh eval sample. If the edited variant improves meaningfully more than the untouched control, that's real evidence of an effect. If they improve by similar amounts, the edit likely isn't doing the work you're crediting it for, at least not yet, at least not in a way this comparison can detect.

Does this apply to picking the "best" variant too, not just the worst?
Yes, symmetrically. A variant that looks unusually good one week is disproportionately likely to have gotten lucky that week, and its score will tend to drift down on remeasurement even with no changes. Teams are generally less suspicious of good news than bad news, which makes this direction of the effect even easier to miss.

Can I just increase my eval set size to make this go away?
It reduces the effect (larger samples mean less noise per measurement, so less room for a lucky or unlucky draw), but it doesn't eliminate it. As long as there's any measurement noise and you're selecting based on an extreme score, some regression to the mean will occur. A bigger eval set raises the bar for how much of an "improvement" you should trust, it doesn't remove the need to check.

Is this the same as p-hacking?
No, it's a related but distinct problem. P-hacking involves selectively choosing or reporting analyses to get a significant result. Regression to the mean happens even with completely honest, single, pre-planned analysis, it's a property of selecting on an extreme observation and re-measuring, not a property of how many analyses you ran or how you reported them.

What's a reasonable noise floor to expect from LLM-judge scoring?
I don't have a single number that generalizes, it depends heavily on the judge model, the rubric's specificity, and the task. The useful exercise is measuring your own: score the same set of outputs twice with your judge (same prompts, same outputs, re-run the judge call) and look at how much individual scores move between runs. That gives you an empirical noise_sd to reason about, rather than assuming a textbook value.

Does averaging over more judge calls per example fix this?
It reduces judge-level noise (averaging 3-5 judge calls per output instead of 1 will tighten your per-example score), which shrinks the noise_sd term in this simulation and therefore shrinks the regression effect. It doesn't address noise coming from the eval set itself being small, or from genuine week-to-week variability in the thing you're measuring, so it's a partial fix, not a complete one.

Open question

I'd like a cleaner way to separate "how much of this week's bounce was regression to the mean" from "how much was a real effect of the edit" using a single week's data, without needing a held-out untouched control every time (which is the correct answer methodologically, but isn't always practical when eval budget or compute is tight). In principle you could estimate the expected regression-to-the-mean magnitude from the historical variance and test-retest reliability of your own eval pipeline, then subtract that expectation from the observed change before crediting the rest to the edit. I haven't seen this done rigorously in an LLM eval context, and I'm not sure how sensitive that estimate would be to getting your own noise model wrong. If someone has built this out properly for a real eval pipeline, I'd want to see the math.

Top comments (2)

Raju Dandigam • Jul 3

This is exactly the kind of eval mistake that feels sophisticated while still producing fake confidence. Calling out the untouched variant as the control group is the important move, because without that comparison teams are basically rewarding noise and calling it optimization. I also like that you framed small eval sets and judge variance as first-class causes rather than side notes, since those are usually the hidden source of a lot of “wins.” In agent systems, the same problem gets worse once tool-choice variance and retry behavior enter the loop, which is why I like pairing eval dashboards with run-level traces from tools like agent-inspect. Have you found one untouched control enough in practice, or do you think teams need a fuller experimental-design mindset once they have multiple prompts and models moving at once?

Maya Andersson • Jul 22

One untouched control is the minimum honest design, not the sufficient one, and the boundary is exactly where you put it: the moment multiple prompts and models move at once, a single control confounds "the system drifted" with "my change worked". The full experimental-design mindset pays for itself in that regime, but the marginal-value curve flattens fast. In practice the 80 percent version is: one frozen control per moving axis, comparisons pre-registered before the runs, and run-to-run variance measured once so you know your noise floor. Your point about tool-choice variance in agent systems is well taken. It fattens the run-level noise, which shrinks your effective sample size again, and trace-level replication is the only honest way to see it.