We shipped a model on a 2-point eval win. It was noise.

#machinelearning #mlops #llm

TL;DR: We promoted a fine-tuned 7B because it beat the incumbent by 2.1 points on our internal eval. Two weeks later we added bootstrap confidence intervals to the harness and found the gain sat well inside the noise band. The model was not better. We just had no way to tell.

The win that wasn't

Our eval suite at Nexus Labs is 840 prompts. Enterprise agent tasks. Each one is scored pass/fail by an exact-match check against a known-good structured output, so every result is a 1 or a 0.

The fine-tuned candidate scored 73.4%. The incumbent scored 71.3%. A 2.1-point lift on a suite that size felt real, so we shipped it to staging and started the rollout paperwork.

It was not real. Or rather, we had zero evidence either way, which is worse, because we acted like we did.

Why a single number lies

An eval run is a sample, not a measurement. Run the same 840 prompts against the same model with any sampling at temperature above 0 and you get a different number. Even at temperature 0, batching order and kernel nondeterminism in vLLM move it.

The math is not subtle. For a pass rate around 0.73 over n=840, the binomial standard error is sqrt(p(1-p)/n), which is about 1.53 points. The standard error of the difference between two such rates is roughly 2.1 points.

So our 2.1-point gap was about one standard error wide. A coin flip dressed up As a result.

Bootstrap instead of hand-waving

The fix is cheap. We resample the per-prompt results and look at the distribution of the difference. Because both models ran the same prompts, we pair them, which cuts the variance compared to treating the two runs as independent.

import numpy as np

# per-prompt correctness, 1/0, aligned by prompt id
old = np.load("old_correct.npy")   # shape (840,)
new = np.load("new_correct.npy")

def paired_bootstrap(a, b, iters=10000, seed=0):
    rng = np.random.default_rng(seed)
    n = len(a)
    diffs = np.empty(iters)
    for i in range(iters):
        idx = rng.integers(0, n, n)
        diffs[i] = b[idx].mean() - a[idx].mean()
    lo, hi = np.percentile(diffs, [2.5, 97.5])
    return diffs.mean(), lo, hi

mean, lo, hi = paired_bootstrap(old, new)
print(f"delta={mean:.3f}  95% CI=[{lo:.3f}, {hi:.3f}]")
# delta=0.021  95% CI=[-0.004, 0.046]

The 95% interval runs from -0.4 points to +4.6 points. It crosses zero. We could not rule out that the new model was slightly worse.

What the numbers actually said

Metric	Incumbent 7B	Fine-tuned 7B
Pass rate	71.3%	73.4%
Paired delta	baseline	+2.1 pts
95% CI on delta	baseline	[-0.4, +4.6] pts
Significant at p<0.05?	baseline	no

Reading the table is the whole point. The headline delta is positive. The interval that contains it includes outcomes where we regressed. You do not ship on that.

What changed in our process

Three rules now gate any model promotion on my team.

First, no promotion without a paired bootstrap CI that excludes zero, or a McNemar test under p<0.05. The raw delta is not allowed in the PR description on its own anymore.

Second, every candidate runs the eval three times. If the three pass rates spread by more than a point at temperature 0, the harness is nondeterministic and we fix that before trusting any comparison. We caught a vLLM max_tokens truncation bug this way that was silently failing 11 long-output prompts on some runs.

Third, when we compare a self-hosted candidate against a hosted reference like gpt-4o-mini, we route both through one gateway so the request shape, retries, and timeouts are identical. We use Bifrost (https://github.com/maximhq/bifrost) for that, since it exposes every provider behind one OpenAI-compatible endpoint and the eval code stops caring who serves the tokens. Same harness, different backend. That removes a confound I used to ignore.

The cost of all this is one extra function and roughly 2x more eval compute. Against the cost of shipping a regression to an enterprise customer, that is nothing.

The deeper problem

840 prompts sounds like a lot. For detecting a 5-point difference, it is fine. For detecting a 2-point difference at 95% confidence, you need closer to 3,000 prompts, and for 1 point you need over 9,000. Most internal evals are too small to resolve the differences people argue about in standups.

So we also report the minimum detectable effect for our suite. Right now ours is about 4.5 points. Anything smaller, we say out loud that we cannot measure it, and we either grow the suite or stop pretending the comparison means something.

Trade-offs and Limitations

Bootstrap CIs assume your prompts are a representative sample of production. They are usually not. A tight interval on a biased suite is confidently wrong, and no amount of resampling fixes the sample.

The paired approach needs aligned per-prompt results, so you have to log at the prompt level, not the aggregate. That is more storage and more plumbing.

And significance is not importance. A real 0.3-point gain can be statistically solid and operationally meaningless. The test tells you the difference exists, not that you should care.

Top comments (1)

Max Quimby • Jun 6

Adding the paired bootstrap is exactly right, and the pairing detail (same prompts → lower variance) is the part most people skip. I'd push it one step further, though: your bootstrap resamples prompts, which captures prompt-selection variance, but at temperature > 0 each prompt's 1/0 is itself a random draw. Run the same prompt five times and you might get three passes. So a single generation per prompt still understates the true interval — we ended up sampling each prompt k times and bootstrapping over (prompt, sample) pairs to get an honest band. With exact-match scoring that mattered more than I expected.

The other silent inflator is selection. If that 7B was the best of, say, six fine-tuning runs, you're implicitly running six comparisons and keeping the max — so a 2-point "win" is even more likely to be noise than the one-vs-one math suggests. Was this candidate picked out of a batch, or a clean head-to-head? And after adding CIs, did you change the promotion rule to "CI excludes zero," or set a minimum effect size you actually care about? "Beats baseline" and "beats baseline by enough to matter" are different gates.