DEV Community

Marcus Chen
Marcus Chen

Posted on

We shipped a model on a 2-point eval win. It was noise.

TL;DR: We promoted a fine-tuned 7B because it beat the incumbent by 2.1 points on our internal eval. Two weeks later we added bootstrap confidence intervals to the harness and found the gain sat well inside the noise band. The model was not better. We just had no way to tell.

The win that wasn't

Our eval suite at Nexus Labs is 840 prompts. Enterprise agent tasks. Each one is scored pass/fail by an exact-match check against a known-good structured output, so every result is a 1 or a 0.

The fine-tuned candidate scored 73.4%. The incumbent scored 71.3%. A 2.1-point lift on a suite that size felt real, so we shipped it to staging and started the rollout paperwork.

It was not real. Or rather, we had zero evidence either way, which is worse, because we acted like we did.

Why a single number lies

An eval run is a sample, not a measurement. Run the same 840 prompts against the same model with any sampling at temperature above 0 and you get a different number. Even at temperature 0, batching order and kernel nondeterminism in vLLM move it.

The math is not subtle. For a pass rate around 0.73 over n=840, the binomial standard error is sqrt(p(1-p)/n), which is about 1.53 points. The standard error of the difference between two such rates is roughly 2.1 points.

So our 2.1-point gap was about one standard error wide. A coin flip dressed up As a result.

Bootstrap instead of hand-waving

The fix is cheap. We resample the per-prompt results and look at the distribution of the difference. Because both models ran the same prompts, we pair them, which cuts the variance compared to treating the two runs as independent.

import numpy as np

# per-prompt correctness, 1/0, aligned by prompt id
old = np.load("old_correct.npy")   # shape (840,)
new = np.load("new_correct.npy")

def paired_bootstrap(a, b, iters=10000, seed=0):
    rng = np.random.default_rng(seed)
    n = len(a)
    diffs = np.empty(iters)
    for i in range(iters):
        idx = rng.integers(0, n, n)
        diffs[i] = b[idx].mean() - a[idx].mean()
    lo, hi = np.percentile(diffs, [2.5, 97.5])
    return diffs.mean(), lo, hi

mean, lo, hi = paired_bootstrap(old, new)
print(f"delta={mean:.3f}  95% CI=[{lo:.3f}, {hi:.3f}]")
# delta=0.021  95% CI=[-0.004, 0.046]
Enter fullscreen mode Exit fullscreen mode

The 95% interval runs from -0.4 points to +4.6 points. It crosses zero. We could not rule out that the new model was slightly worse.

What the numbers actually said

Metric Incumbent 7B Fine-tuned 7B
Pass rate 71.3% 73.4%
Paired delta baseline +2.1 pts
95% CI on delta baseline [-0.4, +4.6] pts
Significant at p<0.05? baseline no

Reading the table is the whole point. The headline delta is positive. The interval that contains it includes outcomes where we regressed. You do not ship on that.

What changed in our process

Three rules now gate any model promotion on my team.

First, no promotion without a paired bootstrap CI that excludes zero, or a McNemar test under p<0.05. The raw delta is not allowed in the PR description on its own anymore.

Second, every candidate runs the eval three times. If the three pass rates spread by more than a point at temperature 0, the harness is nondeterministic and we fix that before trusting any comparison. We caught a vLLM max_tokens truncation bug this way that was silently failing 11 long-output prompts on some runs.

Third, when we compare a self-hosted candidate against a hosted reference like gpt-4o-mini, we route both through one gateway so the request shape, retries, and timeouts are identical. We use Bifrost (https://github.com/maximhq/bifrost) for that, since it exposes every provider behind one OpenAI-compatible endpoint and the eval code stops caring who serves the tokens. Same harness, different backend. That removes a confound I used to ignore.

The cost of all this is one extra function and roughly 2x more eval compute. Against the cost of shipping a regression to an enterprise customer, that is nothing.

The deeper problem

840 prompts sounds like a lot. For detecting a 5-point difference, it is fine. For detecting a 2-point difference at 95% confidence, you need closer to 3,000 prompts, and for 1 point you need over 9,000. Most internal evals are too small to resolve the differences people argue about in standups.

So we also report the minimum detectable effect for our suite. Right now ours is about 4.5 points. Anything smaller, we say out loud that we cannot measure it, and we either grow the suite or stop pretending the comparison means something.

Trade-offs and Limitations

Bootstrap CIs assume your prompts are a representative sample of production. They are usually not. A tight interval on a biased suite is confidently wrong, and no amount of resampling fixes the sample.

The paired approach needs aligned per-prompt results, so you have to log at the prompt level, not the aggregate. That is more storage and more plumbing.

And significance is not importance. A real 0.3-point gain can be statistically solid and operationally meaningless. The test tells you the difference exists, not that you should care.

Further Reading

Top comments (0)