temperature=0 didn't make our LLM evals reproducible

#machinelearning #llm #mlops #infrastructure

TL;DR: We set temperature=0 and seed=42 and still got different eval scores on the same 800-prompt suite across runs. The cause wasn't the sampler. It was batch-dependent floating point in the inference engine plus silent provider routing. We chased it for a week. Here's what we found and the three things that actually fixed it.

I lead the eval team at Nexus Labs. We fine-tune small models for enterprise agent automation, and our whole release process hangs on one number: pass rate on an 800-prompt domain suite. Green means ship.

Two weeks ago the same model, same suite, same code gave us 81.4% on Monday and 79.6% on Wednesday. Nobody touched the weights. That's a 14-prompt swing on a frozen artifact. If your eval moves more than your model improvements, you can't ship on it.

temperature=0 is not determinism

First assumption everyone makes: greedy decoding is deterministic. Set temperature to 0, you always pick the argmax token, done.

It isn't. temperature=0 removes sampling randomness. It does nothing about the fact that the logits themselves change depending on what else is in the batch.

vLLM (we run 0.6.x) uses continuous batching. Your prompt gets grouped with whatever other requests are in flight. Matrix multiply reductions over a batch of 4 versus a batch of 32 accumulate floating point in a different order. The result is a logit that differs in the 5th decimal place. Usually harmless. But when two candidate tokens are within ~1e-4 of each other, the argmax flips. One flipped token early in a tool-call response cascades into a different JSON structure, which fails our parser, which drops a point.

So our "deterministic" eval was deterministic per request but not across batch compositions. Run the suite when the cluster is busy, you get a different batch shape, you get a different score.

The second source: we didn't know which model answered

The bigger embarrassment. Our eval harness pointed at an internal endpoint that load-balanced across two provider deployments during a migration. About 6% of eval requests were silently hitting a different build of the serving stack with a different quantization. We had no per-request record of which backend served which prompt.

You can't debug a number you can't attribute. The fix for this half was operational, not numerical: route eval traffic through a gateway that logs the exact provider and model per request. We already run Bifrost (https://github.com/maximhq/bifrost) in front of our providers for failover, and its per-request logging gave us the backend attribution we'd been missing. LiteLLM does the equivalent; the point is you need the provenance, not a specific logo.

Once every eval response carried a backend tag, the 6% lit up immediately.

What moved the number

We measured each suspected cause by running the 800-prompt suite 20 times and looking at score variance.

Source	Score range over 20 runs	Fixed by
Batch-dependent FP (continuous batching)	`±1.8 pts`	Pin eval batch size to 1
Silent provider routing	`±2.1 pts`	Per-request backend logging
Parser tolerance on whitespace	`±0.9 pts`	Normalize before compare
Unseeded prompt shuffle in harness	0 pts (red herring)	n/a

The prompt-shuffle thing was where we wasted two days. Order doesn't change per-prompt correctness. We knew that. We checked it anyway because it was easy to check, which is its own lesson about how panic allocates engineering time.

The fix

Three changes. None of them clever.

First, eval runs go through a dedicated config with batch size pinned. Slower, but reductions happen over a fixed shape every time:

# eval-serving.yaml
engine:
  max_num_seqs: 1        # no co-batching during eval
  enforce_eager: true    # disable CUDA graph capture variance
sampling:
  temperature: 0.0
  seed: 42
  top_p: 1.0
logging:
  log_backend_id: true   # which deployment served this request

enforce_eager: true matters more than it looks. CUDA graph capture in vLLM can introduce its own kernel-selection differences across runs. Eager mode is slower but it removed another ±0.4 we hadn't isolated separately.

Second, every eval response is stored with the backend identifier and the raw logprobs of the top 2 tokens at each position. When a score moves now, we diff the logprob traces and find the exact prompt and position where decoding diverged. Takes minutes, not days.

Third, we report eval scores as a range over 5 runs, not a single number. If the range is wider than 1 point, the result is "inconclusive, rerun," not "regression." We stopped pretending a single float is ground truth.

Trade-offs and Limitations

Batch size 1 for eval is expensive. Our 800-prompt suite went from 4 minutes to 19. We accept that because eval correctness is worth more than eval speed, but if you run evals on every commit, 19 minutes is a real tax. We gate it: fast batched eval on PRs for a rough signal, pinned eval on release candidates only.

Pinning enforce_eager and max_num_seqs: 1 means your eval environment no longer matches production serving conditions. You're measuring the model, not the production system. That's the right call for catching regressions in weights, the wrong call if you're trying to reproduce a user-reported production bug, where batch effects are part of the story.

And storing top-2 logprobs per position roughly tripled our eval artifact storage. Cheap at our 800-prompt scale. Reconsider it at 100k.

None of this makes the eval "correct." It makes it reproducible. Those are different problems. A reproducible eval that measures the wrong thing is still wrong, just consistently. The contents of the suite are still the hard part.