The stale eval fixture that passed a broken model

#testing #cicd #python #llmops

The stale eval fixture that passed a broken model

A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.

This is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.

Why the cache existed

Our eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.

That is the right instinct. The bug was in the definition of "nothing that affects the result changed."

The cache key that was missing an input

Our key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.

Here is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.

The rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.

The fix, as a key function

This is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.

import hashlib, json

def eval_cache_key(case, prompt_template, model_snapshot, judge_snapshot, eval_config):
    # model_snapshot / judge_snapshot are the resolved dated ids
    # (e.g. "gpt-4o-2024-08-06"), NEVER the moving alias ("gpt-4o").
    payload = {
        "input": case["vars"],
        "prompt": prompt_template,
        "model": model_snapshot,
        "judge": judge_snapshot,
        "eval_config": eval_config,   # thresholds, rubric, metric set
        "schema": 2,                  # bump to invalidate everything on purpose
    }
    blob = json.dumps(payload, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(blob.encode()).hexdigest()

Two things that matter more than they look:

sort_keys=True so the hash is stable regardless of dict ordering. Without it the "same" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.
The schema integer. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.

And resolve the alias to the snapshot at the top of the run, once:

# Wrong: model id is the alias, so a provider-side snapshot bump is invisible.
model = "gpt-4o"

# Right: resolve to the concrete dated snapshot and key on THAT.
model_snapshot = resolve_snapshot("gpt-4o")  # -> "gpt-4o-2024-08-06"

Fail the cache closed, not open

The second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as "no entry, but also do not block," and in one code path that quietly meant "pass." A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.

We also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.

What it cost to find

The embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.

What I'd check first

When an eval passes something production then breaks, before you touch the model or the rubric:

Confirm the eval actually executed on this commit's model. Look for a fresh model call in the run logs, not a cache hit. If every case is a cache hit, your suite did not test anything.
Diff the cache key inputs against what can change the output. If the model snapshot, judge, or eval config is not in the key, that is your stale-green source. Add it and bump the schema.
Check the miss path. Force a cache miss and confirm it runs the eval for real, not that it shrugs and passes. A cache that can fail open is a gate that can ship anything.