DEV Community: Ethan Walker

Retry-until-green turns a 70 percent eval gate into a 34 percent one

Ethan Walker — Fri, 24 Jul 2026 06:19:13 +0000

TL;DR. We let developers re-run a failed eval gate, up to three total runs, and merge on any green. Felt harmless. It is not. If a judge-scored gate catches a real regression with probability 0.7 per run, merge-on-any-green across three runs drops the catch rate to 0.7^3 = 0.343, because the regression ships the moment any single run passes. The same policy pushed our false-block rate near zero, which is why everyone loved it and nobody measured the other side. A retry is a second sample from the same noisy scorer, so aggregate the samples instead of letting the luckiest one decide.

The button

Our eval gate runs a judge-scored suite on every PR that touches a prompt or a retrieval config. I have written before about judge scores drifting between runs on unchanged output; the short version is same diff, 0.83 Friday, 0.78 Monday.

So we added the button. "Re-run eval gate." Up to three runs total, merge on any green. The team's false-block complaints stopped within a week. I counted that as a win and moved on.

The part I did not count: what the button does to a real regression.

The arithmetic

Two numbers describe any noisy gate. The chance it fails a clean PR (false block, call it f). The chance it fails a PR that genuinely regressed (catch rate, call it d).

Both react to retries the same way, and that is the problem. Under merge-on-any-green with k allowed runs:

P(clean PR gets blocked) = f^k
P(regressed PR gets blocked) = d^k

With f = 0.10 and d = 0.70 and k = 3, the false-block rate falls from 10 percent to 0.1 percent. Great. The catch rate falls from 70 percent to 34 percent. It used to catch seven regressions in ten. Now it catches about three, and the dashboard shows nothing, because a merged PR looks identical whether it merged on run one or run three.

Run it yourself:

def blocked(fail_prob: float, k: int) -> float:
    """P(all k runs fail) under merge-on-any-green."""
    return fail_prob ** k

for k in (1, 2, 3, 5):
    print(k, round(blocked(0.70, k), 3), round(blocked(0.10, k), 6))

Output for k = 1, 2, 3, 5: catch 0.7, 0.49, 0.343, 0.168. False blocks 0.1, 0.01, 0.001, 1e-05. Every extra allowed run trades a chunk of your remaining detection power for false-block reduction you mostly already had.

Why the policy works for tests and fails for judge scores

Google's testing blog documented the strongest version of retry-as-policy in their 2016 flaky-test writeup (testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html): a test can be marked flaky so that it reports "a failure only if it fails 3 times in a row." The same post is honest about the cost, noting that the mechanism trains developers to ignore flakiness in their own tests until the triple-fail threshold trips. They also report that roughly 1.5 percent of all test runs across their corpus come back flaky. A decade old, still the clearest thing written on the subject.

Here is why the policy is defensible for tests and not for judge scores. A conventional test is deterministic in intent; flakiness comes from the environment around it. Ports, clocks, race conditions. When a flaky test passes, the pass carries real information, because the assertion itself is exact and the code path demonstrably works when the environment cooperates.

A judge-scored eval inverts this. The scorer is the random variable, so a passing run is just another draw from the distribution that produced the failing run, and it deserves exactly the same trust. Taking the best of three draws and calling it the verdict is not noise handling. It is selecting the sample you liked.

One question worth asking your own CI today: how many of your merged PRs had at least one red eval run before the green one? If your system cannot answer that, you have no idea what your effective catch rate is.

What we run now

Three changes, in the order we made them.

Deterministic checks stay blocking and get no retry button. Schema validation, regex assertions, golden-string diffs, exit codes. A fail is a fail. This is most of the gate and it is the part that was never flaky to begin with.
The judge suite runs a fixed panel of n = 5 samples and gates on the aggregate, computed once. I argued nine days ago that a single judge score cannot hold a merge gate, and that has not changed. An aggregate over a fixed panel is a different object: the run-to-run variance that makes one score untrustworthy is exactly what the averaging shrinks. No re-run path exists for it. If someone wants another run, the new samples join the panel and the aggregate recomputes over all of them. Averaging shrinks the noise by a factor of sqrt(n). Best-of-n converts the noise into bias in the direction you wanted.
Retries exist only for infrastructure failures, and the runner distinguishes them by exit code. A timeout or a 429 re-runs automatically and silently. A low score never does.

The retry-rate metric came last and taught us the most. Retries per PR per week is now on the same dashboard as the pass rate. It is our flakiness number. The week it spikes, something in the judge path drifted, and we know before anyone starts arguing with the gate.

What I'd check first

Count merged PRs in the last 30 days with a red-then-green eval history. That number times your per-run catch rate is roughly what you are leaking.
Find every place a human can re-trigger a scored check. Each one lets the luckiest sample overrule the rest. Replace it with an aggregate or delete it.
Split your runner's exit codes: infra failure, score failure, harness bug. Only the first class earns an automatic retry.

Our CI eval gate sent us a token bill. The deterministic one sent nothing.

Ethan Walker — Tue, 21 Jul 2026 15:02:56 +0000

TL;DR. A quality gate that grades every pull request with an LLM-as-judge metric is buying a judge call per PR, per metric, forever. At forty PRs a day and a dozen judge-graded metrics, that is real money and, worse, a merge queue now coupled to a paid API's price, latency, and rate limits. I wired six open-source eval frameworks into the same GitHub Actions gate and watched the invoice, not the feature list. The split that matters for cost is simple: does the check run in-process and return an exit code, or does it call a model. Promptfoo assertions, MLflow's heuristic metrics, and Future AGI's deterministic metrics can gate without a single judge call. RAGAS ships deterministic metrics too (BLEU, ROUGE, string checks), but the RAG metrics you actually adopt it for are judge calls. DeepEval and Phoenix sit in the middle, judge-first by default but drivable in a cheaper mode. Rankings and the arithmetic below.

The invoice

I was not auditing cost. Finance was. Someone forwarded me a line item, a few hundred dollars against an OpenAI project key I did not recognize, tagged ci. It was the eval gate. We had added an LLM-as-judge relevance metric to the merge queue four months earlier, felt good about the coverage, and moved on. Nobody connected "we grade every PR with a model" to "we pay for every PR we grade."

A few hundred a month is noise against an engineering payroll. What bothered me was the shape of it: the bill grew with our merge volume, which means the better the quarter, the more the gate cost, and the gate itself did nothing you could not have paid for once. And the second I looked, I realized the merge queue now had a dependency nobody had reviewed. If that API rate-limited us at 9am on a busy day, PRs would stall on a gate that had nothing to do with the code in them.

So I did the boring thing and measured it.

The cost formula

The number that drives cost is not metric quality or accuracy. It is how many model calls one gate run makes.

A deterministic check makes zero. contains, equals, regex, is-json, a schema validator, a required-field assert, a golden-file diff. These run in-process, return in milliseconds, and cost nothing per PR. An LLM-as-judge metric makes one model call each, every run, and bills for the tokens.

So the monthly cost of a gate is close to:

PRs/day  x  judge calls per run  x  price per judge call  x  workdays

Everything else is a rounding error. The chart is that formula, nothing more.

[DIAGRAM: https://lh3.googleusercontent.com/d/1bVKW6DCpbvxco0nlIh6O06aJZLe_ttMh]

At forty PRs a day, a gate running a dozen judge-graded metrics adds roughly twenty dollars a month by this arithmetic (a stated ~$0.002 per call; check current token prices before you quote it). A single-judge gate is a few dollars. A deterministic gate is flat at zero, no matter how many PRs you merge. The absolute numbers are small. The point is the slope, and the fact that only one of these lines is flat.

The ranking

All licenses, metric counts, and defaults are as of mid-2026. Check each repo before you rely on any of this, because all six move.

Promptfoo (MIT). The cheapest to gate on, because its assertion library is deterministic by design: contains, equals, regex, is-json, starts-with, plus cost and latency asserts. An LLM-judge assertion is available but opt-in. You can build a real blocking gate that never makes a model call. CLI exit code, JSON output, maintained Action.
MLflow evaluate (Apache-2.0). mlflow.evaluate() ships heuristic metrics (exact match, token overlap, and similar) alongside optional genai judge metrics. Gate on the heuristic set and you pay nothing per run. The eval is logged next to the experiment, which is the reason to reach for it.
Future AGI (Apache-2.0). Its ai-evaluation SDK runs through one evaluate() call from Python or TypeScript and mixes deterministic checks with judge-based metrics. Gate on the deterministic ones and, like Promptfoo and MLflow, it makes no model call, so it lands on the cheap end. Gate on the judge metrics and it bills like the rest.
DeepEval (Apache-2.0). Its determinism comes from the runner. You inherit pytest exit codes and JUnit XML for free, which is why it is pleasant in CI, but most of its metric catalog is judge calls. Run it on plain pytest asserts and it is cheap. Run its default metrics on every PR and you are paying per metric, per PR.
Arize Phoenix (Elastic License 2.0). run_evals() in a script, a handful of evaluators, judge-based. The draw is tracing plus eval in one tool, not a deterministic gate. As a pure cost line it bills like any judge-first tool.
RAGAS (Apache-2.0). It does ship deterministic metrics (BLEU, ROUGE, chrF, string and tool-call checks), so a gate built on those makes no model call. But nobody adds RAGAS for BLEU. The reason it is on your stack is its RAG metrics (faithfulness, answer relevancy, context precision), and those are judge calls almost by definition. Gate on what you came to RAGAS for, and every check is a model call.

Beyond the dollars

The money was the small problem. Two bigger ones showed up once the gate depended on an API.

Rate limits became merge-queue outages. A judge-graded gate that 429s does not fail gracefully, it fails your PR, and the fix is retry-with-backoff logic you now own inside CI. A deterministic assert never 429s.

Price is not yours to control. We had modeled the gate at one token price. Model prices move, sometimes down, sometimes up, and a gate you pay per-run for is a line item that reprices without asking you. A gate that runs in-process is priced once, in compute you already have.

None of this means judge metrics are useless. I still run them. I just stopped putting them in the blocking path. Everything deterministic gates the merge. Everything judge-graded runs on the same PR, posts as a non-blocking comment, and a human reads it. The gate got cheaper and the signal did not get worse, because the judge was never a good binary gate anyway.

What I'd check first

If your CI bill has a mystery line item, or you are about to add an eval gate:

Count the model calls in one gate run (metrics x PRs/day). If that number is not zero and you expected a fixed cost, that is your line item.
Move every deterministic check you have (schema, required fields, forbidden strings, one golden diff) into the blocking path, and demote every judge metric to a non-blocking PR comment.
Before you gate on any judge metric, ask whether it survives a 429 at 9am. If a rate limit would block a clean PR, it is a coupling you did not mean to buy.

Our few-shot examples came from the eval set. The 0.94 was fiction.

Ethan Walker — Mon, 20 Jul 2026 07:29:57 +0000

TL;DR. Our ticket-routing eval scored 0.94 for five weeks. The number was manufactured. We had built a dynamic few-shot selector that retrieved the eight nearest labeled examples for each input, and we built its index out of the same labeled_tickets.jsonl the eval set was sampled from. So for every eval case, the nearest neighbor in that index was the eval case itself, gold label attached, pasted into the prompt directly above the question we were about to ask. The model was not answering. It was copying. Measured against tickets the index had never seen, real accuracy was about 0.79. The reframing that stuck for me: contamination is not just a training-time problem you inherit from a model vendor. Any pool you draw prompt content from is part of your eval's input. If your eval set and your few-shot pool share a parent file, the leak is yours, in your repo, shipped by someone on your team last Tuesday. Split the pools by content hash at the source, then check the prompt and not only the training data.

The trace that ruined a good number

I was not looking for this. I was chasing p99 latency on the routing endpoint, which had crept past two seconds and was making the queue back up. So I pulled a slow trace and started reading the prompt we actually send, top to bottom, the way you do when you suspect somebody stuffed too much into the context window.

The prompt had a system block, then eight few-shot examples, then the ticket to classify. We rendered them nearest-last, so the closest match sat directly above the question.

Example eight was the ticket to classify. Same text. Same customer. And sitting under it, formatted as the demonstration answer, was the label we were about to grade the model on.

I read it three times. Then I checked four more traces. Same shape every time: the last example before the question was the question.

Our eval had been reporting 0.94 since the selector shipped. Nobody questioned it because 0.94 is a believable number. Five weeks. Nobody asked once. If the suite had printed 1.00, someone would have opened it within the hour, because a perfect score reads as a bug. A 0.94 reads as a good quarter. A total leak still did not mean a perfect score: the other seven demonstrations pulled against the exact match often enough to cost a few points, which is precisely what made 0.94 look earned. It went in a deck. Somebody put it on a slide with an arrow pointing up. The score was high enough to celebrate and low enough to trust, which is the worst place a wrong number can sit.

select_examples()

The mechanism is boring, which is why it survived review.

We started with static few-shot: eight hand-picked examples, hardcoded, same eight for every request. It worked fine and it was obviously fine, because you could read the eight in the diff.

Then someone (me, partly, in a design review I do not get to distance myself from) pointed out that a fixed eight cannot cover billing and abuse reports and integration bugs at once. Retrieve the examples instead. For each incoming ticket, embed it, pull the eight nearest labeled tickets out of the index, put those in the prompt. Better coverage per token. This is standard practice and I still think it is right.

The index got built from labeled_tickets.jsonl, which was where every labeled ticket lived. Roughly 1,900 rows at the time.

The eval set was 600 rows sampled from labeled_tickets.jsonl.

Those two sentences are the whole bug. Both artifacts were correct in isolation. Both were reviewed. Nobody put them side by side, because they lived in different files, owned by different people, merged six weeks apart.

That gap is the part worth generalizing. Neither diff was wrong. A reviewer on the selector PR sees a sensible retrieval change and asks about latency and index freshness. A reviewer on the eval PR sees a defensible sample size and asks whether 600 cases is enough. Both are good questions. Neither reviewer gets asked the only one that mattered, which is whether these two things read the same rows, because that question is not visible in either diff. It lives in the space between them. We had no review process for the space between two files, and I am not sure most teams do.

Here is what the selector did, reduced to the part that matters:

def select_examples(query, index, k=8):
    # index was built from every row in labeled_tickets.jsonl
    return index.search(embed(query), k=k)

Then the eval harness:

cases = random.sample(load("labeled_tickets.jsonl"), 600)

Read them together and the failure is arithmetic, not machine learning. The eval case is in the index. The eval case's own embedding is its nearest neighbor, at distance zero. Every single one of the 600 eval cases retrieved itself as its own nearest neighbor and carried its gold label into the prompt. Not some of them. All of them. A retriever asked to find the most similar labeled example to a text that is sitting in its own index will return that text, because nothing is more similar to a string than the string.

We also had about 40 exact duplicate tickets in the pool, because customers paste the same complaint twice and support macros generate identical bodies. Those would have leaked across a naive random split even without a retriever. The retriever just made the leak total instead of partial.

Why the retriever hands over the answer

Worth being precise about what broke, because "contamination" gets used loosely enough to stop meaning anything.

The model did not memorize our tickets during pre-training. The vendor did not train on our data. Our fine-tune was clean. Every version of contamination I had read about was about the training corpus, and every one of those was genuinely not our problem.

The leak happened at prompt construction time, in our code, at inference, on every request. The model saw the answer in the line directly above the question, in a block explicitly labeled as examples of correct behavior. It did what any competent few-shot learner does with a demonstration that exactly matches the query. It copied the label.

So the eval was measuring copy fidelity. That is a real capability. It is not the capability we were shipping, and it is not the one the number claimed.

The published work here is almost entirely about training data, and it is worth knowing even though it does not describe this exact bug. Zhou et al., in Don't Make Your LLM an Evaluation Benchmark Cheater, work through how benchmark leakage into training "can dramatically boost the evaluation results," producing an unreliable read on what a model can do. Their setting is pre-training and fine-tuning corpora. Mine was a JSONL file and a vector index.

The mechanism ports anyway. The model cannot tell where in its input a leaked answer came from, and neither can your score. Weights or context window, the arithmetic is the same: if the answer reached the model before the question, the number measures retrieval, not reasoning. The difference is that training contamination is mostly somebody else's to fix, and prompt contamination is entirely yours. I find that clarifying rather than comforting.

contamination_check.py

The check we should have had. Standard library only, so it runs anywhere Python does and there is nothing to install in CI.

Two passes. A normalized hash catches verbatim and cosmetically edited duplicates. An n-gram containment score catches the case where an eval item sits inside a longer example. Containment rather than Jaccard on purpose: a short eval case buried in a long few-shot example still scores 1.0, and that is exactly the leak worth failing on.

"""contamination_check.py: does the few-shot pool already contain the eval answer?"""
import hashlib
import re
import unicodedata

_PUNCT = re.compile(r"[^\w\s]", flags=re.UNICODE)
_WS = re.compile(r"\s+")


def normalize(text):
    """Casefold, strip punctuation, collapse whitespace. Defeats cosmetic edits."""
    text = unicodedata.normalize("NFKC", text).casefold()
    return _WS.sub(" ", _PUNCT.sub(" ", text)).strip()


def fingerprint(text):
    return hashlib.sha256(normalize(text).encode("utf-8")).hexdigest()


def ngrams(text, n):
    toks = normalize(text).split()
    if len(toks) < n:
        return {" ".join(toks)} if toks else set()
    return {" ".join(toks[i:i + n]) for i in range(len(toks) - n + 1)}


def containment(case, example, n):
    """Fraction of the eval case's n-grams that also appear in the example."""
    gc = ngrams(case, n)
    if not gc:
        return 0.0
    return len(gc & ngrams(example, n)) / len(gc)


def audit(eval_set, fewshot_pool, n=5, threshold=0.6):
    """Return (eval_idx, pool_idx, score, kind) for every eval case the pool leaks."""
    by_hash = {}
    for i, ex in enumerate(fewshot_pool):
        by_hash.setdefault(fingerprint(ex), i)

    hits = []
    for j, case in enumerate(eval_set):
        i = by_hash.get(fingerprint(case))
        if i is not None:
            hits.append((j, i, 1.0, "exact"))
            continue
        scored = ((containment(case, ex, n), i) for i, ex in enumerate(fewshot_pool))
        score, i = max(scored, default=(0.0, -1))
        if score >= threshold:
            hits.append((j, i, round(score, 3), f"{n}-gram"))
    return hits


if __name__ == "__main__":
    fewshot_pool = [
        "Escalate to a human when the customer mentions legal action.",
        "Refund the order automatically if it shipped more than 30 days ago.",
        "When a customer asks to escalate to a human because they mention legal "
        "action or a lawsuit, hand the ticket to the on-call agent immediately.",
    ]

    eval_set = [
        "Escalate to a human when the customer mentions legal action.",         # verbatim
        "escalate to a human when the customer mentions LEGAL ACTION!!",        # cosmetic edit
        "Escalate to a human because they mention legal action or a lawsuit.",  # fragment
        "Hand off to a person if the buyer threatens to sue.",                  # paraphrase
        "Ask for the order number before checking shipment status.",            # clean
    ]

    hits = audit(eval_set, fewshot_pool)
    leaked = {j for j, _, _, _ in hits}
    for j, i, score, kind in hits:
        print(f"LEAK eval[{j}] <- pool[{i}]  score={score:<5} via {kind}")
    print(f"\n{len(leaked)}/{len(eval_set)} eval cases leaked by the few-shot pool")
    for j, case in enumerate(eval_set):
        if j not in leaked:
            print(f"  clean: eval[{j}] {case!r}")

Output:

LEAK eval[0] <- pool[0]  score=1.0   via exact
LEAK eval[1] <- pool[0]  score=1.0   via exact
LEAK eval[2] <- pool[2]  score=1.0   via 5-gram

3/5 eval cases leaked by the few-shot pool
  clean: eval[3] 'Hand off to a person if the buyer threatens to sue.'
  clean: eval[4] 'Ask for the order number before checking shipment status.'

Three of five caught. The cosmetic edit gets normalized into an exact hit, and the fragment gets caught by containment against the longer pool entry. Both of those would have slipped past a naive set(eval) & set(pool).

Now look at what the script calls clean.

The paraphrase the check calls clean

eval[3] is "Hand off to a person if the buyer threatens to sue." The pool contains "Escalate to a human when the customer mentions legal action." Same rule. Same decision. Zero shared 5-grams, so my check reports it as clean and moves on.

It is not clean. It is the same eval case wearing a different coat, and a model given the pool entry will get eval[3] right for reasons that have nothing to do with understanding your routing policy.

This is a known and documented hole, not something I discovered. Yang, Chiang, Zheng, Gonzalez and Stoica went at it directly in Rethinking Benchmark and Contamination for Language Models with Rephrased Samples. Their finding is that "simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures," where the measures in question are exactly the string and n-gram matching my script does. They report that a 13B model can overfit a benchmark and reach performance on par with GPT-4 when rephrased test data is left in, and they found 8% to 18% overlap with HumanEval sitting in pre-training sets like RedPajama-Data-1T and StarCoder-Data. Their decontamination tool is public at lm-sys/llm-decontaminator, Apache-2.0, and it uses an LLM to catch what string matching cannot.

Again their setting is training corpora and mine is a prompt. The blind spot transfers cleanly, because n-grams do not know what a sentence means in either setting.

I have not run their tool against our pool yet, so I am not going to tell you what it would find. What I will say is that the cheap check is worth shipping anyway. It caught three of five in a toy example and it caught our real leak on the first run, because our leak was verbatim. Verbatim is the common case. Paraphrase contamination is real, and my honest position is that I do not currently know how much of it we have, which is a different sentence from "we don't have any."

Where else the same pool leaks

Once I had the shape of it I went looking elsewhere and found three more instances in our own repo within a day. Listing them because the retriever version is the flashiest and probably the least common.

Static few-shot, curated out of the labeled pool. Before the retriever we had eight hardcoded examples, and I had filed them as safe precisely because they were hardcoded. They came from the same file. Two of the eight were in the eval set. A fixed leak is smaller than a total leak. It is still a leak, and it sat in that prompt for months while I told myself static few-shot was the conservative option.

RAG evals where the corpus contains the eval documents. Same arithmetic, different index. If your eval questions were written from documents that live in the retrieval corpus, the retriever will hand the model the exact paragraph each question was written from. This one is arguably fine, because production does the same thing. It stops being fine the second you report the score as evidence about the model rather than about your retriever.

Synthetic eval cases generated from the seed examples. The easiest to walk into and the one I would bet is most widespread right now. You ask a model for 500 eval cases. You seed it with your best labeled examples so the output looks like your domain. Those same examples are in your few-shot block. Your eval set is now a paraphrase of your prompt. No file is shared, no hash collides, and contamination_check.py finds nothing, because there is nothing verbatim left to find. It is the paraphrase case from the Yang et al. paper, arriving through a door we built ourselves and held open.

The common factor is not retrieval. It is one pool of text doing two jobs: teaching the model and grading it. Whenever those roles drink from the same well, the score is compromised, and how badly is not knowable from the score.

Disjoint by construction, not by discipline

Detection is a smoke alarm. It tells you the kitchen is already on fire. The actual fix is to make the overlap impossible to express.

We stopped sampling the eval set and the few-shot index separately from one parent file. Instead, every record gets assigned to exactly one side by hashing its normalized content:

def assign(text, eval_frac=0.35, salt="fewshot-v3"):
    h = hashlib.sha256((salt + normalize(text)).encode("utf-8")).hexdigest()
    return "eval" if int(h[:8], 16) / 0xFFFFFFFF < eval_frac else "fewshot"


def split_pool(pool, eval_frac=0.35, salt="fewshot-v3"):
    buckets = {"eval": [], "fewshot": []}
    for rec in pool:
        buckets[assign(rec, eval_frac, salt)].append(rec)
    return buckets["eval"], buckets["fewshot"]

Three properties earn this over random.sample, and the third is the one that actually mattered to us.

It is deterministic. No seed to forget, no ordering dependency. The same record lands in the same bucket on my laptop, in CI, and in the batch job that rebuilds the index at 4am.

It is stable as the pool grows. Adding 500 new tickets does not reshuffle the existing split, so last month's eval numbers stay comparable to this month's. A reshuffling split quietly moves your baseline and you get to spend a day proving the model did not regress.

It puts duplicates on the same side. This is the part random.sample cannot do. Because the hash runs over the normalized text, our 40-odd exact duplicates and their cosmetic variants all resolve to one bucket. Under random sampling, a duplicated ticket had a real chance of landing one copy in eval and one in few-shot, which is the leak reappearing through the back door after you thought you had closed it.

Run it over 2,000 synthetic records and the realized eval fraction lands within about a point of the 0.35 target, which is the usual hash-bucket wobble and does not matter at this size. The exact counts depend on your strings, so do not pattern-match mine. The part that does not wobble: feed it the escalation rule plus its shouty and double-spaced variants and all three land in the same bucket, every run, on every machine.

The salt is there so you can rotate the split deliberately. Bump fewshot-v3 to fewshot-v4 and you get a fresh partition, on purpose, in a diff, with a name someone has to review.

Then it goes in CI as a gate, not a dashboard. The audit runs on every PR that touches either artifact, and a nonzero hit count fails the build with the offending indices printed. It takes about a second on our pool. I have opinions about slow eval gates, but a hash join over a few thousand strings is not where your merge queue goes to die.

The number after the fix

Rebuilt the index from the few-shot side only. Re-ran the 600-case eval.

0.79.

That is roughly a 15 point drop, and it is the first number that suite ever produced that meant anything. Nobody enjoyed the meeting. The version I would defend now is that we did not lose 15 points, we found out we never had them, and we found out from a trace instead of from a customer.

The follow-on was more interesting than the drop. With a real number, the error slices were legible for the first time. Abuse reports were dragging well below the mean while billing sat above it, and that gap had been invisible at 0.94 because copying the label works equally well for every category. Contamination does not just inflate your score. It flattens it, and a flat score hides the shape of the problem you were trying to see. We spent the next sprint on abuse-report examples and got some of the 15 points back honestly, which took actual work.

What I'd check first

Print one real prompt from a failing eval run and read it end to end. Not the template. The rendered string, with the retrieved examples in it. If the eval case appears in its own prompt, stop and go fix that before you interpret another score.
Diff the provenance of the eval set against the few-shot pool. Not the contents, the source. If both trace back to the same file, table, or index, assume overlap until a hash join says otherwise, and treat any shared parent as a leak that has not been found yet.
Suspect the believable number, not just the perfect one. 1.00 gets investigated. 0.94 gets a slide. Ask what score you would have accepted without checking, and go check that one first.

Our few-shot examples came from the eval set. The 0.94 was fiction.

Ethan Walker — Thu, 16 Jul 2026 17:42:16 +0000

The trace that ruined a good number

The prompt had a system block, then eight few-shot examples, then the ticket to classify. We rendered them nearest-last, so the closest match sat directly above the question.

Example eight was the ticket to classify. Same text. Same customer. And sitting under it, formatted as the demonstration answer, was the label we were about to grade the model on.

I read it three times. Then I checked four more traces. Same shape every time: the last example before the question was the question.

select_examples()

The mechanism is boring, which is why it survived review.

We started with static few-shot: eight hand-picked examples, hardcoded, same eight for every request. It worked fine and it was obviously fine, because you could read the eight in the diff.

The index got built from labeled_tickets.jsonl, which was where every labeled ticket lived. Roughly 1,900 rows at the time.

The eval set was 600 rows sampled from labeled_tickets.jsonl.

Here is what the selector did, reduced to the part that matters:

def select_examples(query, index, k=8):
    # index was built from every row in labeled_tickets.jsonl
    return index.search(embed(query), k=k)

Then the eval harness:

cases = random.sample(load("labeled_tickets.jsonl"), 600)

Why the retriever hands over the answer

Worth being precise about what broke, because "contamination" gets used loosely enough to stop meaning anything.

So the eval was measuring copy fidelity. That is a real capability. It is not the capability we were shipping, and it is not the one the number claimed.

contamination_check.py

The check we should have had. Standard library only, so it runs anywhere Python does and there is nothing to install in CI.

"""contamination_check.py: does the few-shot pool already contain the eval answer?"""
import hashlib
import re
import unicodedata

_PUNCT = re.compile(r"[^\w\s]", flags=re.UNICODE)
_WS = re.compile(r"\s+")


def normalize(text):
    """Casefold, strip punctuation, collapse whitespace. Defeats cosmetic edits."""
    text = unicodedata.normalize("NFKC", text).casefold()
    return _WS.sub(" ", _PUNCT.sub(" ", text)).strip()


def fingerprint(text):
    return hashlib.sha256(normalize(text).encode("utf-8")).hexdigest()


def ngrams(text, n):
    toks = normalize(text).split()
    if len(toks) < n:
        return {" ".join(toks)} if toks else set()
    return {" ".join(toks[i:i + n]) for i in range(len(toks) - n + 1)}


def containment(case, example, n):
    """Fraction of the eval case's n-grams that also appear in the example."""
    gc = ngrams(case, n)
    if not gc:
        return 0.0
    return len(gc & ngrams(example, n)) / len(gc)


def audit(eval_set, fewshot_pool, n=5, threshold=0.6):
    """Return (eval_idx, pool_idx, score, kind) for every eval case the pool leaks."""
    by_hash = {}
    for i, ex in enumerate(fewshot_pool):
        by_hash.setdefault(fingerprint(ex), i)

    hits = []
    for j, case in enumerate(eval_set):
        i = by_hash.get(fingerprint(case))
        if i is not None:
            hits.append((j, i, 1.0, "exact"))
            continue
        scored = ((containment(case, ex, n), i) for i, ex in enumerate(fewshot_pool))
        score, i = max(scored, default=(0.0, -1))
        if score >= threshold:
            hits.append((j, i, round(score, 3), f"{n}-gram"))
    return hits


if __name__ == "__main__":
    fewshot_pool = [
        "Escalate to a human when the customer mentions legal action.",
        "Refund the order automatically if it shipped more than 30 days ago.",
        "When a customer asks to escalate to a human because they mention legal "
        "action or a lawsuit, hand the ticket to the on-call agent immediately.",
    ]

    eval_set = [
        "Escalate to a human when the customer mentions legal action.",         # verbatim
        "escalate to a human when the customer mentions LEGAL ACTION!!",        # cosmetic edit
        "Escalate to a human because they mention legal action or a lawsuit.",  # fragment
        "Hand off to a person if the buyer threatens to sue.",                  # paraphrase
        "Ask for the order number before checking shipment status.",            # clean
    ]

    hits = audit(eval_set, fewshot_pool)
    leaked = {j for j, _, _, _ in hits}
    for j, i, score, kind in hits:
        print(f"LEAK eval[{j}] <- pool[{i}]  score={score:<5} via {kind}")
    print(f"\n{len(leaked)}/{len(eval_set)} eval cases leaked by the few-shot pool")
    for j, case in enumerate(eval_set):
        if j not in leaked:
            print(f"  clean: eval[{j}] {case!r}")

Output:

LEAK eval[0] <- pool[0]  score=1.0   via exact
LEAK eval[1] <- pool[0]  score=1.0   via exact
LEAK eval[2] <- pool[2]  score=1.0   via 5-gram

3/5 eval cases leaked by the few-shot pool
  clean: eval[3] 'Hand off to a person if the buyer threatens to sue.'
  clean: eval[4] 'Ask for the order number before checking shipment status.'

Now look at what the script calls clean.

The paraphrase the check calls clean

It is not clean. It is the same eval case wearing a different coat, and a model given the pool entry will get eval[3] right for reasons that have nothing to do with understanding your routing policy.

Again their setting is training corpora and mine is a prompt. The blind spot transfers cleanly, because n-grams do not know what a sentence means in either setting.

Where else the same pool leaks

Disjoint by construction, not by discipline

Detection is a smoke alarm. It tells you the kitchen is already on fire. The actual fix is to make the overlap impossible to express.

We stopped sampling the eval set and the few-shot index separately from one parent file. Instead, every record gets assigned to exactly one side by hashing its normalized content:

def assign(text, eval_frac=0.35, salt="fewshot-v3"):
    h = hashlib.sha256((salt + normalize(text)).encode("utf-8")).hexdigest()
    return "eval" if int(h[:8], 16) / 0xFFFFFFFF < eval_frac else "fewshot"


def split_pool(pool, eval_frac=0.35, salt="fewshot-v3"):
    buckets = {"eval": [], "fewshot": []}
    for rec in pool:
        buckets[assign(rec, eval_frac, salt)].append(rec)
    return buckets["eval"], buckets["fewshot"]

Three properties earn this over random.sample, and the third is the one that actually mattered to us.

It is deterministic. No seed to forget, no ordering dependency. The same record lands in the same bucket on my laptop, in CI, and in the batch job that rebuilds the index at 4am.

The salt is there so you can rotate the split deliberately. Bump fewshot-v3 to fewshot-v4 and you get a fresh partition, on purpose, in a diff, with a name someone has to review.

The number after the fix

Rebuilt the index from the few-shot side only. Re-ran the 600-case eval.

0.79.

What I'd check first

Print one real prompt from a failing eval run and read it end to end. Not the template. The rendered string, with the retrieved examples in it. If the eval case appears in its own prompt, stop and go fix that before you interpret another score.
Diff the provenance of the eval set against the few-shot pool. Not the contents, the source. If both trace back to the same file, table, or index, assume overlap until a hash join says otherwise, and treat any shared parent as a leak that has not been found yet.
Suspect the believable number, not just the perfect one. 1.00 gets investigated. 0.94 gets a slide. Ask what score you would have accepted without checking, and go check that one first.

We gated CI on six open-source LLM eval frameworks. Only two survived the merge queue.

Ethan Walker — Tue, 14 Jul 2026 16:20:26 +0000

TL;DR. Most "top open-source LLM eval framework" roundups rank features. None of them ask the one question a merge queue cares about: does this gate pass or fail the same way twice. I wired six of these frameworks into a real GitHub Actions merge queue and ran them against production PRs for about eight months. The ones that gate cleanly share a single property: deterministic checks that return an exit code in seconds, with LLM-as-judge scores kept as non-blocking signals. The ones that flake share the opposite: nearly every metric is a judge call, so the queue blocks on a number that drifts. Ranked by "survived our merge queue," Promptfoo and DeepEval came out ahead. The short list first, then per-tool notes, then when you should not gate on any of them.

The outage that set the ranking

Two years ago I put an LLM-as-judge metric on our merge queue with a 0.8 threshold. It looked clean in the demo. Three weeks later it blocked fourteen PRs and a release over a weekend, because the judge scored the same unchanged output 0.83 on Friday and 0.78 on Monday. Same prompt. Same model. No seed. I killed the gate at 1am from my phone and we shipped fine. The regression it was "protecting" us from never existed.

That is the lens for this whole piece. A CI gate has one job: fail when something broke, pass when it did not, and do it the same way every time. An eval framework can carry the best metrics in the world and still be a bad gate if those metrics wobble. The feature-ranked listicles miss this because a notebook never punishes you for nondeterminism. A merge queue does, at 1am, in front of the whole team.

I am not arguing that quality evals are useless. I run plenty of them. I am arguing that a merge queue is a specific, unforgiving place, and the tool that belongs there is not always the tool with the longest metric list. A gate that blocks a clean PR trains your team to force-merge past it, and once the team routinely force-merges past a gate, it has stopped protecting anything. It only adds a click everyone has learned to ignore. That is the failure mode I now rank against first.

What "survived the merge queue" means

Five things, in the order they bit me:

Determinism. Same input, same verdict. A judge call with no fixed seed is not deterministic, and no threshold tuning will fully save you.
Speed. If one eval run adds four minutes and you merge forty PRs a day, you have bought a queue backup.
Cost. Judge-graded metrics burn tokens per run. Multiply by PR count. Some months that is a real invoice nobody budgeted for.
Wiring effort. Does the tool return an exit code, or do I hand-build the pass/fail logic around it.
Signal quality. When it fails, does it point at the actual regression, or just report a lower number and shrug.

Everything below is graded on that, not on how good the metrics look in a demo. Two of those five are about determinism and its side effects, because that is what cost me the most sleep. The other three (speed, wiring, signal) are what decide whether the gate is worth keeping once it works.

How I ran it: each tool guarded the same small golden set (about 60 input/expected pairs for a support-answer feature) inside the same GitHub Actions workflow, blocking on merge. I rotated them one at a time and watched three numbers: the flake rate on unchanged inputs, the added minutes per run, and the token bill at the end of the month. I kept a metric in the blocking path only if it never flipped a verdict on an input that had not changed. That rule alone reshuffled the list.

All versions, licenses, and metric counts are as of mid-2026. Check each repo before you rely on any of it, because all of this moves.

At a glance

1. Promptfoo (MIT). CI hook: a CLI exit code, JSON output, and a maintained GitHub Action. Metrics: dozens of assertions, deterministic and graded. LLM-judge: optional. Best for CLI gating in any stack.
2. DeepEval (Apache-2.0). CI hook: a pytest wrapper (deepeval test run). Metrics: 20-plus. LLM-judge: default for most. Best for pytest-based Python gates.
3. Future AGI (Apache-2.0). CI hook: call evaluate() in your own harness. Metrics: 50-plus (local plus hybrid judge). LLM-judge: optional. Best for an eval SDK you drive from Python or TypeScript.
4. RAGAS (Apache-2.0). CI hook: call evaluate() in a script. Metrics: around a dozen RAG-specific. LLM-judge: yes, most. Best for RAG quality measurement.
5. Arize Phoenix (Elastic License 2.0). CI hook: run_evals() in a script. Metrics: a handful of evaluators. LLM-judge: yes. Best for tracing plus eval in one tool.
6. MLflow evaluate (Apache-2.0). CI hook: mlflow.evaluate() in a script. Metrics: a dozen-plus (heuristic plus genai). LLM-judge: optional. Best for eval logged alongside experiments.

Repo links and install lines are in each section. Now the details.

The pattern that decided it

Once all six were in the same workflow, the ranking stopped being about metric quality. It came down to one split: does the blocking check call a model, or not. Deterministic checks (string match, JSON-schema, regex, exact match) passed and failed the same way every run, finished in under a second, and cost nothing. Judge checks did not. Every judge-based gate I ran drifted near its threshold at least once over the eight months, and two of them blocked a clean PR at least once. So the tools that let me put deterministic assertions in the blocking path, and push judge scores into an advisory lane, came out ahead. The tools built around a judge-first metric set fell behind, not because the metrics are weak, but because a score that moves between identical runs cannot hold a merge gate. Keep that split in mind as you read the six. It explains the whole order, including why a well-known RAG library sits below a younger tool, and why the tracking and observability tools sit at the bottom even though their metrics are fine.

1. Promptfoo

Repo: github.com/promptfoo/promptfoo. License: MIT. Install: npm install -g promptfoo (or run it with npx promptfoo).

What it is. A command-line eval and red-teaming tool. You describe prompts, providers, and test cases in a YAML file, run one command, and get a pass/fail table. As of mid-2026 it ships dozens of built-in assertions split into two camps: deterministic ones (contains, equals, regex, is-json, starts-with, cost, latency) and model-graded ones (llm-rubric, factuality, answer-relevance, similarity by embedding).

How it gates CI. promptfoo eval returns a nonzero exit code the moment an assertion fails. That is the whole ballgame for a merge queue: a nonzero exit is a red check, no glue code required. It writes JSON, CSV, or HTML you can archive as a build artifact, and there is a maintained GitHub Action that comments the eval diff on the PR so a reviewer sees exactly what changed. In my run the deterministic assertions never flaked across the eight months. The only week the queue wobbled was one where I let an llm-rubric assertion sit in the blocking path, and the fix was to move it back out.

# promptfooconfig.yaml  ->  run in CI with: npx promptfoo eval -c promptfooconfig.yaml
prompts:
  - "Answer the support question: {{question}}"
providers:
  - openai:gpt-4o-mini
tests:
  - vars:
      question: "What is the refund window?"
    assert:
      - type: contains
        value: "30 days"        # deterministic: no judge, no flake, sub-second
      - type: llm-rubric
        value: "states a clear refund window"   # graded: costs a call, can flake

Strengths. The deterministic assertions are the reason it sits at the top. contains and is-json do not flake, cost nothing, and run in milliseconds. It is language-agnostic: a Python shop, a Go shop, and a TypeScript shop all wire it the same way, because it is a CLI that speaks exit codes. Config lives in version control next to the code it guards, so a bad gate change shows up in the same diff.

Limits. The YAML sprawls once you pass a few dozen cases, and there is no type safety net until you run it. The model-graded assertions carry the same judge nondeterminism as everything else here, so if you lean on llm-rubric to block merges you have reintroduced the flake you came to avoid. A Python-first team also takes on a Node dependency it may not otherwise want in the CI image.

Best for. Gating prompt and output regressions in any stack, as long as you keep the blocking assertions deterministic and treat the graded ones as advisory.

2. DeepEval

Repo: github.com/confident-ai/deepeval. License: Apache-2.0. Install: pip install deepeval.

What it is. A Python eval framework built to feel like pytest. As of mid-2026 it carries 20-plus metrics, including G-Eval (a rubric metric you define in plain language), answer relevancy, faithfulness, hallucination, contextual precision and recall, plus safety metrics like bias and toxicity. Most of them are LLM-judged.

How it gates CI. You write a normal-looking test file and run deepeval test run test_file.py. Under the hood it wraps pytest, so you inherit pytest's exit codes and its JUnit XML reporter for free. If your CI already understands pytest, it already understands DeepEval. That is the shortest path to a green check for a Python team in this whole list. In my run the pytest wiring took under an hour to stand up. The flake, when it came, came entirely from the judge-based metrics drifting near their thresholds, not from the harness.

# test_support_bot.py  ->  run with: deepeval test run test_support_bot.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# The judge is an LLM, so the score is not deterministic.
# A 0.7 gate that scores 0.71 today can score 0.68 tomorrow on the same input.
CASES = [
    ("What is the refund window?", "You can request a refund within 30 days."),
    ("What is the refund window?", ""),   # empty output: the edge case that bites
]

@pytest.mark.parametrize("query,answer", CASES)
def test_answer_relevancy(query, answer):
    metric = AnswerRelevancyMetric(threshold=0.7)
    case = LLMTestCase(input=query, actual_output=answer)
    # Empty actual_output does not raise; it scores near 0 and fails the gate,
    # which is correct. The flaky judge near the threshold is the real risk.
    assert_test(case, [metric])

Strengths. The pytest wrapping is the cleanest on-ramp here for a Python team. Per-metric thresholds are explicit and live in code. The catalog is the widest in this list, so you rarely hand-roll a metric, and G-Eval lets you encode a house rule ("must cite a ticket id") without writing a scorer from scratch.

Limits. Most headline metrics are judge calls, so the flakiness story from my outage applies directly: keep thresholds loose or your queue pays for it. It nudges you toward Confident AI, the hosted product from the same team, once you want dashboards and shared datasets. And a wide catalog is a wide surface to keep pinned, because a judge-model upgrade can shift scores under a fixed threshold with no code change on your side.

Best for. Python teams that already gate on pytest and want deterministic wiring, as long as you pick a few metrics and treat the judge-based ones with suspicion near the threshold.

3. Future AGI

Repo: github.com/future-agi/future-agi. License: Apache-2.0. Install: pip install ai-evaluation.

What it is. An open-source eval SDK (fi.evals). As of mid-2026 its README lists 50-plus evaluation metrics plus guardrail scanners, behind one evaluate() call, with Python and TypeScript clients. It is one piece of a larger open-source platform, but for CI gating only the eval SDK matters, so that is all I put in front of the merge queue.

How it gates CI. There is no dedicated test runner. You construct an Evaluator, call evaluate() over your inputs, and write your own assert on the returned scores, the same shape as RAGAS below. One wrinkle worth knowing for a gate: the documented quickstart authenticates with an API key and runs against the hosted service, so the naive setup puts a network call in your blocking path. The SDK also supports local metric execution, which is the mode you want for a merge gate, because a deterministic local metric does not flake the way a judge does and does not add a per-run token bill. Turn the hybrid judge on and you inherit the same nondeterminism as everyone else in this list. The actual time sink was the harness code I had to write around evaluate() myself, because nothing here hands you a runner.

# pip install ai-evaluation   (module: fi.evals)
from fi.evals import Evaluator
# documented quickstart authenticates with API keys (hosted execution)
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
result = evaluator.evaluate(eval_templates="...", inputs={...})   # signature abbreviated
# no test runner: read the score off `result` and assert it in your own gate

Strengths. When you run its metrics in local mode, a deterministic score is what a gate wants: same input, same score, no per-PR token bill. One evaluate() surface for both local and judge scoring keeps the harness small. The TypeScript client means a Node CI job can gate without shelling out to Python, which is rare in this list.

Limits. It is younger than DeepEval and Promptfoo and it shows. There is no first-class pytest plugin or JUnit reporter, so you write more of the harness yourself. The documented quickstart is hosted (API key), so you have to configure local execution yourself to keep a network call out of the blocking path. The community, examples, and CI recipes are thinner, which matters at 2am when the gate breaks and you are hunting for the one forum answer that does not exist yet. On raw metric breadth and CI-native ergonomics it does not beat DeepEval, and it is not the most mature option on this list. If you want a tool that gates straight out of the box, this is not the shortest path today.

Best for. Teams that already have a CI harness and want fast local metrics to call from it, in Python or TypeScript, without paying a judge-token bill on every PR.

4. RAGAS

Repo: github.com/explodinggradients/ragas. License: Apache-2.0. Install: pip install ragas.

What it is. A library focused on retrieval-augmented generation. As of mid-2026 it offers around a dozen RAG-specific metrics: faithfulness, answer relevancy, context precision, context recall, answer correctness, and a few newer ones. The metrics are well-researched and map cleanly onto the stages of a RAG pipeline.

How it gates CI. There is no runner. You build a dataset, call evaluate(), and get back a scores object. Turning that into a gate is on you: read the metric, compare to a threshold, raise.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# dataset is a HF Dataset with question / answer / contexts columns
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy])
assert scores["faithfulness"] >= 0.9   # you write the gate yourself

In my run the scores moved enough between two identical runs that I could not keep it in the blocking path, so it lived in a nightly job instead. Why it sits below a younger tool here: nearly every RAGAS metric is a judge call. This ranking is about the merge queue, and judge calls are the exact thing the merge queue punishes. This is not a knock on RAGAS as software. For the narrow job of a hard merge gate, the judge dependency is what drops it below a younger tool.

Strengths. If your problem is specifically RAG quality, these are among the most thought-out metrics available, and the decomposition (retrieval versus generation) tells you where the regression lives, not just that one happened. For diagnosing a bad retrieval step, that split is worth a lot.

Limits. RAG-only by design. Judge-based, so nondeterministic and token-costly per run. Scores drift when the judge model or its version changes, which turns a green history red with no code change. You own all of the pass/fail plumbing.

Best for. Measuring RAG retrieval and generation quality, ideally in a nightly or pre-merge advisory job rather than a hard blocking gate.

5. Arize Phoenix

Repo: github.com/Arize-ai/phoenix. License: Elastic License 2.0 as of mid-2026 (source-available, not OSI-approved; confirm in the repo, because this kind of license has changed before). Install: pip install arize-phoenix.

What it is. Primarily an observability tool. It ingests OpenTelemetry traces of your LLM app and gives you a local UI to inspect them. It also ships phoenix.evals, a library with a handful of prebuilt LLM evaluators (hallucination, QA correctness, relevance, toxicity) and a run_evals harness.

How it gates CI. You can call run_evals in a script, get a dataframe of labels or scores back, and assert on the aggregate.

from phoenix.evals import run_evals, HallucinationEvaluator, OpenAIModel
# returns a dataframe of labels/scores; assert on the aggregate  (API abbreviated)
evals_df = run_evals(dataframe=df, evaluators=[HallucinationEvaluator(OpenAIModel())])
assert (evals_df["label"] == "factual").mean() >= 0.95

It works, but gating is not the design center. The product is built around the trace UI, where you sit and inspect runs. A blocking merge gate is not what it optimizes for. In my run I got more out of pointing it at captured traces than at a merge gate. As a blocker it was awkward, and I kept reaching for the UI to understand a failure instead of reading an exit code.

Strengths. If you also want tracing, this is the one tool here that does eval and observability under a single install, so your CI check and your production debugging speak the same vocabulary. The classification-style evaluators (a label, not a free-form score) are more gate-shaped than raw judge numbers.

Limits. The license is the first thing to run past legal, because source-available is not the same as open source and some orgs treat that line as a hard stop. Gating is a bolt-on, so you build the pass/fail yourself. The evaluators still lean on a judge, with the usual nondeterminism and token cost.

Best for. Teams that want tracing and eval together and will run evals mostly over captured traces, with CI gating as a secondary use.

6. MLflow LLM evaluate

Repo: github.com/mlflow/mlflow. License: Apache-2.0. Install: pip install mlflow.

What it is. mlflow.evaluate() is the LLM-eval entry point inside MLflow, the experiment-tracking platform. As of mid-2026 it offers a dozen-plus built-in metrics, split between heuristic ones (toxicity, reading-grade, exact match, ROUGE, token count, latency) and genai ones you build with make_genai_metric that call a judge.

How it gates CI. You call mlflow.evaluate() on a model or a static dataset, it returns a results object, and you read a metric off it and assert. The heuristic metrics are deterministic, which is the good news for gating. In my run the exact-match and ROUGE metrics held steady as a gate. The genai metrics behaved like every other judge here, and the run-and-experiment ceremony was more setup than a single CI check wanted.

import mlflow
# heuristic metrics are deterministic; genai metrics call a judge
results = mlflow.evaluate(data=eval_df, model_type="question-answering")
assert results.metrics["exact_match"] >= 0.9   # key names vary by version

Strengths. If you already live in MLflow, your eval numbers land next to your training runs and artifacts, with lineage, which is genuinely useful for audits and post-incident review. The heuristic metrics do not flake, so a gate built on exact match or ROUGE holds steady.

Limits. It is built for experiment tracking, not gating, so reducing a run to one clean pass/fail feels like fighting the grain. It expects a run and an experiment context, which is a lot of ceremony for a CI check. The genai metrics reintroduce judge nondeterminism, and MLflow is a heavier dependency to pull into a lean CI image than a single-purpose eval library.

Best for. Teams already standardized on MLflow that want eval logged alongside experiments, using the heuristic metrics for any actual blocking gate.

When not to gate CI on any of these

Picking the right tool does not mean you should gate at all. The best framework in the world is the wrong call in some situations, and I have watched teams (mine included) reach for a merge gate when the real problem was upstream. Gating is not free, and sometimes it is the wrong move no matter which tool you pick:

You have no golden dataset. If you cannot say what the right answer looks like, an eval gate just encodes a guess and fails at random. Build the dataset first, gate second.
Your output is open-ended. Marketing copy, brainstorms, and open chat have no single correct answer for a judge to hit. Gate the structure (valid JSON, required fields present), not the quality.
The judge bill beats the value. Forty PRs a day times a multi-metric judge run is a line item. If nobody will defend that spend, move the eval to nightly.
No one owns the drift. Judge scores move when models update. If no human owns re-baselining, the gate rots into a check everyone force-merges past, which is worse than no gate at all.
Latency breaks your SLA. If the eval adds minutes and your team merges constantly, you have traded correctness theater for a queue backup.

In those cases the answer is usually an async nightly eval, an online eval on a canary, or a deterministic structural check in CI with the quality eval running out of band where it cannot page anyone.

What I'd check first

Run the gate 20 times on one unchanged input. If the pass/fail flips even once, treat it as an advisory signal and make it non-blocking.
Time a single run and multiply by your daily PR count. If the minutes or the judge-token bill blow your queue budget, move it to nightly before you tune a single threshold.
Delete your flakiest metric and see whether one deterministic assertion (regex, JSON-schema, contains) catches the same regression. It usually does, and it never pages you at 1am.

The golden set stopped catching regressions the day traffic changed

Ethan Walker — Mon, 13 Jul 2026 05:02:20 +0000

TL;DR. Our overall eval pass rate read 0.88 through a model change and looked stable. Sliced by request language, German had fallen to 0.60 while English held near 0.90. The aggregate hid that because German was a rounding error inside the golden set even though it had grown into almost a quarter of real traffic. A bigger golden set does not fix this. Slicing every run by the production distribution, and refreshing the set from real traffic, does.

The dashboard stayed green while users complained

We shipped a prompt change and a model bump on the same afternoon. The eval ran against the golden set. Overall pass rate moved from 0.90 to 0.88. A two-point move sits at our run-to-run noise floor, so we shipped and moved on.

Four days later support forwarded a cluster of complaints, all of them German. Truncated sentences. Wrong register, formal where it should have been plain. English words leaking into German answers. The eval had flagged none of it. The number was still 0.88, green as ever.

Here is the shape of the set that produced that green. The golden set held 400 cases. We built it eighteen months earlier when the product was English-only, so roughly 370 of those cases were English and about 30 were German, added later as an afterthought. Meanwhile German requests in production had climbed from a rounding error to nearly a quarter of traffic after a market launch that quarter. So the eval was 7% German while production was closer to 22% German. A hard regression on German could move the aggregate by about two points and still be a live fire for a large and growing group of real users.

The golden set encoded last year's traffic, not this quarter's, and the single number it produced averaged over a distribution we no longer had.

Why does an average hide a slice?

An aggregate pass rate is a weighted average, and the weights are whatever mix of cases you happened to freeze into the set. If the set is 90% English and English holds at 0.90, the aggregate sits near 0.90 no matter what the other languages do. German at 0.60 carrying a 7% weight pulls the overall down by roughly 0.02. You will read 0.02 as noise. On most runs you would be right, and that is exactly what makes it dangerous.

Two things had to be true at the same time for this to bite, and both were true for us.

The slice regressed. The change helped English and hurt German at once. That is more common than people expect. A prompt edited and spot-checked against English examples can shift tokenization, instruction-following, and register in another language that nobody re-read before the merge. One model swap can lift your largest slice and quietly drop a smaller one on the same commit.

The slice grew. German had gone from a sliver of real traffic to almost a quarter of it, but the eval set never tracked the change. The group that mattered most in production was represented by the fewest cases in the test. The faster a slice grows in the wild, the more badly a stale set under-weights it.

Put those two facts together and the aggregate becomes an average over the wrong distribution. It gives a confident answer to a question we had stopped asking a year earlier. The same trap sits behind any slice key, not just language: a new input-length bucket, a big tenant you just onboarded, an intent that spiked after a UI change. Whichever slice grew fastest since you froze the set is the one the aggregate is now lying to you about.

The check: passrate_by_slice

The measurement fix is small and boring, which is the point. Tag every eval case with the slice keys you care about (language, input-length bucket, tenant, intent), compute pass rate per slice, take the delta against the overall on every run, and sort by that delta so the worst slice lands at the top of the output where you cannot scroll past it.

Here is the whole thing. Standard library, no dependencies, runs on plain Python 3.

import collections

def passrate_by_slice(rows, slice_key, pass_key="passed"):
    total, passed = collections.Counter(), collections.Counter()
    for r in rows:
        s = r.get(slice_key, "unknown")
        total[s] += 1
        passed[s] += 1 if r.get(pass_key) else 0
    overall = sum(passed.values()) / max(1, sum(total.values()))
    rows_out = []
    for s in total:
        pr = passed[s] / total[s]
        rows_out.append((s, total[s], round(pr, 3), round(pr - overall, 3)))
    return round(overall, 3), sorted(rows_out, key=lambda x: x[3])

# one eval run, sliced by request language. the model regressed on 'de' only.
run = ([{"lang": "en", "passed": True}] * 90 + [{"lang": "en", "passed": False}] * 10 +
       [{"lang": "de", "passed": True}] * 60 + [{"lang": "de", "passed": False}] * 40)

overall, table = passrate_by_slice(run, "lang")
print("overall pass rate:", overall)
for s, n, pr, delta in table:
    print(f"  {s:6} n={n:4}  pass={pr:.3f}  delta_vs_overall={delta:+.3f}")

Run it and you get:

overall pass rate: 0.75
  de     n= 100  pass=0.600  delta_vs_overall=-0.150
  en     n= 100  pass=0.900  delta_vs_overall=+0.150

The overall 0.75 is the number that would have shipped and passed the gate. The de row is the one that matters: pass 0.600, a delta of minus 0.150 against the aggregate, sitting first because the sort pushes the worst slice to the top. English reads fine at 0.900. Same eval, same run, two very different stories, and only one of them was visible before the slice existed.

This toy is exaggerated on purpose. A 0.15 slice gap is loud, and real ones rarely are. In production the per-slice deltas are small and they jitter a little between runs from sampling alone, so a single run in isolation will not tell you much. The signal is a slice delta that drifts in one direction across several runs while the aggregate holds flat. Track each slice run over run and alert on the movement, not on the absolute gap in any one snapshot. A slice that slid from minus 0.01 to minus 0.06 over three runs is worth a look even though 0.06 by itself looks like nothing.

Refresh the set from traffic, not from memory

Slicing tells you where a regression is hiding. It does not close the second gap, which is that a frozen set drifts away from production every week it sits still. The set is fixed at the moment you built it. The traffic it is supposed to represent keeps shifting. The distance between the two only grows, and every point of drift is another slice the aggregate is quietly mis-weighting.

So refresh on a cadence. Every sprint, sample recent production traffic stratified by the same slice keys, label it, and fold a fresh batch into the set. Keep a frozen core of regression cases you never want to break again, the specific failures you have already paid for once. Add current cases that reflect the mix you actually serve today. Retire cases for intents you have dropped. The set should track the live distribution, not embalm an old one.

Reweighting buys you most of the protection before you label a single new case. Score the aggregate against the current production mix instead of the historical set mix. If German is 22% of traffic this month, weight German at 22% of the number. A static set that is 7% German is quietly asserting that German is 7% of your risk, and it is wrong by a factor of three. The reweight is one dictionary of production shares, refreshed whenever the mix moves.

None of this is heavy. The slice function is twenty lines. The refresh is a weekly job that pulls a stratified sample plus a short human pass to accept or reject cases. The reweight is a lookup table. Against that you are weighing a week of a broken language behind a green dashboard, which is what the old setup actually cost us. The corpus you evaluate against has to move at the speed your traffic moves, or the number it gives you ages out from under you.

What I'd check first

The slice mix versus reality. Compare your set's language, length, tenant, and intent breakdown to last week's real traffic, and treat a wide divergence as the tell.
The per-slice deltas across runs. Diff the slice output over your last two or three runs; a slice falling while the overall holds flat is the regression.
The age of the cases. If most of the set predates your last launch, you are grading last year's product and the green number has already expired.

Your LLM-as-judge disagrees with itself between runs

Ethan Walker — Wed, 08 Jul 2026 19:51:30 +0000

Same outputs, same judge, two runs, two scores. The gate flickered red then green on a branch with zero code changes, and that flapping cost me more trust than any real regression.

The flap

I had a faithfulness gate on merge: judge scores every case, the mean has to clear 0.80. One Tuesday it failed at 0.79. I re-ran the identical job, no code change, no prompt change, and it passed at 0.82. Ran it a third time: 0.80 exactly. Nothing in the repo had moved. The judge was disagreeing with itself.

A gate that returns a different verdict on the same inputs is worse than no gate. People stop believing the red, they re-run until it goes green, and now the check is a slot machine you pull until it pays out. The regression it was supposed to catch could sail through on the lucky pull. So before I trusted that gate again I had to make the judge reproducible enough to stand on.

Where the jitter comes from

Four sources, in the order they bit me.

Sampling temperature. A judge call is a generation. If temperature is above zero the model samples, and a borderline case lands on 4/5 one time and 3/5 the next. This is the biggest lever and the easiest to miss because most SDK defaults are not zero.

Model version drift. "gpt-4o" or "claude-latest" is a moving alias. The provider ships a new snapshot, your scores shift a few points overnight, and you blame your prompt. Pin the dated snapshot, not the floating name.

Prompt ambiguity. If your rubric says "rate helpfulness 1 to 5" without anchoring what a 3 versus a 4 means, the model resolves the ambiguity differently each call. Vague rubrics convert directly into variance.

Tie-breaking. When the judge is genuinely on the fence between two scores, tiny sampling noise decides, and that decision is exactly where your threshold tends to sit.

Make it reproducible enough to gate

You will not get bit-identical determinism from a hosted model. That is fine. The goal is not zero noise, it is noise small enough that a threshold crossing means a real change and not a coin flip. Five things got me there.

Temperature 0, and a seed where the provider supports one. This alone collapsed most of my flap. Seeds help further on providers that honor them, but do not assume a seed gives you exact reproducibility across a model update.

Pin the exact judge model and prompt version in the cache key. Same discipline as any eval cache: the score is only reusable if the input, the judge snapshot, and the rubric version all match. Bump the version string whenever you touch the rubric.

Average over k judged samples, or take majority vote. One call is a sample from a distribution. k calls and a mean (or a vote for pass/fail rubrics) shrink the variance of your estimate by roughly sqrt(k). I run k=5.

Quantize the score. If you gate on a continuous 0 to 1, every hundredth flaps. Round to a coarse grid (0.0, 0.25, 0.5, 0.75, 1.0) per case so sub-grid noise stops moving the aggregate.

Version the judge prompt as code. The rubric lives in the repo, gets a version string, and changes go through review. A judge prompt edited in a UI and not tracked is a silent score change you cannot bisect.

The gate that respects the noise band

The real fix is conceptual: stop treating one judged score as ground truth. Judge k times, keep the mean and the spread, and only fail when the mean is below the threshold by more than the noise you actually measured. If the mean sits inside the noise band around the threshold, that is not a regression, it is jitter, and failing on it is how you get a flapping gate.

import statistics
from typing import Callable

def stable_judge_score(
    judge: Callable[[str, str], float],
    output: str,
    reference: str,
    k: int = 5,
    quantize_to: float = 0.25,
) -> tuple[float, float]:
    """Run the judge k times at temperature 0. Return (mean, stdev),
    each raw score snapped to a coarse grid to kill sub-grid jitter."""
    scores = []
    for _ in range(k):
        raw = judge(output, reference)          # judge must be called at temperature 0
        snapped = round(raw / quantize_to) * quantize_to
        scores.append(snapped)
    mean = statistics.fmean(scores)
    stdev = statistics.pstdev(scores) if k > 1 else 0.0
    return mean, stdev


def gate(mean: float, stdev: float, threshold: float = 0.80) -> bool:
    """Fail only when the mean is below threshold by more than the
    observed noise. Inside the noise band counts as pass, not a flap."""
    return mean >= (threshold - stdev)


if __name__ == "__main__":
    # toy judge: deterministic here, real one hits an LLM at temperature 0
    def fake_judge(out: str, ref: str) -> float:
        return 0.79 if "borderline" in out else 0.9

    mean, stdev = stable_judge_score(fake_judge, "a borderline answer", "ref", k=5)
    passed = gate(mean, stdev, threshold=0.80)
    print(f"mean={mean:.3f} stdev={stdev:.3f} pass={passed}")
    raise SystemExit(0 if passed else 1)   # this exit code is what CI reads

The raise SystemExit is the load-bearing line. That is what makes the branch protection rule refuse a real regression. Everything above it exists so that exit code means something. On my suite, moving from one raw judge call to k=5 with quantization took the run-to-run swing on that faithfulness metric from about 0.03 down to under 0.01, which was finally tight enough that a red meant a real drop and people stopped re-running to dodge it.

One caution on k: more samples cost more judge calls and more wall-clock, so I only spend the k on the cases near the threshold, and run the obviously-passing and obviously-failing cases once. The noise only matters where the decision is close.

What I'd check first

Log temperature on the judge call. If it is not zero, nothing else you do about jitter matters until it is.
Diff the judge model string between the run that passed and the run that failed. A floating alias silently swapped a snapshot on you more often than you would think.
Measure the run-to-run stdev of your gated metric before you trust the gate. If the swing is wider than the margin your threshold sits on, you are gating on noise and the red is meaningless.

LLM-as-judge disagrees with itself between runs

Ethan Walker — Wed, 08 Jul 2026 19:39:50 +0000

The flap

Where the jitter comes from

Four sources, in the order they bit me.

Tie-breaking. When the judge is genuinely on the fence between two scores, tiny sampling noise decides, and that decision is exactly where your threshold tends to sit.

Make it reproducible enough to gate

Quantize the score. If you gate on a continuous 0 to 1, every hundredth flaps. Round to a coarse grid (0.0, 0.25, 0.5, 0.75, 1.0) per case so sub-grid noise stops moving the aggregate.

The gate that respects the noise band

import statistics
from typing import Callable

def stable_judge_score(
judge: Callable[[str, str], float],
output: str,
reference: str,
k: int = 5,
quantize_to: float = 0.25,
) -> tuple[float, float]:
"""Run the judge k times at temperature 0. Return (mean, stdev),
each raw score snapped to a coarse grid to kill sub-grid jitter."""
scores = []
for _ in range(k):
raw = judge(output, reference) # judge must be called at temperature 0
snapped = round(raw / quantize_to) * quantize_to
scores.append(snapped)
mean = statistics.fmean(scores)
stdev = statistics.pstdev(scores) if k > 1 else 0.0
return mean, stdev

def gate(mean: float, stdev: float, threshold: float = 0.80) -> bool:
"""Fail only when the mean is below threshold by more than the
observed noise. Inside the noise band counts as pass, not a flap."""
return mean >= (threshold - stdev)

if name == "main":
# toy judge: deterministic here, real one hits an LLM at temperature 0
def fake_judge(out: str, ref: str) -> float:
return 0.79 if "borderline" in out else 0.9

mean, stdev = stable_judge_score(fake_judge, "a borderline answer", "ref", k=5)
passed = gate(mean, stdev, threshold=0.80)
print(f"mean={mean:.3f} stdev={stdev:.3f} pass={passed}")
raise SystemExit(0 if passed else 1)   # this exit code is what CI reads

What I'd check first

Log temperature on the judge call. If it is not zero, nothing else you do about jitter matters until it is.
Diff the judge model string between the run that passed and the run that failed. A floating alias silently swapped a snapshot on you more often than you would think.
Measure the run-to-run stdev of your gated metric before you trust the gate. If the swing is wider than the margin your threshold sits on, you are gating on noise and the red is meaningless.

When an LLM answer is wrong, the trace is where you look. Some tools make that easy.

Ethan Walker — Tue, 07 Jul 2026 17:26:04 +0000

A user reports a hallucinated answer in prod. To fix it you need the full trace of that one request, and how fast you can pull it depends entirely on the tracing you set up months earlier.

The ticket

A support user pasted a screenshot: our agent told them a refund window was 90 days. The real policy is 30. Wrong answer, confidently stated, already sent. The ticket had a request id in the response headers and nothing else.

The only useful question at that point is: what actually happened inside that one request. Which chunks did retrieval pull? What was the exact prompt the model saw after templating? What did each tool call return? A wrong answer is almost never the model being creative. It is usually a bad chunk, a stale document, a tool that returned the wrong row, or a prompt that got assembled wrong. You cannot see any of that from the output. You have to open the trace for that specific request id and read the spans.

The axis that matters

For debugging a single bad output, I care about two things. First, given a request id, how fast can I pull that one request's complete trace: retrieved chunks, templated prompt, tool args and returns, token counts, per-span latency. Second, is the tracing OpenTelemetry-native, so the spans drop into the collector and backend I already run, instead of locking me into a proprietary SDK and a second dashboard.

That second point is not aesthetic. When tracing is OTel-native, an LLM span sits in the same trace as the HTTP handler, the vector DB call, and the downstream service. One trace id, request to response. When it is proprietary, the LLM half lives in a separate tool and you are stitching timelines by hand at 2am.

Six tools, by how they capture

Ordered by how they get spans out of your app, not by any ranking.

Helicone (github.com/Helicone/helicone) is the fastest to turn on. You point your model's base URL at their proxy and every call gets logged, no per-span instrumentation. That one-line integration is the appeal. The tradeoff is granularity: a gateway sees the request and response it proxies, so your retrieval step and internal tool calls (which never hit the proxy) do not show up as spans unless you instrument them separately. Great for "what did the model get sent", thinner for the chunk that poisoned it.

LangSmith (smith.langchain.com) gives you the richest single-request view if you live in LangChain or LangGraph. Chains, tool nodes, and retriever steps show up already structured, and the waterfall for one run is genuinely good for reading a bad output. The catch is that the tracing is fairly proprietary. You are sending to their backend through their SDK, and pulling those spans into your own OTel collector is not the native path.

Langfuse (github.com/langfuse/langfuse) is open source and OTel-aware: it exposes an OpenTelemetry endpoint, so OTel spans can land there, and it also has its own SDK and decorators. Self-hostable if you want the data in your own infra. Reading one request means opening the trace by id and walking the observation tree, which shows the prompt, the retrieved context you logged, and per-step latency and tokens.

Future AGI (github.com/future-agi/future-agi) approaches tracing as one surface of a broader platform that also covers evaluation, prompt work, and guardrails, and its tracing library is OpenTelemetry-native, so spans flow through the standard OTel path into a backend you can point at your own stack. For the debugging job the useful part is that a wrong answer's trace carries the retrieved context and tool IO as spans on the same trace id, which is what you open when a request id is all you have from the ticket.

Braintrust (braintrust.dev) centers its logging around its Eval object. Traces are first-class, and the strong version of the workflow is: you catch a bad output, and turn that exact request into a test case in the same tool. If your loop is debug-then-lock-with-an-eval, that tight coupling helps. If you just want raw request-level tracing decoupled from their eval abstraction, it is more opinionated than a plain OTel backend.

Arize Phoenix (github.com/Arize-ai/phoenix) is open source and OpenTelemetry-native through OpenInference, so LLM, retriever, and tool spans use standard OTel semantics and flow into your collector. You can run it locally for a single debugging session or against a persistent backend. Opening one request means filtering to its trace id and reading the span tree, with the retrieved documents and tool calls attached as span attributes.

Three of these (Langfuse, Phoenix, Future AGI) are open source and multi-surface. Two of the six (Phoenix, Future AGI) are OTel-native by design; Langfuse supports OTel alongside its own SDK. All of the above is as of mid-2026, and this space ships fast, so check the current docs before you commit.

Reading one request

Whatever the backend, the move is the same: get the trace id (from the request id you logged), pull the trace, walk the spans, look at retrieval and tool IO first. Here is the instrumentation side, plain OpenTelemetry, so the LLM span carries the attributes you will actually want at 2am.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("rag-agent")


def answer(request_id: str, question: str) -> str:
    # request_id ties this trace back to the support ticket
    with tracer.start_as_current_span("answer") as span:
        span.set_attribute("request.id", request_id)
        span.set_attribute("input.question", question)

        chunks = retrieve(question)                    # own child span inside
        span.set_attribute("retrieval.chunk_ids", [c.id for c in chunks])
        span.set_attribute("retrieval.chunk_count", len(chunks))

        prompt = build_prompt(question, chunks)
        span.set_attribute("llm.prompt", prompt)       # the EXACT text the model saw

        out = call_model(prompt)                       # sets llm.tokens on its span
        span.set_attribute("output.answer", out)
        return out

The one attribute that saves you every time is llm.prompt holding the fully templated string, not the template. On the refund bug, that is where it fell out: the retrieved chunk was from an old policy doc, retrieval.chunk_ids pointed straight at it, and the prompt span showed the model was handed "90 days" as context. The model was not hallucinating. It was faithfully repeating a stale chunk. Total time from request id to root cause was about seven minutes, and six of those were me finding the request id in our own logs.

If your spans do not carry the retrieved chunk ids and the templated prompt, no backend will save you. You will be staring at an input and an output with the interesting part missing.

What I'd check first

Pull the trace by request id and open the retrieval span first. Wrong chunk in equals wrong answer out, and this is the most common cause.
Read llm.prompt as the fully templated string, not the template. A prompt assembled wrong looks fine in code and obvious in the span.
Confirm the LLM span shares one trace id with the HTTP and vector-DB spans. If the LLM half lives in a separate proprietary tool, you are stitching timelines by hand, and that is the setup problem to fix before the next incident.

# A 94% pass rate hid a PII leak in 6 test cases

Ethan Walker — Sun, 05 Jul 2026 17:02:23 +0000

Our eval dashboard said 94%. Green checkmark, merge button unlocked, everyone moved on. Three days later a customer forwarded us a transcript where our support agent had pasted another user's account ID and partial billing address into a response. Not a jailbreak, not adversarial input, just a normal support query where the agent's tool-calling step grabbed the wrong record and included it verbatim in a "helpful" summary.

We went back to the eval run that had passed. Out of 512 test cases, 31 failed for one reason or another (phrasing too verbose, wrong tone, minor factual softening). Six of those 31 failures were the PII leak pattern. Six out of 512 is a rounding error against a flat pass-rate metric. It's also, in my opinion, the only failure category in that entire run that should have blocked the deploy on its own.

That's the problem with a single threshold on a flat pass rate: it assumes all failures cost the same. They don't. A verbose answer costs you a slightly annoyed user. A PII leak costs you a disclosure obligation and possibly a very bad week. Averaging them into one number is a category error, and it's one I'd guess most teams running LLM-as-judge pipelines are making right now without realizing it, because building a flat pass rate is the default output of every eval framework I've used (DeepEval, Promptfoo, LangSmith all give you this by default; none of them force severity weighting on you).

[IMAGE: https://lh3.googleusercontent.com/d/1Xf1IIcHzCOOSS4EsN5wW5PPS_fVxTZ9U]

Why flat pass rate hides small severe clusters

Think about the arithmetic. If you have 500 test cases and your threshold is "pass rate must be at least 90%," you can absorb 50 failures before the gate trips. If your failure distribution is mostly benign (wrong tone, slightly long response, minor formatting), the gate is well calibrated for that. But the moment even a handful of those 50 allowed failures belong to a catastrophic category (irreversible action taken, PII disclosed, a factual claim that could cause real financial harm), a flat threshold has no way to tell the difference. It just counts.

The other failure mode we hit: our test suite was itself imbalanced. We had roughly 40 test cases probing tone and style for every 1 test case probing PII handling or destructive tool calls, because tone issues are easy to write test cases for and someone had clearly optimized for coverage breadth over risk coverage. So even a perfect recall on the severe category could get statistically drowned by the benign category in an aggregate score.

The fix: weight by blast radius, gate on the weighted score

We now tag every eval test case with a severity level at write time, not after an incident forces us to retrofit it:

Severity 1 (nitpick): phrasing, tone, formatting. Annoying, not harmful.
Severity 2 (moderate): factually wrong but correctable, no action taken, no data exposed.
Severity 3 (severe): PII/PHI disclosure, irreversible action (refund issued, account deleted, email sent), or a claim that could cause the user financial or safety harm.

Then we compute both the flat pass rate (for visibility, it's still a useful trend line) and a severity-weighted score, and we gate CI on the weighted score, not the flat one.

"""
severity_gate.py
Severity-weighted eval scoring. Computes both the flat pass rate
and a blast-radius-weighted score, and fails CI on the weighted score
even when the flat rate looks healthy.
"""
from dataclasses import dataclass

SEVERITY_WEIGHTS = {1: 1, 2: 4, 3: 20}  # tune per your own risk tolerance


@dataclass
class TestCaseResult:
    name: str
    correct: bool
    severity: int  # 1, 2, or 3


def flat_pass_rate(results: list[TestCaseResult]) -> float:
    return sum(r.correct for r in results) / len(results)


def severity_weighted_score(results: list[TestCaseResult]) -> float:
    """
    Returns a score in [0, 1]. Each failure subtracts its severity weight
    from the achievable total, so one severity-3 failure costs as much
    as 20 severity-1 failures.
    """
    total_weight = sum(SEVERITY_WEIGHTS[r.severity] for r in results)
    earned_weight = sum(SEVERITY_WEIGHTS[r.severity] for r in results if r.correct)
    return earned_weight / total_weight


def run_severity_gate(results: list[TestCaseResult], weighted_threshold: float = 0.98) -> None:
    flat = flat_pass_rate(results)
    weighted = severity_weighted_score(results)
    severe_failures = [r for r in results if not r.correct and r.severity == 3]

    print(f"flat pass rate: {flat:.3f}")
    print(f"severity-weighted score: {weighted:.4f} (threshold {weighted_threshold})")
    if severe_failures:
        print(f"severity-3 failures ({len(severe_failures)}):")
        for r in severe_failures:
            print(f"  - {r.name}")

    if weighted < weighted_threshold:
        raise SystemExit(
            f"SEVERITY GATE FAILED: weighted score {weighted:.4f} "
            f"below {weighted_threshold}, despite flat pass rate {flat:.3f}"
        )


if __name__ == "__main__":
    # Reconstruction of the run that shipped the PII leak, anecdotally,
    # from our postmortem numbers (512 cases, 31 failures, 6 severity-3)
    results = (
        [TestCaseResult(f"tone_{i}", correct=True, severity=1) for i in range(420)]
        + [TestCaseResult(f"tone_fail_{i}", correct=False, severity=1) for i in range(20)]
        + [TestCaseResult(f"fact_{i}", correct=True, severity=2) for i in range(56)]
        + [TestCaseResult(f"fact_fail_{i}", correct=False, severity=2) for i in range(5)]
        + [TestCaseResult(f"pii_{i}", correct=True, severity=3) for i in range(5)]
        + [TestCaseResult(f"pii_fail_{i}", correct=False, severity=3) for i in range(6)]
    )
    run_severity_gate(results, weighted_threshold=0.98)

Run against our reconstructed postmortem numbers, flat pass rate comes out to about 0.94 (481 of 512), which is exactly what shipped. The severity-weighted score comes out 0.823, well under a 0.98 threshold, because those six severity-3 failures each cost 20x what a tone nitpick costs. That gate would have blocked the merge.

Picking the weights honestly

I'll flag the obvious weak point: the weights in SEVERITY_WEIGHTS are a judgment call, not a derived constant. We set severity-3 at 20x severity-1 after arguing about it for most of an afternoon, using rough numbers from what a support escalation and a compliance review actually cost us in engineering hours the last time something like this happened. Another team might reasonably land on 10x or 50x. What matters isn't the exact ratio, it's that the ratio is explicit and versioned in the repo instead of implicit in whoever eyeballs the dashboard that week.

We also had to fix the test suite imbalance separately. Weighting doesn't help if you only have 6 severity-3 test cases total and one of them is flaky. We're up to 40 severity-3 cases now, covering PII handling, destructive tool calls, and financial claims, added deliberately rather than as an afterthought to coverage metrics that were optimized for breadth.

What I'd check first

Pull your last 90 days of eval runs and bucket every individual failure by consequence, not by test category name. If you've never done this, you likely don't know your severity distribution.
Check whether your CI threshold is a flat number. If it is, ask what ratio of severe-to-total failures your current threshold can silently absorb before it trips.
Look at how many of your test cases actually probe severe/irreversible outcomes versus tone and phrasing. If it's lopsided toward the easy-to-write category, your aggregate score is measuring the wrong thing more than it's measuring the right one.

our CI passed. Your agent isn't operator-ready.

Ethan Walker — Wed, 01 Jul 2026 17:07:40 +0000

Your CI passed. Your agent isn't operator-ready.

We shipped a document-extraction agent to an enterprise customer last quarter. Twelve-week eval. 94% pass rate on our test suite. Three weeks into the pilot, it started generating refunds for invoices it couldn't parse. Silently. No error. No trace. Just wrong output that looked like right output.

Our CI was green the entire time.

The issue was not the model. It was not the prompt. It was the six percent of inputs we hadn't tested, arriving as the first thing an actual operator's data sent our way.

That's not an edge case. That's what operator-ready means in practice.

What "production-ready" means vs. what "operator-ready" means

Production-ready is an infrastructure concept. Your service is up. It handles load. It restarts on crash. Logs go somewhere. Alerts exist.

Operator-ready is different. It means your agent can be handed to someone who did not build it, running on data you did not design it for, making decisions that have real consequences if they're wrong.

The distinction matters because most eval pipelines are designed for the first. They measure pass rate on a test set. They don't measure what happens when an operator's input distribution is 30% different from your test set, which it always is.

The three gaps that bite in operator handoffs

Gap 1. Validation theater

A Pydantic model with 97% validation success sounds good. Here's what it hides.

The 3% that fail: what does your agent do? If your retry loop fills missing fields with model-inferred defaults, you've built a silent wrong-answer machine. The schema passed. The output is wrong. And you have no log entry flagging it.

Fix: separate the "schema valid" signal from the "content confidence" signal. Log field-level confidence alongside the output. An output is not trusted until both are above threshold.

We added a field_confidence dict to every extraction response. Low-confidence fields trigger a human-review flag, not a retry. That alone caught 14 of the 18 incidents in our first operator month.

Gap 2. Adversarial input handling

Your test set was built by you or your team. It covers the cases you thought of. An operator's data covers the cases they didn't tell you about.

In our case: multi-page invoices with embedded scanned PDFs. Our test suite had single-page invoices. The agent handled them differently, and "differently" meant "wrong" in ways our eval never measured.

This is not a parsing bug. It's a distribution shift. The correct response is not to fix the parser. It's to test against a sample of the actual operator's data before going live.

Before any operator handoff, we now require 50 documents from the operator's own corpus run through the agent, with manual review of outputs. Not synthetic data. Not our test set. Theirs.

That one change caught the scanner-PDF issue before the pilot started.

Gap 3. The audit log that doesn't log what matters

Every engineer's first logging setup captures: what the model returned. Almost nobody logs: what the model decided not to do.

For an operator deploying an extraction agent inside a compliance workflow, the question isn't just "what did the agent output." It's also: "did the agent flag this document as low confidence," "did it skip any fields," "did it trigger any fallback paths."

If you can't answer those questions from the trace, you can't support the operator when something goes wrong. And something will go wrong.

Minimum viable operator audit trail:

Output with field-level confidence scores
Fallback path indicator (did it retry? did it degrade?)
Input hash (so you can replay the exact document)
Model version and prompt version at inference time (not just "gpt-4o", the specific deployment)

We built this into a standard trace schema and started injecting it into every response. The overhead is negligible. The debuggability improvement is significant.

The pre-operator checklist I actually use

Before handing an agent to any operator, I run through this:

Run 50+ samples from the operator's actual data, not our test set. Measure field-level error rate on their corpus specifically. If there's a gap between their corpus accuracy and your test-set accuracy, that gap is your risk.

Search logs for the last 30 days for any output that passed schema validation but triggered downstream errors. These are your silent failures. Fix them before the operator sees them.

Intentionally feed malformed inputs. Verify the agent degrades to a safe fallback, not a wrong output. "I cannot parse this document" is better than a wrong invoice total.

Confirm you can answer "what did the agent do on document X at timestamp Y" in under 5 minutes. If you can't, your audit trail is incomplete and you're not operator-ready regardless of your eval score.

Check the agent's permission scope. Does it have access to resources it doesn't need for this operator's use case? The principle of least privilege applies to agents too.

The number that actually matters

Our eval pass rate was 94%. Our operator-handoff error rate in month one was 8%.

Those two numbers can coexist because they're measuring different things against different data.

After we added the three changes above (field confidence, operator corpus testing, full audit trail), the month-two operator error rate dropped to 1.4%. The eval pass rate barely moved (95%).

The eval score was not the problem. The eval scope was.

What I'd check first

If you've shipped an agent and you're about to hand it to an operator, here's the three-line diagnostic:

Can you answer "what did the agent decide NOT to do on this input" from your trace? If no, your audit trail is incomplete.
Have you run the agent on at least 50 documents from the operator's actual corpus? If no, your pass rate is a test-set metric, not an operator reliability estimate.
What happens when your agent receives input outside its schema? If the answer is "it retries and fills defaults," you have a silent wrong-answer path. Change it to "it flags for human review."

Operator-ready is not a CI check. It's a claim about how the agent behaves on someone else's data, making decisions with real consequences. The eval suite gets you close. These three checks get you there.

The stale eval fixture that passed a broken model

Ethan Walker — Mon, 29 Jun 2026 17:13:40 +0000

The stale eval fixture that passed a broken model

A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.

This is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.

Why the cache existed

Our eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.

That is the right instinct. The bug was in the definition of "nothing that affects the result changed."

The cache key that was missing an input

Our key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.

Here is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.

The rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.

The fix, as a key function

This is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.

import hashlib, json

def eval_cache_key(case, prompt_template, model_snapshot, judge_snapshot, eval_config):
    # model_snapshot / judge_snapshot are the resolved dated ids
    # (e.g. "gpt-4o-2024-08-06"), NEVER the moving alias ("gpt-4o").
    payload = {
        "input": case["vars"],
        "prompt": prompt_template,
        "model": model_snapshot,
        "judge": judge_snapshot,
        "eval_config": eval_config,   # thresholds, rubric, metric set
        "schema": 2,                  # bump to invalidate everything on purpose
    }
    blob = json.dumps(payload, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(blob.encode()).hexdigest()

Two things that matter more than they look:

sort_keys=True so the hash is stable regardless of dict ordering. Without it the "same" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.
The schema integer. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.

And resolve the alias to the snapshot at the top of the run, once:

# Wrong: model id is the alias, so a provider-side snapshot bump is invisible.
model = "gpt-4o"

# Right: resolve to the concrete dated snapshot and key on THAT.
model_snapshot = resolve_snapshot("gpt-4o")  # -> "gpt-4o-2024-08-06"

Fail the cache closed, not open

The second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as "no entry, but also do not block," and in one code path that quietly meant "pass." A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.

We also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.

What it cost to find

The embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.

What I'd check first

When an eval passes something production then breaks, before you touch the model or the rubric:

Confirm the eval actually executed on this commit's model. Look for a fresh model call in the run logs, not a cache hit. If every case is a cache hit, your suite did not test anything.
Diff the cache key inputs against what can change the output. If the model snapshot, judge, or eval config is not in the key, that is your stale-green source. Add it and bump the schema.
Check the miss path. Force a cache miss and confirm it runs the eval for real, not that it shrugs and passes. A cache that can fail open is a gate that can ship anything.