91% pass rate. Gate green. Shipped. Worst regression we had all quarter.

#testing #ci #ai #llm

The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.

The number that lied: 91%

The eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.

The users hitting that slice did not experience a 91%. They experienced a 74%.

What an absolute threshold actually measures

A static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.

The fix: gate on the delta, per slice

We stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:

No single slice drops more than 3 points versus baseline.
The aggregate drops no more than 1.5 points versus baseline.

def gate(current, baseline):
    failures = []
    for slice_name, score in current.slices.items():
        prev = baseline.slices.get(slice_name)
        if prev is not None and prev - score > 3.0:
            failures.append((slice_name, prev, score))
    if baseline.aggregate - current.aggregate > 1.5:
        failures.append(("AGGREGATE", baseline.aggregate, current.aggregate))
    return failures  # empty == pass

The refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.

The part that bites you: baseline management

Delta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.

So the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.

What I'd check first

Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.
Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is "a lot," the aggregate is hiding it.
Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.

Top comments (1)

Alex Shev • Jun 23

Aggregate pass rate is a dangerous comfort metric. The question is which 9 percent failed, what user path they touched, and whether the failure was recoverable. AI gates need severity weighting, not only a green percentage.