The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.
The number that lied: 91%
The eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.
The users hitting that slice did not experience a 91%. They experienced a 74%.
What an absolute threshold actually measures
A static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.
The fix: gate on the delta, per slice
We stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:
- No single slice drops more than 3 points versus baseline.
- The aggregate drops no more than 1.5 points versus baseline.
def gate(current, baseline):
failures = []
for slice_name, score in current.slices.items():
prev = baseline.slices.get(slice_name)
if prev is not None and prev - score > 3.0:
failures.append((slice_name, prev, score))
if baseline.aggregate - current.aggregate > 1.5:
failures.append(("AGGREGATE", baseline.aggregate, current.aggregate))
return failures # empty == pass
The refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.
The part that bites you: baseline management
Delta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.
So the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.
What I'd check first
- Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.
- Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is "a lot," the aggregate is hiding it.
- Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.
Top comments (1)
Aggregate pass rate is a dangerous comfort metric. The question is which 9 percent failed, what user path they touched, and whether the failure was recoverable. AI gates need severity weighting, not only a green percentage.