DEV Community

Cover image for 91% pass rate. Gate green. Shipped. Worst regression we had all quarter.
Ethan Walker
Ethan Walker

Posted on

91% pass rate. Gate green. Shipped. Worst regression we had all quarter.

The gate was a fixed 90% threshold on an intent-classification eval. The change came in at 91%, cleared the bar, went out. A fixed pass-rate gate catches collapses, not drift. This was drift, and it walked right through.

The number that lied: 91%

The eval had sat at 96-97% for weeks. A retrieval change knocked one slice (ambiguous refund requests) from 98% to 74%. That slice is 4% of traffic, so the aggregate only fell to 91%. Above 90, so the gate stayed green. The aggregate did exactly what aggregates do: it averaged a real failure into noise.

The users hitting that slice did not experience a 91%. They experienced a 74%.

What an absolute threshold actually measures

A static threshold answers one question: did the whole thing fall off a cliff. It says nothing about whether a specific slice quietly got worse while everything else held it up. If 96 of your slices are fine and one craters, a high floor hides the crater. You find out from a support ticket, not from CI.

The fix: gate on the delta, per slice

We stopped gating on an absolute number and started gating against the last passing run. Two rules, both have to hold:

  1. No single slice drops more than 3 points versus baseline.
  2. The aggregate drops no more than 1.5 points versus baseline.
def gate(current, baseline):
    failures = []
    for slice_name, score in current.slices.items():
        prev = baseline.slices.get(slice_name)
        if prev is not None and prev - score > 3.0:
            failures.append((slice_name, prev, score))
    if baseline.aggregate - current.aggregate > 1.5:
        failures.append(("AGGREGATE", baseline.aggregate, current.aggregate))
    return failures  # empty == pass
Enter fullscreen mode Exit fullscreen mode

The refund slice dropping 24 points would have failed rule 1 on the first run, regardless of where the aggregate landed.

The part that bites you: baseline management

Delta gating breaks the moment your baseline drifts down with you. If the baseline updates on every run, a 0.5-point slide each day passes every single time and you ratchet straight into a regression over two weeks. Slow drift is invisible to a gate that keeps moving its own goalposts.

So the baseline updates only when main is green, and any intentional drop needs a human to approve it before it becomes the new floor. The baseline is a record of verified-good, not a record of most-recent.

What I'd check first

  • Pull the variance across your last 5 green runs per slice. If one slice swings more than your delta threshold run-to-run, your threshold is noise, not signal.
  • Take your smallest slice and ask: how far can it drop before the aggregate notices. If the answer is "a lot," the aggregate is hiding it.
  • Confirm your baseline only advances on green main with a human in the loop. If it updates every run, you are not gating on drift, you are following it down.

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

Aggregate pass rate is a dangerous comfort metric. The question is which 9 percent failed, what user path they touched, and whether the failure was recoverable. AI gates need severity weighting, not only a green percentage.