DEV Community

Maya Andersson
Maya Andersson

Posted on

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what "broke" the judge.

Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another's intervals. The decline we had spent two days investigating was indistinguishable from noise.

What we changed

  1. Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.

  2. Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.

  3. Escalate on sustained shifts only: consecutive weeks outside the prior band, not a single bad reading.

The part that surprised me

How rare this practice is. Most eval dashboards I have seen show single kappa or accuracy numbers with no uncertainty at all, and teams retune judges off moves of 0.05. We would never accept that for an A/B test; somehow it became normal for eval metrics.

import numpy as np
def kappa_ci(judge, human, n_boot=2000, alpha=0.05):
    from sklearn.metrics import cohen_kappa_score
    idx = np.arange(len(judge)); stats = []
    for _ in range(n_boot):
        s = np.random.choice(idx, size=len(idx), replace=True)
        stats.append(cohen_kappa_score(judge[s], human[s]))
    lo, hi = np.percentile(stats, [100*alpha/2, 100*(1-alpha/2)])
    return lo, hi
Enter fullscreen mode Exit fullscreen mode

Open question I am still chewing on: consecutive-weeks-outside-band is a crude escalation rule. If you use something sharper for eval metrics (CUSUM, control charts), I would like to hear how it behaves in practice on noisy judge data.

Top comments (0)