More eval traces will not stabilize your kappa. Stratify the ones you have

#ai #programming #devops #agents

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple of known failure dimensions, and the swing dropped more than quadrupling the sample did. Composition was the lever, not volume.

The symptom: kappa that will not sit still

The judge scored production traces against a 5-point rubric. Each week we hand-labeled a calibration set and computed kappa. It bounced: 0.55, then 0.42, then 0.61. Nothing in the rubric or the judge prompt had changed. A kappa that moves 0.2 on noise is useless as an early-warning signal, because you cannot tell a real judge regression from the wobble.

Why adding traces did almost nothing

Random sampling pulls mostly from the majority class. For us that was clean passes, the easy 5s. Kappa is driven by agreement on the rare, ambiguous classes (the 2s and 3s), and random sampling gives you only a handful of those no matter how big the sample gets. So 200 random traces was mostly more easy passes: more data, almost no new signal where it counts.

Sampling	n	kappa range over 4 weeks
Random	50	0.41 to 0.63
Random	200	0.43 to 0.61
Stratified	50	0.52 to 0.58

Fix step one: stratify by score class

Force every score class into the weekly set so the rare classes are actually estimable.

from sklearn.model_selection import train_test_split
# represent every judge score class in the weekly calibration set
cal, _ = train_test_split(
    traces, train_size=50, stratify=traces["score_class"], random_state=0
)

Fix step two: stratify by the failure dimensions you already know

Score class alone was not enough. We added two dimensions we had been burned by before (input length bucket and whether the trace was multi-turn) and stratified on the combination. The rare-and-hard cases now show up every week instead of randomly, so the kappa we compute is measuring the part of the distribution that actually drifts.

What I am still unsure about

Stratification needs you to know which dimensions matter. For a brand-new judge you do not know them yet, so you are stuck random-sampling until enough failures teach you the strata. I do not have a clean answer for that cold-start case. If you stratify a calibration set before you have failure data, what do you stratify on besides score class?

FAQ

How few traces can you get away with?
For us 50 stratified was stable. Below about 30 the rare classes had too few examples to estimate kappa at all.
Doesn't stratifying bias the estimate?
Yes, on purpose, toward the hard cases. We report both the stratified kappa (the early-warning number) and the raw kappa (the honest population number).
Which judge model?
A frontier model from a different family than the system under test. The cross-family part matters more than the exact model.