If your LLM-as-judge calibration kappa moves around week to week and you cannot explain it from labeller behavior, the usual cause is the marginal distribution of your calibration set, not the labellers.
Quick refresher. Cohen's kappa is:
kappa = (Po - Pe) / (1 - Pe)
Where Po is observed agreement and Pe is expected agreement by chance. Pe depends on the marginal distribution of the labels in your set.
If 70% of last week's traces were labelled "acceptable" by labeller A and 25% "good" and 5% "bad", Pe is one number. If this week's mix is 50/40/10, Pe shifts. The labellers can be doing exactly the same thing and your kappa value moves.
Three things that help:
Sample your calibration set across multiple time windows (rolling 4-week window, stratified by time bucket). Reduces the chance that one week's traffic pattern dominates Pe.
Report per-class precision and recall alongside kappa. Kappa is one summary number; the per-class metrics tell you where the labeller-LLM disagreement actually sits.
For very small calibration sets (under 100 traces), use Wilson confidence intervals around the per-class precision instead of treating kappa as a point estimate. The Wilson interval is robust to small samples; the normal-approximation interval is not.
References for the calibration-set design and the small-sample math are in Cohen (1960) "A coefficient of agreement for nominal scales" and Wilson (1927) "Probable inference, the law of succession, and statistical inference." Both are short reads.
Top comments (0)