<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maya Andersson</title>
    <description>The latest articles on DEV Community by Maya Andersson (@maya_andersson_dev).</description>
    <link>https://dev.to/maya_andersson_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940866%2F5582fb73-6689-457f-92ac-b4e833ce5f1d.png</url>
      <title>DEV Community: Maya Andersson</title>
      <link>https://dev.to/maya_andersson_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maya_andersson_dev"/>
    <language>en</language>
    <item>
      <title>More eval traces will not stabilize your kappa. Stratify the ones you have</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 09 Jun 2026 18:40:44 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/more-eval-traces-will-not-stabilize-your-kappa-stratify-the-ones-you-have-fpl</link>
      <guid>https://dev.to/maya_andersson_dev/more-eval-traces-will-not-stabilize-your-kappa-stratify-the-ones-you-have-fpl</guid>
      <description>&lt;p&gt;TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces to 200. Variance barely moved. Then we stratified the 50 we already had, by score class and a couple of known failure dimensions, and the swing dropped more than quadrupling the sample did. Composition was the lever, not volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom: kappa that will not sit still
&lt;/h2&gt;

&lt;p&gt;The judge scored production traces against a 5-point rubric. Each week we hand-labeled a calibration set and computed kappa. It bounced: 0.55, then 0.42, then 0.61. Nothing in the rubric or the judge prompt had changed. A kappa that moves 0.2 on noise is useless as an early-warning signal, because you cannot tell a real judge regression from the wobble.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why adding traces did almost nothing
&lt;/h2&gt;

&lt;p&gt;Random sampling pulls mostly from the majority class. For us that was clean passes, the easy 5s. Kappa is driven by agreement on the rare, ambiguous classes (the 2s and 3s), and random sampling gives you only a handful of those no matter how big the sample gets. So 200 random traces was mostly more easy passes: more data, almost no new signal where it counts.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sampling&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;kappa range over 4 weeks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;0.41 to 0.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;0.43 to 0.61&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stratified&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;0.52 to 0.58&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Fix step one: stratify by score class
&lt;/h2&gt;

&lt;p&gt;Force every score class into the weekly set so the rare classes are actually estimable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="c1"&gt;# represent every judge score class in the weekly calibration set
&lt;/span&gt;&lt;span class="n"&gt;cal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stratify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score_class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix step two: stratify by the failure dimensions you already know
&lt;/h2&gt;

&lt;p&gt;Score class alone was not enough. We added two dimensions we had been burned by before (input length bucket and whether the trace was multi-turn) and stratified on the combination. The rare-and-hard cases now show up every week instead of randomly, so the kappa we compute is measuring the part of the distribution that actually drifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I am still unsure about
&lt;/h2&gt;

&lt;p&gt;Stratification needs you to know which dimensions matter. For a brand-new judge you do not know them yet, so you are stuck random-sampling until enough failures teach you the strata. I do not have a clean answer for that cold-start case. If you stratify a calibration set before you have failure data, what do you stratify on besides score class?&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How few traces can you get away with?&lt;/strong&gt; &lt;br&gt;
For us 50 stratified was stable. Below about 30 the rare classes had too few examples to estimate kappa at all.&lt;br&gt;
&lt;strong&gt;Doesn't stratifying bias the estimate?&lt;/strong&gt; &lt;br&gt;
Yes, on purpose, toward the hard cases. We report both the stratified kappa (the early-warning number) and the raw kappa (the honest population number).&lt;br&gt;
&lt;strong&gt;Which judge model?&lt;/strong&gt; &lt;br&gt;
A frontier model from a different family than the system under test. The cross-family part matters more than the exact model.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>agents</category>
    </item>
    <item>
      <title>Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Thu, 04 Jun 2026 16:57:46 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/calibration-set-size-for-llm-as-judge-when-50-traces-is-enough-and-when-200-is-mandatory-d1d</link>
      <guid>https://dev.to/maya_andersson_dev/calibration-set-size-for-llm-as-judge-when-50-traces-is-enough-and-when-200-is-mandatory-d1d</guid>
      <description>&lt;p&gt;TL;DR. The human-labeled calibration set you use to validate an LLM-as-judge does not need a fixed size. It needs a size that depends on how balanced your labels are. For roughly balanced binary criteria with no heavy tail, 50 stratified traces will usually pin Cohen's kappa to within a tolerable band (in my runs, a 95 percent bootstrap interval on the order of plus or minus 0.10 to 0.15). The moment you have a rare-but-expensive category, say a safety violation that shows up in 6 percent of traces, 50 is not enough and you should plan for 200 or more, because the variance of kappa is dominated by the count of minority-class examples, not the total. Below I give the kappa formula and why it is sensitive to the marginal distribution, the sample-size intuition, Wilson confidence intervals for small-n per-class precision, and the stratified-sampling routine that keeps marginals stable week to week. Pasteable Python at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Kappa, and why the marginal distribution is doing more work than you think
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa (Cohen, 1960) measures agreement between two raters corrected for the agreement you would expect by chance. Here the two raters are your human labeler and your LLM-as-judge. The formula is kappa = (p_o - p_e) / (1 - p_e), where p_o is observed agreement and p_e is chance agreement computed from the marginals. For a binary label, if the human marks "pass" with probability a and the judge with probability b, then p_e = a*b + (1-a)*(1-b).&lt;/p&gt;

&lt;p&gt;The part people skim past is that p_e is a function of the label marginals, and that function is not linear. When the classes are balanced (a and b near 0.5), p_e sits near 0.5, the denominator is near 0.5, and kappa is well-behaved. When one class is rare (a near 0.95), p_e is pushed close to 1, the denominator collapses toward zero, and kappa becomes a ratio of two small numbers. Small numbers in a denominator are how you get instability. This is the origin of the kappa paradoxes: you can have 95 percent observed agreement and a kappa near zero, purely because the marginals are lopsided. It is not a bug in your judge. It is the chance-correction working as designed on a distribution where chance agreement is already very high. The practical consequence for sizing: a class-imbalanced set carries less information per trace, so you need more traces for the same precision on kappa. I want to be careful not to overclaim. Kappa is not broken on imbalanced data. It is higher-variance, and you pay for that variance with sample size.&lt;/p&gt;

&lt;p&gt;A note on what I report. I treat kappa as descriptive, not as a hypothesis test. The interesting question is never "is kappa significantly greater than zero" (it almost always is, and that bar is meaningless). The question is "is kappa high enough, with a tight enough interval, that I trust this judge to stand in for a human." That is a question about the width of the interval, which is a question about n.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The case for 50 traces: balanced binary criteria, no heavy tail
&lt;/h2&gt;

&lt;p&gt;If your label is a roughly balanced binary criterion, 50 stratified human labels are often enough to make a deployment decision. Two things have to hold: the criterion is genuinely binary and roughly balanced (each class in the 30 to 70 percent range), and there is no rare-but-costly tail you care about separately. Under those conditions the variance of kappa is at its most forgiving.&lt;/p&gt;

&lt;p&gt;A concrete example from my own work. I had a 5-class quality scale (a 1-to-5 Likert a previous team had wired into the judge). Kappa with humans was 0.47, and the bootstrap interval on 80 examples was wide enough that I could not tell 0.47 from 0.35. The 5-class scale was the problem: it spread the marginals thin across five buckets, so most cells had tiny counts. I split it into three binary criteria (is it factually supported, is it relevant, is it complete) and re-labeled the same traces. On the "factually supported" criterion, which was close to balanced, kappa came out at 0.78 on 50 examples with an interval I was comfortable shipping on. Same traces, same judge prompt structure, very different statistical footing. The honest caveat: 50 works for "does this judge agree with us well enough on the common case." It does not work for "does this judge catch the rare bad thing."&lt;/p&gt;

&lt;h2&gt;
  
  
  3. When 200 becomes mandatory: heavy tails and the rare expensive class
&lt;/h2&gt;

&lt;p&gt;The variance of kappa scales inversely with n, which everyone knows, but it also scales with the rarity of the minority class, which fewer people budget for. If the category you care about appears in 6 percent of traces, a 50-trace sample contains, in expectation, three examples of it. Three. Your estimate of the judge's recall on that category is being driven by three data points. This is where I insist on 200 or more. Not because 200 is magic, but because of what it does to the minority-class count: at a 6 percent base rate, 200 traces gives about 12 minority examples in expectation, 400 gives about 24. Pick the rarest class you care about, decide how many examples you need to estimate its precision and recall to a tolerable width, then back out the total n from the base rate. If you need 20 examples of a 6 percent class, you need roughly 20 / 0.06, over 300 traces of raw sampling, or you oversample the rare class deliberately and weight afterward.&lt;/p&gt;

&lt;p&gt;This is the moment to mention the thing I keep repeating, phrased plainly. Quality detection without an uncertainty estimate around the metric does not actually catch the failure. If you report "the judge has 0.81 recall on hallucinations" from a sample with seven hallucinations in it, you have not measured recall. You have measured noise and rounded it to two decimal places. When the question is a paired comparison instead (you changed the judge prompt and want to know whether agreement improved on the same traces), McNemar's test is the right tool. It looks only at the discordant pairs and tests whether the split between the two kinds of disagreement is significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Wilson intervals: the per-class precision number you can trust on small n
&lt;/h2&gt;

&lt;p&gt;Once you report per-class precision and recall (and on an imbalanced set you must, because aggregate kappa hides the minority class), you need a confidence interval on a proportion estimated from a small count. The normal approximation (p plus or minus 1.96 times the square root of p(1-p)/n) fails exactly when you need it most: when the count is small or the proportion is near 0 or 1, it produces intervals that run below zero or above one. Wilson (1927) gives the interval I default to. It centers on a value pulled slightly toward 0.5, derives its width from the score test, stays inside [0, 1], and has far better coverage near the boundaries. For "the judge flagged 9 traces as violations and 7 were real," the Wilson interval on that 7-of-9 precision is honest. The normal-approximation interval on the same 7-of-9 is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Stratified sampling across time windows: keeping the marginals stable
&lt;/h2&gt;

&lt;p&gt;Everything above assumes the calibration set's label distribution resembles production. It will not, if you sample naively, because production drifts. I learned this once: I trained a judge for factual accuracy, got kappa 0.61 on the dev set, deployed it, and three weeks later kappa on a fresh sample was 0.39. The input distribution had shifted (more domain jargon than my calibration set contained). The kappa drop was not the judge getting worse. It was the calibration set no longer describing the job. The fix is to stratify across time windows and across whatever covariate moves your marginals: pull traces in weekly strata, sample within each week proportionally, oversample any rare class so its count is high enough within each window. This keeps the marginal distribution stable week to week and lets you watch for drift, because if this week's stratified sample shows a kappa outside last week's interval, that is a drift signal rather than sampling noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pasteable Python
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;import numpy as np&lt;br&gt;
from sklearn.metrics import cohen_kappa_score&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Cohen's kappa between human labels and judge labels.
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;human = np.array(["pass", "fail", "pass", "pass", "fail", "pass"])&lt;br&gt;
judge = np.array(["pass", "fail", "pass", "fail", "fail", "pass"])&lt;br&gt;
print(f"kappa = {cohen_kappa_score(human, judge):.3f}")&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Wilson score interval for a proportion (per-class
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;precision/recall on small n).&lt;br&gt;
def wilson_interval(successes, n, z=1.96):&lt;br&gt;
    if n == 0:&lt;br&gt;
        return (0.0, 1.0)&lt;br&gt;
    phat = successes / n&lt;br&gt;
    denom = 1 + z**2 / n&lt;br&gt;
    center = (phat + z**2 / (2 * n)) / denom&lt;br&gt;
    half = (z / denom) * np.sqrt(phat * (1 - phat) / n + z**2 / (4 * n**2))&lt;br&gt;
    return (max(0.0, center - half), min(1.0, center + half))&lt;br&gt;
low, high = wilson_interval(7, 9)&lt;br&gt;
print(f"precision 7/9 = 0.778, Wilson 95% CI = [{low:.3f}, {high:.3f}]")&lt;/code&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Bootstrap the variance (and CI) of kappa at a given n.
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;def bootstrap_kappa(human, judge, n_boot=2000, seed=0):&lt;br&gt;
    rng = np.random.default_rng(seed)&lt;br&gt;
    n = len(human)&lt;br&gt;
    human, judge = np.asarray(human), np.asarray(judge)&lt;br&gt;
    estimates = np.empty(n_boot)&lt;br&gt;
    for b in range(n_boot):&lt;br&gt;
        idx = rng.integers(0, n, size=n)   # resample the pairs, not the labels&lt;br&gt;
        estimates[b] = cohen_kappa_score(human[idx], judge[idx])&lt;br&gt;
    return {"kappa": cohen_kappa_score(human, judge),&lt;br&gt;
            "std": float(np.nanstd(estimates)),&lt;br&gt;
            "ci95": (float(np.nanpercentile(estimates, 2.5)), float(np.nanpercentile(estimates, 97.5)))}&lt;br&gt;
rng = np.random.default_rng(1)&lt;br&gt;
h = rng.integers(0, 2, size=50)&lt;br&gt;
flip = rng.random(50) &amp;lt; 0.15&lt;br&gt;
j = np.where(flip, 1 - h, h)&lt;br&gt;
res = bootstrap_kappa(h, j)&lt;br&gt;
print(f"n=50  kappa={res['kappa']:.3f}  std={res['std']:.3f}  95% CI=[{res['ci95'][0]:.3f}, {res['ci95'][1]:.3f}]")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Two things to read off the bootstrap: the std is your standard error on kappa at this n, and the percentile CI is the band you should be quoting. Run it once at n=50 and once at n=200 on your own labels and you will see the interval shrink. That shrinkage, scaled by your minority-class rate, is the entire sizing decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37 to 46.&lt;br&gt;
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209 to 212.&lt;br&gt;
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153 to 157.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;Is kappa the right metric at all? For nominal labels with two raters, kappa is a reasonable default. If your label is ordinal (a real 1-to-5 scale where the distance between buckets matters), weighted kappa or an intraclass correlation is more appropriate. My preference is to avoid ordinal judge scales and decompose into binary criteria.&lt;/p&gt;

&lt;p&gt;Can I just label more data instead of doing any of this math? Yes, and if labeling is cheap for you, more data is the cleanest fix. The math matters when labels are expensive, which is the usual case for the rare-and-costly category.&lt;/p&gt;

&lt;p&gt;Why bootstrap kappa instead of a closed-form variance? There is a closed-form asymptotic variance, but it is an asymptotic result and I do not trust it at the small n where I am actually operating. The bootstrap makes no large-sample assumption and surfaces the rare-class degeneracy directly.&lt;/p&gt;

&lt;p&gt;What kappa value is good enough to ship a judge? There is no universal threshold and I am suspicious of the ones in circulation. It depends on the cost of the judge being wrong. For a low-stakes triage filter I have shipped in the high 0.6s. For anything where a missed positive is expensive, I want a higher kappa and a tight interval on the minority class specifically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions I want to test next
&lt;/h2&gt;

&lt;p&gt;The relationship between bootstrap interval width and minority-class count: I have a working intuition you can predict the n you need from the rare-class rate alone, but I have not pinned down the constant across label types. Stratification under genuine drift: holding the marginals fixed makes the comparison legitimate but may hide the very change I should detect, and I do not have a principled way to do both at once. And whether any of this transfers to multi-judge ensembles, where pairwise kappa stops being the natural object and the right tool is probably closer to an intraclass correlation. If you have labeled data sitting around, the most useful thing you can do is run the bootstrap at two sizes on your own task and see where your interval lands. The number that matters is not the kappa. It is how much the kappa moves when you resample.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>development</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>why Cohen's kappa drifts week to week (and what to do about it)</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:25:19 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh</link>
      <guid>https://dev.to/maya_andersson_dev/why-cohens-kappa-drifts-week-to-week-and-what-to-do-about-it-2alh</guid>
      <description>&lt;p&gt;If your LLM-as-judge calibration kappa moves around week to week and you cannot explain it from labeller behavior, the usual cause is the marginal distribution of your calibration set, not the labellers.&lt;/p&gt;

&lt;p&gt;Quick refresher. Cohen's kappa is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;kappa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Po&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Pe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Pe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where Po is observed agreement and Pe is expected agreement by chance. Pe depends on the marginal distribution of the labels in your set.&lt;/p&gt;

&lt;p&gt;If 70% of last week's traces were labelled "acceptable" by labeller A and 25% "good" and 5% "bad", Pe is one number. If this week's mix is 50/40/10, Pe shifts. The labellers can be doing exactly the same thing and your kappa value moves.&lt;/p&gt;

&lt;p&gt;Three things that help:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Sample your calibration set across multiple time windows (rolling 4-week window, stratified by time bucket). Reduces the chance that one week's traffic pattern dominates Pe.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Report per-class precision and recall alongside kappa. Kappa is one summary number; the per-class metrics tell you where the labeller-LLM disagreement actually sits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For very small calibration sets (under 100 traces), use Wilson confidence intervals around the per-class precision instead of treating kappa as a point estimate. The Wilson interval is robust to small samples; the normal-approximation interval is not.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;References for the calibration-set design and the small-sample math are in Cohen (1960) "A coefficient of agreement for nominal scales" and Wilson (1927) "Probable inference, the law of succession, and statistical inference." Both are short reads.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evaluation</category>
      <category>machinelearning</category>
      <category>statistics</category>
    </item>
    <item>
      <title>Your LLM-as-judge eval set is too small. Here is the math</title>
      <dc:creator>Maya Andersson</dc:creator>
      <pubDate>Tue, 26 May 2026 17:49:50 +0000</pubDate>
      <link>https://dev.to/maya_andersson_dev/your-llm-as-judge-eval-set-is-too-small-here-is-the-math-2iac</link>
      <guid>https://dev.to/maya_andersson_dev/your-llm-as-judge-eval-set-is-too-small-here-is-the-math-2iac</guid>
      <description>&lt;p&gt;How many human-labeled examples do you need to calibrate an LLM-as-judge against humans on your task? The default answer most teams use is "enough," which usually means whatever they had time to label. That answer is wrong in a specific, mathematically tractable way.&lt;/p&gt;

&lt;p&gt;The short version: if your judge has Cohen's kappa around 0.6 against humans and you want a 95% confidence interval no wider than 0.10, you need approximately 200 paired labels. If your judge has kappa around 0.4, you need approximately 400. Most production teams I have read about are using 50, which gives a CI width of 0.20 or wider at the same kappa range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method
&lt;/h2&gt;

&lt;p&gt;Cohen's kappa (Cohen 1960) measures inter-rater agreement adjusted for chance. The classical interpretation thresholds (Landis &amp;amp; Koch 1977) treat 0.40 to 0.60 as "moderate" and 0.60 to 0.80 as "good."&lt;/p&gt;

&lt;p&gt;The variance of an estimated kappa shrinks with sample size, but slower than linearly. For a fixed true kappa, doubling N narrows the CI by roughly sqrt(2). To halve the CI width, you need 4x the data.&lt;/p&gt;

&lt;p&gt;Here is a bootstrap-CI calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohen_kappa_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;kappa_with_bootstrap_ci&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;n_resamples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns (point_estimate, (low, high)) bootstrap CI.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;paired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;point_estimate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;resampled_kappas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;rng&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_resamples&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;integers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;bs_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;paired&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;bs_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bs_pairs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;bs_human&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;bs_pairs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;cohen_kappa_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bs_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bs_human&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ci&lt;/span&gt;
    &lt;span class="n"&gt;low&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;high&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resampled_kappas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;point_estimate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For paired comparison between two judges on the same examples, McNemar's test is the right statistic (not a re-application of kappa). The implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;statsmodels.stats.contingency_tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mcnemar&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_judges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_a_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;judge_b_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns McNemar exact test p-value for whether judge A
    and judge B differ in their agreement-with-human rate.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;a_correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_a_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;b_correct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge_b_scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;human_scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="c1"&gt;# 2x2 contingency: both right, A only, B only, both wrong
&lt;/span&gt;    &lt;span class="n"&gt;both_right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;a_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;b_only&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;both_wrong&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_correct&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;both_right&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_only&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b_only&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;both_wrong&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;mcnemar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;pvalue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The bounded sample size problem
&lt;/h2&gt;

&lt;p&gt;The CI width is the quantity that determines whether a kappa estimate is operationally useful. A point estimate of 0.65 with CI [0.45, 0.85] gives almost no information. A point estimate of 0.65 with CI [0.60, 0.70] tells you the judge is reliably "good."&lt;/p&gt;

&lt;p&gt;For production drift detection, you need CIs tight enough that drift is distinguishable from sampling noise. CI width below 0.10 detects 0.10-point drops reliably; CI width 0.20 does not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;True kappa&lt;/th&gt;
&lt;th&gt;N for CI width 0.10&lt;/th&gt;
&lt;th&gt;N for CI width 0.20&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;approximately 450&lt;/td&gt;
&lt;td&gt;approximately 115&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;approximately 250&lt;/td&gt;
&lt;td&gt;approximately 65&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.7&lt;/td&gt;
&lt;td&gt;approximately 150&lt;/td&gt;
&lt;td&gt;approximately 40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;approximately 50&lt;/td&gt;
&lt;td&gt;approximately 15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are Monte Carlo estimates, not closed-form derivations. The exact formula (Fleiss 1981) involves prevalence and bias terms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What N to actually use
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;recommend_n&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_kappa&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lookup from Monte Carlo simulation; not a closed form.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_kappa&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;450&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;target_ci_width&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;4.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not know your judge's kappa yet, start with N=200 for initial calibration. Re-estimate the required N based on observed kappa and label more if you came in low.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three production judges, three decisions
&lt;/h2&gt;

&lt;p&gt;Judge A (refund agent factual accuracy). Initial N=200. Observed kappa 0.61 [CI 0.54, 0.68]. After 3 weeks in production, kappa on a fresh 200-example sample dropped to 0.39 [CI 0.30, 0.48]. Distribution shift on the input. The drop was detectable because both CIs were tight.&lt;/p&gt;

&lt;p&gt;Judge B (customer-support tone scoring). Initial N=200, observed kappa 0.72 [CI 0.67, 0.78]. Stable across two months.&lt;/p&gt;

&lt;p&gt;Judge C (code-review quality scoring). Initial N=200, observed kappa 0.31 [CI 0.22, 0.40]. Too low to use. Reverted to human-only review.&lt;/p&gt;

&lt;p&gt;If I had used N=50, two of three decisions would have been ambiguous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;Kappa is a single-criterion metric. Production judges often score multiple criteria; per-criterion kappa with separate CIs is the right approach.&lt;/p&gt;

&lt;p&gt;Prevalence affects kappa variance. Stratified sampling helps. My Monte Carlo assumes balanced classes.&lt;/p&gt;

&lt;p&gt;The bootstrap CI is approximate. For N less than 50, use Fleiss's closed form, or accept that you do not have enough data.&lt;/p&gt;

&lt;p&gt;This is about agreement, not validity. A judge can have high kappa with humans who are themselves wrong. Sara Hooker's writing on benchmark validity is the relevant prior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;The relationship between calibration set size and drift-detection sensitivity for production traces. My working hypothesis is sensitivity tracks 1 over sqrt(N), but I have not derived this formally.&lt;/p&gt;

&lt;p&gt;The right cadence for re-labeling. Weekly works in practice; the closed-form relationship between re-labeling cadence and model-update cadence I have not seen written down.&lt;/p&gt;

&lt;p&gt;Cross-judge agreement as a partial substitute for human labels. The published literature is thin. Farquhar et al. 2024 is close but is about hallucination detection, not judge calibration. Zheng et al. (LMSYS) hints at this direction but does not run the experiment systematically. If anyone has a citation, I would appreciate it.&lt;/p&gt;

&lt;p&gt;The implication for benchmark validity. Most published LLM-as-judge benchmarks report kappa point estimates with sample sizes below what is required to detect 0.05 to 0.10-point differences between judges. The published rankings may be within sampling noise. The literature on this is not yet settled.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
