Maya Andersson

Posted on May 26

Your LLM-as-judge eval set is too small. Here is the math

#llm #machinelearning #ai #datascience

How many human-labeled examples do you need to calibrate an LLM-as-judge against humans on your task? The default answer most teams use is "enough," which usually means whatever they had time to label. That answer is wrong in a specific, mathematically tractable way.

The short version: if your judge has Cohen's kappa around 0.6 against humans and you want a 95% confidence interval no wider than 0.10, you need approximately 200 paired labels. If your judge has kappa around 0.4, you need approximately 400. Most production teams I have read about are using 50, which gives a CI width of 0.20 or wider at the same kappa range.

Method

Cohen's kappa (Cohen 1960) measures inter-rater agreement adjusted for chance. The classical interpretation thresholds (Landis & Koch 1977) treat 0.40 to 0.60 as "moderate" and 0.60 to 0.80 as "good."

The variance of an estimated kappa shrinks with sample size, but slower than linearly. For a fixed true kappa, doubling N narrows the CI by roughly sqrt(2). To halve the CI width, you need 4x the data.

Here is a bootstrap-CI calculation:

import numpy as np
from sklearn.metrics import cohen_kappa_score

def kappa_with_bootstrap_ci(judge_scores, human_scores,
                            n_resamples=2000, ci=0.95):
    """Returns (point_estimate, (low, high)) bootstrap CI."""
    paired = list(zip(judge_scores, human_scores))
    n = len(paired)

    point_estimate = cohen_kappa_score(judge_scores, human_scores)

    resampled_kappas = []
    rng = np.random.default_rng(42)
    for _ in range(n_resamples):
        idx = rng.integers(0, n, size=n)
        bs_pairs = [paired[i] for i in idx]
        bs_judge = [p[0] for p in bs_pairs]
        bs_human = [p[1] for p in bs_pairs]
        resampled_kappas.append(
            cohen_kappa_score(bs_judge, bs_human)
        )

    alpha = 1 - ci
    low = np.percentile(resampled_kappas, 100 * alpha / 2)
    high = np.percentile(resampled_kappas, 100 * (1 - alpha / 2))
    return point_estimate, (low, high)

For paired comparison between two judges on the same examples, McNemar's test is the right statistic (not a re-application of kappa). The implementation:

from statsmodels.stats.contingency_tables import mcnemar

def compare_judges(judge_a_scores, judge_b_scores, human_scores):
    """Returns McNemar exact test p-value for whether judge A
    and judge B differ in their agreement-with-human rate."""
    a_correct = [a == h for a, h in zip(judge_a_scores, human_scores)]
    b_correct = [b == h for b, h in zip(judge_b_scores, human_scores)]
    # 2x2 contingency: both right, A only, B only, both wrong
    both_right = sum(a and b for a, b in zip(a_correct, b_correct))
    a_only = sum(a and not b for a, b in zip(a_correct, b_correct))
    b_only = sum(not a and b for a, b in zip(a_correct, b_correct))
    both_wrong = sum(not a and not b for a, b in zip(a_correct, b_correct))
    table = [[both_right, a_only], [b_only, both_wrong]]
    return mcnemar(table, exact=True).pvalue

The bounded sample size problem

The CI width is the quantity that determines whether a kappa estimate is operationally useful. A point estimate of 0.65 with CI [0.45, 0.85] gives almost no information. A point estimate of 0.65 with CI [0.60, 0.70] tells you the judge is reliably "good."

For production drift detection, you need CIs tight enough that drift is distinguishable from sampling noise. CI width below 0.10 detects 0.10-point drops reliably; CI width 0.20 does not.

True kappa	N for CI width 0.10	N for CI width 0.20
0.3	approximately 450	approximately 115
0.5	approximately 250	approximately 65
0.7	approximately 150	approximately 40
0.9	approximately 50	approximately 15

These are Monte Carlo estimates, not closed-form derivations. The exact formula (Fleiss 1981) involves prevalence and bias terms.

What N to actually use

def recommend_n(target_kappa: float,
                target_ci_width: float = 0.1) -> int:
    """Lookup from Monte Carlo simulation; not a closed form."""
    if target_kappa >= 0.85:
        return max(50, int(40 / target_ci_width**2 * 0.5))
    elif target_kappa >= 0.65:
        return max(150, int(40 / target_ci_width**2 * 1.5))
    elif target_kappa >= 0.45:
        return max(250, int(40 / target_ci_width**2 * 2.5))
    else:
        return max(450, int(40 / target_ci_width**2 * 4.5))

If you do not know your judge's kappa yet, start with N=200 for initial calibration. Re-estimate the required N based on observed kappa and label more if you came in low.

Three production judges, three decisions

Judge A (refund agent factual accuracy). Initial N=200. Observed kappa 0.61 [CI 0.54, 0.68]. After 3 weeks in production, kappa on a fresh 200-example sample dropped to 0.39 [CI 0.30, 0.48]. Distribution shift on the input. The drop was detectable because both CIs were tight.

Judge B (customer-support tone scoring). Initial N=200, observed kappa 0.72 [CI 0.67, 0.78]. Stable across two months.

Judge C (code-review quality scoring). Initial N=200, observed kappa 0.31 [CI 0.22, 0.40]. Too low to use. Reverted to human-only review.

If I had used N=50, two of three decisions would have been ambiguous.

Limitations

Kappa is a single-criterion metric. Production judges often score multiple criteria; per-criterion kappa with separate CIs is the right approach.

Prevalence affects kappa variance. Stratified sampling helps. My Monte Carlo assumes balanced classes.

The bootstrap CI is approximate. For N less than 50, use Fleiss's closed form, or accept that you do not have enough data.

This is about agreement, not validity. A judge can have high kappa with humans who are themselves wrong. Sara Hooker's writing on benchmark validity is the relevant prior.

Open questions

The relationship between calibration set size and drift-detection sensitivity for production traces. My working hypothesis is sensitivity tracks 1 over sqrt(N), but I have not derived this formally.

The right cadence for re-labeling. Weekly works in practice; the closed-form relationship between re-labeling cadence and model-update cadence I have not seen written down.

Cross-judge agreement as a partial substitute for human labels. The published literature is thin. Farquhar et al. 2024 is close but is about hallucination detection, not judge calibration. Zheng et al. (LMSYS) hints at this direction but does not run the experiment systematically. If anyone has a citation, I would appreciate it.

The implication for benchmark validity. Most published LLM-as-judge benchmarks report kappa point estimates with sample sizes below what is required to detect 0.05 to 0.10-point differences between judges. The published rankings may be within sampling noise. The literature on this is not yet settled.

Top comments (1)

Harjot Singh • May 31

This is a needed dose of statistics in a field that mostly eyeballs results. The trap is real: you run 20 examples, see 18 pass, declare "90%!" and ship - but the confidence interval on n=20 is so wide that your true rate could be 70% or 99%. Small eval sets produce numbers precise enough to feel trustworthy and noisy enough to be meaningless. People make ship/no-ship calls on differences that are pure sampling noise.

The practical upshot most teams miss: a prompt change that moves your score from 88% to 91% on a 50-item set hasn't necessarily improved anything - you need the sample size to detect the effect size you care about, or you're tuning on noise. Knowing the math (and that LLM-judge adds its OWN variance on top) is what separates real eval from theater. That statistical honesty is something I care about in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - verification only helps if the signal is real, not a small-n illusion. Excellent, rigorous post - rare to see the actual math. What sample size do you tell people is the realistic floor for a trustworthy LLM-judge eval? Curious what the math works out to in practice.