Gabriel Anhaia

Posted on May 7

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests

#ai #observability #machinelearning #statistics

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team ships a prompt change. The eval set has 100 questions. The old prompt scored 78, the new prompt scores 82. Slack lights up; the numbers land in the deploy note; the change ships.

Two weeks later, customer support tickets are flat. The "win" was four examples flipping out of a hundred. With a sample that small, four is well inside the noise floor. The same prompt re-run on the same model on a different day moves by more than that. The team did not measure an improvement; they measured a coin flip and treated the side it landed on as a result.

This is the cheapest, most common eval mistake in production LLM work. The fix is not better judges or more sophisticated metrics. The fix is making the eval set large enough that a 4-point delta means something. That is a math problem with a known answer.

Why 100 examples is a coin flip

Pass-rate evals are a binomial. Each example is a Bernoulli trial: the answer is correct or it is not. The score is the sample proportion. The standard error of a proportion at n=100 and p=0.80 is about sqrt(0.80 * 0.20 / 100) = 0.04. A 95% confidence interval is roughly ±2 * SE = ±8 percentage points.

Read that again. With 100 examples at an 80% pass rate, the true pass rate sits between 72 and 88 with 95% confidence. A 4-point delta between two prompts is less than half the noise.

That is for a single run. To detect a delta between prompt A and prompt B, you need both confidence intervals not to overlap. The standard error of the difference grows with the variances of both arms, so the threshold is stricter.

import math

def se_proportion(p: float, n: int) -> float:
    return math.sqrt(p * (1 - p) / n)

def ci_95(p: float, n: int) -> tuple[float, float]:
    se = se_proportion(p, n)
    return (p - 1.96 * se, p + 1.96 * se)

print(ci_95(0.80, 100))    # (~0.72, ~0.88)
print(ci_95(0.80, 1000))   # (~0.78, ~0.82)
print(ci_95(0.80, 5000))   # (~0.789, ~0.811)

At n=1000, the 95% interval shrinks to roughly ±2.5 points. At n=5000, ±1.1 points. The cost of an evaluation run scales linearly with n, but the precision of your verdict scales with sqrt(n). Pay for the samples or skip the claim.

The sample-size formula nobody runs

The number you actually want is: how many examples do I need to detect a delta of size Δ with significance α=0.05 and power 1−β=0.80. The closed-form for two-proportion comparison is:

n_per_arm ≈ ( z(α/2) * sqrt(2 * p̄ * (1 − p̄))
            + z(β)   * sqrt(p1*(1−p1) + p2*(1−p2)) )² / Δ²

where p̄ = (p1 + p2) / 2, Δ = |p2 − p1|, z(0.025) = 1.96, z(0.20) = 0.84. In code:

import math
from statistics import NormalDist

def n_per_arm(p1: float, p2: float,
              alpha: float = 0.05,
              power: float = 0.80) -> int:
    z_alpha = NormalDist().inv_cdf(1 - alpha / 2)
    z_beta  = NormalDist().inv_cdf(power)
    p_bar = (p1 + p2) / 2
    pooled = math.sqrt(2 * p_bar * (1 - p_bar))
    split  = math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
    delta  = abs(p2 - p1)
    n = ((z_alpha * pooled + z_beta * split) / delta) ** 2
    return math.ceil(n)

print(n_per_arm(0.78, 0.80))  # ~6,300
print(n_per_arm(0.78, 0.82))  # ~1,580
print(n_per_arm(0.78, 0.85))  # ~480
print(n_per_arm(0.78, 0.90))  # ~165

A 2-point delta at the 80% range needs about 6,300 examples per arm to detect with 95% confidence and 80% power.
A 4-point delta needs about 1,580 per arm.
A 7-point delta needs about 480 per arm. That is the first number where a 500-example eval is honest.
A 12-point delta needs about 165 per arm.

The 100-example eval from the opening can detect roughly a 15-point delta. Below that, every "win" is statistically a flip.

How many examples you need depends on the size of the change you want to catch. If you only care about catching regressions of 10 points or more, 250 examples is fine. If you want to catch a 2-point lift, you are in the thousands per arm or you are guessing.

You do not actually need both arms full

Most prompt-change A/B tests are paired: the same eval question runs through both prompts. That structure gives you a sharper test for free. The relevant statistic is McNemar's test on the discordant pairs (cases where one prompt passed and the other failed), not two independent proportions.

def mcnemar_n(p_disc: float, delta_disc: float,
              alpha: float = 0.05,
              power: float = 0.80) -> int:
    """Sample size for paired binary outcome.
    p_disc: total fraction of discordant pairs (one passes, one fails)
    delta_disc: difference between the two discordant cells
    (p10 - p01 in McNemar terms)
    """
    z_alpha = NormalDist().inv_cdf(1 - alpha / 2)
    z_beta  = NormalDist().inv_cdf(power)
    n = ((z_alpha * math.sqrt(p_disc)
          + z_beta * math.sqrt(p_disc - delta_disc**2))
         / delta_disc) ** 2
    return math.ceil(n)

# 8% of pairs disagree, of which the new prompt wins 5% net
print(mcnemar_n(0.08, 0.05))  # ~290

For the same effect size, the paired design needs an order of magnitude fewer examples than two independent arms. If both prompts pass on the easy questions and both fail on the hard ones, those rows tell you nothing. The signal lives entirely in the discordant pairs. Counting the rest is wasted compute.

If the eval harness already runs both prompts on the same questions (and it should), the paired test is free. Switch to it.

Sequential testing: stop early when the answer is obvious

Running 6,000 examples per arm is expensive. If the first 800 already show a 10-point gap, the remaining 5,200 are wasted compute. Sequential testing lets you peek at the data and stop early without inflating the false-positive rate, provided you correct for the peeks.

The naive version is wrong: "I'll check after every 100 examples and stop when p < 0.05." Each peek is another shot at a false positive. Five peeks at α=0.05 each gives a true α closer to 0.20.

The correct version uses an alpha-spending function. Pocock and O'Brien-Fleming boundaries are the standard. The shape: at the first interim look, require a much stricter p-value than 0.05; loosen it as more data comes in; the cumulative type-I error stays at 0.05 across all peeks.

def obrien_fleming_threshold(k: int, K: int,
                              alpha: float = 0.05) -> float:
    """O'Brien-Fleming z-threshold at peek k of K total peeks.
    Stricter early, loose late. Returns z-critical."""
    z_full = NormalDist().inv_cdf(1 - alpha / 2)
    return z_full * math.sqrt(K / k)

for k in range(1, 6):
    z = obrien_fleming_threshold(k, K=5)
    print(f"peek {k}/5: |z| > {z:.3f}")
# peek 1/5: |z| > 4.382
# peek 2/5: |z| > 3.099
# peek 3/5: |z| > 2.530
# peek 4/5: |z| > 2.191
# peek 5/5: |z| > 1.960

The first peek requires |z| > 4.4. That is a delta so large you would not need a test. The final peek is the usual 1.96. When the effect is real and large, the early peeks catch it. Small effects mean you run to the end. The overall false-positive rate stays at the α you advertised.

For LLM evals, this matters most when each example costs API tokens. A regression eval at every PR running 6,000 examples is a real cost line. Stopping at 1,500 when the answer is clearly negative saves the rest. The Optimizely sequential-testing glossary covers the practical theory; the statsmodels interim analysis tooling implements the standard boundaries if you want a library instead of rolling it.

Stratify by query type or eat Simpson's paradox

The scariest result in eval analysis is the one where the new prompt wins the aggregate by 3 points, loses every individual subgroup, and ships anyway. This is Simpson's paradox, and it shows up in LLM evals because eval sets are usually mixtures: factual queries, math queries, refusal probes, multi-turn dialogues, code questions, summarization, and so on. Each subgroup has a different baseline accuracy and a different difficulty curve.

When the new prompt shifts the distribution of which queries the judge sampled, even subtly, the aggregate average can move in the opposite direction of every subgroup average. The classic epidemiology example (kidney stones, Charig et al., 1986) is identical in structure: treatment A wins overall, treatment B wins on small stones and on large stones. The aggregate flipped the conclusion.

The defense is not exotic. Stratify the eval set into well-defined query types, hold the per-type sample counts fixed across runs, and report per-stratum deltas alongside the aggregate. If three out of four strata regressed and one improved, the aggregate "win" is not a win.

def stratified_summary(rows, p1_col="old", p2_col="new"):
    """rows: iterable of dicts with keys 'stratum', 'old', 'new' (0/1)."""
    from collections import defaultdict
    by_stratum = defaultdict(list)
    for r in rows:
        by_stratum[r["stratum"]].append(r)

    out = []
    for s, rs in sorted(by_stratum.items()):
        n = len(rs)
        old = sum(r[p1_col] for r in rs) / n
        new = sum(r[p2_col] for r in rs) / n
        out.append({
            "stratum": s, "n": n,
            "old": old, "new": new, "delta": new - old,
        })
    return out

The harness should print per-stratum results before the aggregate. If the stratum table makes a reviewer flinch, do not call the aggregate a result.

The corollary for sample-size planning: budget examples per stratum, not just overall. A 2,000-example eval evenly split across 8 strata gives 250 per stratum. That is fine for catching 8-point per-stratum effects and useless for 2-point ones. If a specific query type is the one you are trying to move, oversample it.

A short checklist before you ship the deploy note

Compute n_per_arm for the smallest delta you actually care to detect, before running the eval. If your eval set is smaller than that, the result cannot back the claim.
Use the paired McNemar test when both prompts run on the same questions. It is one function call away and frees up an order of magnitude of compute.
Sequential testing with O'Brien-Fleming boundaries lets you stop early on obvious wins or obvious losses without inflating false positives.
Stratify by query type, report per-stratum deltas, and treat any aggregate win that hides a subgroup loss as a regression.
Numbers in the deploy note get a confidence interval, not just a point estimate. 82% (95% CI: 80.0 – 84.0) is honest; 82% is a coin flip with branding.

The upfront cost is the power calculator, the strata, and CIs in the harness. After that, the team stops shipping noise as features.

If this was useful

The math above is a small slice of what the LLM Observability Pocket Guide covers: picking the right evals tooling, building a harness that scales past the prototype phase, and reading traces well enough to catch the bugs the dashboard hides. If the sample-size walk-through was useful, the book covers the same discipline at length: every claim you ship to a stakeholder gets a confidence interval behind it.

Top comments (1)

Alex Morgan • May 9

The statistical power framing is underused in this space. Most teams set eval set sizes based on 'feels like enough' rather than actual power calculations. The minimum detectable effect size question is where this gets really sharp — if you're trying to detect a 3% improvement in answer quality, you need a much bigger eval set than if you're looking for 10%. What MDE are you typically designing for in practice?