DEV Community: Maya Andersson

Your eval dashboard has 30 metrics. When one "moves," that is usually arithmetic, not a regression.

Maya Andersson — Tue, 21 Jul 2026 14:39:53 +0000

Here is the ritual. You ship a prompt change, rerun the eval suite, and open the dashboard. Thirty numbers sit there: faithfulness, answer relevance, context precision, toxicity, latency-adjusted quality, and two dozen more. Twenty-nine are flat. One dropped from 0.86 to 0.81 and the cell is red. Someone says "we regressed on groundedness," and the next hour goes to explaining a number that never needed explaining.

I want to make the boring case that most of these red cells are not findings. They are what you get when you run many comparisons and let each one fire on its own. This has a name, the multiple comparisons problem, and it has arithmetic you can do on a napkin.

Claim 1: the false alarm rate is per-metric, and you have a lot of metrics

Set a threshold for "this metric moved" that would fire 5% of the time by chance when nothing actually changed. That is what a two-sided test at alpha = 0.05 means: a 5% false positive rate per test, under the null.

One metric, one test, 5% chance of a false alarm. Fine. But you are not looking at one metric. Assume for a moment the metrics are independent (they are not, and that matters below). The probability that at least one of n metrics throws a false alarm is:

P(at least one false positive) = 1 - (1 - alpha)^n

n = 1:   1 - 0.95^1  = 0.05    (5%)
n = 20:  1 - 0.95^20 = 0.64    (64%)
n = 30:  1 - 0.95^30 = 0.79    (79%)

At 30 metrics, a run where nothing changed has a 79% chance of showing you at least one red cell. The expected count is just n times alpha: 30 x 0.05 = 1.5 false alarms per clean run. So the "one metric moved" you are staring at is, on average, exactly what a no-op change produces. You did not find a regression. You found the metric that lost this round of a lottery you run every deploy.

Claim 2: your metrics are correlated, which does not save you

The obvious objection: eval metrics are not independent, so the formula above is wrong. Correct, it is wrong. It is not wrong in your favor.

Correlation changes the shape of the false-alarm distribution but not its center. The expected number of false positives is n times alpha regardless of dependence, because expectation is linear and does not care whether the tests move together. Correlation mostly changes the variance. When metrics are highly correlated, false alarms clump: a clean run tends to show either zero red cells or several at once, because the correlated metrics fail together. That is worse for interpretation, not better, because a cluster of three red cells looks like a "real pattern" and is the single most convincing way to fool yourself. That cluster is three correlated metrics failing on the same noise, not three independent findings.

So you cannot correlation your way out of this. You have to correct for it.

Claim 3: the fix is a decision made before you look, not a harder stare after

There are two standard corrections, and they answer two different questions.

Bonferroni controls the family-wise error rate, the probability of even one false positive across the whole family. To hold that at 0.05 across 30 metrics, you test each at 0.05 / 30 = 0.00167. It is exact enough (it slightly over-corrects under dependence) and it is strict. Strict is the point when a single false regression would block a release.

Bonferroni is often too strict when you genuinely track many metrics and can tolerate a few false alarms among your flagged ones. That is what the false discovery rate is for. Benjamini and Hochberg's 1995 procedure (Controlling the false discovery rate, Journal of the Royal Statistical Society Series B) controls the expected proportion of your flagged metrics that are false, rather than the chance of any false flag at all. The procedure is short: sort your n p-values ascending, find the largest k where p(k) <= (k / n) x q for your target rate q, and flag everything up to k. It gives you far more power than Bonferroni while still bounding how much of your red is noise. For a dashboard where you expect a handful of real movements among thirty metrics, FDR is usually the right tool. For a release gate where one false block is expensive, family-wise control is.

Either way, the correction is a rule you commit to before opening the dashboard. The failure mode is picking the correction after you have already seen which cell is red, because by then you are choosing the test that lets you believe what you already decided.

Claim 4: separate the metric you are testing from the metrics you are watching

The cleaner fix is upstream of any correction. Most eval suites conflate two things:

Confirmatory metrics are the one or two you changed the prompt to improve. You have a directional hypothesis. You test those, at full alpha, and you are allowed to act on them.

Exploratory metrics are the other twenty-eight you keep on the board for monitoring. These do not get to trigger a "we regressed" conversation on a single run. They get watched over multiple runs, and a real regression in one shows up as a trend across several deploys, not a one-time red cell that is gone next Tuesday. A drop that reverts on the next run without any code change was regression to the mean, which is the second-most-common way a dashboard lies to you.

Decide which bucket each metric is in before the run. Two primary metrics tested at 0.05, twenty-eight secondary metrics under an FDR rule or simply held to a "two consecutive runs" bar. Now your false-alarm budget is one or two tests wide, not thirty.

The rule I actually use

Before the run, write down the one or two metrics the change is supposed to move, and the direction. Everything else is monitoring. For the monitoring set, do not react to a single red cell: apply an FDR correction across the set, or require the movement to survive a second run. When you must gate a release on a broad panel, use family-wise control and accept that you will miss small real effects, because at the gate a false block costs more than a missed 0.01.

None of this makes your eval more sensitive. It makes the eval account for how many comparisons you ran before a cell turned red.

FAQ

Does this mean my dashboard is useless?
No. A dashboard is a fine monitoring instrument. It is a bad hypothesis test. Use it to watch trends across runs, not to adjudicate a single deploy. The moment a single-run red cell triggers a decision, you have turned a monitor into a test without paying for the test.

If I only ever look at one metric, do I need any of this?
No. One pre-specified metric, one test, no correction needed. The problem is entirely created by the number of comparisons you let fire. Two or three metrics tied to a hypothesis tell you more than thirty scanned for whatever happened to move.

Bonferroni or Benjamini-Hochberg?
Bonferroni when a single false alarm is expensive, such as a release gate: it controls the chance of any false positive. Benjamini-Hochberg when you track many metrics and can tolerate a known fraction of false flags in exchange for catching more real ones. They optimize different things on purpose.

My eval scores are not p-values, they are just averages. Does this still apply?
Yes, and it is easier to abuse. A raw score delta with no notion of sampling variability is a comparison with an implicit, undocumented threshold. The multiplicity is still there; you have just hidden the alpha. Put a confidence interval on each delta and the correction problem becomes visible again.

Will correcting make me miss real regressions?
It will reduce power, yes. That is the trade. The answer is not to skip the correction, it is to keep the confirmatory set small so each test keeps full power, and let the large monitoring set be governed by trend, not by a single run.

Open question

Family-wise and FDR corrections both assume the family is well defined: you know how many tests you are running. On a live eval dashboard the family is open-ended. You add metrics over months, you rerun after every deploy, and you look at the board on days when nothing shipped. The true number of comparisons is not 30, it is 30 times every run times every glance. I do not have a clean way to define "the family" for a metric board that is queried continuously by a whole team. Sequential testing methods (alpha-spending, always-valid p-values) are the honest direction, but I have not seen them adapted well to eval suites where the metric set itself keeps growing. If you have a working formulation of the family for a continuously-watched dashboard, I would like to see it.

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

Maya Andersson — Thu, 16 Jul 2026 15:23:29 +0000

TL;DR. Almost every eval harness reports a pass rate with an error bar, and almost every one of those error bars comes from the normal approximation: p̂ plus or minus 1.96 times the square root of p̂(1 - p̂)/n. That formula is taught first, implemented everywhere, and reasonable near a pass rate of 50 percent. It falls apart at the extremes, which is precisely where any model worth shipping lives. At 49 of 50 passing it produces an upper bound of 1.0188, a probability above one. At 50 of 50 it produces the interval [1.0, 1.0], a claim of perfect certainty from fifty observations. Worse than either artifact: when the true pass rate is 98 percent and n is 50, its actual coverage is 63.5 percent. The reframing is that this is a solved problem, and has been since 1927. Invert the score test instead of approximating around the estimate, and you get the Wilson interval, which is one argument change in the library you already have installed.

The regression I shipped because I misread an interval

Three years ago I owned the eval suite for a document extraction pipeline. Fifty held-out documents, hand-labeled, each either extracted correctly or not. We were shipping a prompt change and the numbers looked clean: 49 of 50 passing, and the harness printed a 95 percent confidence interval of [0.941, 1.000].

I read the lower bound and did what most people do with a lower bound. I treated it as the pessimistic case. I told the stakeholder that the worst plausible outcome was about 94 percent, that we would be fine at 94 percent, and we shipped on a Thursday.

Production accuracy landed near 91 percent. That was outside the interval I had quoted, and I spent a week looking for the distribution shift that explained it. There was a small one. It was not the story. The story was that my interval had no business excluding 91 percent in the first place. Running the same 49 of 50 through the Wilson interval returns [0.895, 0.996]. The lower bound is 89.5 percent, not 94.1 percent. Wilson's interval contained the truth. Mine did not, and the gap between the two was not a rounding difference. It was 4.6 percentage points of false confidence, manufactured by a formula that should not have been applied to that data.

The distribution shift got the postmortem. The interval got nothing, because nobody in the room, including me, thought of the error bar as a thing that could itself be wrong. That is the failure mode I want to argue against here, criterion by criterion.

What the Wald interval is, and where it comes from

The estimator is uncontroversial. With k passes out of n cases, p̂ = k/n.

The interval most harnesses report is the Wald interval:

p̂ ± z · sqrt( p̂(1 - p̂) / n )

The derivation is a normal approximation to the binomial, with one extra move that does the damage: the standard error is evaluated at p̂, the number you happened to observe, rather than at p, the parameter you are trying to bracket. That substitution is harmless when p̂ is near 0.5 and n is large. Near the boundaries it is not harmless at all, because the standard error sqrt(p(1-p)/n) is itself a function of p that collapses to zero as p approaches 1. Plug in p̂ = 1.0 and the formula obediently reports that it has no uncertainty.

So the question is not whether the Wald interval is defensible in general. It is whether it is defensible in the regime where eval results actually land. Below are five criteria any interval estimator should satisfy, and how Wald does against each.

Criterion 1: the interval must stay inside the parameter space

A pass rate is a proportion. It lives in [0, 1]. An interval that includes values above 1 is not reporting uncertainty about a proportion, it is reporting that the model has been arithmetically mangled.

At 49 of 50, the Wald interval computes to [0.9412, 1.0188]. The upper bound is a probability of 1.0188.

Here is the part that keeps this from being obvious, and the reason I did not catch it for years. Your library hides it. In statsmodels, the proportion_confint function ends with an explicit clip for exactly two methods:

if method in ["normal", "agresti_coull"]:
    ci_low = np.clip(ci_low, 0, 1)
    ci_upp = np.clip(ci_upp, 0, 1)

So proportion_confint(49, 50, method="normal") returns [0.9412, 1.0000], and the 1.0188 never reaches your logs. The docstring says so plainly, and nobody reads it, because who reads the docstring for a confidence interval.

The clip is cosmetic, and I can show that precisely: clipping the upper bound from 1.0188 down to 1.0 only removes values in the range (1.0, 1.0188] from the interval, and no true proportion can live there. The set of true values the interval covers is identical before and after the clip. Coverage does not move by a single decimal place. What the clip accomplishes is that a number which would have announced the method's failure now looks like an ordinary, tidy upper bound of 1.0.

Criterion 2: the interval must not vanish when the data is unanimous

Run a perfect eval. 50 of 50. The Wald interval is [1.0, 1.0], width zero.

Read that as an epistemic claim and it says: having observed fifty successes, I am now certain, to the exclusion of all alternatives, that this system never fails. Not "very likely above 95 percent". Certain. A zero-width interval assigns probability zero to a true rate of 0.999.

Fifty of fifty is a perfectly ordinary eval outcome. It is also the outcome where the Wald interval fails hardest, and the failure has a sign: it always errs toward overconfidence, never toward caution.

Wilson at 50 of 50 returns [0.9287, 1.0]. Clopper-Pearson returns [0.9289, 1.0]. Both say the sensible thing: fifty consecutive passes is real evidence, it is consistent with a true rate around 93 percent, and you cannot rule out roughly a 1-in-14 failure rate on this evidence.

This connects to a heuristic some readers will know, the "rule of three": with zero failures in n trials, the upper bound on the failure rate is about 3/n. At n = 50 that gives 6 percent, implying a lower bound near 94 percent, which is close to Wilson's 92.9 percent but not equal to it. The reason is that the rule of three is the one-sided 95 percent bound. I checked: 1 - 0.05^(1/50) = 0.0582, and 3/50 = 0.06. The two-sided version needs 0.025 in each tail, which gives 1 - 0.025^(1/50) = 0.0711, closer to 3.7/n. Useful heuristic, commonly misquoted by half a tail.

Criterion 3: nominal coverage must resemble actual coverage

This is the criterion that matters, and the one nobody checks.

A 95 percent confidence interval makes a frequentist promise: across repeated experiments, the interval contains the true parameter 95 percent of the time. That promise is testable. The binomial has finitely many outcomes, so you do not even need simulation. Enumerate every k from 0 to n, ask whether the interval built from that k contains the true p, and weight by the binomial probability of observing that k. The answer is exact.

I ran that enumeration. At n = 50, nominal 95 percent:

true p	Wald	Wilson	Clopper-Pearson
0.50	0.935	0.935	0.967
0.80	0.938	0.951	0.967
0.90	0.879	0.970	0.970
0.95	0.920	0.962	0.988
0.98	0.635	0.922	0.982
0.99	0.395	0.911	0.986

At a true pass rate of 98 percent, an interval labeled "95 percent confidence" contains the truth 63.5 percent of the time. At 99 percent, it manages 39.5 percent. Those are not slightly optimistic numbers. An interval that misses the parameter more than a third of the time is not delivering what its label promises, and at 99 percent it misses more often than it hits.

Note the shape of the failure. Near p = 0.5 the Wald interval is fine (0.935, close enough to nominal). The degradation is monotone in how good your model is. The better the system under test, the more the interval lies, and it lies in the direction of telling you the system is more reliable than it is. Any team whose models improve over time is walking into this, and the error bar gets quieter about it every quarter.

None of this is news to statisticians. Brown, Cai and DasGupta laid it out in 2001 in Statistical Science, in a paper whose abstract describes the Wald interval's coverage as "erratic" and "chaotic," and states that "common textbook prescriptions regarding its safety are misleading and defective" (Brown, Cai and DasGupta, 2001, Statistical Science 16(2), 101 to 133). They recommend Wilson or the equal-tailed Jeffreys interval for small n, and Agresti-Coull for larger n. The paper is 25 years old. The formula it warns about is still the default in most eval code I read.

Criterion 4: more data must not make the interval worse

The standard defense of Wald is that it is asymptotic, so just use enough data. There are rules of thumb attached: n ≥ 30, or np ≥ 5, or np ≥ 10.

Test the rule. At p = 0.98, the binding constraint is the failure count, n(1 - p), because that is the small one. The np ≥ 5 rule, applied to failures, demands n ≥ 250.

At n = 250 and p = 0.98, Wald coverage is 0.873.

The rule is satisfied, and the interval is still nowhere near 95 percent. So the rule of thumb does not work, which is what "misleading and defective" meant.

It gets less comfortable. Coverage is not even monotone in n. Holding p = 0.98 fixed and computing exact coverage as n grows:

n	Wald coverage
125	0.916
142	0.941
150	0.800
225	0.935
250	0.873

Adding eight test cases, from 142 to 150, drops coverage from 0.941 to 0.800. More data, worse interval. I verified each of these two ways, by exact enumeration over the binomial and by a one-million-replication Monte Carlo, and they agree to three decimals (the Monte Carlo returns 0.9408, 0.8003 and 0.8732 for n = 142, 150 and 250).

The mechanism is not mysterious once you see it. k is discrete. As n changes, the lattice of achievable k values slides relative to the true p, and whether p sits inside the interval built from the most probable k flips on and off. That produces a sawtooth, which is why Brown, Cai and DasGupta call the behavior chaotic rather than merely biased. There is no threshold n above which you are safe, because the function you would be thresholding oscillates.

Which brings me to the honest caveat, and I would rather state it than have it found. Wilson oscillates too. Every interval for a discrete parameter does. Sweeping n from 50 to 600 at p = 0.98, Wilson's coverage ranges from 0.918 to 0.980, while Wald's ranges from 0.635 to 0.951. The difference is that Wilson's oscillation is centered near the nominal level, so its errors go in both directions and stay small. Wald's ceiling across that entire sweep is 0.951. It essentially never over-delivers, and its floor is 0.635. Wilson is not exact. It is well-behaved, which is a different and more achievable property.

Criterion 5: the interval should be asymmetric when the estimate is near a boundary

Wald intervals are symmetric by construction, because the formula is "estimate plus or minus a fixed quantity." Near a boundary, symmetry is the wrong shape.

At 49 of 50, the truth cannot be far above 0.98, since only 1.0 is available up there. It can quite easily be below, because 0.95 and 0.93 and 0.91 all produce 49 of 50 with unremarkable probability. The interval should be lopsided: short on the top, long on the bottom.

Wilson does this automatically. At 49 of 50 it returns [0.8950, 0.9965], which extends 0.085 below the point estimate and 0.0165 above it, roughly five times more room downward than upward. That asymmetry is not a patch bolted onto the formula. It falls out of the derivation, because Wilson inverts the score test: rather than approximating the standard error at p̂, it asks which values of p would fail to be rejected, evaluating the standard error at each candidate p. Solving that quadratic in p gives

center = (p̂ + z²/2n) / (1 + z²/n)

half-width = (z / (1 + z²/n)) · sqrt( p̂(1 - p̂)/n + z²/4n² )

Notice the center is not p̂. It is p̂ pulled toward 0.5, which is why the interval never escapes [0, 1] and never collapses at p̂ = 1: the z²/4n² term inside the square root keeps the width positive even when p̂(1 - p̂) is exactly zero. Wilson published this in 1927 (Wilson, 1927, Journal of the American Statistical Association 22(158), 209 to 212). It predates the eval harness by about ninety years.

The code

Pasteable, and the output below is what it actually prints:

# pip install statsmodels scipy
import numpy as np
from scipy.stats import binom, norm
from statsmodels.stats.proportion import proportion_confint

Z = norm.ppf(0.975)  # 1.959963...


def wald_by_hand(k, n):
    """The textbook normal-approximation interval, unclipped."""
    p = k / n
    se = np.sqrt(p * (1 - p) / n)
    return p - Z * se, p + Z * se


def show(k, n):
    print(f"\n{k}/{n} passing  (point estimate = {k/n:.3f})")
    lo, hi = wald_by_hand(k, n)
    print(f"  Wald, by hand      : [{lo:.4f}, {hi:.4f}]   width={hi-lo:.4f}")
    lo, hi = proportion_confint(k, n, alpha=0.05, method="normal")
    print(f"  Wald, statsmodels  : [{lo:.4f}, {hi:.4f}]   width={hi-lo:.4f}")
    lo, hi = proportion_confint(k, n, alpha=0.05, method="wilson")
    print(f"  Wilson             : [{lo:.4f}, {hi:.4f}]   width={hi-lo:.4f}")
    lo, hi = proportion_confint(k, n, alpha=0.05, method="beta")
    print(f"  Clopper-Pearson    : [{lo:.4f}, {hi:.4f}]   width={hi-lo:.4f}")


show(49, 50)
show(50, 50)


def exact_coverage(n, true_p, method):
    """Actual coverage, computed by enumerating every possible k.
    No simulation: the binomial has finitely many outcomes."""
    total = 0.0
    for k in range(n + 1):
        lo, hi = (wald_by_hand(k, n) if method == "wald"
                  else proportion_confint(k, n, alpha=0.05, method=method))
        if lo <= true_p <= hi:
            total += binom.pmf(k, n, true_p)
    return total


print("\n\nActual coverage of a nominal 95% interval, n=50")
print(f"{'true p':>8} {'Wald':>8} {'Wilson':>8} {'Clopper':>9}")
for p in [0.50, 0.80, 0.90, 0.95, 0.98, 0.99]:
    print(f"{p:>8.2f} {exact_coverage(50, p, 'wald'):>8.3f} "
          f"{exact_coverage(50, p, 'wilson'):>8.3f} "
          f"{exact_coverage(50, p, 'beta'):>9.3f}")

Output:

49/50 passing  (point estimate = 0.980)
  Wald, by hand      : [0.9412, 1.0188]   width=0.0776
  Wald, statsmodels  : [0.9412, 1.0000]   width=0.0588
  Wilson             : [0.8950, 0.9965]   width=0.1014
  Clopper-Pearson    : [0.8935, 0.9995]   width=0.1060

50/50 passing  (point estimate = 1.000)
  Wald, by hand      : [1.0000, 1.0000]   width=0.0000
  Wald, statsmodels  : [1.0000, 1.0000]   width=0.0000
  Wilson             : [0.9287, 1.0000]   width=0.0713
  Clopper-Pearson    : [0.9289, 1.0000]   width=0.0711


Actual coverage of a nominal 95% interval, n=50
  true p     Wald   Wilson   Clopper
    0.50    0.935    0.935     0.967
    0.80    0.938    0.951     0.967
    0.90    0.879    0.970     0.970
    0.95    0.920    0.962     0.988
    0.98    0.635    0.922     0.982
    0.99    0.395    0.911     0.986

Two library notes. In statsmodels, method="beta" is Clopper-Pearson (it is named for the Beta distribution used to compute it, which is a naming choice that has cost me time). In scipy, the same intervals are available as binomtest(k, n).proportion_ci(method="wilson") and method="exact" for Clopper-Pearson. I checked both libraries against each other and against the hand-rolled Wilson formula above; all three agree to machine precision.

What I do now

Wilson is the default. One keyword argument, no new dependency, correct behavior at the boundary, coverage that stays near nominal across the range where eval results actually land.

Clopper-Pearson when someone external is going to rely on the number, or when the cost of overstating reliability is asymmetric (safety filters, anything with a compliance story attached). It is called "exact" because it inverts the binomial test directly rather than approximating, but note from the table that its coverage runs to 0.982 and 0.986 where nominal is 0.95. It is conservative, and its intervals are wider than they strictly need to be. That is a deliberate trade, not a free upgrade. Agresti and Coull made this argument in 1998, in a paper titled "Approximate Is Better than 'Exact' for Interval Estimation of Binomial Proportions" (Agresti and Coull, 1998, The American Statistician 52(2), 119 to 126), and their point stands: guaranteed-conservative is not the same as accurate, and if you want coverage near 95 percent rather than above it, the approximate methods do better.

Report the interval, not the point estimate. "98 percent" and "98 percent, 95 percent CI [89.5, 99.7]" are the same measurement, and only one of them communicates that fifty test cases is fifty test cases. If the interval you get is too wide to support the decision you are making, the answer is more test cases, and the width tells you roughly how many. I worked that arithmetic in a separate piece on eval-set size, so I will not repeat it here.

And stop reading the lower bound as the pessimistic case. It is a bound on a plausible range, not a floor. I have made that mistake in production and it cost me a week of looking for a distribution shift that explained three of the nine points I was missing.

FAQ

Does this matter if my pass rate is around 70 percent?

Much less. At n = 50 and p = 0.80, exact Wald coverage is 0.938 against a nominal 0.95, which is a real but tolerable error. At p = 0.50 it is 0.935. The Wald interval was designed for this regime and behaves acceptably in it. The problem is that a 70 percent pass rate is usually a system you are still fixing, not one you are reporting on, and by the time you are writing the number in a document it has moved to 95 percent or higher. The method degrades precisely as your project succeeds, which is an unfortunate property for a measurement tool.

Is Wilson always better than Wald?

Not always, and I would rather not oversell it. At p = 0.50 and n = 50 they return identical coverage of 0.935, because near the center the two derivations nearly coincide. Wilson's advantage appears at the extremes and at small n, and it grows as you move toward the boundary. There is no regime I am aware of where Wilson is meaningfully worse, so "always use Wilson" is a defensible default even though "Wilson is always better" is too strong a claim. The honest version: Wilson is never worse in any way that matters, and is dramatically better exactly where you need it.

Why not just clip the Wald interval to [0, 1] and move on?

Because clipping changes nothing that matters. I showed this above: truncating an upper bound of 1.0188 to 1.0 only excludes values above 1.0 from the interval, and no true proportion lives there. The coverage is bit-for-bit identical before and after. What clipping does is remove the visible symptom, the absurd number that would have prompted you to check the method. statsmodels clips by default for normal and agresti_coull, which means the most common way to compute a Wald interval in Python is also the way that hides its most obvious failure.

What about the 50 of 50 case specifically? Is any interval sensible there?

Wilson and Clopper-Pearson both handle it, returning [0.9287, 1.0] and [0.9289, 1.0] respectively at n = 50. Both are saying the same reasonable thing: fifty consecutive passes is genuine evidence of a high rate, and it is also consistent with a true rate near 93 percent. Wald returns [1.0, 1.0], claiming certainty. If your eval dashboard has ever shown a perfect score with no error bar, or an error bar of zero width, this is why, and it is worth grepping your harness for.

Is Clopper-Pearson the safe choice since it is "exact"?

"Exact" describes the construction, not the coverage. Clopper-Pearson inverts the binomial test directly instead of approximating, which guarantees coverage of at least 95 percent. It overshoots: 0.982 at p = 0.98 and 0.986 at p = 0.99 in the table above, against a nominal 0.95. You are paying for that guarantee with width, and wide intervals have their own cost, since an interval too wide to distinguish two candidate models is not helping you decide anything. Use it when understating reliability is cheaper than overstating it. Otherwise Wilson.

Does the same problem hit the standard error I report on a per-metric basis?

Yes, for anything that is fundamentally a count of successes over trials. Pass rate, exact-match accuracy, tool-call validity, refusal rate, any binary judge verdict aggregated across a test set. If the reported number is k/n and the error bar is p̂ ± z·sqrt(p̂(1-p̂)/n), the analysis here applies unchanged. It does not apply to metrics that are means of continuous scores, like a 1-to-5 rating averaged across cases, which have a different and generally better-behaved sampling distribution.

What should I actually change in my code tomorrow?

Search for sqrt(p * (1 - p) / n) and for proportion_confint( without an explicit method argument, since statsmodels defaults to method="normal". Both are the Wald interval. Change the default to method="wilson". That is the whole migration. If you have historical eval numbers with published intervals, the point estimates are unaffected and only the intervals move, so a backfill is cheap and mostly widens old lower bounds.

Open question

The thing I have not resolved is what to do when the eval set is not a random sample, which is most of the time.

Everything above assumes n independent Bernoulli draws from a fixed distribution. Real eval sets are curated. We add cases because they broke something, we keep cases because they are hard, we group cases by document or by conversation so that the units are correlated rather than independent. Under curation the binomial model is the wrong likelihood, and an interval derived from it, Wilson included, is answering a question about a population that does not exist. Wilson gives you a correct interval for the wrong model. That is an improvement over Wald, which gives you an incorrect interval for the wrong model, but I notice it is a smaller improvement than this whole essay implies.

I have seen three responses. Treat the eval set as the entire population and report no interval at all, which is honest but gives up on generalization. Cluster-bootstrap at the document or conversation level, which handles the correlation but not the curation. Maintain a separate uncurated random sample purely for estimation, which is correct and which I have never seen anyone actually staff.

I do not have a defensible fourth answer. If you have shipped one, I would like to read it.

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

Maya Andersson — Tue, 14 Jul 2026 16:54:09 +0000

TL;DR. You run version A and version B against the same 500-item eval set. A passes 71.4 percent, B passes 74.0 percent, and you conclude B is better. That reasoning throws away the one fact that matters most: both systems answered the same questions, so their per-item outcomes are correlated, not independent. Reading two separate averages (or eyeballing whether their confidence intervals overlap) is the wrong test for a same-items design. The fix is to pair the outcomes per item and test the difference directly. For pass/fail, that is McNemar's test on the items where the two runs disagree. For graded scores, it is a bootstrap over the per-item deltas. Report an effect size and a confidence interval on the delta, not two averages sitting next to each other.

I have shipped a regression because two dashboard numbers looked close enough. The new prompt scored 68.9 percent, the old one 69.7 percent, and I called it noise and moved on. It was not noise. The new prompt was quietly worse on a specific slice, and pairing the runs would have shown me a tight interval that sat entirely below zero. This has practical stakes, not just cosmetic ones: the unpaired reading can hide a regression that the paired reading would surface.

The rest of this post is one claim, argued criterion by criterion: when two eval runs share the same items, you owe them a paired analysis, and the paired analysis is not hard.

The setup, and why it looks reasonable

Here is the pattern I see in almost every eval report. There is a table with two rows. Row one is the baseline, row two is the candidate, and each row has a single number: mean pass rate, or mean rubric score, or mean judge score. Sometimes there is a 95 percent confidence interval on each row, drawn as a little error bar. The reader compares the two numbers, glances at whether the error bars overlap, and makes a call.

The instinct is not stupid. A mean is a legitimate summary, and an error bar is more honest than a bare point. The problem is narrower and more specific: the two error bars in that picture are each computed as if that run stood alone, and the comparison you actually care about (is B better than A) is a comparison the picture does not draw. The uncertainty you need is the uncertainty of the difference, and the difference has its own variance that depends on how the two runs move together across items.

That last phrase is the whole argument, so here it is in concrete terms.

Criterion 1: Same items means paired data, not two independent samples

Some prompts in your eval set are simply harder than others. A gnarly multi-hop question, an ambiguous instruction, a long context that buries the answer. Both A and B face that same hard prompt. When A fails it, B often fails it too, because the difficulty is a property of the item, shared by both systems.

Statistically, that shared difficulty induces a positive correlation between A's per-item outcome and B's per-item outcome. And the variance of a difference is not the sum of the variances when the two things are correlated:

Var(A - B) = Var(A) + Var(B) - 2 Cov(A, B).

When Cov(A, B) is positive (and with a shared test set it usually is), the variance of the delta is smaller than what you would get by treating the runs as independent. Two consequences follow, and they point in opposite directions depending on which mistake you make.

If you run an unpaired two-proportion test (the kind that assumes independence), you plug in a standard error that ignores that covariance term. You typically overestimate the uncertainty of the difference, which makes you underpowered: you fail to detect improvements that are actually there. If instead you eyeball two marginal confidence intervals and check for overlap, you hit a separate and well-documented trap. Two intervals can overlap while the paired difference is comfortably significant, because the overlap of marginal intervals is not the same question as whether the interval on the difference excludes zero.

This is not a niche observation. Dror, Baumer, Shlomov, and Reichart lay it out for language work in "The Hitchhiker's Guide to Testing Statistical Significance in NLP" (ACL 2018): the structure of your data determines which test is valid, and a shared test set produces dependent measurements that a naive test mishandles. Dietterich made the same point a generation earlier for classifiers in "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" (Neural Computation, 1998), where he recommends tests that respect the paired structure of predictions on a single held-out set and warns against procedures that quietly assume independence.

So the first criterion is a modeling decision rather than a choice of statistic: same items, therefore paired. Everything below follows from taking that seriously.

Criterion 2: For pass/fail, McNemar's test looks only at the disagreements

When each item is a binary pass or fail, pairing the two runs sorts every item into one of four buckets:

both pass
both fail
A passes, B fails (call it a_only)
B passes, A fails (call it b_only)

The items where both systems agree carry no information about which system is better. They cancel. All the signal lives in the discordant pairs, the items where exactly one of the two got it right. McNemar's test (McNemar, "Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages," Psychometrika, 1947) asks a single question: among the discordant items, is the split between a_only and b_only further from 50/50 than chance would produce?

The exact version is a binomial test on the discordant count. The large-sample version is a chi-square statistic with a continuity correction, (|b - c| - 1)^2 / (b + c), on one degree of freedom. For the sample sizes typical of an eval set, I reach for the exact binomial and stop thinking about it.

import numpy as np
from scipy.stats import binomtest

rng = np.random.default_rng(0)
n = 500

# Some prompts are just harder for every system. That shared difficulty is
# exactly why the two runs are correlated, and exactly why you must pair.
difficulty = rng.beta(2, 2, size=n)
p_a = np.clip(0.45 + 0.40 * (1 - difficulty), 0, 1)   # system A
p_b = np.clip(0.50 + 0.40 * (1 - difficulty), 0, 1)   # system B, a bit stronger
pass_a = rng.random(n) < p_a
pass_b = rng.random(n) < p_b

a_only = int(np.sum(pass_a & ~pass_b))   # A right, B wrong
b_only = int(np.sum(pass_b & ~pass_a))   # B right, A wrong
n_disc = a_only + b_only

print(f"A={pass_a.mean():.3f}  B={pass_b.mean():.3f}  discordant: {a_only} vs {b_only}")
if n_disc == 0:
    print("no disagreements: McNemar is undefined, there is nothing to weigh")
else:
    p = binomtest(b_only, n_disc, 0.5, alternative="two-sided").pvalue
    print(f"exact McNemar p = {p:.4f}")

Notice the guard on n_disc == 0. It is not decoration. If your two runs are very similar (a small prompt tweak, a temperature change), you can land with almost no discordant items, and both the chi-square and the exact test degrade or become undefined. The honest reading in that case is not "p is large, no difference" but "I do not have enough disagreements to say anything," which is a different sentence and a cue to collect more items.

To make the reframe concrete, take an illustrative outcome from a run like the one above. Suppose across 500 items you see 40 items where only A passed and 53 where only B passed, with 407 concordant. The marginal gap is (53 - 40) / 500 = 2.6 points, which matches a 71.4 versus 74.0 headline. But McNemar is weighing 53 against 40 out of 93 discordant items, and the exact binomial p for that split is about 0.21 (illustrative). That 2.6-point gap is inside the range you would see from shuffling which system got the coin flip on the contested items. The averages made it look decided. The paired test says wait.

Criterion 3: For graded scores, bootstrap the per-item deltas

Plenty of evals do not produce a clean pass or fail. You get a rubric score in [0, 1], a 1-to-5 judge rating, a similarity score, a latency. The same logic holds, and the mechanical move is the same: form the per-item difference first, then reason about the distribution of those differences.

The paired bootstrap does this without assuming the deltas are normal. You compute diff_i = score_B(i) - score_A(i) for each item, then resample those paired differences with replacement many times, recomputing the mean each time, and read a confidence interval off the resampled means. The word "paired" is doing real work: you resample the differences, not the two score columns independently. Resampling the columns separately would break the pairing and reintroduce the very independence assumption you are trying to avoid.

import numpy as np

def paired_bootstrap_delta(scores_a, scores_b, n_boot=10_000, alpha=0.05, seed=1):
    rng = np.random.default_rng(seed)
    diffs = np.asarray(scores_b) - np.asarray(scores_a)   # pair FIRST
    n = len(diffs)
    if n == 0:
        raise ValueError("no items to compare")
    idx = rng.integers(0, n, size=(n_boot, n))            # resample the PAIRS
    boot = diffs[idx].mean(axis=1)
    lo, hi = np.quantile(boot, [alpha / 2, 1 - alpha / 2])
    return float(diffs.mean()), (float(lo), float(hi))

# Illustrative graded scores in [0, 1] on the same 400 items.
rng = np.random.default_rng(7)
difficulty = rng.beta(2, 2, 400)
scores_a = np.clip(rng.normal(0.70 - 0.20 * difficulty, 0.10), 0, 1)
scores_b = np.clip(scores_a + rng.normal(0.02, 0.08, 400), 0, 1)   # correlated with A

delta, (lo, hi) = paired_bootstrap_delta(scores_a, scores_b)
print(f"mean delta (B - A) = {delta:.3f}, 95% CI = [{lo:.3f}, {hi:.3f}]")

Bootstrap resampling of a single test set is not something I am inventing here. Koehn introduced it to machine translation evaluation in "Statistical Significance Tests for Machine Translation Evaluation" (EMNLP 2004), precisely to attach a confidence interval to a metric computed on one fixed test set, and the general method traces to Efron and Tibshirani's "An Introduction to the Bootstrap" (1993). The one discipline to keep is the pairing. If your items come in natural groups (several questions drawn from the same source document, or several turns from the same conversation), even the paired item bootstrap can understate uncertainty, and you want to resample at the level of the group. More on that in the open question.

Criterion 4: Report an effect size and a CI on the delta, not a bare p-value

Suppose the paired test comes back at p = 0.03. You still do not know whether B is better by half a point or by twelve. A p-value answers "could this be zero," not "how big is it and does the size matter to anyone." On a 5,000-item eval, a 0.3-point improvement can clear significance while being operationally meaningless. On a 120-item eval, a genuinely useful 4-point gain can miss the 0.05 line entirely.

So report the delta and its interval as the headline, and treat the p-value as secondary. For the binary case, the difference in paired proportions is (b - c) / n, and a standard error that respects the pairing is

SE = (1 / n) * sqrt(b + c - (b - c)^2 / n),

which for the illustrative 40-versus-53 example gives a 2.6-point delta with a roughly [-1.2, +6.4]-point 95 percent interval. That interval straddles zero, which is the same verdict McNemar gave, but now stated in units a product owner can act on. For the continuous case, the bootstrap interval on the mean delta is already in the right form.

This emphasis is exactly what the American Statistical Association argued for in its 2016 statement on p-values (Wasserstein and Lazar, "The ASA Statement on p-Values: Context, Process, and Purpose," The American Statistician): a p-value does not measure the size of an effect or the importance of a result, and good practice reports estimates with their uncertainty. Dror and colleagues make the same recommendation for language experiments. An effect size with an interval tells you direction, magnitude, and precision in one line. A lone p-value tells you one bit and hides the rest.

Criterion 5: If you compare many metrics at once, correct for it

Modern eval runs are wide. You are not testing one number, you are testing pass rate and faithfulness and a toxicity check and format-adherence and latency and a half dozen rubric dimensions, all at once, all with their own paired test. Each test at the 0.05 level has a 1-in-20 chance of a false alarm under the null. Run twelve of them and the probability of at least one false positive, if nothing truly changed, is 1 - 0.95^12, which is about 0.46. Almost a coin flip that you will "discover" a difference that is not there and chase it for a day.

There are two standard responses. Bonferroni divides your alpha by the number of tests (0.05 / 12 = 0.0042 each), which controls the chance of any false positive but is conservative and will hide real effects. Benjamini and Hochberg's procedure ("Controlling the False Discovery Rate," Journal of the Royal Statistical Society Series B, 1995) instead controls the expected fraction of your flagged results that are false, which keeps more power when several metrics really did move. For eval dashboards, where you would rather not miss a real regression, I default to Benjamini-Hochberg and reserve Bonferroni for the small number of metrics I would gate a release on. Either way, the point stands: the more comparisons you draw from one run, the more of them will look significant by luck, and a wide dashboard without any correction will regularly flag differences that are not real.

When the simple average is fine

I promised a measured claim, so here is the boundary. Comparing two averages is not always wrong, and there are cases where I do exactly that and sleep fine.

If the two runs used different, independently sampled item sets, they are not paired, and an unpaired analysis is the correct one. If the gap is so large that no reasonable interval could touch zero (A at 40 percent, B at 88 percent, on a few hundred items), the paired test will agree with your eyes and you can skip the ceremony for a quick read, then do it properly before you write it down. If you are not making a decision, just watching a number drift over time on a monitoring chart, a smoothed average is a fine early-warning signal and nobody needs a p-value to notice a cliff. And if your eval set is tiny (say under 30 items), no test rescues you; the honest move is to report the raw counts, resist a verdict, and go collect more data.

The through line is the same. The average is a fine description. It becomes the wrong test the moment you use two of them, side by side, on the same items, to decide which system won. That specific move, two averages side by side on the same items to decide which system won, is the one that needs a paired test.

I will also admit what pairing does not fix. It does not repair a biased judge, a leaky eval set, or items that do not represent production. Get a real improvement wrong and no statistic saves you. Pairing only makes sure that, given honest measurements, you read the comparison the measurements actually support.

FAQ

Is a two-proportion z-test fine if my eval set is large enough?
No, and size does not rescue it. The two-proportion z-test assumes the two samples are independent. With a shared test set they are not, because item difficulty is common to both runs. A larger n makes the wrong standard error more precisely wrong, not correct. What large n does buy you is stability for the paired test: McNemar and the paired bootstrap both behave well with more items, and their intervals tighten. So keep growing the eval set, but feed the counts into a paired procedure. The independence assumption is a property of the design, not of the sample size, and you cannot buy your way out of it with more rows.

My two confidence intervals overlap. Doesn't that mean no significant difference?
This is the most common trap in the whole topic. Overlapping marginal intervals do not imply the paired difference is non-significant. The interval you drew on A and the interval you drew on B each describe one run in isolation. The question you care about lives in a third interval, the one on the delta, which depends on how A and B covary across items. Because a shared test set makes them positively correlated, the interval on the difference is often much tighter than the overlap picture suggests. Draw the interval on the delta and check whether it excludes zero. Ignore the overlap of the two marginal bars.

What if I only have the two average pass rates, not the per-item results?
Then you cannot run the correct test, and you should go get the per-item results. This is the practical reason to log outcomes at the item level for every run: the paired analysis needs to know, item by item, whether each system passed. Two summary numbers have already discarded the pairing, and no post-hoc formula reconstructs it. If retrieving the raw outcomes is genuinely impossible, be honest that you can only make an unpaired, underpowered comparison, and treat any close call as undecided rather than shipping on it. Store the per-item pass/fail and scores by default so this never becomes the blocker.

When should I use McNemar's chi-square versus the exact binomial version?
Use the exact binomial test on the discordant count when that count is small, which for eval sets is most of the time. The chi-square form, with the continuity correction (|b - c| - 1)^2 / (b + c), is a large-sample approximation and is fine when the number of discordant pairs is comfortably into the dozens (a common rule of thumb is b + c of at least 25). Below that, the approximation drifts and the exact test is both safer and, in Python, no harder to call. When discordant pairs are near zero, neither version is meaningful; you simply do not have enough disagreements, and the answer is more data, not a smaller p.

How many items do I need for the paired test to detect a real difference?
It depends on the discordance rate, not the total item count, which is the counterintuitive part. Power comes from the items where the two systems disagree, so two nearly identical runs need a large set to accumulate enough discordant pairs, while two clearly different runs reach significance on far fewer items. As a rough planning move, estimate the fraction of items where you expect the runs to differ, multiply by your set size to get expected discordant pairs, and aim for that to be at least in the dozens. If you cannot get there, you are testing a difference too small to resolve at your current scale, which is itself a useful finding.

I track 15 metrics per run. Do I really need a multiple-comparisons correction?
Yes, if you are making decisions on whichever metric lights up. Fifteen independent tests at 0.05 give roughly a 1 - 0.95^15, about 0.54, chance of at least one false positive under the null. That is worse than a coin flip. You do not need to correct metrics you are only monitoring, but any metric that can trigger a decision (block a release, revert a prompt) belongs in a corrected family. I use Benjamini-Hochberg to control the false discovery rate across the dashboard, because it keeps more power than Bonferroni when several metrics genuinely moved, and I reserve the stricter Bonferroni cut for the two or three release-gating metrics where a single false alarm is expensive.

Open question

Pairing solves the correlation between two systems on the same item. It does not, by itself, solve the correlation between items that are not independent of each other.

Real eval sets are full of hidden clusters. Ten questions generated from the same source document. Eight turns sampled from the same conversation. A batch of items authored by the same annotator with the same blind spots. When items cluster like this, the effective number of independent observations is smaller than your row count, and both McNemar and the item-level paired bootstrap will hand you an interval that is too tight, because they treat correlated items as if they were independent draws. You end up overconfident in the opposite direction from where we started.

The candidate fixes are known in name: a cluster bootstrap that resamples whole groups instead of individual items, or a mixed-effects model with a random intercept per cluster. What I do not have is a clean, agreed-on recipe for the messy case where the clustering is partial and unlabeled, where some items share a document and others do not, and where nobody logged the grouping at eval-authoring time. How much does ignoring soft clustering actually inflate false positives on a typical agent eval, and is the cluster bootstrap worth the complexity for sets in the low thousands?

I have opinions and no proof. If you have run this comparison on your own evals, with the grouping tracked and the intervals computed both ways, I would genuinely like to see the numbers. That is the next thing I want to measure, and I would rather learn it from your data than guess.

The regression your eval set is too small to catch

Maya Andersson — Fri, 10 Jul 2026 16:53:27 +0000

TL;DR. To catch a drop from a 0.90 pass rate to 0.85 at 80% power (one-sided, alpha 0.05), you need about 253 examples. A 50-example set has roughly 35% power, so it misses that regression about two times in three. The move that matters is not "collect more data" as a slogan. Size the set to the effect you actually care about, report a confidence interval on each run instead of a bare point delta, and prefer per-criterion binary labels over vague graded scores.

The two-point win that wasn't

A few months ago I sat in a review where a team was pleased with itself. A new prompt had moved their eval pass rate from 87.5 percent to 90 percent, and someone had already written "prompt v3: +2pp" in the changelog. The eval set had forty examples.

Here is what those points were made of. On forty examples every result is a multiple of 2.5 points, so the pass rate can only land on 35 out of 40, or 36, or 37, with nothing in between. The move from 87.5 to 90 was 35 correct becoming 36 correct. One example. A single test case that used to fail now passed, and it could flip back next week when the decoding lands differently.

I asked the obvious question. If we reran the old prompt three more times, would it always score 35 out of 40? Nobody knew, because nobody had rerun it. The honest summary of that meeting is that we had watched one example change its mind, and we had written it into the changelog as progress.

This is not a story about one careless team. It is the default failure mode of eval-driven development. We compare two numbers, see a gap, and our brains supply a cause. The possibility we skip is that the gap sits entirely inside the noise.

Power is the number nobody computes

The quantity that decides whether your eval can see a regression is statistical power: the probability that your test declares a difference when a real difference of a given size exists. Power depends on three things. The effect size you want to catch, the baseline rate, and the number of examples. Most teams pick the effect size implicitly ("a five-point drop would matter"), never write down the baseline variance, and let the sample size be whatever happened to be sitting in the folder.

Jacob Cohen spent a career arguing that this is backwards, most durably in "Statistical Power Analysis for the Behavioral Sciences" (2nd ed., 1988). The discipline he asked for is boring and effective: decide the smallest effect that would change a decision, then size the study so you can actually see it. An eval set is a study. The same arithmetic applies, whether the outcome is a clinical result or a pass or fail from a grader.

For a pass rate the math is the two-proportion normal approximation. You have a baseline rate p0, a rate you would not want to miss p1, and you solve for the n that gives you, say, 80 percent power. For the case that comes up constantly, catching a slide from 0.90 to 0.85, the answer is about 253 examples. Not forty. Not fifty.

Run the numbers yourself

Here is the whole calculation in standard-library Python. No dependencies, so you can paste it into a scratch file and swap in your own baseline and the drop you care about.

import math

def _phi(x):  # standard normal CDF
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

def _z(p):  # inverse normal CDF (Acklam)
    a = [-3.969683028665376e+01, 2.209460984245205e+02, -2.759285104469687e+02,
         1.383577518672690e+02, -3.066479806614716e+01, 2.506628277459239e+00]
    b = [-5.447609879822406e+01, 1.615858368580409e+02, -1.556989798598866e+02,
         6.680131188771972e+01, -1.328068155288572e+01]
    c = [-7.784894002430293e-03, -3.223964580411365e-01, -2.400758277161838e+00,
         -2.549732539343734e+00, 4.374664141464968e+00, 2.938163982698783e+00]
    d = [7.784695709041462e-03, 3.224671290700398e-01, 2.445134137142996e+00,
         3.754408661907416e+00]
    plow, phigh = 0.02425, 1 - 0.02425
    if p < plow:
        q = math.sqrt(-2 * math.log(p))
        return (((((c[0]*q+c[1])*q+c[2])*q+c[3])*q+c[4])*q+c[5]) / ((((d[0]*q+d[1])*q+d[2])*q+d[3])*q+1)
    if p <= phigh:
        q = p - 0.5; r = q*q
        return (((((a[0]*r+a[1])*r+a[2])*r+a[3])*r+a[4])*r+a[5])*q / (((((b[0]*r+b[1])*r+b[2])*r+b[3])*r+b[4])*r+1)
    q = math.sqrt(-2 * math.log(1 - p))
    return -(((((c[0]*q+c[1])*q+c[2])*q+c[3])*q+c[4])*q+c[5]) / ((((d[0]*q+d[1])*q+d[2])*q+d[3])*q+1)

def power_one_proportion(n, p0, p1, alpha=0.05):  # one-sided
    z_a = _z(1 - alpha)
    return _phi((abs(p0 - p1) * math.sqrt(n) - z_a * math.sqrt(p0*(1-p0))) / math.sqrt(p1*(1-p1)))

def n_for_power(p0, p1, power=0.80, alpha=0.05):
    z_a, z_b = _z(1 - alpha), _z(power)
    return math.ceil(((z_a*math.sqrt(p0*(1-p0)) + z_b*math.sqrt(p1*(1-p1))) / abs(p0-p1))**2)

p0, p1 = 0.90, 0.85  # detect a 5-point drop in pass rate
for n in (50, 100, 250):
    print(f"n={n:4}  power to catch {p0}->{p1} = {power_one_proportion(n, p0, p1):.2f}")
print("n needed for 80% power:", n_for_power(p0, p1))

n=  50  power to catch 0.9->0.85 = 0.35
n= 100  power to catch 0.9->0.85 = 0.51
n= 250  power to catch 0.9->0.85 = 0.80
n needed for 80% power: 253

Read it slowly. A fifty-example set has about 35 percent power against a five-point drop, which means it misses that regression roughly two times in three. At a hundred examples you reach 51 percent, a coin flip. You do not clear 80 percent until about 250, and the closed-form requirement is 253.

Two caveats carry weight here. This is a normal approximation and a one-sided test, appropriate when you only care about catching a drop rather than an improvement. And it describes a single run measured against a fixed baseline. If you evaluate the same items before and after (a paired design), the right test is McNemar on the items that flipped, and it needs fewer examples because it cancels the item-to-item difficulty that otherwise inflates your variance. If instead you compare two independent runs, that is a two-sample problem, and you need roughly twice as many per run. The forty-example set was never in the conversation.

When more labels are not an option

"Collect more data" is easy to say and expensive to do, because the labels are the cost, not the inputs. Three things help when you are genuinely stuck at a small n.

Report an interval, not a point. Every pass rate is an estimate with a standard error, and near the boundaries where eval scores live (roughly 0.85 to 0.97) the ordinary Wald interval behaves badly. Use the Wilson score interval instead, from Wilson, E. B. (1927), "Probable inference, the law of succession, and statistical inference," JASA. For 36 out of 40, the Wilson 95 percent interval runs from about 0.77 to 0.96. Print that next to the number. An interval that wide makes the two-point win argue against itself, and nobody has to say a thing.

Prefer per-criterion binary labels over graded scores. A vague one-to-five rubric hides two problems: raters disagree about whether an answer is a three or a four, and that disagreement is pure measurement noise that widens every interval you compute. Splitting the judgment into specific binary criteria (did it cite a source, did it follow the format, did it answer the question) is easier to label reliably, and each example then yields several outcomes instead of one. The honest caveat is that criteria inside one example are correlated, so k criteria are not k independent examples. You still come out ahead, because the reliability gain is real and the correlated-labels problem is smaller than the rater-noise problem it replaces.

Fix the effect size before the run, not after. Decide in advance the smallest regression you care about, size the set to it, and write the decision rule down. Five points at 80 percent power means 253 examples. If you can only get 80, then say plainly that you are powered to catch about a ten-point drop and nothing smaller, and stop reading two-point moves as signal. A known blind spot is a manageable risk. An unknown one ships regressions to users.

The lesson I keep relearning is that eval sets fail quietly. A set that is too small does not error out. It returns a number with two decimal places and a confident sign, and that number is noise that reads exactly like a real measurement.

FAQ

Does a bigger or better model change the sample size?
Only through its pass rate. The arithmetic depends on the effect size and the baseline rate, not on which model produced the answers. A stronger model with a higher baseline actually needs fewer examples to catch the same absolute drop, because the variance p(1-p) shrinks as the rate approaches 1. Catching a five-point slide from 0.97 costs far fewer labels than the same slide from 0.90.

What if my eval is paired, the same items scored under both versions?
Then use McNemar's test on the discordant pairs, the items that passed under one version and failed under the other. It ignores the items that agree, and it needs fewer examples than the independent-samples formula because it removes item difficulty from the variance. The catch is that if almost nothing flips, you have almost no information no matter how many items you scored. Paired designs concentrate all the signal in the disagreements.

One-sided or two-sided?
One-sided is right when you will only act on a drop, which is the regression-detection case. If you will act on a move in either direction, use two-sided and expect to need somewhat more, since the critical value rises from 1.645 to 1.96. Do not reach for one-sided just to shrink the sample size on paper.

Open question

The clean part of this is the single-metric, binary-outcome case. The messy part is the multi-criterion eval, where each example carries five or ten correlated binary judgments and I want one honest power calculation for the whole thing. Treating the criteria as independent overcounts the evidence, and treating each example as one outcome throws information away. The right answer sits somewhere in between, governed by the intra-example correlation, and the design-effect corrections from cluster sampling are the closest tool I know of. The trouble is that they need a correlation estimate you usually do not have until you have already run the eval several times. I do not have a tidy recipe for sizing a correlated multi-criterion eval up front, and non-binary rubric scores make it harder still. If you have solved this in a way that survives contact with real data, I would like to read it.

One average eval score was hiding two different failure modes

Maya Andersson — Wed, 08 Jul 2026 17:05:02 +0000

A mean faithfulness of 0.75 sounds like a model that is usually right and occasionally slips. Mine was near-perfect on half the data and near-zero on the other half, and 0.75 described neither slice.

I spent a week chasing the wrong problem because I trusted an average. Our RAG system reported faithfulness 0.75 on the eval set, and 0.75 reads like "mostly grounded, some drift." So I tuned the prompt to nudge the drift down. The mean did not move. It did not move because there was no drift to fix. The distribution was not centered anywhere near 0.75. It was two spikes, one clump of scores at 0.95-plus and another clump at 0.05-ish, and the mean was sitting in the empty valley between them where almost no example actually lived.

This is the oldest trap in descriptive statistics, and Anscombe made it unforgettable in "Graphs in Statistical Analysis" (The American Statistician, 1973, 27(1), 17-21). His four datasets share the same mean, variance, correlation, and regression line, and look completely different the moment you plot them. Eval scores are Anscombe's point with a modern coat of paint: the summary statistic is identical across wildly different realities, and only the shape tells you which reality you are in.

Why the mean is the wrong summary for eval scores specifically

There is a reason this bites eval work harder than most data. A lot of eval scores are not smooth quantities. Faithfulness, correctness, pass/fail graded on a rubric: these pile up at the extremes. An answer is grounded or it is fabricated. The judge says yes or no. When your scores clump at 0 and 1, the mean stops describing a typical case and starts describing a mixing ratio. A faithfulness of 0.75 does not mean "the average answer is 75% faithful." It means "roughly 75% of answers are faithful and the rest are not," and those two readings call for completely different actions. The first says polish. The second says find the 25% and figure out what they have in common.

The mean is a fine summary when the distribution is unimodal and roughly symmetric. It is an actively misleading summary when the distribution is bimodal, because the center of mass lands where no data is. And clumped-at-0-and-1 eval scores are bimodal by construction.

Criterion one: look at the distribution before you report the mean

The cheapest possible defense is to plot the scores before you quote a single number. Not summary stats. The histogram. Thirty seconds tells you whether you are looking at one mode or two.

import numpy as np

rng = np.random.default_rng(0)

# A model that is great on 'in-domain' queries and terrible on 'out-of-domain',
# 60/40 split. Beta draws keep scores in [0,1] and clumped near the extremes.
in_domain  = rng.beta(12, 1.2, size=600)   # clustered high, near 0.9-1.0
out_domain = rng.beta(1.2, 12, size=400)   # clustered low, near 0.0-0.1
scores = np.concatenate([in_domain, out_domain])

print(f"mean faithfulness: {scores.mean():.3f}")   # ~0.58, down in the near-empty valley

# a text histogram so this runs with no plotting deps
edges = np.linspace(0, 1, 11)
counts, _ = np.histogram(scores, bins=edges)
for lo, hi, c in zip(edges[:-1], edges[1:], counts):
    print(f"{lo:.1f}-{hi:.1f} | {'#' * (c // 8)} {c}")

Run that and the bars tell the story the mean hid: two towers, one low, one high, and a hollow middle where 0.58 is pointing. The number says "average student." The shape says "two students, one acing, one failing, sitting at the same desk."

Criterion two: report percentiles, and specifically look for the hollow middle

If you must reduce the distribution to numbers, percentiles carry the shape that the mean throws away. A unimodal distribution has its mass near the median. A bimodal one has a median stranded in the gap, with the 25th and 75th percentiles pulled toward the two separate clumps.

qs = [10, 25, 50, 75, 90]
pct = np.percentile(scores, qs)
for q, v in zip(qs, pct):
    print(f"p{q}: {v:.3f}")

# a crude bimodality flag: how much mass sits in the middle third [0.33, 0.66]
# vs the outer thirds. A hollow middle is the tell.
middle = ((scores >= 0.33) & (scores <= 0.66)).mean()
outer  = 1 - middle
print(f"mass in middle third: {middle:.2f}   outer thirds: {outer:.2f}")

When the middle third holds far less mass than the outer thirds, your average is describing a place your data avoids. That is the signal to stop reporting one number and start splitting the data.

Criterion three: break it down by slice, because the mean is a fossil of a real boundary

Bimodality is rarely random. It almost always means there is a variable you have not conditioned on: query type, source document, language, prompt template, tenant. The two spikes are two populations wearing one average. The fix is to find the axis that separates them and report per-slice means.

# reconstruct the slice we secretly built above and report per-slice
labels = np.array(["in_domain"] * 600 + ["out_domain"] * 400)

for slice_name in np.unique(labels):
    s = scores[labels == slice_name]
    print(f"{slice_name:11s} n={len(s):4d}  "
          f"mean={s.mean():.3f}  p50={np.median(s):.3f}")
# in_domain   n= 600  mean~0.92
# out_domain  n= 400  mean~0.09

Now the numbers are actionable. The aggregate 0.72 told me to tune. The per-slice split told me the model is fine and my retrieval falls off a cliff on out-of-domain queries, which is a different bug in a different subsystem. I had been editing prompts to fix a retrieval gap. The average sent me to the wrong file.

FAQ

Is the median a safe replacement for the mean here?

No, and that is the trap. On a 60/40 bimodal split the median just lands inside whichever clump holds more than half the mass. In the code above it comes back around 0.82, sitting up in the high clump and hiding the low population entirely. The median is not more honest than the mean on bimodal data. It is dishonest in a different direction. The distribution or a per-slice breakdown is the fix, not a different single number.

How do I detect bimodality automatically instead of eyeballing histograms?

For a quick automated flag, the dip test (Hartigan and Hartigan, 1985) tests unimodality directly and is a reasonable gate in CI. A cruder version is the middle-mass check above: if the central third holds much less mass than the outer thirds, flag it for a human to look at the plot. Automated flags are for triage. The plot is still where you understand it.

My scores are continuous and unimodal. Do I still need this?

Less urgently. If you have plotted the distribution and it is a single hump, the mean is a defensible summary and you can report it with a standard deviation. This whole problem is specific to distributions that clump, which is most rubric-graded and binary eval scores and fewer of your latency-style continuous metrics.

Open question

Per-slice reporting works when you already know the slicing variable. The hard case is when you do not: the distribution is clearly bimodal, but none of your logged metadata separates the two clumps cleanly. You know two populations exist, you just cannot name the boundary. Clustering the failures on their embeddings or their input features is the obvious move, but I have found the clusters are often not the ones a human would draw, and naming them post hoc invites the same cherry-picking these methods are supposed to prevent. What is the honest workflow for discovering the hidden slice when the data insists there is one but will not tell you what it is?

Your LLM-as-judge has a position bias you are not measuring

Maya Andersson — Tue, 07 Jul 2026 16:42:42 +0000

If your pairwise judge sees answer A before answer B, it tends to prefer A. If you never swap the order, every win-rate you report is contaminated by which slot you happened to put each answer in.
The first time I actually measured this, I did not believe the number. I had a judge scoring 400 pairwise comparisons, model against baseline, and it reported a 61% win-rate for our new model. Then I ran the exact same 400 pairs with the two answers swapped in the prompt. The win-rate came back at 46%. Same answers, same judge, same rubric. The only thing I changed was which one I showed first, and the verdict moved by 15 points.

That is position bias, and it is not a quirk of my setup. Zheng et al. document it directly in "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023). They call it position bias and show that strong judges, GPT-4 included, systematically favor one slot regardless of content, sometimes flipping their own verdict when you swap the responses. If you are running a pairwise eval and you present answers in a fixed order, you are baking that preference into your headline metric and calling it quality.

The thing you are actually measuring

A pairwise judge call takes a question, answer A, answer B, and returns a preference. The implicit claim is that the preference is a function of the two answers. It is not. It is a function of the two answers AND the order you presented them AND whatever slot the judge happens to lean toward. When those last two matter, your win-rate is measuring a mix of quality and seating arrangement, and you cannot tell how much of your 61% is real.

The failure is invisible if you only run each pair once. You get a clean-looking number, you ship, and the number was partly an artifact of prompt layout. The fix is not exotic. You run each pair in both orders and look at what the judge does with the swap.

Criterion one: run both orders and measure the disagreement rate

For every pair, call the judge twice. Once with the real answer as A and the candidate as B, once swapped. A judge with no position bias gives you consistent verdicts: if it preferred the candidate in one order, it prefers the candidate in the other. When the two orders disagree, that pair is telling you the order decided the call, not the content.

Two numbers fall out of this. The flip rate is the fraction of pairs where swapping the order changed the verdict. The position-preference rate is how often the judge picked the first slot regardless of what sat there. Here is the measurement, with a deterministic stub in place of a real judge so it runs as-is:

import random

random.seed(7)

def judge(question, answer_a, answer_b):
    """Return 'A' or 'B'. Replace this body with a real judge call.
    This stub fakes a first-slot bias so the metrics below are non-trivial."""
    # 30% of the time the judge just grabs the first slot (position bias);
    # otherwise it 'reads' and prefers the answer tagged as stronger.
    if random.random() < 0.30:
        return "A"
    return "A" if answer_a["quality"] >= answer_b["quality"] else "B"

def pairwise_both_orders(question, baseline, candidate):
    """Run one comparison in both orders. Return the two verdicts,
    normalized to whether the CANDIDATE won."""
    v1 = judge(question, baseline, candidate)      # baseline in slot A
    cand_won_order1 = (v1 == "B")
    v2 = judge(question, candidate, baseline)      # candidate in slot A
    cand_won_order2 = (v2 == "A")
    return cand_won_order1, cand_won_order2

pairs = [
    {"q": f"q{i}",
     "baseline": {"quality": random.random()},
     "candidate": {"quality": random.random()}}
    for i in range(400)
]

flips = 0
first_slot_picks = 0
total_calls = 0
candidate_wins_consistent = 0
consistent_pairs = 0

for p in pairs:
    w1, w2 = pairwise_both_orders(p["q"], p["baseline"], p["candidate"])
    # flip = the two orders disagree on who won
    if w1 != w2:
        flips += 1
    else:
        consistent_pairs += 1
        if w1:
            candidate_wins_consistent += 1
    # count how often slot A won, across both calls, as a position signal
    # order1: A holds baseline, so baseline winning == A won
    first_slot_picks += (1 if not w1 else 0) + (1 if w2 else 0)
    total_calls += 2

n = len(pairs)
print(f"flip rate:            {flips / n:.3f}")
print(f"first-slot pick rate: {first_slot_picks / total_calls:.3f}  (0.50 = unbiased)")
print(f"win-rate on consistent pairs only: "
      f"{candidate_wins_consistent / consistent_pairs:.3f} "
      f"(n={consistent_pairs})")

The first-slot pick rate is your diagnostic. If it sits near 0.50, the judge is not favoring a position and you can mostly relax. If it drifts to 0.60 or above, the judge is grabbing a slot, and any single-order win-rate you computed is off by roughly the amount that bias leaks into your particular matchup.

Criterion two: correct, do not just report

Once you have both orders, you have three honest options, in order of how much data they cost you.

Average over both orders. Count the candidate as winning a pair only if it wins in expectation across the two runs. A pair that splits (wins one order, loses the other) contributes half a win. This is the cheapest correction and it neutralizes a symmetric position preference, because the bias helps the candidate in one order and hurts it in the other, and the two roughly cancel.

Drop the pairs where the order flips the call. These are pairs the judge could not decide on the merits. Reporting a win-rate on the consistent pairs only, with the count shown, is more honest than folding coin-flips into the average. You lose sample size, so show the n.

Use the flip rate as a judge-quality gate. If more than, say, 15% of your pairs flip on swap, the problem is not your model, it is your judge. A judge that reverses itself on one in six comparisons is not a measuring instrument yet. Fix the rubric or the judge before you trust any win-rate it produces.

def scored_winrate(pairs, judge_pair):
    """Average-over-both-orders win-rate, plus the flip rate as a health check."""
    credit = 0.0
    flips = 0
    for p in pairs:
        w1, w2 = judge_pair(p)
        credit += (w1 + w2) / 2.0     # half-credit on splits
        flips += (w1 != w2)
    return credit / len(pairs), flips / len(pairs)

Criterion three: know when a swap is not enough

Two orders catch left-versus-right bias. They do not catch a judge that always prefers the longer answer, or the one with more markdown, or the more confident tone. Those biases survive the swap because they travel with the content, not the slot. Position correction buys you one specific fix. It does not certify the judge overall. Zheng et al. are explicit that position bias is one of several, alongside verbosity and self-enhancement bias, so treat the swap as necessary, not sufficient.

FAQ

Do I have to double every judge call?

That doubles my eval cost. For the run you make a ship decision on, yes, run both orders. It is the only way to separate quality from seating. For continuous monitoring you can subsample: run both orders on a random 10-20% of pairs to track the flip rate over time, and only go full-swap when you are about to make a call. The cost is real but bounded, and it is cheaper than shipping on a contaminated number.

My judge barely flips (flip rate under 3%). Can I stop swapping?

If you have measured a low flip rate on a representative sample and the first-slot pick rate is near 0.50, single-order evals are defensible for that judge and that task. Re-check when you change the judge model, the rubric, or the answer format, because position bias is not a fixed property of the judge, it interacts with the prompt.

Does randomizing the order instead of swapping fix it?

Randomizing removes the systematic tilt from your aggregate win-rate, which is better than a fixed order. It does not tell you the flip rate, so you lose the diagnostic. Swapping every pair gives you both the correction and the measurement, which is why I prefer it for decision runs.

Open question

Averaging over both orders neutralizes a symmetric position bias. But nothing guarantees the bias is symmetric. A judge might favor slot A by 12 points and slot B by 4, in which case the two orders do not cancel and my averaged win-rate is still tilted. I can detect the asymmetry (the first-slot pick rate is not 0.50 even after swapping), but I do not have a clean estimator that corrects for an asymmetric position bias without assuming a model of how the bias combines with quality. Modeling it as an additive slot effect in a Bradley-Terry-style fit is the obvious next move, and I have not seen it done convincingly on real judge data. What does a principled asymmetric-position correction look like in practice?

We fixed the worst prompt variant. It got better. That doesn't mean the fix worked.

Maya Andersson — Fri, 03 Jul 2026 16:47:59 +0000

A pattern I've seen on more than one team: weekly eval run finishes, someone sorts the leaderboard, and the worst-performing prompt variant or model checkpoint gets flagged for attention. Someone makes a change, a tweak to the system prompt, a different few-shot example, sometimes just a rewording of one instruction. Next week's eval run shows that variant improved. The team credits the fix.

Sometimes the fix is real. Often, some or all of that improvement would have happened anyway, with no change at all, just because of how "worst performer" gets selected.

The mechanism

If your eval set is small (a few hundred examples, sometimes less) or your judge scoring is noisy (LLM-judge variance, human rater disagreement, both), then any variant's measured score on a given week is its true underlying quality plus some random noise. Most weeks, that noise roughly cancels out. But the variant you pick out as "worst this week" wasn't selected at random. It was selected because it had a low score, and a low score is disproportionately likely to belong to a variant that both has middling true quality and got unlucky that week.

Next time you measure it, the "got unlucky" part isn't guaranteed to repeat. On average, it won't. So the score tends to drift back up toward that variant's actual mean, regardless of whether anyone touched the prompt. This is regression to the mean, and it's one of the oldest documented statistical phenomena (Galton described it in the 1880s studying the heights of parents and children, well before anyone was eval-ing a prompt), and it shows up anywhere you select on an extreme observation and then re-measure.

The eval version of the trap: you didn't just measure the worst variant, you measured, edited, and then re-measured it, and attributed all of the change to the edit. If a genuinely untouched variant would have shown a similar-looking bounce on its own, your attribution is wrong, or at least partly wrong, and you have no way to tell how much of the "improvement" is signal versus reversion unless you have a comparison point.

What that comparison point looks like

The clean version of this check is simple in principle: alongside the variant you changed, keep at least one variant completely untouched and re-run the same eval next week. If the untouched variant "improves" by a similar margin to the one you edited, your fix probably isn't doing what you think it's doing. If the untouched variant stays flat while the edited one jumps, that's a much stronger case for the edit actually working.

Most teams don't do this, because "the variant we already know is our worst" doesn't feel worth spending eval budget on twice. That instinct is exactly backwards. The untouched variant is your control group, and the entire point of the exercise is that you don't know in advance how much of next week's number is reversion.

Simulating it

Here's a small simulation that makes the effect concrete. Twenty prompt variants, each with a stable "true" quality score, plus judge-scoring noise added on top each time you measure. Nothing about any variant changes between week 1 and week 2, no edits at all. We just re-measure the same 20 variants with fresh noise, exactly the way a noisy eval set would behave on a second run.

import numpy as np

rng = np.random.default_rng(3)

n_variants = 20
true_scores = rng.normal(loc=0.75, scale=0.05, size=n_variants)
noise_sd = 0.06

week1 = true_scores + rng.normal(0, noise_sd, n_variants)
week2 = true_scores + rng.normal(0, noise_sd, n_variants)  # nothing changed

worst_idx = np.argmin(week1)
print(f"worst variant in week 1: index {worst_idx}, score {week1[worst_idx]:.3f}, true mean {true_scores[worst_idx]:.3f}")
print(f"same variant, week 2, still unchanged: {week2[worst_idx]:.3f}")
print(f"apparent 'improvement' with zero real change: {week2[worst_idx] - week1[worst_idx]:.3f}")

# average this effect over many simulated weeks to see how reliable it is
n_sims = 2000
deltas = []
for _ in range(n_sims):
    true_scores = rng.normal(loc=0.75, scale=0.05, size=n_variants)
    w1 = true_scores + rng.normal(0, noise_sd, n_variants)
    w2 = true_scores + rng.normal(0, noise_sd, n_variants)
    idx = np.argmin(w1)
    deltas.append(w2[idx] - w1[idx])

print(f"mean apparent improvement for the worst variant, {n_sims} simulated weeks: {np.mean(deltas):.3f}")
print(f"fraction of simulated weeks the untouched worst variant looked better: {np.mean(np.array(deltas) > 0):.2%}")

On my run, the single example showed the worst variant jumping from 0.604 to 0.788, an apparent 0.183 improvement with literally nothing changed. Across 2,000 simulated weeks, the untouched "worst variant" improved on its second measurement about 88% of the time, by an average margin of roughly 0.09. In this simulation, with these noise and true-score parameters, "we fixed the worst variant and it got better" would have been true, in the sense of the number going up, on the vast majority of weeks, with zero actual change to anything.

The specific numbers here (88%, 0.09) are artifacts of the noise_sd and true_scores parameters I chose, not universal constants. Increase noise_sd relative to the spread of true_scores and the effect gets stronger (noisier judging makes this worse, not better). Shrink it and the effect weakens. If you want a feel for your own eval setup, the parameters worth plugging in are your own judge's test-retest variance (score the same output twice, see how much it moves) and the actual spread of scores across your variants, not the placeholder values above.

Why this is easy to miss in practice

A few things conspire to hide this in normal team workflow. First, the "worst variant" framing is compelling on its own, it feels obviously right to go fix whatever scored lowest, so nobody questions the selection process itself. Second, most teams don't keep an explicit untouched control variant running alongside their edits, so there's no comparison point showing what "no change" would have looked like. Third, a single week's bounce is a satisfying, legible story ("we fixed the verbosity issue and the score went up") in a way that "some fraction of this was probably regression to the mean" is not, and legible stories win in a fast-moving team even when they're partly wrong.

None of this means fixes never work. Plenty of prompt edits genuinely help, and this isn't an argument for ignoring your worst-performing variant. It's an argument for having a way to tell how much of the observed change is due to the edit versus due to how the variant got selected in the first place.

Caveats

The simulation assumes independent, symmetric, roughly Gaussian noise around a stable true score, which is a simplification. Real eval noise can be skewed (a judge might systematically overscore short outputs, for instance, which isn't symmetric noise), and a variant's "true" quality isn't actually fixed either, model behavior can drift, your eval set's composition can shift week to week, and a genuinely-worse variant might be worse for a structural reason that a prompt edit can legitimately fix. This simulation isolates the pure noise-driven case to make the mechanism visible. Real situations are a mix of genuine effect and reversion, and the honest answer is usually "some of both," not "all reversion" or "all real."

It's also worth flagging that the size of this effect depends heavily on how extreme the selection is. If you pick the single worst variant out of 20, you're selecting a more extreme outlier than if you pick the worst out of 3, and the regression effect is correspondingly stronger with more variants and more noise relative to their true spread. A team running this same check with 5 variants and a much larger, less noisy eval set would see a smaller effect than the one in this simulation, not the same one.

Discussion

The broader point isn't specific to prompt engineering. Any process where you select an extreme observation, intervene, and then re-measure without a control is vulnerable to this, performance reviews for the "worst" employee of the quarter, marketing campaigns for the "worst" performing ad variant, sports commentary about a player who "turned it around" after a bad game. LLM eval just happens to be a place where the sample sizes are often small and the measurement noise is often high (a 200-example eval set scored by an LLM judge with real day-to-day variance is a genuinely noisy instrument), which makes the effect larger and easier to demonstrate than it would be with a cleaner measurement setup.

The fix is procedural, not statistical wizardry: keep a control, be honest about your noise floor, and treat a single week's bounce as weak evidence until you've seen it hold up.

FAQ

Is regression to the mean the same thing as the eval set just being too small?
Related but not identical. A small eval set increases the noise in each measurement, which makes regression to the mean more visible and more severe, but the phenomenon exists even with a reasonably sized eval set as long as there's any measurement noise and you're selecting on an extreme score. Small samples amplify it, they don't cause it.

How do I know if my "fix" actually worked, then?
Run the edited variant and at least one untouched control variant through the same fresh eval sample. If the edited variant improves meaningfully more than the untouched control, that's real evidence of an effect. If they improve by similar amounts, the edit likely isn't doing the work you're crediting it for, at least not yet, at least not in a way this comparison can detect.

Does this apply to picking the "best" variant too, not just the worst?
Yes, symmetrically. A variant that looks unusually good one week is disproportionately likely to have gotten lucky that week, and its score will tend to drift down on remeasurement even with no changes. Teams are generally less suspicious of good news than bad news, which makes this direction of the effect even easier to miss.

Can I just increase my eval set size to make this go away?
It reduces the effect (larger samples mean less noise per measurement, so less room for a lucky or unlucky draw), but it doesn't eliminate it. As long as there's any measurement noise and you're selecting based on an extreme score, some regression to the mean will occur. A bigger eval set raises the bar for how much of an "improvement" you should trust, it doesn't remove the need to check.

Is this the same as p-hacking?
No, it's a related but distinct problem. P-hacking involves selectively choosing or reporting analyses to get a significant result. Regression to the mean happens even with completely honest, single, pre-planned analysis, it's a property of selecting on an extreme observation and re-measuring, not a property of how many analyses you ran or how you reported them.

What's a reasonable noise floor to expect from LLM-judge scoring?
I don't have a single number that generalizes, it depends heavily on the judge model, the rubric's specificity, and the task. The useful exercise is measuring your own: score the same set of outputs twice with your judge (same prompts, same outputs, re-run the judge call) and look at how much individual scores move between runs. That gives you an empirical noise_sd to reason about, rather than assuming a textbook value.

Does averaging over more judge calls per example fix this?
It reduces judge-level noise (averaging 3-5 judge calls per output instead of 1 will tighten your per-example score), which shrinks the noise_sd term in this simulation and therefore shrinks the regression effect. It doesn't address noise coming from the eval set itself being small, or from genuine week-to-week variability in the thing you're measuring, so it's a partial fix, not a complete one.

Open question

I'd like a cleaner way to separate "how much of this week's bounce was regression to the mean" from "how much was a real effect of the edit" using a single week's data, without needing a held-out untouched control every time (which is the correct answer methodologically, but isn't always practical when eval budget or compute is tight). In principle you could estimate the expected regression-to-the-mean magnitude from the historical variance and test-retest reliability of your own eval pipeline, then subtract that expectation from the observed change before crediting the rest to the edit. I haven't seen this done rigorously in an LLM eval context, and I'm not sure how sensitive that estimate would be to getting your own noise model wrong. If someone has built this out properly for a real eval pipeline, I'd want to see the math.

I reviewed six "operator-ready" checklists for AI agents. None of them define the problem correctly.

Maya Andersson — Wed, 01 Jul 2026 16:00:59 +0000

The industry has converged on a definition of "operator-ready" that is measurable, deployable, and wrong.

The most cited frameworks, Anthropic's "Building Effective Agents" (December 2024), Hamel Husain's "Your AI product needs evals" (2024), the LangChain eval documentation, NIST AI RMF (2023), Google's responsible AI practices, OpenAI's model specification (May 2024), share a common structure. They define reliability as pass-rate on a test set. They define readiness as a threshold on that pass-rate.

This is a reasonable definition for production-readiness. It is not a correct definition for operator-readiness.

The distinction is not semantic. It has direct consequences for how you test, what you ship, and what breaks after handoff.

What the existing frameworks get right

Hamel Husain's framework is the most practically useful of the six. His argument that "you cannot improve what you cannot measure" is correct, and his guidance on building eval sets that are representative, diverse, and graded with real human judgment is solid. The Anthropic guide's emphasis on minimal footprint and clear failure modes is well-reasoned for the agent design phase.

The NIST AI RMF is the most complete risk taxonomy. Its four functions (Govern, Map, Measure, Manage) are a useful organizational structure for compliance-conscious deployments.

These are good frameworks. The problem is not that they're wrong. The problem is that they're answering a different question than the one that breaks things in production.

Where they all fall short

Every framework I reviewed defines reliability as a static property: "the agent achieves X% on the eval set." That's a snapshot metric, not a deployment metric.

Operators change things. They add new document types. They expand use cases. They bring their own data. Their users find inputs that your test set never anticipated.

The real question is not "does the agent achieve X% on the eval set." It's "does the agent maintain X% after six weeks of operator usage, on the operator's actual input distribution."

Those are different questions. The first is testable before deployment. The second requires a different kind of eval infrastructure.

The three things the existing frameworks miss

1. Distribution shift is the default condition

Every literature source I checked treats distribution shift as an edge case to handle. It is not an edge case. It is the default condition of operator deployment.

The operator's data is never exactly your eval data. The operator's users will find inputs you did not anticipate. The operator's business context will evolve. Distribution shift is not a risk you mitigate and move past. It's the ongoing condition of every production deployment.

A framework that doesn't include ongoing distribution shift monitoring as a first-class readiness requirement is describing a static artifact, not a live system.

2. "Pass-rate" conflates several different failure modes

An eval pass-rate in the low nineties can coexist with an operator error rate several times higher on real data. I've seen this in practice across multiple deployments. The reason is that pass-rate is an aggregate measure that hides the variance of failure types.

There are at least four different failure modes that look identical in a pass-rate number:

Formatting failures (schema doesn't match, easy to catch and retry)
Content errors on in-distribution inputs (model got the right format, wrong substance)
Content errors on out-of-distribution inputs (distribution shift failures, the most dangerous)
Silent failures (output is wrong but passes all automated checks)

A team with 94% pass-rate and mostly formatting failures is in a very different position from a team with 94% pass-rate and mostly silent content errors. The number looks the same.

Operator-readiness requires disaggregating the failure modes, not aggregating them into a single score.

3. The eval-to-deployment gap is structural

The Anthropic guide correctly notes that agents should be evaluated with "realistic and diverse inputs." It does not address what happens when the operator's inputs are more diverse than your test set in ways you couldn't have anticipated.

This is the eval-to-deployment gap, and it is structural. You build the eval set with the data you have. The operator deploys with the data they have. Those two sets overlap imperfectly.

The only way to close this gap is to treat pre-deployment testing on the operator's own corpus as a mandatory step. Not as a quality assurance nicety. As a readiness gate.

Fifty documents from the operator's actual corpus, reviewed manually, compared to the eval pass-rate on the same task. If the accuracy on those fifty documents is materially lower than the eval accuracy, the deployment is not ready regardless of what the aggregate pass-rate says.

What a correct operator-readiness definition looks like

An agent is operator-ready when:

Pass-rate on the operator's own corpus sample (minimum 50 documents, reviewed manually) is within 5 percentage points of the training eval pass-rate.
Failure mode distribution is documented: what percentage of failures are formatting errors vs. content errors vs. silent failures.
Distribution shift monitoring is in place: a scheduled re-evaluation on a rolling sample of recent operator inputs, with alerting when the pass-rate drift exceeds a defined threshold.
Failure recovery behavior is tested explicitly: what does the agent do with inputs outside its distribution? Does it fail loudly (flag for review) or fail silently (produce wrong output)?

This is more work than "run the eval suite, check the score." It is also the actual test for whether the agent will maintain its quality guarantees six weeks after handoff.

FAQ

What's the fastest way to close the eval-to-deployment gap?

Before handoff, run the agent on 50 documents from the operator's actual corpus. Not synthetic data, not your training set. Documents the operator will actually send. Review those outputs manually. Compare the accuracy to your eval accuracy on the same task.

If the gap is more than 5 percentage points, you have a distribution shift problem to characterize before deploying.

Is there a practical threshold for operator-readiness?

There isn't a universal threshold. The right threshold depends on the stakes of failure. For a low-stakes use case (content categorization), 85% might be acceptable. For a high-stakes use case (contract extraction with legal consequences), 85% probably isn't.

What matters more than the threshold is that you're measuring operator accuracy (on the operator's data) rather than eval accuracy (on your data). Those are different denominators.

Does ongoing monitoring replace pre-deployment testing?

No. Monitoring catches degradation after deployment. Pre-deployment testing on the operator's corpus catches distribution shift before deployment. Both are necessary. Neither substitutes for the other.

What about fine-tuning on the operator's data?

Fine-tuning is the right long-term answer for persistent distribution shift. It's not the short-term answer for a deployment that needs to go live in two weeks. The short-term answer is: characterize the gap, document the failure modes, set up monitoring, and be transparent with the operator about the limitations on their data distribution.

Open question

The hardest problem I haven't seen solved well: how do you define "operator-ready" for an agent that will serve multiple operators with fundamentally different data distributions?

A financial services operator and a healthcare operator running the same document-extraction agent have different input distributions, different failure modes, and different acceptable error rates. Per-operator eval sets are the correct answer in theory. They're expensive in practice.

Is there a reasonable way to stratify a single eval set across operator types without running a full per-operator measurement pass? I've seen teams try domain-stratified sampling (one slice per major input type), but the slice sizes are never large enough to give statistically stable estimates for the rare input categories that are most likely to cause problems.

If you've solved this problem, I'd be interested in how.

We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.

Maya Andersson — Mon, 29 Jun 2026 16:56:20 +0000

We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked like our traffic, we scored against them, the pass rate went up, and we felt good. Then production incidents went up too, on exactly the inputs the synthetic set said we handled. The test set had grown and its predictive value had dropped, at the same time.

That is the trap with synthetic eval data, and it is not a tooling problem. Generating cases is easy now. Every framework will hand you a thousand. The hard part, the part none of the generators do for you, is proving the synthetic set behaves like the traffic you actually get. A test set that does not match your distribution is not a smaller version of production. It is a different test, and it can pass while production fails.

So when I compare the tools that generate eval data, I do not grade them on how many cases they spit out, or how clean the prompts are. I grade them on one question: how much do they help me check that the generated set looks like reality before I trust a number it produces?

The criterion, stated precisely

A synthetic eval set is trustworthy when two things hold. First, coverage: the cases span the same kinds of inputs your real traffic contains, in roughly the same proportions, including the messy and rare ones. Second, difficulty calibration: the synthetic cases are about as hard as real cases, so the pass rate on synthetic data tracks the pass rate on real data.

Both are measurable, and neither is measured by default. Coverage you check by embedding real and synthetic inputs and comparing the distributions, or by labeling both with the same taxonomy and comparing the histograms. Calibration you check by holding out a labeled slice of real data and confirming the model's pass rate on it lands near its pass rate on the synthetic set. If those two numbers diverge, the synthetic set is lying to you, and no amount of volume fixes it.

That is the lens for everything below.

The generators, by how much they help you validate

DeepEval (Synthesizer). Strong, controllable generation: it builds test cases from documents or from scratch, with knobs for evolution and complexity. The generation is good. What it does not hand you is the distribution-match check against your real traffic. You generate, then you validate the realism yourself. Worth reading alongside the synthetic-data-for-evaluation literature, for example the Self-Instruct work (Wang et al., arXiv:2212.10560), which is honest that generated instructions drift in diversity and difficulty unless you correct for it.

Promptfoo. Dataset and test-case generation wired into a CI-first tool, so the generated cases drop straight into a gate. Convenient for getting volume into a pipeline fast. The realism question is still yours: it will generate and run, but it does not compare the generated set's distribution to production for you.

Giskard. Comes at it from the risk angle, generating adversarial and edge cases to surface failures rather than to mirror average traffic. That is a different and useful goal, finding what breaks, but do not confuse a stress set with a representative set. An eval set built only from Giskard-style probes will over-represent the hard tail, which is great for hardening and misleading for estimating real-world pass rate.

Ragas. For RAG specifically, it generates question-answer test sets from your documents, including multi-hop questions. Good fit if your system is retrieval-shaped. The generated questions still need the same coverage check: documents you own are not the same distribution as questions users actually ask.

Future AGI. The thing it does differently is integration, not the generator itself. It is an end-to-end open-source platform, and synthetic data generation lives inside the same Datasets and evaluation surface that runs your evals and holds your traces, so the generated set, the eval that scores it, and the production traces you would validate it against are in one place rather than three. The repo is github.com/future-agi/future-agi. Be clear on what that does and does not buy you: it does not auto-prove your synthetic set matches production any more than the others do, that check is still methodology you run. What it removes is the stitching, because comparing synthetic-set behavior to real-trace behavior is a lot easier when both already live in the same system than when you are exporting CSVs between a generator, an eval library, and a tracing tool. On raw generation controllability, DeepEval's Synthesizer is at least as configurable.

The honest summary across all five: every one of them generates, and not one of them validates realism as the default first step. The validation is the work, and it is on you regardless of which generator you pick.

The procedure I actually run

Tool aside, this is the sequence, and steps 1 and 4 are the ones teams skip.

Pull a real sample. A few hundred genuine production inputs, with their outcomes if you have them.
Generate the synthetic set with whichever tool fits your shape.
Embed both real and synthetic inputs, compare the distributions. If the synthetic set clusters somewhere your real traffic does not, or misses a cluster real traffic has, fix the generation prompts and regenerate.
Hold out a labeled real slice. Score the model on it and on the synthetic set. If the two pass rates differ by more than a few points, the synthetic set is miscalibrated and its pass rate is not a proxy for anything. Do not trust it until they converge.
Only then use the synthetic set for volume, and keep the real slice as the anchor you re-check against.

The generator changes how pleasant steps 2 and 3 are. It does not change whether you have to do 1, 4, and 5.

FAQ

Why not just use real data and skip synthetic entirely?
Because real data is often scarce, imbalanced, or sensitive, and you cannot get enough of the rare cases that matter. Synthetic data is a reasonable way to fill those gaps. The point is not to avoid it, it is to validate it before you trust a number it produces.

How much real data do I need to validate the synthetic set?
Enough to estimate a distribution and a pass rate with a usable confidence interval, which is usually a few hundred examples, not tens of thousands. The validation slice is smaller than the synthetic set it is checking.

What is the single most common failure?
Difficulty miscalibration. Generated cases skew easy, because models write clean, unambiguous inputs and real users do not. The pass rate looks great and means nothing. The held-out real slice is what catches this.

Does generating adversarial cases count as a synthetic eval set?
It is a stress set, not a representative one. Use it to harden the system, not to estimate real-world pass rate. Keep the two sets and the two questions separate.

Open question

Distribution-match has a chicken-and-egg problem on genuinely new features, where you have little or no real traffic yet, so there is nothing to validate the synthetic set against. You are forced to trust generated data precisely when you can least check it. I do not have a clean answer here. The best I have is to treat the synthetic pass rate on a brand-new feature as a smoke test rather than a measurement, and to re-validate aggressively the moment real traffic arrives. If you have a principled way to bound how wrong a synthetic set can be before you have any real data to compare against, I would genuinely like to see it.

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Maya Andersson — Thu, 25 Jun 2026 17:51:07 +0000

Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a decimal point. So I ran six tools the way a methodologist would: not "which one scores," but "which one helps me prove the score is trustworthy."

Trust here has a specific meaning. An LLM judge inherits known failure modes: position bias (it favors the first answer it sees), verbosity bias (it rewards longer outputs), and self-preference (it scores outputs from its own model family higher). None of these show up in the score itself. They show up only when you compare the judge against a human-labeled set and compute agreement. The standard instrument for that is Cohen's kappa, not raw accuracy, because raw accuracy lies whenever your classes are imbalanced.

So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?

DeepEval (G-Eval). The broadest eval breadth of the group, honestly. Chain-of-thought scoring via G-Eval, a pytest-style harness, a large catalog of metrics. It is the tool I reach for when I want coverage. What it does not do for you is the human-agreement step. You write the judge, you collect the labels, you compute kappa yourself. Reference: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634), which is worth reading precisely because it measures Spearman correlation with human judgment rather than asserting it. (G-Eval is the paper's method; DeepEval is the tool that implements it.)

Confident AI. The hosted layer on top of DeepEval. Adds storage, sharing, a dashboard. The validation gap is identical, because it is the same engine underneath. You get a nicer place to keep results, not a built-in human-agreement workflow.

Evidently. Strong on report dashboards and drift detection. If your problem is "the judge looked fine in March and I want to know when it drifts," this fits. It is monitoring-shaped, not validation-shaped. It will not hand you a kappa against a held-out human set as a first-class step.

Braintrust. The side-by-side run-comparison UI is genuinely useful for spotting where two judge configurations disagree. That is disagreement-spotting, which is upstream of validation but not the same as it. Seeing two columns diverge tells you something is off, not whether either column agrees with a human.

Promptfoo. Treats judges as test assertions. Lightweight, CI-friendly, easy to wire into a pipeline. Thin on judge-versus-human statistics by design, it is a testing tool, not a measurement-theory tool.

Future AGI. Sits in the middle of this list, not at the top of it. It is an end-to-end open-source platform rather than an eval-only tool, and its evaluation surface is hybrid: deterministic functions, grounded checks, and LLM-as-judge under one interface. The hybrid framing is the interesting part for this question, because the deterministic and grounded paths give you cheaper anchors to sanity-check the judge path against. It still does not crown itself the answer to the human-agreement problem. You bring the labels. DeepEval has broader raw eval breadth; Future AGI trades some of that breadth for the hybrid local-plus-judge structure. (Source: github.com/future-agi/future-agi.)

The finding across all six: not one of them treats "compute judge agreement with human labels and show me the confusion matrix" as the default first action. Every tool optimizes for producing a score. The validation is left as an exercise for the user, which is exactly the part most teams skip.

Here is the procedure I actually run, regardless of tool:

Hand-label 200 examples on the dimension I care about. Two annotators where I can afford it, so I can also measure human-human agreement.
Run the candidate judge on the identical 200.
Compute Cohen's kappa, not accuracy.
Deploy the judge only when kappa clears roughly 0.6, and even then I read the confusion matrix to see which class it gets wrong.
Rewrite the rubric against those errors and re-measure.

The tool choice changes how pleasant steps 2 through 5 are. It does not change whether you have to do them.

FAQ

Why Cohen's kappa instead of accuracy? Accuracy is inflated by class imbalance. If 90 percent of your examples are "pass," a judge that says "pass" every time scores 90 percent accuracy and zero usefulness. Kappa corrects for agreement that would happen by chance, so it does not reward that degenerate strategy.

What kappa is good enough? There is no universal threshold, but I treat roughly 0.6 as the floor for deploying a judge on a non-trivial dimension, and I want to see where the disagreements land before trusting it. Lower can be acceptable on genuinely subjective dimensions, see the open question below.

Do I need 200 labels specifically? No. 200 is a practical balance between annotation cost and a confusion matrix you can actually read. The point is a held-out human set, not the exact count.

Can one tool just do the validation for me? None of the six I tested ship human-agreement-with-confusion-matrix as the default workflow. They produce scores; you supply and compare the labels.

Open question

Cohen's kappa assumes a meaningful ground truth to agree with. On highly subjective dimensions (helpfulness, tone, "did this answer feel complete"), human annotators themselves often only reach kappa of 0.4 to 0.5 with each other. A judge cannot beat the ceiling set by human-human disagreement. So how should we report a judge's kappa relative to the human-human kappa on the same set, and is there a clean way to estimate the subjectivity ceiling of a dimension before we spend the labeling budget? If you have a method you trust here, I would like to see it.

LLM-as-judge tools compared: the question is not which one scores, it is which one you can trust

Maya Andersson — Wed, 17 Jun 2026 16:52:05 +0000

TL;DR: I compared the main LLM-as-judge tools (DeepEval's G-Eval, Confident AI, Evidently, Braintrust, Promptfoo, and MLflow) on the axis that actually decides whether the scores mean anything: how well each helps you VALIDATE the judge against human labels. A judge that has not been checked against humans is just a second opinion with the same blind spots, and most tooling makes it easy to run a judge and hard to prove it agrees with you.

A judge you have not validated is not a measurement

An LLM-as-judge has known failure modes: position bias (prefers the first answer), verbosity bias (prefers the longer one), and self-preference (prefers its own family). Run it un-validated and you inherit all three silently. The only thing that turns a judge into a measurement is checking its agreement with human labels on a held-out set, with an actual statistic (Cohen's kappa, not "looks about right"). So I judge the judge-tools by how much they help with that.

The six, by how much they help you validate

DeepEval (G-Eval): the popular pick. G-Eval gives you chain-of-thought judge metrics out of the box and a pytest-style harness. Strong on running judges; you bring your own human-label comparison.
Confident AI: the hosted layer on DeepEval, useful for storing runs and sharing, same validation gap to close yourself.
Evidently: strong on report-style dashboards and drift, including LLM-judge descriptors; good if you want monitoring framing.
Braintrust: a clean UI for comparing judge outputs side by side across runs, which helps you eyeball disagreement even if it does not compute kappa for you.
Promptfoo: treats the judge as an assertion in a test matrix; lightweight and CI-friendly, thin on judge-vs-human stats.
MLflow: fits if MLflow is already your tracking backbone; judge metrics plug into the same runs and registry.

None of them, as of June 2026, makes "compute the judge's agreement with my human labels and show me the confusion matrix" a one-click default, which is the step that actually decides whether the judge is trustworthy. You still wire it.

How I actually validate a judge

Label 200 examples by hand. Run the judge on the same 200. Compute Cohen's kappa (chance-corrected agreement), not raw accuracy. Below about 0.6 and the judge is not ready; read the confusion matrix to see which class it confuses, fix the rubric, re-measure. Only then do I trust the judge on the unlabeled rest.

Open question

Kappa against my labels assumes my labels are right. On genuinely subjective dimensions (helpfulness, tone) two careful humans disagree, so the ceiling on judge-human agreement is the human-human agreement, which I rarely measure. I do not have a clean way to know whether a kappa of 0.55 means a bad judge or an irreducibly subjective task. If you have, I want to read it.

Power analysis for LLM evals: how big does your eval set need to be to catch a 5% regression?

Maya Andersson — Mon, 15 Jun 2026 17:08:30 +0000

TL;DR: Most eval sets are sized by "what we had lying around", not by what they can actually detect. If your eval set is 50 traces and you are trying to catch a 5-point drop in pass rate, you are underpowered: the regression hides inside sampling noise more often than not, and you ship it green. A two-line power calculation tells you the size you actually need, and ours said roughly 4x what we were running.

The number nobody computes

We argue about which metric to use and skip the prior question: how big a change can this eval set even see. An eval set has a detection floor, like any experiment. Below it, a real regression and an unlucky sample look identical, so a green run means nothing.

A two-line power check

For a pass/fail eval, detecting a drop from p1 to p2 at 80% power is a standard two-proportion calculation:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# detect a drop from 0.90 to 0.85 (5 points), 80% power, alpha 0.05
es = proportion_effectsize(0.90, 0.85)
n = NormalIndPower().solve_power(effect_size=es, alpha=0.05, power=0.8, alternative="smaller")
print(round(n))   # a few hundred per run, not 50

At 50 traces we could only reliably catch a swing of ~15 points, which is a disaster you would notice anyway, not the slow drift you actually care about.

What we changed

Sized the eval set to the smallest regression we cared about (a 5-point drop), which set the floor. Stratified so rare-but-important slices were not drowned out. Reported the eval result with its uncertainty, so a 1-point move stopped triggering investigations.

The honest caveat

Bigger eval sets cost more (every trace is judge tokens), so there is a real tension between detection power and eval cost. The answer is not "make it huge", it is "size it to the smallest regression that would actually hurt, and no smaller." For us that was a few hundred; for a safety-critical check it might be thousands.

Open question

The power calc assumes i.i.d. traces, and production traffic is bursty, correlated, and drifting. I do not have a clean way to compute effective sample size for a correlated eval set, so I treat the "few hundred" as a floor and pad it. If you have done power analysis on correlated eval traffic properly, I would like to read how.