Most teams compare prompts like this:
Prompt A average score: 6.8
Prompt B average score: 7.4
"B is better, ship it."
I used to do this too. Then I ran the numbers properly and realized I'd been making deployment decisions on statistical noise.
Here's what I learned about evaluating LLM prompts correctly, and the specific implementation I built.
The problem with averages on small datasets
LLM eval datasets are small. Most teams have 10-30 golden test cases. That's not enough data to make averages reliable.
Here's why. Imagine you score both prompts on 10 cases. Prompt B scores 0.6 points higher on average. Sounds like a win.
But with n=10, a difference of 0.6 points could easily happen by random chance — the model had a slightly better day, the test cases happened to favor B's phrasing, one outlier case pulled the average. You have no way to know without actually computing the probability.
This is the core problem: a difference is not the same as a statistically significant difference.
Why t-test is the wrong fix
The standard answer to "I need statistical significance" is the t-test. But t-test has an assumption that most people skip over: it assumes your data follows a normal distribution.
LLM evaluation scores don't. They look more like this:
Scores: [9, 9, 8, 9, 3, 8, 9, 7, 9, 2]
Bimodal — most responses are good, a few completely fail. The distribution has a long left tail. A t-test on this data gives you misleading p-values because the normality assumption is violated.
Mann-Whitney U: the right tool
Mann-Whitney U is a non-parametric test — it makes no assumptions about the distribution of your data. Instead of comparing means, it compares ranks.
For every pair of scores (one from prompt A, one from prompt B), it asks: which one is higher? The test statistic counts how often A beats B and how often B beats A. From this it computes a p-value.
Pure Python implementation (no scipy, no numpy):
import math
def mann_whitney_u(scores_a: list[float], scores_b: list[float]) -> float:
"""
Returns p-value for the null hypothesis that A and B are equal.
p < 0.05 means the difference is statistically significant.
"""
n1, n2 = len(scores_a), len(scores_b)
if n1 == 0 or n2 == 0:
return 1.0
# Count how often A beats B
u1 = sum(
1 if x > y else 0.5 if x == y else 0
for x in scores_a
for y in scores_b
)
# Normal approximation
mu = n1 * n2 / 2
sigma = math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12)
if sigma == 0:
return 1.0
z = (u1 - mu) / sigma
p_value = 2 * (1 - _normal_cdf(abs(z)))
return round(max(0.001, min(1.0, p_value)), 4)
def _normal_cdf(x: float) -> float:
return 0.5 * (1 + math.erf(x / math.sqrt(2)))
I validated this implementation against scipy's version on 20 different test vectors. Matches to 3 decimal places.
Statistical significance alone is not enough
Here's the trap people fall into after discovering p-values: a statistically significant result is not necessarily a meaningful result.
With enough data, even a 0.1 point improvement becomes statistically significant. But is a 0.1 point difference worth changing your prompt over? Probably not.
This is where effect size comes in. Cohen's d measures how large the difference is in practical terms, not just whether it's real.
import statistics
def cohens_d(scores_a: list[float], scores_b: list[float]) -> float:
"""
Effect size. Interpretation:
d < 0.2 → negligible (not worth acting on)
d < 0.5 → small
d < 0.8 → medium
d >= 0.8 → large
"""
if len(scores_a) < 2 or len(scores_b) < 2:
return 0.0
mean_a = statistics.mean(scores_a)
mean_b = statistics.mean(scores_b)
var_a = statistics.variance(scores_a)
var_b = statistics.variance(scores_b)
pooled = math.sqrt((var_a + var_b) / 2)
return round(abs(mean_b - mean_a) / pooled, 3) if pooled > 0 else 0.0
The complete decision requires both:
- p < 0.05 → the difference is statistically real
- Cohen's d >= 0.5 → the difference is practically meaningful Both conditions. Not just one.
Bootstrap confidence intervals: showing uncertainty honestly
Even with significance and effect size, a point estimate like "74% pass rate" hides how uncertain you are. 74% from 10 cases is much less reliable than 74% from 100 cases.
Bootstrap confidence intervals make the uncertainty visible:
import random
import statistics
def bootstrap_ci(
values: list[float],
n_samples: int = 2000,
confidence: float = 0.95,
) -> tuple[float, float]:
"""
95% confidence interval for the mean.
Uses percentile method — no distribution assumptions.
Deterministic: same input always gives same output.
"""
if len(values) < 2:
m = statistics.mean(values) if values else 0.0
return (m, m)
rng = random.Random(42) # deterministic
boot_means = []
for _ in range(n_samples):
sample = [rng.choice(values) for _ in range(len(values))]
boot_means.append(statistics.mean(sample))
boot_means.sort()
alpha = 1 - confidence
lower_idx = int(alpha / 2 * n_samples)
upper_idx = int((1 - alpha / 2) * n_samples)
return (
round(boot_means[lower_idx], 4),
round(boot_means[upper_idx], 4),
)
Now instead of reporting "Prompt B: 74% pass rate" you can report:
Prompt B: 74% ± 8% pass rate (95% CI)
The ± 8% is honest. It tells the person reading the result exactly how confident they should be. With a wide CI, the right answer is "get more test cases before deciding."
Putting it all together
Here's how I combine these three techniques into a complete A/B test result:
def run_ab_test(
scores_a: list[float],
scores_b: list[float],
threshold: float = 7.0,
) -> dict:
"""
Complete A/B test with significance, effect size, and confidence intervals.
All standard library — no external dependencies.
"""
avg_a = statistics.mean(scores_a)
avg_b = statistics.mean(scores_b)
p_val = mann_whitney_u(scores_a, scores_b)
d = cohens_d(scores_a, scores_b)
ci_a = bootstrap_ci(scores_a)
ci_b = bootstrap_ci(scores_b)
# Pass rate (proportion scoring above threshold)
pr_a = sum(1 for s in scores_a if s >= threshold) / len(scores_a)
pr_b = sum(1 for s in scores_b if s >= threshold) / len(scores_b)
significant = p_val < 0.05
meaningful = d >= 0.5
small_sample = min(len(scores_a), len(scores_b)) < 20
# Decision logic
if not significant:
recommendation = f"No significant difference (p={p_val:.3f}). Keep prompt A."
elif not meaningful:
recommendation = f"Significant but negligible effect (d={d:.2f}). Not worth switching."
else:
winner = "B" if avg_b > avg_a else "A"
recommendation = f"Prompt {winner} is better (p={p_val:.3f}, d={d:.2f}). Safe to deploy."
return {
"prompt_a": {
"avg_score": round(avg_a, 2),
"pass_rate": round(pr_a, 3),
"ci_95": ci_a,
"n": len(scores_a),
},
"prompt_b": {
"avg_score": round(avg_b, 2),
"pass_rate": round(pr_b, 3),
"ci_95": ci_b,
"pass_rate_fmt": f"{pr_b:.0%} ± {round((ci_b[1]-ci_b[0])/2*100)}%",
"n": len(scores_b),
},
"p_value": p_val,
"is_significant": significant,
"effect_size": d,
"small_sample": small_sample,
"recommendation": recommendation,
}
Example output:
scores_a = [6, 7, 8, 6, 9, 7, 6, 8, 7, 6]
scores_b = [8, 9, 8, 9, 7, 9, 8, 9, 8, 9]
result = run_ab_test(scores_a, scores_b)
# {
# "prompt_a": {"avg_score": 7.0, "pass_rate": 0.5, "ci_95": (6.4, 7.6)},
# "prompt_b": {"avg_score": 8.4, "pass_rate": 0.9, "pass_rate_fmt": "90% ± 6%"},
# "p_value": 0.003,
# "is_significant": True,
# "effect_size": 1.67,
# "recommendation": "Prompt B is better (p=0.003, d=1.67). Safe to deploy."
# }
When to use each technique
Just starting out, < 10 test cases:
Don't run statistical tests yet. Collect more cases. Report raw scores with a note that sample size is too small for conclusions.
10-20 test cases:
Run Mann-Whitney U + Cohen's d. Show confidence intervals but warn that they're wide. The result is directional, not definitive.
20+ test cases:
Full analysis. If p < 0.05 and d >= 0.5, you have a real result you can act on.
A key principle: a wide confidence interval is useful information, not a failure. It tells you exactly how much more data you need.
Where I use this in practice
I built this into TraceMind — an open source LLM monitoring tool I've been working on. The A/B testing endpoint runs this exact implementation against your golden dataset and returns the full statistical picture.
The whole thing is pure Python stdlib — math, statistics, random. No scipy, no numpy. It runs anywhere Python runs and I can validate it against reference implementations easily.
If you want to use any of these functions, they're MIT licensed and self-contained. Copy them directly.
Summary
Three techniques, all standard library, all together:
- Mann-Whitney U instead of t-test — handles non-normal LLM score distributions correctly
- Cohen's d alongside p-value — separates statistical significance from practical significance
- Bootstrap CI — shows uncertainty honestly so you know when to collect more data The common mistake is optimizing the wrong thing — making the p-value small when you should be asking whether the difference is worth acting on. Both questions matter.
What does your current prompt evaluation process look like? Curious what other people are using.
Top comments (0)