Aayush kumarsingh

Posted on May 8

Why comparing average scores is the wrong way to evaluate LLM prompts (and what to do instead)

#python #machinelearning #llm #opensource

Most teams compare prompts like this:

Prompt A average score: 6.8
Prompt B average score: 7.4

"B is better, ship it."

I used to do this too. Then I ran the numbers properly and realized I'd been making deployment decisions on statistical noise.

Here's what I learned about evaluating LLM prompts correctly, and the specific implementation I built.

The problem with averages on small datasets

LLM eval datasets are small. Most teams have 10-30 golden test cases. That's not enough data to make averages reliable.

Here's why. Imagine you score both prompts on 10 cases. Prompt B scores 0.6 points higher on average. Sounds like a win.

But with n=10, a difference of 0.6 points could easily happen by random chance — the model had a slightly better day, the test cases happened to favor B's phrasing, one outlier case pulled the average. You have no way to know without actually computing the probability.

This is the core problem: a difference is not the same as a statistically significant difference.

Why t-test is the wrong fix

The standard answer to "I need statistical significance" is the t-test. But t-test has an assumption that most people skip over: it assumes your data follows a normal distribution.

LLM evaluation scores don't. They look more like this:

Scores: [9, 9, 8, 9, 3, 8, 9, 7, 9, 2]

Bimodal — most responses are good, a few completely fail. The distribution has a long left tail. A t-test on this data gives you misleading p-values because the normality assumption is violated.

Mann-Whitney U: the right tool

Mann-Whitney U is a non-parametric test — it makes no assumptions about the distribution of your data. Instead of comparing means, it compares ranks.

For every pair of scores (one from prompt A, one from prompt B), it asks: which one is higher? The test statistic counts how often A beats B and how often B beats A. From this it computes a p-value.

Pure Python implementation (no scipy, no numpy):

import math

def mann_whitney_u(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Returns p-value for the null hypothesis that A and B are equal.
    p < 0.05 means the difference is statistically significant.
    """
    n1, n2 = len(scores_a), len(scores_b)
    if n1 == 0 or n2 == 0:
        return 1.0

    # Count how often A beats B
    u1 = sum(
        1 if x > y else 0.5 if x == y else 0
        for x in scores_a
        for y in scores_b
    )

    # Normal approximation
    mu    = n1 * n2 / 2
    sigma = math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12)

    if sigma == 0:
        return 1.0

    z       = (u1 - mu) / sigma
    p_value = 2 * (1 - _normal_cdf(abs(z)))
    return round(max(0.001, min(1.0, p_value)), 4)

def _normal_cdf(x: float) -> float:
    return 0.5 * (1 + math.erf(x / math.sqrt(2)))

I validated this implementation against scipy's version on 20 different test vectors. Matches to 3 decimal places.

Statistical significance alone is not enough

Here's the trap people fall into after discovering p-values: a statistically significant result is not necessarily a meaningful result.

With enough data, even a 0.1 point improvement becomes statistically significant. But is a 0.1 point difference worth changing your prompt over? Probably not.

This is where effect size comes in. Cohen's d measures how large the difference is in practical terms, not just whether it's real.

import statistics

def cohens_d(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Effect size. Interpretation:
    d < 0.2  → negligible (not worth acting on)
    d < 0.5  → small
    d < 0.8  → medium
    d >= 0.8 → large
    """
    if len(scores_a) < 2 or len(scores_b) < 2:
        return 0.0

    mean_a = statistics.mean(scores_a)
    mean_b = statistics.mean(scores_b)
    var_a  = statistics.variance(scores_a)
    var_b  = statistics.variance(scores_b)
    pooled = math.sqrt((var_a + var_b) / 2)

    return round(abs(mean_b - mean_a) / pooled, 3) if pooled > 0 else 0.0

The complete decision requires both:

p < 0.05 → the difference is statistically real
Cohen's d >= 0.5 → the difference is practically meaningful Both conditions. Not just one.

Bootstrap confidence intervals: showing uncertainty honestly

Even with significance and effect size, a point estimate like "74% pass rate" hides how uncertain you are. 74% from 10 cases is much less reliable than 74% from 100 cases.

Bootstrap confidence intervals make the uncertainty visible:

import random
import statistics

def bootstrap_ci(
    values:     list[float],
    n_samples:  int   = 2000,
    confidence: float = 0.95,
) -> tuple[float, float]:
    """
    95% confidence interval for the mean.
    Uses percentile method — no distribution assumptions.
    Deterministic: same input always gives same output.
    """
    if len(values) < 2:
        m = statistics.mean(values) if values else 0.0
        return (m, m)

    rng        = random.Random(42)  # deterministic
    boot_means = []

    for _ in range(n_samples):
        sample = [rng.choice(values) for _ in range(len(values))]
        boot_means.append(statistics.mean(sample))

    boot_means.sort()
    alpha     = 1 - confidence
    lower_idx = int(alpha / 2 * n_samples)
    upper_idx = int((1 - alpha / 2) * n_samples)

    return (
        round(boot_means[lower_idx], 4),
        round(boot_means[upper_idx], 4),
    )

Now instead of reporting "Prompt B: 74% pass rate" you can report:

Prompt B: 74% ± 8% pass rate (95% CI)

The ± 8% is honest. It tells the person reading the result exactly how confident they should be. With a wide CI, the right answer is "get more test cases before deciding."

Putting it all together

Here's how I combine these three techniques into a complete A/B test result:

def run_ab_test(
    scores_a:  list[float],
    scores_b:  list[float],
    threshold: float = 7.0,
) -> dict:
    """
    Complete A/B test with significance, effect size, and confidence intervals.
    All standard library — no external dependencies.
    """
    avg_a  = statistics.mean(scores_a)
    avg_b  = statistics.mean(scores_b)
    p_val  = mann_whitney_u(scores_a, scores_b)
    d      = cohens_d(scores_a, scores_b)
    ci_a   = bootstrap_ci(scores_a)
    ci_b   = bootstrap_ci(scores_b)

    # Pass rate (proportion scoring above threshold)
    pr_a = sum(1 for s in scores_a if s >= threshold) / len(scores_a)
    pr_b = sum(1 for s in scores_b if s >= threshold) / len(scores_b)

    significant  = p_val < 0.05
    meaningful   = d >= 0.5
    small_sample = min(len(scores_a), len(scores_b)) < 20

    # Decision logic
    if not significant:
        recommendation = f"No significant difference (p={p_val:.3f}). Keep prompt A."
    elif not meaningful:
        recommendation = f"Significant but negligible effect (d={d:.2f}). Not worth switching."
    else:
        winner = "B" if avg_b > avg_a else "A"
        recommendation = f"Prompt {winner} is better (p={p_val:.3f}, d={d:.2f}). Safe to deploy."

    return {
        "prompt_a": {
            "avg_score": round(avg_a, 2),
            "pass_rate": round(pr_a, 3),
            "ci_95":     ci_a,
            "n":         len(scores_a),
        },
        "prompt_b": {
            "avg_score":    round(avg_b, 2),
            "pass_rate":    round(pr_b, 3),
            "ci_95":        ci_b,
            "pass_rate_fmt": f"{pr_b:.0%} ± {round((ci_b[1]-ci_b[0])/2*100)}%",
            "n":            len(scores_b),
        },
        "p_value":        p_val,
        "is_significant": significant,
        "effect_size":    d,
        "small_sample":   small_sample,
        "recommendation": recommendation,
    }

Example output:

scores_a = [6, 7, 8, 6, 9, 7, 6, 8, 7, 6]
scores_b = [8, 9, 8, 9, 7, 9, 8, 9, 8, 9]

result = run_ab_test(scores_a, scores_b)

# {
#   "prompt_a": {"avg_score": 7.0, "pass_rate": 0.5, "ci_95": (6.4, 7.6)},
#   "prompt_b": {"avg_score": 8.4, "pass_rate": 0.9, "pass_rate_fmt": "90% ± 6%"},
#   "p_value": 0.003,
#   "is_significant": True,
#   "effect_size": 1.67,
#   "recommendation": "Prompt B is better (p=0.003, d=1.67). Safe to deploy."
# }

When to use each technique

Just starting out, < 10 test cases:
Don't run statistical tests yet. Collect more cases. Report raw scores with a note that sample size is too small for conclusions.

10-20 test cases:
Run Mann-Whitney U + Cohen's d. Show confidence intervals but warn that they're wide. The result is directional, not definitive.

20+ test cases:
Full analysis. If p < 0.05 and d >= 0.5, you have a real result you can act on.

A key principle: a wide confidence interval is useful information, not a failure. It tells you exactly how much more data you need.

Where I use this in practice

I built this into TraceMind — an open source LLM monitoring tool I've been working on. The A/B testing endpoint runs this exact implementation against your golden dataset and returns the full statistical picture.

The whole thing is pure Python stdlib — math, statistics, random. No scipy, no numpy. It runs anywhere Python runs and I can validate it against reference implementations easily.

If you want to use any of these functions, they're MIT licensed and self-contained. Copy them directly.

Summary

Three techniques, all standard library, all together:

Mann-Whitney U instead of t-test — handles non-normal LLM score distributions correctly
Cohen's d alongside p-value — separates statistical significance from practical significance
Bootstrap CI — shows uncertainty honestly so you know when to collect more data The common mistake is optimizing the wrong thing — making the p-value small when you should be asking whether the difference is worth acting on. Both questions matter.

What does your current prompt evaluation process look like? Curious what other people are using.

Top comments (4)

VoltageGPU • May 13

Great article, really highlights the limitations of using average scores when evaluating LLM prompts. In my work with GPU-accelerated inference pipelines, I've seen how variability in output quality can impact downstream tasks—especially when using prompts for automated decision-making. Instead of just averages, we use confidence intervals and error analysis to better understand the distribution of model behavior.

Ken W Alger • May 12

This is the 'Grown-Up' take on evaluation that the industry desperately needs. Relying on average scores is how you end up with a 'theoretically perfect' agent that still fails catastrophically in a production edge case.

In my work on the Sovereign Synapse series, I’ve been moving away from these blunt metrics toward what I call the Forensic Trace. It’s not about how the LLM performs on 100 average prompts; it’s about the variance and the failure modes when the context window gets crowded or the data is 'stale.' If you aren't measuring the distribution of the errors, you aren't building a reliable system; you're just gambling on a mean.

I recently wrote about why the tech stack doesn't matter, and this is the perfect extension of that logic. The 'Build'—the evaluation framework and the LLM-as-a-judge patterns—is what provides the Developer Trust (DT). A high average score is a marketing metric; a low-variance forensic audit is an engineering metric.

Great to see someone pushing for more rigorous, distribution-based evaluation.

Xidao • May 12

Really solid breakdown of why statistical rigor matters in prompt evaluation. The bimodal distribution observation is spot on — I see the same pattern where most outputs cluster at the high end but a few completely fail, and those failures are often the cases you care most about.

One complementary approach I have found useful alongside Mann-Whitney U is bootstrap confidence intervals for the effect size. Instead of just asking "is the difference significant?", you can estimate the range of plausible effect sizes. This helps when you have borderline p-values — if the 95% CI for Cohen's d includes both trivial and meaningful effect sizes, you know you need more data rather than a different test.

Also worth noting: with LLM evals, the scores themselves are often noisy because the judge model (or human rater) introduces its own variance. Running each test case multiple times and averaging the scores before applying statistical tests can dramatically reduce the noise floor. Have you experimented with multi-sample scoring in your pipeline?

Max Quimby • May 12

Strong post — the "averages on N=20 are not evidence" point cannot be made often enough. One thing that bought us a lot of statistical power for ~zero engineering cost was switching from independent comparisons (prompt A vs prompt B on a shared dataset) to paired comparisons. Run both prompts on each case, take the per-case delta, then your null is "delta = 0" rather than "means are equal." Wilcoxon signed-rank is the paired sibling of Mann-Whitney, and it routinely turns a "no significant difference" into a clean signal on the same data.

The other thing I'd add: bootstrap CIs are even more useful when you also fix and report the temperature, seed (where available), and judge model. We've burned ourselves before by getting a beautiful Cohen's d = 0.7 result, only to find half of it was variance from the judge model running on a different snapshot week-to-week. The eval pipeline needs version pinning the same way your training code does.

Looking forward to the next post if you do one on holdout discipline — the leak from "I wrote this prompt while staring at the eval set" is the next-most-common failure I see.