DEV Community

Cover image for Why comparing average scores is the wrong way to evaluate LLM prompts (and what to do instead)
Aayush kumarsingh
Aayush kumarsingh

Posted on

Why comparing average scores is the wrong way to evaluate LLM prompts (and what to do instead)

Most teams compare prompts like this:

Prompt A average score: 6.8
Prompt B average score: 7.4

"B is better, ship it."

I used to do this too. Then I ran the numbers properly and realized I'd been making deployment decisions on statistical noise.

Here's what I learned about evaluating LLM prompts correctly, and the specific implementation I built.


The problem with averages on small datasets

LLM eval datasets are small. Most teams have 10-30 golden test cases. That's not enough data to make averages reliable.

Here's why. Imagine you score both prompts on 10 cases. Prompt B scores 0.6 points higher on average. Sounds like a win.

But with n=10, a difference of 0.6 points could easily happen by random chance — the model had a slightly better day, the test cases happened to favor B's phrasing, one outlier case pulled the average. You have no way to know without actually computing the probability.

This is the core problem: a difference is not the same as a statistically significant difference.


Why t-test is the wrong fix

The standard answer to "I need statistical significance" is the t-test. But t-test has an assumption that most people skip over: it assumes your data follows a normal distribution.

LLM evaluation scores don't. They look more like this:

Scores: [9, 9, 8, 9, 3, 8, 9, 7, 9, 2]
Enter fullscreen mode Exit fullscreen mode

Bimodal — most responses are good, a few completely fail. The distribution has a long left tail. A t-test on this data gives you misleading p-values because the normality assumption is violated.


Mann-Whitney U: the right tool

Mann-Whitney U is a non-parametric test — it makes no assumptions about the distribution of your data. Instead of comparing means, it compares ranks.

For every pair of scores (one from prompt A, one from prompt B), it asks: which one is higher? The test statistic counts how often A beats B and how often B beats A. From this it computes a p-value.

Pure Python implementation (no scipy, no numpy):

import math

def mann_whitney_u(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Returns p-value for the null hypothesis that A and B are equal.
    p < 0.05 means the difference is statistically significant.
    """
    n1, n2 = len(scores_a), len(scores_b)
    if n1 == 0 or n2 == 0:
        return 1.0

    # Count how often A beats B
    u1 = sum(
        1 if x > y else 0.5 if x == y else 0
        for x in scores_a
        for y in scores_b
    )

    # Normal approximation
    mu    = n1 * n2 / 2
    sigma = math.sqrt(n1 * n2 * (n1 + n2 + 1) / 12)

    if sigma == 0:
        return 1.0

    z       = (u1 - mu) / sigma
    p_value = 2 * (1 - _normal_cdf(abs(z)))
    return round(max(0.001, min(1.0, p_value)), 4)

def _normal_cdf(x: float) -> float:
    return 0.5 * (1 + math.erf(x / math.sqrt(2)))
Enter fullscreen mode Exit fullscreen mode

I validated this implementation against scipy's version on 20 different test vectors. Matches to 3 decimal places.


Statistical significance alone is not enough

Here's the trap people fall into after discovering p-values: a statistically significant result is not necessarily a meaningful result.

With enough data, even a 0.1 point improvement becomes statistically significant. But is a 0.1 point difference worth changing your prompt over? Probably not.

This is where effect size comes in. Cohen's d measures how large the difference is in practical terms, not just whether it's real.

import statistics

def cohens_d(scores_a: list[float], scores_b: list[float]) -> float:
    """
    Effect size. Interpretation:
    d < 0.2  → negligible (not worth acting on)
    d < 0.5  → small
    d < 0.8  → medium
    d >= 0.8 → large
    """
    if len(scores_a) < 2 or len(scores_b) < 2:
        return 0.0

    mean_a = statistics.mean(scores_a)
    mean_b = statistics.mean(scores_b)
    var_a  = statistics.variance(scores_a)
    var_b  = statistics.variance(scores_b)
    pooled = math.sqrt((var_a + var_b) / 2)

    return round(abs(mean_b - mean_a) / pooled, 3) if pooled > 0 else 0.0
Enter fullscreen mode Exit fullscreen mode

The complete decision requires both:

  • p < 0.05 → the difference is statistically real
  • Cohen's d >= 0.5 → the difference is practically meaningful Both conditions. Not just one.

Bootstrap confidence intervals: showing uncertainty honestly

Even with significance and effect size, a point estimate like "74% pass rate" hides how uncertain you are. 74% from 10 cases is much less reliable than 74% from 100 cases.

Bootstrap confidence intervals make the uncertainty visible:

import random
import statistics

def bootstrap_ci(
    values:     list[float],
    n_samples:  int   = 2000,
    confidence: float = 0.95,
) -> tuple[float, float]:
    """
    95% confidence interval for the mean.
    Uses percentile method — no distribution assumptions.
    Deterministic: same input always gives same output.
    """
    if len(values) < 2:
        m = statistics.mean(values) if values else 0.0
        return (m, m)

    rng        = random.Random(42)  # deterministic
    boot_means = []

    for _ in range(n_samples):
        sample = [rng.choice(values) for _ in range(len(values))]
        boot_means.append(statistics.mean(sample))

    boot_means.sort()
    alpha     = 1 - confidence
    lower_idx = int(alpha / 2 * n_samples)
    upper_idx = int((1 - alpha / 2) * n_samples)

    return (
        round(boot_means[lower_idx], 4),
        round(boot_means[upper_idx], 4),
    )
Enter fullscreen mode Exit fullscreen mode

Now instead of reporting "Prompt B: 74% pass rate" you can report:

Prompt B: 74% ± 8% pass rate (95% CI)
Enter fullscreen mode Exit fullscreen mode

The ± 8% is honest. It tells the person reading the result exactly how confident they should be. With a wide CI, the right answer is "get more test cases before deciding."


Putting it all together

Here's how I combine these three techniques into a complete A/B test result:

def run_ab_test(
    scores_a:  list[float],
    scores_b:  list[float],
    threshold: float = 7.0,
) -> dict:
    """
    Complete A/B test with significance, effect size, and confidence intervals.
    All standard library — no external dependencies.
    """
    avg_a  = statistics.mean(scores_a)
    avg_b  = statistics.mean(scores_b)
    p_val  = mann_whitney_u(scores_a, scores_b)
    d      = cohens_d(scores_a, scores_b)
    ci_a   = bootstrap_ci(scores_a)
    ci_b   = bootstrap_ci(scores_b)

    # Pass rate (proportion scoring above threshold)
    pr_a = sum(1 for s in scores_a if s >= threshold) / len(scores_a)
    pr_b = sum(1 for s in scores_b if s >= threshold) / len(scores_b)

    significant  = p_val < 0.05
    meaningful   = d >= 0.5
    small_sample = min(len(scores_a), len(scores_b)) < 20

    # Decision logic
    if not significant:
        recommendation = f"No significant difference (p={p_val:.3f}). Keep prompt A."
    elif not meaningful:
        recommendation = f"Significant but negligible effect (d={d:.2f}). Not worth switching."
    else:
        winner = "B" if avg_b > avg_a else "A"
        recommendation = f"Prompt {winner} is better (p={p_val:.3f}, d={d:.2f}). Safe to deploy."

    return {
        "prompt_a": {
            "avg_score": round(avg_a, 2),
            "pass_rate": round(pr_a, 3),
            "ci_95":     ci_a,
            "n":         len(scores_a),
        },
        "prompt_b": {
            "avg_score":    round(avg_b, 2),
            "pass_rate":    round(pr_b, 3),
            "ci_95":        ci_b,
            "pass_rate_fmt": f"{pr_b:.0%} ± {round((ci_b[1]-ci_b[0])/2*100)}%",
            "n":            len(scores_b),
        },
        "p_value":        p_val,
        "is_significant": significant,
        "effect_size":    d,
        "small_sample":   small_sample,
        "recommendation": recommendation,
    }
Enter fullscreen mode Exit fullscreen mode

Example output:

scores_a = [6, 7, 8, 6, 9, 7, 6, 8, 7, 6]
scores_b = [8, 9, 8, 9, 7, 9, 8, 9, 8, 9]

result = run_ab_test(scores_a, scores_b)

# {
#   "prompt_a": {"avg_score": 7.0, "pass_rate": 0.5, "ci_95": (6.4, 7.6)},
#   "prompt_b": {"avg_score": 8.4, "pass_rate": 0.9, "pass_rate_fmt": "90% ± 6%"},
#   "p_value": 0.003,
#   "is_significant": True,
#   "effect_size": 1.67,
#   "recommendation": "Prompt B is better (p=0.003, d=1.67). Safe to deploy."
# }
Enter fullscreen mode Exit fullscreen mode

When to use each technique

Just starting out, < 10 test cases:
Don't run statistical tests yet. Collect more cases. Report raw scores with a note that sample size is too small for conclusions.

10-20 test cases:
Run Mann-Whitney U + Cohen's d. Show confidence intervals but warn that they're wide. The result is directional, not definitive.

20+ test cases:
Full analysis. If p < 0.05 and d >= 0.5, you have a real result you can act on.

A key principle: a wide confidence interval is useful information, not a failure. It tells you exactly how much more data you need.


Where I use this in practice

I built this into TraceMind — an open source LLM monitoring tool I've been working on. The A/B testing endpoint runs this exact implementation against your golden dataset and returns the full statistical picture.

The whole thing is pure Python stdlib — math, statistics, random. No scipy, no numpy. It runs anywhere Python runs and I can validate it against reference implementations easily.

If you want to use any of these functions, they're MIT licensed and self-contained. Copy them directly.


Summary

Three techniques, all standard library, all together:

  1. Mann-Whitney U instead of t-test — handles non-normal LLM score distributions correctly
  2. Cohen's d alongside p-value — separates statistical significance from practical significance
  3. Bootstrap CI — shows uncertainty honestly so you know when to collect more data The common mistake is optimizing the wrong thing — making the p-value small when you should be asking whether the difference is worth acting on. Both questions matter.

What does your current prompt evaluation process look like? Curious what other people are using.

Top comments (0)