DEV Community

Diven Rastdus
Diven Rastdus

Posted on • Originally published at astraedus.dev

Your AI Agent Evaluation Is Lying to You: Why 10 Test Runs Prove Nothing

I ran 10 games between two AI agents. Agent v3 went 5-5 against Agent v1. I reported "v3 ties v1, no measurable improvement, don't merge."

That conclusion was wrong. Not because v3 was secretly better or worse, but because 10 games told me almost nothing at all.

Here's the math I should have done first.

The win-rate trap

The obvious metric for comparing two agents is win rate. Agent A beats Agent B 50% of the time? They're even. 70%? A is better. Simple.

Except win rate has a confidence interval, and at small N that interval is enormous.

The Wilson score interval gives a reasonable bound for binary outcomes:

import math

def wilson_interval(wins, total, z=1.96):
    """95% confidence interval for true win probability."""
    if total == 0:
        return (0.0, 1.0)
    p = wins / total
    denom = 1 + z**2 / total
    center = (p + z**2 / (2 * total)) / denom
    spread = z * math.sqrt((p * (1 - p) + z**2 / (4 * total)) / total) / denom
    return (center - spread, center + spread)
Enter fullscreen mode Exit fullscreen mode

At 5 wins out of 10 games:

>>> wilson_interval(5, 10)
(0.236, 0.764)
Enter fullscreen mode Exit fullscreen mode

The 95% confidence interval for the true win probability is [0.24, 0.76]. That range comfortably fits "Agent A is dominant" (76% win rate), "they're even" (50%), and "Agent B is dominant" (24%). You literally cannot tell them apart.

How many games do you need? For two agents where the true skill gap gives one a 60% win rate, you need roughly 100 games to shrink the CI enough to exclude 50%. For a 55% edge, you're looking at 400+.

# Minimum games to distinguish p_true from 0.5 at 95% confidence
def min_games(p_true, z=1.96):
    """Approximate sample size for Wilson CI to exclude 0.5."""
    delta = abs(p_true - 0.5)
    return int(math.ceil(z**2 * p_true * (1 - p_true) / delta**2))

>>> min_games(0.60)  # 60% true win rate
93
>>> min_games(0.55)  # 55% true win rate
381
>>> min_games(0.52)  # 52% true win rate
2401
Enter fullscreen mode Exit fullscreen mode

Most agent improvements are in the 52-58% range against the prior version. You need hundreds of games, not ten.

TrueSkill makes the same mistake look different

If you're running a multi-agent ladder (like I am for a Kaggle competition), you're probably using TrueSkill or Elo instead of raw win rate. These feel more sophisticated. They give you a single number -- the mu rating -- and you compare it across agents.

But TrueSkill also tracks sigma, the uncertainty in that rating. And at low game counts, sigma is so large that the ratings are meaningless.

Here's my actual ladder setup, mirroring Kaggle's scoring:

import trueskill

env = trueskill.TrueSkill(
    mu=600.0,           # Kaggle's initial rating
    sigma=200.0,        # starts extremely uncertain
    draw_probability=0.05
)
Enter fullscreen mode Exit fullscreen mode

After 10 games, a typical agent might show mu=640, sigma=36. That looks precise. It's not. The 95% confidence interval on the true skill is [mu - 2*sigma, mu + 2*sigma] = [568, 712].

When I compared v1 (mu=640, sigma=36) against v3 (mu=560, sigma=36), the intervals were [568, 712] and [488, 632]. They overlap by 64 points. I could not distinguish these agents. But the mu gap (80 points) looked meaningful on a leaderboard.

The fix is to check sigma before drawing conclusions:

def ratings_are_distinguishable(rating_a, rating_b, threshold=0.95):
    """Check if two TrueSkill ratings are statistically distinguishable."""
    mu_diff = abs(rating_a.mu - rating_b.mu)
    combined_uncertainty = math.sqrt(rating_a.sigma**2 + rating_b.sigma**2)
    # z-score for the difference
    z = mu_diff / combined_uncertainty
    # For 95% confidence, need z > 1.96
    return z > 1.96

# After 10 games: NOT distinguishable
>>> ratings_are_distinguishable(
...     env.create_rating(mu=640, sigma=36),
...     env.create_rating(mu=560, sigma=36)
... )
False

# After 200 games (sigma ~8): distinguishable if gap is real
>>> ratings_are_distinguishable(
...     env.create_rating(mu=640, sigma=8),
...     env.create_rating(mu=560, sigma=8)
... )
True
Enter fullscreen mode Exit fullscreen mode

The fix: three rules

After burning a day on a wrong conclusion, I now follow three rules for agent evaluation.

Rule 1: Persist ratings across runs. Every ladder session starting from sigma=200 wastes all prior information. Save ratings to disk and load them on the next run:

import json
from pathlib import Path

RATINGS_PATH = Path("runs/ratings.json")

def load_ratings(env):
    """Load persisted TrueSkill ratings, or return empty dict."""
    if RATINGS_PATH.exists():
        data = json.loads(RATINGS_PATH.read_text())
        return {
            name: env.create_rating(mu=r["mu"], sigma=r["sigma"])
            for name, r in data.items()
        }
    return {}

def save_ratings(ratings):
    """Persist current ratings to disk."""
    RATINGS_PATH.parent.mkdir(parents=True, exist_ok=True)
    data = {
        name: {"mu": r.mu, "sigma": r.sigma}
        for name, r in ratings.items()
    }
    RATINGS_PATH.write_text(json.dumps(data, indent=2))
Enter fullscreen mode Exit fullscreen mode

Now each run adds information instead of starting from scratch. Sigma actually converges.

Rule 2: Set a sigma floor before making decisions. Don't compare agents until both have sigma below the gap you care about. For my competition, that's sigma < 15:

def is_converged(rating, sigma_threshold=15.0):
    return rating.sigma < sigma_threshold

# Before comparing v1 and v3:
if not (is_converged(ratings["v1"]) and is_converged(ratings["v3"])):
    games_needed = estimate_games_to_converge(ratings)
    print(f"Need ~{games_needed} more games before comparison is valid")
Enter fullscreen mode Exit fullscreen mode

Rule 3: Report intervals, not point estimates. Never say "v3 has mu=560." Say "v3 has mu=560 +/- 72 (95% CI)." The interval is the answer. The point estimate is decoration.

def format_rating(name, rating):
    ci = 2 * rating.sigma
    return f"{name}: {rating.mu:.0f} +/- {ci:.0f} (sigma={rating.sigma:.1f})"

# "v3: 560 +/- 72 (sigma=36.0)"   -- don't trust this
# "v3: 560 +/- 16 (sigma=8.0)"    -- now we're talking
Enter fullscreen mode Exit fullscreen mode

What this actually looks like in practice

I'm building game AI agents for a Kaggle competition. My ladder now persists ratings across sessions and prints a convergence status alongside every ranking:

Agent           |  mu   | sigma | 95% CI        | Games | Converged
v22_timeline    |  907  |  11.2 | [885, 930]    |   142 | Yes
v21_capture     |  842  |  14.8 | [812, 871]    |    89 | Yes
romantamrazov   |  823  |  16.1 | [791, 855]    |    72 | BORDERLINE
v19_lp          |  798  |  18.3 | [761, 835]    |    51 | No
Enter fullscreen mode Exit fullscreen mode

The "Converged" column is the gate. I don't merge a new agent variant until its sigma is below 15 and the CI doesn't overlap with the agent it's trying to beat. This costs more compute upfront (running 100+ games instead of 10) but saves me from merging regressions and spending days debugging phantom improvements.

The deeper problem

This isn't just a statistics issue. It's a workflow issue. When you run 10 tests, get a number, and make a decision, you feel like you evaluated something. The ritual of "run tests, look at results, decide" creates false confidence even when the test itself had zero statistical power.

The fix is mechanical: compute the confidence interval, display it, and refuse to decide when it's too wide. Make the uncertainty impossible to ignore. If your evaluation pipeline doesn't show you how uncertain it is, it's not an evaluation pipeline. It's a random number generator with a nice UI.


I build AI systems and compete in Kaggle's Orbit Wars competition. I write about the real problems I hit -- the kind that don't show up in tutorials. More at astraedus.dev.

Top comments (0)