Every LLM Eval Library Has the Same Bug: Stochastic Judges Used as Deterministic Oracles

#ai #testing #llm #observability

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ran the eval. Pass rate is 87%. You ship.

Your colleague reruns the same suite, same model, same prompts, same dataset, ten minutes later. Pass rate is 81%. Then 84%. Then 89%.

Nothing changed. Your eval is non-deterministic and you didn't notice because the library you used printed one number with two decimal places and called it a metric. That number is whatever the LLM judge said on the first call. Ask it again and a fraction of verdicts flip.

The bug isn't in any single library. The bug is the shape. Most LLM-as-judge eval workflows, including the homemade for-loop in your repo, were designed around the deterministic eval contract from classical ML: one inference per case, one label, one number. LLMs broke that contract as soon as anyone wired one up to grade other models' outputs. Most teams haven't updated the contract.

The contract that no longer holds

A unit test asserts that add(2, 3) == 5. Same input, same output, every time. The assertion is a binary oracle. You can call it once.

An LLM judge is a sampling process over a probability distribution. Temperature 0 gets you closer to deterministic but not all the way there. Practitioners running the same prompt against the same hosted model at temperature 0 still see different completions across calls. The usual suspects are kernel-level non-determinism, model routing across replicas, and batch-size effects on floating-point math (see, for example, the Thinking Machines piece on defeating non-determinism in LLM inference). With temperature above 0 the variance is louder. The judge is a noisy classifier with an unknown agreement rate against ground truth and a hidden flip rate against itself.

When your eval library does this:

verdict = judge_llm(answer, rubric)
results.append(verdict)

…it is taking one sample from a distribution and treating it as the truth. That is the bug. Run the same suite twice and you get two different pass rates. Print the first one as a number on a dashboard and you have a metric that lies about its own precision.

How loud is the noise?

Loud enough to matter. Published work and practitioner reports converge on the same rough range:

Shi et al., Judging the Judges (arXiv:2406.07791) finds judge verdicts on pairwise tasks flip when the order of the candidates is swapped. Position bias varies by model and task and grows where the gap between answers is small.
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge (arXiv:2410.02736) catalogs verbosity, authority, and bandwagon biases that shift verdicts on retest.
Field reports from teams running production evals describe intra-judge flip rates roughly in the 5–15% range on the same case across reruns at temperature 0, depending on rubric specificity and how close the answer sits to the decision boundary. Treat that as anecdotal — your mileage will vary by model, rubric, and dataset. The arXiv work above is the load-bearing evidence; the percentage is a rough field estimate.

Take the middle of that range, say 10%, and apply it to a 200-case eval suite. Twenty verdicts per run are coin flips. Your pass rate has a built-in error bar of roughly ±5 points (a binomial 95% CI on n=200, p≈0.85, assuming independence). Two prompt variants whose true pass rates differ by 3 points are statistically indistinguishable in a single run. Most eval libraries will happily print one as 84% and the other as 81% with no uncertainty interval, and you ship the wrong prompt.

The fix is older than LLMs: take more than one sample

Statistics has been solving this problem since the 1700s. If a single observation is noisy, take many and aggregate. For a binary judge, the aggregator that does not require strong assumptions is majority vote across k independent calls. For a continuous score, take the mean and report a confidence interval. Either way, k > 1.

This isn't an exotic technique. It's the same idea as Self-Consistency for chain-of-thought (Wang et al., 2022) — sample multiple reasoning paths, vote. Apply it to the judge call instead of the generator call.

A 50-line wrapper that turns a 1-shot judge into a k-vote evidence machine:

# judges/k_vote.py
from collections import Counter
from dataclasses import dataclass
from statistics import mean, stdev

@dataclass
class JudgeResult:
    verdict: int          # 0 or 1
    agreement: float      # share of votes for winner
    votes: list[int]      # raw votes, for audit
    rationales: list[str] # raw reasons, for audit

def k_vote_judge(
    judge_fn,        # callable: (answer, rubric) -> {"verdict": 0|1, "rationale": str}
    answer: str,
    rubric: str,
    k: int = 5,
) -> JudgeResult:
    if k % 2 == 0:
        raise ValueError("k must be odd to avoid ties")

    votes, rationales = [], []
    for _ in range(k):
        raw = judge_fn(answer, rubric)
        votes.append(int(raw["verdict"]))
        rationales.append(raw["rationale"])

    counts = Counter(votes)
    winner, n_winner = counts.most_common(1)[0]
    return JudgeResult(
        verdict=winner,
        agreement=n_winner / k,
        votes=votes,
        rationales=rationales,
    )

Calls the judge k times with the same prompt. With k=5 and a 10% per-call flip rate, and assuming roughly independent flips, the binomial probability the majority is wrong drops from 10% to under 1%. Real-world judge errors are often correlated, so the actual drop is smaller, but it is still a large move.
Reports an agreement rate alongside the verdict. A 5/5 verdict and a 3/2 verdict are not the same evidence. Treat them differently downstream.
Keeps the raw votes and rationales for audit. When a reviewer asks "why did the judge flag case 47," you have five rationales, not one. Patterns of disagreement teach you where your rubric is ambiguous.

Pin the model snapshot, use temperature 0, anchor the system prompt on one binary question. Each of those reduces the noise. Only sampling removes it.

The other half: a calibration set

k-vote handles the judge's disagreement with itself. It does not handle the judge's disagreement with reality. The judge could vote 5/5 in favor of an answer that a human grader would reject. The fix for that is older still: a calibration set.

Collect 100 examples covering the range of production traffic. Have a domain expert label each one binary: pass or fail. Lock the dataset in Git.

Now run the judge against the calibration set. Compute:

TPR (true positive rate): of the cases the human marked pass, what share did the judge mark pass?
TNR (true negative rate): of the cases the human marked fail, what share did the judge mark fail?
Cohen's kappa: agreement above chance.

# meta_eval.py
from sklearn.metrics import cohen_kappa_score

def calibrate(judge_fn, calibration_set: list[dict], k: int = 5):
    human, judge = [], []
    for ex in calibration_set:
        human.append(ex["human_label"])
        judge.append(
            k_vote_judge(judge_fn, ex["answer"], ex["rubric"], k).verdict
        )
    pos = max(1, sum(human))
    neg = max(1, len(human) - sum(human))
    return {
        "tpr": sum(
            1 for h, j in zip(human, judge) if h == 1 and j == 1
        ) / pos,
        "tnr": sum(
            1 for h, j in zip(human, judge) if h == 0 and j == 0
        ) / neg,
        "kappa": cohen_kappa_score(human, judge),
    }

Three rules that flow from this:

A judge with TPR or TNR under 0.8, or kappa under 0.6, is not fit to deploy. Revise the rubric and re-run.
Report the agreement rate against humans on the dashboard next to the pass rate from the eval suite. Pass rate without agreement rate is theatre.
Re-run calibration whenever you bump the judge model snapshot. The same prompt against gpt-4o-2024-11-20 and gpt-4o-2024-08-06 is two different judges.

What this means for the libraries

Whichever library you use, check its default. Most ship single-sample judging in their quickstart code, and most also let you plug in a custom judge. The upgrade path is short:

Wrap the judge call your library uses with k_vote_judge. Default k=5, odd.
Persist the agreement rate alongside the verdict. Most libraries already store metadata; piggyback on that.
Add a calibration step to your eval workflow. Run it once on dataset creation and re-run it on every model bump.
On the dashboard, show three numbers per metric: pass rate, mean agreement rate (judge-internal), and judge-vs-human agreement on the last calibration. If any of the three changes meaningfully, investigate before you ship.

The library authors need to change the default. Your job today is simpler: stop treating single-shot judge verdicts as ground truth.

If this was useful

The LLM Observability Pocket Guide covers how to evaluate the tracing-and-evals tools on the market when picking one for your team: what to look for in their judge primitives, how to test whether they treat the judge as an oracle, and how to bolt sampling and calibration on top when they don't. It also covers the meta-eval protocol in more detail than fits in a blog post: kappa thresholds, calibration cadence, what to alert on, and when to retire a judge entirely.