Why Your LLM-as-Judge Disagrees With Itself (And How to Fix It)

Swap the order of two candidate answers and ask the same judge model to pick the better one. A depressing number of the time, it picks the other answer. Same model, same prompt, same temperature 0 — different winner, just because A and B traded seats. If you are using an LLM-as-judge to gate releases or rank a leaderboard, that single fact should make you nervous, because it means a chunk of your "quality" signal is measuring slot position, not quality.

LLM-as-judge is the standard way to score open-ended model output at scale — you give a strong model (Claude Opus 4.x, GPT-5.x) a rubric and ask it to grade. It works, it's cheap relative to humans, and it correlates with human preference well enough to be useful. But it has structured, reproducible biases. If you don't engineer around them, your evals are noisier than you think and quietly wrong in a consistent direction.

Key takeaways

Position bias is the biggest single defect: judges systematically prefer the first (sometimes the last) option. Fix it by running every pairwise comparison twice with the order swapped and only counting consistent verdicts.
Self-preference bias is real: a model tends to rate its own generations higher. Don't judge a model's output with the same model family when you can avoid it.
Verbosity bias means longer, more confident-sounding answers win even when they're wrong. Control for length in your rubric and your stats.
Score clustering: on a 1–10 scale, judges crowd 7–8. Use pairwise comparison or a forced-decomposition rubric instead of asking for a bare number.
Structured output + chain-of-thought-before-verdict cuts variance more than any prompt-wording tweak.

Why does an LLM-as-judge prefer whichever answer comes first?

Because the comparison isn't symmetric to the model the way it is to you. The two candidates occupy different positions in the context window, and the autoregressive forward pass conditions on everything to the left. The token "A" and the token "B" carry priors. The first option establishes a frame that the second is implicitly judged against. The result is a measurable, consistent skew toward one slot — usually the first.

This is not noise you can average away by running more samples in the same order. It's a bias with a sign. If you always present your new model as option B, you will systematically under-credit it.

The fix is a swap test. Run each comparison in both orders and define agreement:

import anthropic
from pydantic import BaseModel

client = anthropic.Anthropic()

class Verdict(BaseModel):
    reasoning: str
    winner: str  # "A" or "B"

JUDGE_PROMPT = """You are grading two answers to the same question.
Question: {question}

Answer A:
{a}

Answer B:
{b}

First reason step by step about correctness, completeness, and whether
either answer is longer without being better. THEN output your verdict.
Ignore length unless it affects correctness."""

def judge_once(question, a, b):
    msg = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        temperature=0,
        tools=[{
            "name": "verdict",
            "description": "Record the grading verdict.",
            "input_schema": Verdict.model_json_schema(),
        }],
        tool_choice={"type": "tool", "name": "verdict"},
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(question=question, a=a, b=b)}],
    )
    return Verdict(**msg.content[0].input)

def judge(question, answer_x, answer_y):
    # First pass: X is A, Y is B
    v1 = judge_once(question, answer_x, answer_y)
    # Second pass: positions swapped
    v2 = judge_once(question, answer_y, answer_x)
    win1 = "X" if v1.winner == "A" else "Y"
    win2 = "X" if v2.winner == "B" else "Y"   # note the flip
    if win1 == win2:
        return win1            # consistent → trust it
    return "tie"               # judge flipped on order → not a real signal

The tie branch is the important part. A flip on swap is the model telling you it doesn't actually have a preference here; counting either verdict as a win is fabricating signal. In my experience the consistent-verdict rate is a better health metric for your eval than the win rate itself — if more than a quarter of comparisons flip on swap, your rubric is underspecified or the two candidates are genuinely too close to rank.

What is self-preference bias and why does it corrupt model comparisons?

Self-preference bias is the tendency of a judge to rate text generated by itself (or its own model family) higher than text from other models, holding quality constant. The leading explanation is familiarity: a model assigns higher likelihood to text that matches its own distribution — phrasing, structure, hedging style — and that perplexity-style comfort leaks into the quality judgment.

The practical consequence is a conflict of interest. If you fine-tune a Claude model and then evaluate it with Claude as the judge, the judge is mildly rooting for the home team. The bias is usually small relative to a large real quality gap, but it's exactly the wrong size to trust when you're measuring a small improvement — which is most of the time in production.

Mitigations, in order of strength:

Cross-family judging. Judge candidates from model family X with family Y. If you're comparing two Claude variants, this doesn't help — so:
Blind the judge to provenance. Never put model names in the judge prompt. "GPT-5.1's answer" vs "our model's answer" is a contaminated comparison.
Panel of judges. Average over two or three different judge models. A self-preference from one is diluted by the others. This roughly doubles or triples cost; reserve it for high-stakes decisions like release gates.

Why do judges give long answers higher scores?

Verbosity bias: longer answers read as more thorough, more confident, more "complete," and judges reward that surface signal even when the extra length adds nothing or is wrong. A padded answer with three caveats and a summary table often beats a correct one-liner.

This interacts viciously with RLHF'd models, which already lean verbose. If your candidate model was tuned to be chatty and your judge rewards chattiness, you get a feedback loop that optimizes for length, not correctness.

Two defenses. First, instrument it: log answer length alongside every verdict and check whether win rate correlates with length. If your winning answers are consistently 1.5x longer, length is a confound and you should be suspicious. Second, neutralize it in the rubric explicitly — the "ignore length unless it affects correctness" line in the prompt above is doing real work. Even better, decompose the score so length can't hide:

class Decomposed(BaseModel):
    factual_errors: int          # count, not vibe
    missing_requirements: list[str]
    unsupported_claims: int
    answers_the_question: bool
    overall: int                 # derived last, after the facts

Forcing the judge to count errors and list missing requirements before emitting an overall score anchors the judgment to checkable facts instead of prose fluency. A 200-word answer with two factual errors loses to a 40-word correct one because the schema makes the errors explicit.

Why is a 1–10 score worse than a head-to-head comparison?

Because absolute scores cluster and drift. Ask an LLM-as-judge for a 1–10 rating and it will pile most answers into 7–8, almost never use 1–4, and shift the whole distribution if you tweak the prompt wording. The numbers feel precise and aren't. You can't compare a 7 from Tuesday's prompt to a 7 from Thursday's.

Pairwise comparison sidesteps the calibration problem entirely. "Is A better than B?" is a far easier and more stable judgment than "rate A on an absolute scale of 1 to 10," because the judge only has to discriminate, not calibrate. You lose a global score, but for ranking candidates or running A/B evals you never needed one — you needed the ordering, and pairwise gives you that with much lower variance.

If you genuinely need absolute scores (say, a quality threshold for filtering), anchor the scale. Put one or two reference answers of known quality in the prompt — "a 9 looks like this, a 4 looks like this" — and the clustering tightens dramatically because you've grounded the abstract numbers in concrete examples.

How do you cut judge variance without changing the rubric?

Three structural changes beat any amount of prompt-wording fiddling:

Reason before verdict. Make the judge produce its analysis first and the decision last, in that token order. Verdict-first throws away the chain-of-thought — the model commits, then rationalizes. This is the single highest-leverage change and it's why the schema puts reasoning before winner.
Temperature 0 plus structured output. Free-text verdicts force you to parse "I think Answer B is slightly better, although..." with a regex. Tool-call/structured output removes the parsing failure mode and the formatting variance. The constraint costs a little, but for a judge you want determinism, not creativity.
Self-consistency for the hard cases. For comparisons that flip on swap, run the judge a few times at low-but-nonzero temperature and take the majority. Don't pay that cost on the easy 80% — spend it only where the swap test already told you there's genuine ambiguity.

Failure mode to watch: the judge that agrees with everything

One last trap. A poorly prompted judge defaults to agreeable — it'll call almost anything "correct" or "helpful" if you ask vaguely. Your win rates hover near 50% and your scores near 8, and you conclude both models are great. They might be; more likely your rubric isn't forcing a discrimination. The tell is a low consistent-verdict rate combined with high scores: the judge is being nice, not discerning. Tighten the rubric until it produces decisive, swap-stable verdicts on cases where you already know the right answer. Validate your judge against a small human-labeled set before you trust it on anything that ships.

So why does your LLM-as-judge disagree with itself?

An LLM-as-judge disagrees with itself mainly because of position bias — it prefers whichever answer it reads first — compounded by self-preference bias (it favors its own model family's style), verbosity bias (it rewards length over correctness), and score clustering on absolute scales. None of these are random noise you can average away; each has a consistent direction that skews your results. The fixes are structural, not cosmetic: run every comparison in both orders and only count swap-consistent verdicts, blind the judge to model provenance and prefer cross-family or panel judging, force the rubric to count concrete errors before emitting a score, use pairwise comparison instead of 1–10 ratings, and make the judge reason before it decides via temperature-0 structured output. Do those five things and your judge stops measuring slot position and starts measuring quality.