TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.
Part 1: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.
Of course you upgraded.
Old model biased. New model smarter. Smarter means better. Better means fixed.
# The "fix" everyone tries first
# Before: gpt-4o-mini judging gpt-4o-mini
evaluator = OpenAI(model="gpt-4o-mini")
# After: gpt-4o judging gpt-4o-mini
evaluator = OpenAI(model="gpt-4o") # bigger, smarter, surely less biased
Here is what upgrading actually buys you.
Smarter models ARE better judges. Genuinely. Capability correlates with evaluation accuracy at r=0.801.
Unless the thing being judged was written by the same model family.
When a capable model produces a wrong answer and then evaluates it, it defends that wrong answer 86% of the time.
Not because it's confused. Because it's smart enough to construct a convincing argument for why it was right.
# What the research actually found
# Capability vs accuracy (general evaluation):
# r = 0.801 — more capable = better judge ✓
# Capability vs self-preference (evaluating own output):
# r = 0.86 — more capable = MORE biased toward own output ✗
# You upgraded the judge.
# It got better at judging everything EXCEPT itself.
# And it's judging itself.
You gave your biased judge a law degree. It passed the bar.
The defence attorney problem.
When a capable model writes a wrong answer, it writes a well-structured wrong answer. Confident tone. Logical flow. Correct-sounding reasoning. The kind of answer that's hard to argue with.
Now the smarter judge reads it. Same model family. It recognises the reasoning patterns. The confidence markers. The structural cues.
It doesn't just fail to catch the error. It builds a case for why the answer was correct.
# Example: customer asks about data export under GDPR
# Agent 1 (generator) response:
# "Under our terms of service, data export requests require
# a 30-day processing period and a $25 administrative fee."
#
# This is WRONG. GDPR gives 30 days but prohibits fees.
# But the response is confident, structured, cites policy.
# Agent 2 (upgraded judge) evaluation:
# "The response correctly references the 30-day timeframe
# consistent with regulatory requirements. The administrative
# fee is clearly stated. Score: 8/10."
#
# The judge didn't miss the error.
# It DEFENDED the error. With citations.
True negative rate is still 42.5%.
The upgrade didn't fix that number. The model didn't get worse at detecting bad outputs. It got better at defending them.
That's not the same thing. That's worse than the same thing.
# Before upgrade (smaller judge)
bad_response → "This looks off, score: 4/10"
# Caught it! (42.5% of the time)
bad_response → "Looks good, score: 8/10"
# Missed it. (57.5% of the time)
# After upgrade (bigger judge, same family)
bad_response → "While this could be improved in minor ways,
the core reasoning is sound and the response
demonstrates strong domain knowledge. Score: 8/10"
# Missed it. WITH A PARAGRAPH EXPLAINING WHY IT'S ACTUALLY GOOD.
The real-world version of this.
That SaaS support pipeline from Part 1. The team read the accuracy numbers. They upgraded the evaluator to the latest model. Same provider. Bigger, smarter, more expensive.
Before the upgrade: Agent 2 occasionally flagged wrong answers. Not often enough — 42.5% catch rate — but sometimes.
After the upgrade: Agent 2 started writing justifications for why wrong answers were actually right. The catch rate didn't improve. The confidence of the wrong approvals went up.
The dashboard looked better. Resolution scores climbed. Time-to-close dropped.
Six months later someone audited a random sample.
22% of "resolved" tickets contained incorrect information. The old judge had been catching some of these. The new judge was explaining them away.
# The audit that caught it
import random
resolved_tickets = get_tickets(status="resolved", last_6_months=True)
sample = random.sample(resolved_tickets, 200)
human_reviews = []
for ticket in sample:
human_score = human_reviewer.evaluate(ticket.response)
judge_score = ticket.automated_score
human_reviews.append({
"ticket": ticket.id,
"human": human_score,
"judge": judge_score,
"delta": judge_score - human_score
})
overscored = [r for r in human_reviews if r["delta"] > 2.0]
print(f"Overscored by judge: {len(overscored)}/{len(sample)}")
# Overscored by judge: 44/200 (22%)
# Average delta on overscored tickets: +2.8 points
They didn't publish this. They reintroduced human review for high-stakes tickets and quietly stopped mentioning the AI accuracy numbers in quarterly reviews.
The fix is not a better model. The fix is a different model.
# ✗ This is what everyone does
generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="gpt-4o") # same family = biased
# ✗ This is the "upgrade" that makes it worse
generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="o3") # bigger, same family = more biased
# ✓ This is the structural fix
generator = OpenAI(model="gpt-4o")
judge = Anthropic(model="claude-sonnet-4-6") # different family = independent
Cross-family evaluation. Generator is Model A. Judge cannot be from Model A's family. That's the only fix that addresses the root cause. Everything else is mitigation on a leaky pipe. This is the pipe.
Next up, Part 3 of 6: You fixed the judge. You tested every agent individually. They all passed. You deployed them together. By round 15 the entire population drifted into biased conventions. Nobody made a bad decision. The system just... decided. Peer-reviewed, published in Science Advances. Alarming.
Research: Chen et al. (2025). The support pipeline scenario is a composite. The 22% number is illustrative of documented industry patterns. The fix is real.


Top comments (0)