TL;DR: Researchers tested 20 AI models as judges. 17 out of 20 were statistically biased. True negative rate: 42.5% — your judge misses bad output more than half the time. If you have an LLM checking another LLM's work, this is your problem.
You probably have this in production right now.
response = await generator.chat(user_query)
review = await evaluator.chat(f"Rate this response 1-10: {response}")
if review.score >= 7:
send_to_customer(response)
Two models. One generates. One judges. The judge says it's fine. You ship it.
This pattern runs inside every multi-agent framework, every "self-check" wrapper, every pipeline where one LLM validates another. LangChain, CrewAI, AutoGen — doesn't matter. If Agent B evaluates Agent A, you're here.
Researchers tested 20 models as judges. Three independent statistical tests.
17 out of 20 were biased.
Not maybe biased. Statistically significant. Published. Peer-reviewed.
What does "biased" actually mean here?
Two flavours:
8 models inflated scores for outputs that sounded like themselves. "This is how I would say it, so it must be good."
9 models deflated scores. Overcorrected. Penalised their own patterns, probably because RLHF trained them to be "critical."
3 were neutral.
Neither flavour is safe. Inflated means bad output ships. Deflated means good output gets rejected and you over-escalate to humans, burning the cost savings you built the pipeline for.
"But we use rubrics."
HealthBench. Medical benchmark. 48,562 evaluation criteria. 262 physicians wrote the rubric.
The judge model still scored its own outputs 10 points higher than they deserved.
On a medical benchmark. With doctor-verified rubrics.
Ten points is not noise. Ten points is a different drug recommendation.
# What you think is happening
score = judge.evaluate(response, rubric=detailed_medical_rubric)
# score: 82 — "good enough to send"
# What is actually happening
# Response quality: 72 (mediocre)
# Self-preference bonus: +10
# Reported score: 82 — "good enough to send"
# Real accuracy: not good enough to send
The numbers that should scare you.
True positive rate: 94.5%. Good output? Judge catches it. Great.
True negative rate: 42.5%. Bad output? Judge catches it less than half the time.
Your pipeline is a filter that lets almost everything through. The good stuff AND the bad stuff.
# Your judge's actual performance
good_response → "approved" (94.5% of the time) ✓
bad_response → "approved" (57.5% of the time) ✗
bad_response → "rejected" (42.5% of the time) ✓
# That's not a quality gate.
# That's a coin flip with a marketing budget.
Where this shows up in the real world.
A SaaS support pipeline. The setup most teams build first: Agent 1 drafts the customer response, Agent 2 checks it before sending.
Agent 1 writes a confident, well-structured, completely wrong answer about the refund policy. It sounds authoritative. Covers all the right talking points. Cites the terms of service. Gets the conclusion backwards.
Agent 2 reads it. Same model family. Trained on the same data. Thinks in the same patterns.
Yes. This is how I would say it.
Score: 8/10. Ticket closed. Customer not refunded.
The dashboard says Resolved. The customer was not resolved.
# The invisible failure mode
response = agent_1.draft_reply(ticket)
# "Per our terms of service (Section 4.2), refunds are
# not applicable after 30 days..."
# (Wrong. Policy changed 3 months ago. Agent 1 doesn't know.)
score = agent_2.evaluate(response)
# score: 8/10
# Agent 2 doesn't know either. But it SOUNDS right.
# Same training data. Same confident tone. Same blind spot.
if score >= 7:
send_to_customer(response) # ← wrong answer, shipped with confidence
Nobody catches this until a human reads the ticket. If a human reads the ticket. The whole point of the pipeline was fewer humans reading tickets.
What to do right now — the minimum.
You don't need to rearchitect your pipeline today. But you need to know if you have this problem.
# Quick bias test: same prompt, swap evaluator identity
import json
test_prompts = load_test_set() # 50+ representative queries
results_same_family = []
results_cross_family = []
for prompt in test_prompts:
response = generator.chat(prompt)
# Same-family evaluation (what you probably have)
score_same = same_family_judge.evaluate(response)
results_same_family.append(score_same)
# Cross-family evaluation (the control)
score_cross = different_family_judge.evaluate(response)
results_cross_family.append(score_cross)
delta = mean(results_same_family) - mean(results_cross_family)
print(f"Self-preference delta: {delta:.2f}")
# If delta > 0.5 on a 10-point scale, you have the problem.
# Most teams find delta between 0.8 and 2.1.
If the delta is significant, your judge is biased. Part 2 covers what happens when you try the obvious fix: upgrading to a smarter model.
Next up, Part 2 of 6: You read this. You upgraded to a smarter judge. You made it worse. The smarter the model, the better it argues it was right — especially when it wasn't.
Research: Yang et al. (2026), Wataoka et al. (2024), Pombal et al. (2026). The SaaS support scenario is a composite. The pattern is in your codebase.



Top comments (0)