Sho Naka

Posted on Jul 5

LLM-as-Judge Is Too Lenient. Here's a Cheap Fix: Judge Refute (Maybe) Arbitrate

#ai #llm #architecture #testing

If you've wired an LLM up to grade another LLM's output (a quality gate, an eval harness, a "does this pass the rubric" check), you've probably run into a well-known tendency: it grades on a curve. It wants to say pass. Here's a pattern that tightens that up without doubling your model bill: make the judge defend its verdict against a model whose only job is to tear it apart.

TL;DR

LLM-as-judge setups have a well-known tendency toward leniency: the same model family that generated plausible-sounding output tends to also find plausible-sounding output... plausible.
Fix: split grading into two roles, a Judge that scores against a rubric, and a Refuter whose sole incentive is to overturn the judge's verdict.
If they agree, ship the result. That's the cheap path: no extra model calls beyond the two you already paid for.
If they disagree, escalate to a tie-breaker that only sees the disagreement, not the full context from scratch.
The escalation decision itself is plain deterministic code (if verdict_a == verdict_b), so keeping the control flow boring and auditable saves the model calls for actual judgment.

A note on evidence before you read further: the leniency tendency above is a widely observed pattern in LLM-as-judge setups, and the mechanism (self-agreement, no adversarial pressure) is a reasonable explanation for it, but the specific pattern below is something I'm proposing, not something I've run through a large labeled benchmark. Treat it as a design worth trying, not a measured result.

The problem: your judge grades on a curve

"LLM-as-judge" is everywhere now: content quality gates, RAG answer scoring, code review bots, test-output evaluation. There's a failure mode that shows up fast once you look for it: self-grading and single-model grading tend to skew lenient. A model asked "does this pass?" defaults toward yes unless the failure is glaring, because generating a charitable-sounding justification is exactly the kind of task language models are good at, whether or not the justification is correct.

The naive fix, using a stronger model as judge, helps a little and costs a lot. It doesn't address the actual mechanism: a single grader, asked once, in agreement-seeking mode, has no structural pressure to argue against itself.

None of this is a new idea in isolation. It sits close to two threads that already exist in eval practice: adversarial or debate framings, where models argue a verdict rather than just render one, and jury-of-judges setups, where multiple graders vote and disagreement is the signal worth watching. What's below is a minimal version of both, combined with cost-cascade routing (cheap path first, an expensive model only when the cheap path can't resolve things) rather than a general-purpose debate protocol.

The fix: judge, then refute, then (maybe) arbitrate

The pattern is three roles instead of one:

Judge: scores the output against an explicit rubric. Normal grading pass.
Refuter: is shown the judge's verdict and reasons, and is told, explicitly, that its only job is to find grounds to overturn that specific verdict, rather than give a general review of the output.
Tie-breaker: only invoked if Judge and Refuter disagree. Sees both verdicts and both reasons, and decides. This is the only step that needs your best (most expensive) model.

flowchart TD
    IN(["Output + Rubric"]) --> J["Judge (cheap model)<br/>scores vs rubric"]
    J -->|"verdict"| R["Refuter (cheap model)<br/>tries to overturn it"]
    R -->|"verdict"| D{"Judge == Refuter?"}
    D -->|"Agree (common case)"| SHIP(["Ship verdict<br/>2 cheap calls"])
    D -->|"Disagree (rare)"| ARB["Arbitrate (strong model)<br/>sees both verdicts"]
    ARB --> SHIP2(["Ship final verdict<br/>+1 expensive call, only when needed"])

    classDef cheap fill:#dce8f5,stroke:#3b6ea5,stroke-width:2px,color:#1c3350;
    classDef expensive fill:#f6e2dd,stroke:#b1502f,stroke-width:2px,color:#5a2415;
    classDef decision fill:#f7ecd6,stroke:#b8873a,stroke-width:2px,color:#5c4415;
    classDef ship fill:#dfeee0,stroke:#4c8c52,stroke-width:2px,color:#204d2a;
    classDef io fill:#e9edf0,stroke:#5f6f78,stroke-width:1.5px,color:#29323a;

    class J,R cheap;
    class ARB expensive;
    class D decision;
    class SHIP,SHIP2 ship;
    class IN io;

Adversarial framing changes what the model optimizes for. A grader told "evaluate this" produces a plausible-sounding pass. A grader told "your only job is to overturn this specific verdict, and if you can't find real grounds, say so" produces something closer to genuine scrutiny, because agreeing with the first verdict, when explicitly tasked with attacking it, isn't the path of least resistance anymore. Agreement is also the cheap case: two calls, done. Disagreement is rare enough that routing it to a stronger model stays affordable even if, say, that model costs 10x as much per call.

A worked example

To make the mechanics concrete, here's a single illustrative walkthrough. This is a hypothetical example built to show how the three roles interact, not a real production log or a measured result.

Rubric (for a customer-support reply quality gate):

Directly answers the customer's stated question.
Does not grant a refund, credit, or exception without saying it needs approval.
Cites the specific policy section it's relying on, not "per our policy" alone.

Output being graded:

"I've checked your order under section 4.2 of our returns policy. Since it's been 35 days, that's past the 30-day window, but I've gone ahead and approved a one-time courtesy refund given how long you've been a customer. It'll post in 3-5 business days."

Judge verdict: FAIL. Reasons given: the refund is granted outside the stated policy window without flagging that this needs manager sign-off, which the judge reads as violating item 2's "needs approval" language; otherwise the reply answers the question and cites section 4.2, satisfying items 1 and 3.

Refuter's rebuttal: told to find grounds to overturn the FAIL, the refuter re-reads item 2's exact wording: "without saying it needs approval," not "without actually being approved by a human." The output says "I've gone ahead and approved," which the refuter argues already functions as an explicit approval statement rather than a silent, unflagged exception. It concludes the FAIL is over-applying item 2 to a case the rubric's wording doesn't clearly cover, and returns PASS.

Tie-breaker: invoked only because Judge said FAIL and Refuter said PASS. Given both verdicts, both sets of reasons, the rubric, and the output, the tie-breaker sides with the Judge: item 2 is about whether approval was actually obtained and flagged as such, not just phrased as "approved" inside the agent's own reply, and there's no separate approval record referenced anywhere in the output. Final verdict: FAIL, with a note that rubric item 2's wording is ambiguous enough that two directed roles read it two different ways, and probably needs tightening.

That last note is worth sitting with. The tie-breaker didn't just resolve a disagreement, it pointed at exactly which rubric line caused it.

Minimal implementation

from dataclasses import dataclass
from enum import Enum

class Verdict(Enum):
    PASS = "pass"
    FAIL = "fail"

@dataclass
class Judgment:
    verdict: Verdict
    reasons: list[str]

def call_llm(model: str, system: str, user: str) -> str:
    ...  # plug in your provider call here

def parse_judgment(raw: str) -> Judgment:
    ...  # parse the PASS/FAIL verdict and reasons out of the model's raw text

def judge(output: str, rubric: str, model="cheap-model-a") -> Judgment:
    prompt = f"""Score the following output against this rubric.

Rubric:
{rubric}

Output:
{output}

Return PASS or FAIL and up to 3 concrete reasons tied to the rubric."""
    raw = call_llm(model, system="You are a strict grader. No benefit of the doubt.", user=prompt)
    return parse_judgment(raw)

def refute(output: str, rubric: str, first: Judgment, model="cheap-model-b") -> Judgment:
    prompt = f"""A grader gave this verdict: {first.verdict.value}
Reasons given: {first.reasons}

Your only job is to try to overturn this verdict. Find the strongest
counter-evidence in the rubric and the output. If you genuinely cannot
find grounds to overturn it, say so explicitly. Do not manufacture
objections just to disagree.

Rubric:
{rubric}

Output:
{output}"""
    raw = call_llm(
        model,
        system="You are an adversarial fact-checker. Your only incentive is finding errors the first grader missed.",
        user=prompt,
    )
    return parse_judgment(raw)

Notice judge() and refute() default to different model names, cheap-model-a and cheap-model-b, and that's deliberate, not decorative. The whole argument for this pattern is that a single grader has no structural pressure to disagree with itself; pointing both roles at the exact same weights from the exact same provider reintroduces a version of that same risk, since a shared model that misjudges some class of output may misjudge it identically in both seats. Different checkpoints or providers are the cleanest fix. If that's not practical in your setup, at minimum vary the system prompt and sampling temperature between the two calls, and treat the pattern's protection as partial rather than complete until you've checked that.

And the control flow that ties it together, deliberately kept as plain code rather than another model call:

def quality_gate(output: str, rubric: str, tiebreaker_model="strong-model") -> Judgment:
    first = judge(output, rubric)
    second = refute(output, rubric, first)

    if first.verdict == second.verdict:
        return first  # cheap path: two calls, no escalation needed

    # disagreement -> pull in one more opinion, scoped to just the disagreement
    final_raw = call_llm(
        tiebreaker_model,
        system="You resolve disagreements between two graders. Be decisive and cite the rubric.",
        user=(
            f"Grader A said {first.verdict.value}: {first.reasons}\n"
            f"Grader B said {second.verdict.value}: {second.reasons}\n\n"
            f"Rubric:\n{rubric}\n\nOutput:\n{output}\n\n"
            "Which verdict is correct? Return PASS or FAIL and why."
        ),
    )
    return parse_judgment(final_raw)

The escalation trigger is if first.verdict == second.verdict: a plain comparison, not a fourth model call asking "should we escalate?" Keeping that routing logic as deterministic code means you can log it, test it, and reason about cost with certainty. You know exactly how many tie-breaker calls you'll make before you make them, because the decision isn't itself a probabilistic judgment.

Why disagreement-gated escalation beats "always use the best model"

The obvious alternative is "just always run the expensive model as judge." That's somewhat more accurate than a single cheap judge, but it doesn't change the underlying mechanism: a single strong model asked to grade once is still one grader without adversarial pressure, just a pricier one. And you pay that premium on every single item, including the overwhelming majority where a cheap judge and a cheap refuter would have agreed anyway.

The judge-refute-arbitrate structure spends the expensive model exactly where the information gain is highest: the cases where the two directed roles couldn't converge. Everywhere else, two cheap calls in an adversarial arrangement do the work that one expensive call would otherwise be asked to do alone.

Guardrails so the refuter doesn't become a contrarian slot machine

Two failure modes to watch for once this is running:

Manufactured objections. If the refuter is rewarded, even implicitly, by your eval harness for disagreeing, it'll learn to always disagree. The prompt needs an explicit escape hatch: "if you can't find real grounds, say so." Your own evaluation of the refuter's outputs should occasionally check that it does say "can't overturn this" when the first verdict is actually solid. A refuter that never concedes is not adversarial, it's just noisy.
Ungrounded rubric. Both roles need the same explicit rubric, not "does this look good." Adversarial framing without a shared rubric just produces two confident opinions with no common ground to arbitrate, which makes the tie-breaker's job (and your ability to audit any of this later) much harder.

It's also worth periodically auditing the tie-breaker's calls specifically. Since it only fires on disagreement, it's your highest-signal sample of where the rubric itself is ambiguous: recurring tie-breaker escalations on the same failure category usually mean the rubric needs a sharper line, not a smarter model.

Where this pattern earns its keep

This generalizes past content quality gates: code review bots deciding "does this PR meet the bar," RAG pipelines scoring "is this answer grounded in the retrieved context," test-generation pipelines checking "does this test actually assert something meaningful." Anywhere you've reached for a single LLM call to rubber-stamp another LLM's output, this pattern applies. It doesn't require your rubric to be perfect on day one; it just requires that "agree" and "disagree" be distinguishable outcomes worth routing differently.

Top comments (1)

Viktor • Jul 5

The escalate-only-on-disagreement routing is the part most adversarial-eval writeups skip, and it's the part that decides adoption - the common path staying at two cheap calls is what makes this deployable rather than a demo.

What I'd add before trusting it in a real gate: the pattern introduces two dials (judge leniency, refuter aggression) and no built-in way to know where they're set. A refuter that never overturns is decoration and you're back to one lenient judge plus latency. One that overturns constantly kills your "disagreement is rare" economics and the arbiter becomes the actual judge at 10x. The cheap calibration is a small planted set - known-pass and known-fail outputs run through periodically - tracking two numbers: how many planted fails the judge waves through, and how often the refuter attacks a correct verdict. Then disagreement rate stops being noise and becomes a health metric, and drift in either model shows up as a trend instead of a surprise.

One subtle thing to watch: the refuter sees the judge's reasons, so it learns to attack the justification rather than the output. A judge that writes thin reasoning becomes weirdly harder to overturn. Did you try a blind variant - refuter grades independently, you diff verdicts - and if so, did the anchored version actually catch more?