Evaluating Agents With an LLM-as-Judge Harness (Without Kidding Yourself About It)

#llm #systemdesign #testing #webdev

Key Takeaways

You can't unit-test a coach agent the way you test a pure function — the output is non-deterministic and "good" is a judgment call, not an assertion.
An LLM-as-judge harness lets you grade a whole test set automatically against a rubric, which is the only way solo-scale eval stays sustainable.
But the judge is itself a fallible model. If you don't design around its known biases — position, verbosity, self-preference, and quiet drift when the judge model updates — you build a green dashboard that means nothing.
The mitigations that actually work are mechanical, not prompt-magic: shuffle order on every pairwise call, pin the judge version, keep a small human-labelled anchor set, and re-check the judge against it.

The problem I actually had

FamNest's coach agent generates responses to parents — check-ins, encouragement, the occasional gentle redirect. I have a growing pile of these interactions, and every time I change a prompt, swap a model, or adjust the pipeline, I need to know one thing: did I just make it better or worse?

For normal code, that's what tests are for. I change something, the suite runs, red or green, done. But there's no assertEqual for "was this an empathetic, useful response to a tired parent." The output changes every run even at temperature zero-ish, and the quality bar is a human judgment, not a fixed string. Two responses can be worded completely differently and both be good. One can match my "expected output" word for word and still be worse than a version that didn't.

So the honest options were: read every response by hand every time I change something (does not scale past about week two), or build a harness where a model grades the outputs against a rubric. I built the harness. Then I spent an uncomfortable amount of time learning all the ways a harness like that can lie to you.

What the harness actually is

At its simplest, it's a loop:

def evaluate(test_cases, coach_agent, judge):
    results = []
    for case in test_cases:
        response = coach_agent.generate(case.input, case.context)
        verdict = judge.score(
            rubric=COACH_RUBRIC,
            user_message=case.input,
            response=response,
        )
        results.append({
            "case_id": case.id,
            "score": verdict.score,
            "reasoning": verdict.reasoning,
        })
    return results

COACH_RUBRIC is the part that matters. It's not "rate this 1–10" — that produces mush. It's specific, scored dimensions: does the response acknowledge the actual thing the user said (not a generic version of it)? Does it avoid giving medical advice? Is it the right length for the moment, or is it a wall of text at someone who's exhausted? Each dimension gets a small integer and a one-line justification, and the harness keeps the justification, not just the number — because when the score drops, the reasoning is what tells me whether the agent regressed or the judge just had an opinion.

That last distinction is the whole game.

The part where I stopped trusting the judge

Here's the failure mode that made me rebuild the whole thing. You score helpfulness at 0.91 all quarter. Then the judge model ships a minor version bump. The mean shifts a few points, the distribution narrows, and your CI gate keeps passing — so you don't look. Weeks later the agent does something genuinely bad and the eval never flagged it, because the judge changed underneath you and the number stopped meaning what it meant the day you set the threshold.

The research here is not subtle, and it's worth internalizing before you trust a single green checkmark. A 2026 RAND study that stress-tested judges across multiple benchmarks concluded that no judge was uniformly reliable, and frontier models exceeded 50% error rates on hard bias benchmarks. Consistency broke on inputs as trivial as formatting changes and paraphrasing. Separately, the classic MT-Bench work found that in pairwise comparisons, the answer in the first slot wins something like 10–15 points more often purely because it's first — position bias, nothing to do with quality.

(Worth noting the field isn't static: some 2026 reproductions find position bias has shrunk to near-negligible on current-gen models under a clean pairwise rubric, while verbosity bias stays small. Which is exactly the point — the biases move as the models move, so you measure them yourself rather than trusting a blog post from last year, including this one.)

The named biases I actually design around:

Position bias — in any A-vs-B comparison, slot order can decide the winner. Mitigation: run every pairwise comparison twice with the order flipped, and only count it if the verdict is stable across both.
Verbosity bias — longer answers tend to score higher even at matched quality. Mitigation: put length appropriateness in the rubric as an explicit dimension so the judge is scoring it on purpose instead of rewarding it by accident.
Self-preference — a judge from the same model family as the candidate tends to over-score it. Mitigation: don't let the judge be the same model as the agent it's grading. (In my case the coach runs on one provider; I judge with a different family entirely.)
Calibration drift — the silent one above. Mitigation below, because it's the most important.

The anchor set is the thing that keeps you honest

The single highest-leverage piece of the harness isn't the judge prompt. It's a small set — a few dozen cases — that I labelled by hand, carefully, once. Good responses, bad responses, and the genuinely ambiguous ones. That's my ground truth.

Every time I run the harness, it grades the anchor set too. If the judge's scores on those known cases still line up with my human labels, I trust its scores on the rest of the run. If the judge drifts on the anchor set — because the model updated, because I tweaked the rubric, because Mercury is in retrograde — I find out immediately, on cases where I already know the right answer, instead of finding out in production on a case where I don't.

This is the same instinct as the deterministic crisis floor I wrote about earlier in this series: the most consequential check should be the one that's simplest and least dependent on a model behaving. For safety, that's regex. For evaluation, it's a few dozen examples I graded with my own eyes and refuse to let a model overrule silently.

What I'd tell someone starting this

Build the harness — reading every output by hand does not scale, and an LLM judge genuinely does correlate with human preference well enough to be useful. But treat the judge as a component that can fail, not an oracle. Pin its version so it doesn't change without you deciding. Shuffle order on comparisons. Keep the reasoning, not just the score. And keep a small hand-labelled anchor set that you re-check every single run, because a green eval dashboard that you never validate is worse than no dashboard — it's the confidence of measurement without the substance, and that's exactly the kind of thing that ships a broken agent with a clean conscience.

The harness didn't remove my judgment from the loop. It moved my judgment to where it's cheap and permanent — a small set of examples I curate once — instead of where it's expensive and forgettable, which is re-reading a hundred responses every time I touch a prompt.

Part of an ongoing series documenting FamNest's architecture. Earlier posts cover the deterministic crisis floor and the multi-agent coach pipeline. Next: how we test a non-deterministic system end to end.

Top comments (1)

Luis Cruz • Jul 1

LLM-as-judge setups are useful, but this post nails the uncomfortable truth: most “evaluation harnesses” are less about measuring quality and more about producing consistent-looking confidence. Once you start using a judge model to grade an agent, you’re no longer escaping subjectivity—you’re just outsourcing it to another model with its own biases and blind spots.

The key gap I see in practice is that people treat the judge as an oracle instead of a noisy estimator. If you don’t anchor it with strong rubrics, calibration sets, and human spot checks, you end up optimizing for what the judge prefers, not what the system actually does in the wild. That drift is subtle but dangerous.

The suggestion to combine LLM judges with deterministic checks (schemas, tool traces, execution logs) is the part that actually makes this production-relevant. Pure “vibe scoring” breaks down quickly at scale.

In short: LLM-as-judge is not wrong, but it’s incomplete by design. The real value comes only when it’s treated as one signal in a multi-layer evaluation system—not the final authority.