Ken W Alger

Posted on Apr 16 • Edited on May 13 • Originally published at kenwalger.com

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

#ai #python #testing #architecture

We’ve built a powerful Forensic Team. They can find books, analyze metadata, and spot discrepancies using MCP.

But in the enterprise, 'it seems to work' isn't a metric. If an agent misidentifies a $50,000 first edition, the liability is real.

Today, we move from Subjective Trust to Quantitative Reliability. We are building The Judge—a high-reasoning evaluator that audits our Forensic Team against a 'Golden Dataset' of ground-truth facts.

Before you Begin

Prerequisites: You should have an existing agentic workflow (see my MCP Forensic Series) and a high-reasoning model (Claude 3.5 Opus/GPT-4o) to act as the Judge.

1. The "Golden Dataset"

Before we can grade the agents, we need an Answer Key. We’re creating tests/golden_dataset.json. This file contains the "Ground Truth"—scenarios where we know there are errors.

Example Entry:

{
"test_id": "TC-001",
"input": "The Great Gatsby, 1925",
"expected_finding": "Page count mismatch: Observed 218, Standard 210",
"severity": "high"
}

Director's Note: In an enterprise setting, "Reliability" is the precursor to "Permission". You will not get the budget to scale agents until you can prove they won't hallucinate $50k errors. This framework provides the data you need for that internal sell.

2. The Judge's Rubric

A good Judge needs a rubric. We aren't just looking for "Yes/No." We want to grade on:

Precision: Did it find only the real errors?
Recall: Did it find all the real errors?
Reasoning: Did it explain why it flagged the record?

3. Refactoring for Resilience

Before building the Judge, we had to address a common "Senior-level" trap: hardcoding agent logic. Based on architectural reviews, we moved our system prompts from the Python client into a dedicated config/prompts.yaml.

This isn't just about clean code; it’s about Observability. By decoupling the "Instructions" from the "Execution," we can now A/B test different prompt versions against the Judge to see which one yields the highest accuracy for specific models.

4. The Implementation: The Evaluation Loop

We’ve added evaluator.py to the repo. It doesn't just run the agents; it monitors their "vital signs."

Error Transparency: We replaced "swallowed" exceptions with structured logging. If a provider fails, the system logs the incident for diagnosis instead of failing silently.
The Handshake: The loop runs the Forensic Team, collects their logs, and submits the whole package to a high-reasoning Judge Agent.

The Evaluator-Optimizer Blueprint

This diagram represents our move from "Does the code run?" to Does the intelligence meet the quality bar?" This closed-loop system is required before we can start the fiscal optimization of choosing smaller models to handle simpler tasks.

Architectural diagram of an AI Evaluator-Optimizer loop. It shows a Golden Dataset feeding into an Agent Execution layer, which then passes outputs and logs to a Judge Agent for scoring against a rubric. The final Reliability Report provides a feedback loop for prompt tuning and iterative improvement. — The Evaluator-Optimizer Loop-Moving from manual vibe-checks to automated, quantitative reliability scoring.

Director-Level Insight: The "Accuracy vs. Cost" Curve

As a Director, I don't just care about "cost per token." I care about Defensibility. If a forensic audit is challenged, I need to show a historical accuracy rating. By implementing this Evaluator, we move from "Vibe-checking" to a Quantitative Reliability Score. This allows us to set a "Minimum Quality Bar" for deployment. If a model update or a prompt change drops our accuracy by 2%, the Judge blocks the deployment.

The Production-Grade AI Series

Post 1: The Judge Agent — You are here
Post 2: The Accountant (Cognitive Budgeting & Model Routing)
Post 3: The Guardian (Human-in-the-Loop Handshakes)

Looking for the foundation? Check out my previous series: The Zero-Glue AI Mesh with MCP.

Top comments (6)

Muhammad Zubair Bin Akbar • Apr 16

This is a great move from “it works” to “we can prove it works,” which is exactly what enterprise AI needs.

I like the focus on a Golden Dataset with known failures, most teams skip that and only test happy paths. The precision/recall + reasoning rubric is also spot on, especially since weak explanations can become a real issue during audits.

Decoupling prompts into config is a nice touch too it makes proper A/B testing and iteration actually possible.

One thing I’m curious about: how do you handle partial matches or differently worded outputs when scoring? That part can get tricky if the Judge isn’t tightly constrained.

Overall, this feels like a solid foundation for building measurable and defensible agent reliability.

Ken W Alger • Apr 16

That is a great question, and you’ve highlighted the exact reason why traditional string-matching fails in enterprise AI. To handle the 'semantic overlap' without the Judge hallucinating its own criteria, I generally use a three-layered approach:

Semantic Embedding Checks (The Pre-Filter)
Before the Judge even sees the text, I often run a quick cosine similarity check between the agent’s output and the Golden Dataset. If the score is high (e.g., >0.92) but the words don't match, it tells the system we have a 'differently worded but likely correct' candidate. This helps prioritize which outputs need the most 'reasoning' from the Judge.
The 'Facts vs. Style' Rubric
In the Judge's prompt, I explicitly decouple Factuality from Fluency. I instruct the Judge to extract the 'claims' from the agent's response and verify them individually against the Golden Dataset.
• Example: If the agent says 'The book was printed in 1884' and the Golden Data says 'Date of Publication: 1884,' the Judge scores that as a 1/1 match for precision, regardless of the surrounding prose.
Few-Shot 'Edge Case' Examples
The best way to tighten a Judge is to provide it with 3–5 examples of 'Partial Matches' in the prompt itself. I show it an example of a 'technically correct but incomplete' answer and an example of a 'differently worded but perfect' answer. This 'sets the bar' for what constitutes a win.
Forced Reasoning (CoT)
As you noted, weak explanations are a liability. I force the Judge to write the 'Evidence' for its score before it outputs the numerical grade. If the Judge can't cite a specific discrepancy in its reasoning, the score is flagged for human review.

It’s definitely a balancing act—too loose and you get false positives; too tight and you’re back to hardcoded regex. I’ve found that focusing the Judge on claim extraction is the most defensible path for an audit trail!

Muhammad Zubair Bin Akbar • Apr 16

That makes a lot of sense, especially the claim extraction piece. Framing it as “facts vs. style” feels like the right abstraction, it keeps the Judge focused on what actually matters instead of getting distracted by wording.

I also like the idea of using embeddings as a pre-filter rather than the final decision-maker. It’s a nice way to reduce unnecessary load on the Judge without over-trusting similarity scores.

The few shot edge cases point is interesting too, it’s easy to underestimate how much those examples actually shape the Judge’s behavior, especially around partial correctness.

Out of curiosity, have you run into situations where the claim extraction itself becomes inconsistent? I could see that being another layer where things drift a bit depending on how the model interprets “claims.”

Ken W Alger • Apr 16

You’ve touched on the 'meta-problem' of LLM-as-a-Judge: who audits the claim extractor? You’re absolutely right—if the model interprets 'claims' differently every time, your precision/recall metrics become noise.

In my experience, 'claim drift' usually happens when the prompt is too open-ended (e.g., 'List all claims'). To stop that drift, I’ve moved toward a Constrained Extraction Pipeline:

The 'Atomic Claim' Protocol I instruct the extractor to only output Atomic Claims—single, verifiable statements that cannot be broken down further.

• Bad Claim: 'The book was a 1884 first edition in good condition.' (Too many variables).
• Atomic Claims: 1. 'Pub Date is 1884.' 2. 'Edition is First.' 3. 'Condition is Good.'

This makes the 'Fact vs. Style' comparison a 1:1 binary match, which is much easier for a downstream Judge to score consistently.

Schema Enforcement (Pydantic/Zod)
I never let the extractor return raw text. I force it into a strict JSON schema where each claim must be categorized (e.g., Date, Identifier, Physical_Trait). By forcing the model to 'pigeonhole' its thoughts, you drastically reduce the chance of it getting 'creative' with how it phrases a claim.
The 'Decomposition' Sanity Check
For high-stakes audits, I actually run the extraction twice (sometimes with a smaller, faster model like Haiku or GPT-4o-mini) and have a simple logic gate check if the number of claims matches. If Model A finds 4 claims and Model B finds 7, the system flags it for 'Extraction Variance' before the Judge even gets involved.
Reference-Anchored Extraction
Instead of saying 'Extract claims,' I say 'Extract claims that relate to these specific keys in the Golden Dataset.' This anchors the model's 'attention' to only what matters for the audit, preventing it from wasting reasoning tokens on stylistic filler.

Ultimately, the goal is to move the extraction from a 'creative summary' to a 'data parsing' task. It’s definitely an extra layer of engineering, but for defensible agent reliability, that extra layer is the only thing that keeps the metrics honest!

Jill Mercer • Apr 17

ken, you're calling me out with the vibe-check comment — i usually just ship and pray. i’m still figuring it out in cursor, but keeping the context straight is getting harder as my apps grow. a judge agent feels like the right way to stop the guessing game when the vibes aren't enough. thanks for showing how to build the bridge between shipping fast and actually knowing it’s right.

Ken W Alger • Apr 17

That 'ship and pray' era was fun, but as the stakes get higher, the 'praying' part gets a lot more stressful. I think we're seeing a shift where the best developers won't be the ones who can code the fastest, but the ones who can build the best governance loops around what they ship.

Once the app grows, the 'context' isn't just a technical limit, it’s the integrity of the whole system. Glad the 'Judge' concept resonated with you. It’s definitely saved me from a few 'vibe-only' hallucinations.