Building a production LLM Judge: lessons from the enterprise audit engine

#ai #mcp #agents #langgraph

When I was building the enterprise audit engine, the LLM Judge was the last thing I
planned to add. It felt like over-engineering. The main agent already had MCP tool
access to live device state, a policy file to reason against, and a LangGraph state
machine keeping it on track. That felt like enough.
Then during testing, the agent correctly flagged a non-compliant firmware version —
and recommended a remediation action that belonged to a completely different device
category. The reasoning was internally consistent. It just used the wrong rule.
Nothing in the pipeline caught it. The output looked clean. Without a second check, that
would have gone to the user.
That's when I added the Judge. Here's how it works and what I learned building it.
What the Judge actually does
The Judge is a separate LLM call that runs after every agent response, before anything
reaches the user. It gets two inputs:
• The agent's output — the proposed compliance verdict and any suggested
remediation
• A fresh read of the policy file — not pulled from the agent's context, read
independently
Its job is to compare those two things and decide whether the agent's reasoning holds
up against the actual policy. It's not re-running the audit. It's checking whether the
answer the agent produced is consistent with the rules the agent was supposed to
apply.
If the reasoning checks out, the output moves forward. If something doesn't line up, the
Judge blocks it, logs the mismatch, and returns an error instead of a verdict.

Why independence matters more than intelligence
The design decision that makes the Judge actually useful is a simple one: it doesn't
share context with the main agent.
Most multi-step agent pipelines accumulate context as they run — retrieved chunks,
intermediate reasoning, tool outputs. By the time the agent produces its final answer, it's
been reasoning inside a specific context window for several steps. That context shapes
what the agent sees as plausible.
If the Judge reads from that same context window, it's not really checking anything
independently. It's just re-reading the same information through a slightly different
prompt. Whatever bias or error the agent accumulated, the Judge inherits it too. That's a
rubber stamp, not a check.
The fix is straightforward. The Judge reads the policy file directly. Not from cache, not
from whatever the agent retrieved — from the file. Every time. That's what gives it the
ability to catch the category mismatch the main agent missed. The agent had
accumulated enough context during its reasoning loop that the wrong rule seemed
plausible. The Judge, reading the policy fresh, saw the inconsistency immediately.

How it sits in the LangGraph state machine
The Judge is a node in the LangGraph graph, not a wrapper around the graph. That
distinction matters for how state flows through the system.
After check_compliance runs and produces a verdict, the graph routes to llm_judge
before doing anything else. The Judge node reads the policy file, compares it against
the proposed verdict, and sets a pass/fail flag in the graph state.
From there the graph branches:
• FAIL — the output is blocked, the error is logged to AgentOps, and the response
returned to the user explains that the verdict couldn't be verified
• PASS — the graph moves to suggest_remediation, where the agent proposes
the corrective action
After suggest_remediation, there's a hard gate. The graph does not automatically
proceed to execute_remediation. The only path forward is a human clicking Approve in
the Streamlit UI. That approval event triggers the final node, which is the only place in
the entire graph where a write operation touches the database.
The Judge and the HITL gate are separate concerns. The Judge is about correctness
— did the agent apply the right rule? The gate is about authorization
— did a human sign off on the action? Both are necessary and neither replaces the other.

What it logs
Every Judge decision gets written to AgentOps as a structured event: the agent's
proposed verdict, the policy sections the Judge checked against, the final pass/fail, and
if it failed, the specific mismatch it found.
That logging turned out to be more useful than I expected. During development it made
debugging much faster — you could see exactly what the Judge was checking and why
it made the call it did. For a production enterprise system, it also creates a full audit trail
for every compliance decision the system touches.
If someone asks later why a device was flagged or cleared, there's a complete record:
what the agent found, what the Judge verified, and who approved the remediation. That
kind of traceability is hard to retrofit once a system is in production.

What I'd do differently
The main thing I'd change is adding a confidence threshold to the Judge output rather
than a binary pass/fail.
Right now the Judge either passes or blocks. That works for clear-cut cases, but there's
a middle category — outputs where the reasoning is mostly right but one detail is
uncertain — where a hard block isn't the right response. A confidence score would let
the system surface those cases to the human reviewer with a flag rather than just
blocking them silently.
The other thing is latency. Adding a second LLM call to every response adds time, and
for simple queries that don't need a full audit, that overhead isn't worth it. The intent
classifier at the FastAPI gateway already routes simple queries to the NIM worker and
skips the full agent. Extending that logic to also skip the Judge for low-stakes queries
would help.
The short version
The Judge is useful not because it's smarter than the main agent, but because it's
independent of it. Reading the policy fresh, without inheriting whatever context the
agent accumulated during its reasoning loop, is what gives it the ability to catch errors
the agent can't see from inside its own context window.
It adds latency and cost. For a system making compliance decisions on production
infrastructure, that's a reasonable trade.
Code is on GitHub. The Judge implementation is in the governance layer — happy to
walk through the prompt design or the AgentOps integration in the comments.

DEV Community

Building a production LLM Judge: lessons from the enterprise audit engine

Top comments (0)