Your Model-as-Judge Doesn't Belong in the Hot Path

#ai #agents #observability #evaluation

There is a diagram I have drawn on too many whiteboards. An agent runs, produces an output, and then — right there in the request path, before the result goes anywhere — a model-as-judge scores it 8.4 out of 10 and decides whether to ship it. Everyone nods. It looks like a quality gate. It is, in fact, the single most expensive architectural mistake I see teams make with agent evals.

Here is the opinion I will defend: your real-time gate and your model-as-judge are two different systems that must live in two different places. One is a deterministic check that runs on every single execution, costs effectively nothing, returns in milliseconds, and is allowed to block the run. The other is a slow, metered, non-deterministic opinion that can only ever run offline, after the fact, on a sample. Collapsing them into one "the LLM grades the output before we return it" step gives you the worst of both: you pay judge latency and judge dollars on the hot path, and you still don't have a gate you can trust.

The fix is not a better judge. It's putting the judge where it belongs — and putting something else entirely in the path it was squatting in.

Evidence has an independence axis, not a cost axis

Most people rank eval methods by cost: cheap regex checks at the bottom, expensive LLM judges at the top, as if you're buying more quality by spending more. That framing is exactly backwards and it's why judges end up in the hot path — "it's the most expensive, so it must be the most authoritative."

Rank evidence by independence instead — how hard it is for the agent to forge:

Tier 1 — externally observable proof the agent can't fake. Did the output parse as valid JSON? Does the file it claims to have written actually exist on disk? Did the code compile? Did the tests pass? Did the run finish before the timeout? Is the result non-empty? These are facts about the world, not opinions about the work. The agent cannot talk its way past JSON.parse throwing.
Tier 2 — statistical signal against a baseline the agent didn't author. Embedding similarity between the output and the task it was given. Length and repetition checks. Did the diff actually change anything, or did the agent claim a fix and touch nothing? The agent didn't write the baseline, so it can't trivially game the comparison.
Tier 3 — model-as-judge. A shared-substrate opinion. Useful, but it is a signal, never a verdict — and understanding why is the whole point of this post.

The reason this axis matters architecturally: Tier 1 and Tier 2 are deterministic, cost ~nothing, and run in milliseconds — so they can sit in the hot path and block a run. Tier 3 is metered, slow, and non-deterministic — so it cannot. This isn't a preference. It's a property of what each tier is.

Why the judge physically cannot be the gate

Three things disqualify a model-as-judge from the hot path, and each is independently fatal.

It's slow and metered. A judge call is another full inference, often on a frontier model with a long rubric prompt. You've now doubled your latency and added a per-run dollar cost to the thing you most want to run on every request. At low volume you don't notice. At production volume you've built a second, slower agent whose only job is to grade the first one, and you're paying for both on the critical path.

It's non-deterministic, so it can't be a gate. A gate's job is to make a stable accept/reject decision. Run the same output through the same judge three times and you can get 7, 8, and 6. A gate that flips its verdict on identical input isn't a gate — it's a coin weighted by temperature. You cannot build a reliable block/allow decision on a number that won't reproduce.

This is the deep one: the judge is circular. When a model judges another model's reasoning, the judge and the judged share a substrate. There is no independent ground truth in that loop — you're asking a language model whether a language model's output is good, and both are drawing from the same well of training and the same failure modes. A judge that confidently rubber-stamps a confidently wrong answer is not a bug; it's the predictable result of putting the grader and the gradee on the same axis. Tier 1 and Tier 2 can legitimately run over an agent's full trajectory — its reasoning steps, its tool calls — because they're checking against the external world or an independent baseline. Tier 3 cannot judge a trajectory, because judging a model's reasoning with a model is the circular case. So the judge may only ever inspect artifacts the judged agent didn't get to write — the final file, the committed diff, the rendered output — never the agent's own narration of how great its work was.

Put those together and the conclusion is forced: the judge belongs offline, on a sample, looking at artifacts. The hot-path gate has to be Tier 1 + Tier 2.

What the two lanes actually look like

Here's the split made concrete. The real-time gate runs inline and can throw to block the run. The judge runs in a separate offline lane that can never block anything.

// ---------- LANE 1: the real-time gate (Tier 1 + Tier 2) ----------
// Deterministic, ~$0, milliseconds. Runs on EVERY execution.
// Allowed to block the run by throwing.

interface GateResult {
  passed: boolean;
  failures: string[];
}

async function realtimeGate(
  output: string,
  task: { prompt: string; expectFile?: string },
): Promise<GateResult> {
  const failures: string[] = [];

  // Tier 1 — externally observable proof the agent can't forge.
  if (output.trim().length === 0) failures.push("empty output");

  try {
    JSON.parse(output);
  } catch {
    failures.push("output is not valid JSON");
  }

  if (task.expectFile && !(await fileExists(task.expectFile))) {
    // It CLAIMED to write the file. Does the file exist? Fact, not opinion.
    failures.push("claimed file does not exist on disk: " + task.expectFile);
  }

  // Tier 2 — statistical signal vs a baseline the agent didn't author.
  const relevance = await cosineSimilarity(
    await embed(output),
    await embed(task.prompt), // baseline = the task itself
  );
  if (relevance < 0.35) failures.push("output unrelated to task (sim=" + relevance.toFixed(2) + ")");

  return { passed: failures.length === 0, failures };
}

// In the hot path: a failure here BLOCKS the run. This is a real gate.
async function runAgentGated(task: Task): Promise<string> {
  const output = await runAgent(task);
  const gate = await realtimeGate(output, task);
  if (!gate.passed) {
    throw new GateError("blocked: " + gate.failures.join("; ")); // stops the run, ~$0
  }
  return output;
}

That gate is boring, and boring is the point. Every check in it is reproducible, runs in milliseconds, and catches a forgeable claim by comparing it to something the agent couldn't fake. Now the judge — same evals philosophy, completely different placement:

// ---------- LANE 2: the offline judge (Tier 3) ----------
// Metered, slow, non-deterministic. Runs OFF the hot path, on a SAMPLE.
// Can NEVER block a run. Inspects only artifacts the agent didn't author.

interface JudgeSignal {
  runId: string;
  score: number;        // 0..1 — a SIGNAL, not a verdict
  rationale: string;
  label: "opinion-not-evidence";
}

async function offlineJudge(runId: string): Promise<JudgeSignal> {
  const trace = await agentlens.getTrace(runId);

  // Critical: judge the ARTIFACT, not the agent's reasoning about it.
  // Feeding the agent's own trajectory to a model judge is the circular case.
  const artifact = trace.finalArtifact; // the committed file/diff/output
  // (We deliberately do NOT pass trace.reasoning to the judge.)

  const verdict = await llmJudge({
    rubric: "Is this artifact clear, correct, and complete for the task?",
    artifact,
    task: trace.task,
  });

  return {
    runId,
    score: verdict.score,
    rationale: verdict.rationale,
    label: "opinion-not-evidence", // never gets to block anything
  };
}

The asymmetry is the entire design. Lane 1 throws and stops the run. Lane 2 returns a labeled signal and goes in a dashboard. They are never the same function call, and the judge never touches the agent's reasoning — only the artifact it produced.

Where the trace comes in (and why both halves need it)

You'll notice the offline judge reads from agentlens.getTrace(runId). That is not incidental — it's the load-bearing piece that makes this whole architecture debuggable, and it's why I run agent-eval and AgentLens as a single unit rather than two tools.

agent-eval is the scoring-and-gating half: it implements both lanes — the deterministic Tier 1 + Tier 2 checks that block a run in real time, and the Tier 3 judge that runs offline as a labeled signal. It's the thing that decides whether the output is good and, in the hot path, whether the run is allowed to proceed.

AgentLens is the trace half: it captures how the agent got there — every model call and tool step, the resolved inputs the agent actually saw after templating, and the raw outputs that came back. Two reasons that pairing is mandatory, not nice-to-have:

The judge needs the trace to find the artifact-without-the-reasoning. To judge only what the agent didn't author, you need a record that cleanly separates the final committed artifact from the agent's narration of it. That separation lives in the trace.
Tier 1 + Tier 2 need unforgeable, agent-didn't-author data to score against. The whole premise of the independence axis is that you're checking the output against something the agent couldn't fake — the real file on disk, the actual diff, the resolved task input. AgentLens is what preserves that ground-truth record, so when the gate blocks a run, you can open the trace and see exactly which resolved input and which raw tool output produced the violation.

agent-eval tells you the run got blocked, or that the judge gave the artifact a 6. AgentLens tells you why — which step, which input, which output. A gate decision with no trace behind it is a verdict you can't appeal; a judge score with no trace is an opinion you can't audit.

Ship the 80% at Tier 1+2, reserve the judge for the tail

Here's the part that makes this practical rather than theoretical. When you actually categorize production agent failures, the overwhelming majority are caught at Tier 1 + Tier 2 alone:

The output is stale, or the run crashed, or it timed out → Tier 1.
The JSON won't parse, the format is wrong → Tier 1.
It hallucinated a file path or a record ID that isn't in the evidence → Tier 1 (does it exist?).
It returned empty, or returned something unrelated to the task → Tier 1 + Tier 2.

None of those need a model's opinion. They're facts, and your cheap, fast, deterministic, blocking gate catches all of them before they ever reach a user — on every run, at ~$0. That is the 80% (honestly more), and it's exactly the set of failures a "the LLM grades it 7/10" tool is worst at catching reliably, because it buried those facts inside a fuzzy score.

Then there's the genuinely subjective tail — maybe 20%. Is the summary actually clear? Is the tone right? Did it pick the better of two reasonable approaches? That is where a model-as-judge earns its keep. So you run it offline, on a sample, clearly labeled opinion, not evidence, and you use it to trend quality and surface candidates for human review — never to silently block or rubber-stamp a run in the path.

This is the line that separates a real agent-eval architecture from the "LLM-as-judge gives you a number" tools: the judge is the last 20%, offline, and it's a signal — not the gate, not on the hot path, and not allowed to grade the agent's own reasoning.

What to do Monday

If your architecture currently has an LLM scoring outputs inline before you return them, pull it out of the path. Concretely:

Build the real-time gate from Tier 1 + Tier 2 only. Format validity, existence checks, timeout, non-empty, relevance-to-task. Make it throw. This is the thing allowed to block a run, and it should cost nothing and reproduce every time.
Move the judge to an offline lane on a sample. It scores artifacts, never trajectories; it's labeled a signal; it can't block anything. Wire it to your AgentLens traces so each score is one click from the run that produced it.
Capture the trace on every run with AgentLens so both lanes have unforgeable, agent-didn't-author data to work against — and so a blocked run or a low judge score is debuggable instead of just alarming.

The judge feels authoritative because it's expensive and it talks like a senior reviewer. But authority on this problem comes from independence, not from cost — and the moment you put the grader and the gradee on the same substrate in the hot path, you've spent your latency budget to buy an opinion that can't reproduce and can't be trusted to block. Put the proof in the path. Put the opinion in the dashboard. They are not the same system, and the agents that survive production are built by teams that stopped pretending they were.