Every agent you put in production is a function with no type signature. You prompt it, it returns prose, and you hope the next step can parse it. That hope is where production agents die — not in some exotic reasoning failure, but in a missing closing brace.
The fix isn't a smarter model. It's an old idea from boring software: the output is a contract, and you reject anything that violates it before you let it touch the next step.
The agent is an untyped function call
Treat one agent step honestly. It takes resolved inputs, calls a model and some tools, and produces an artifact. In a normal codebase you'd never accept that artifact without a type. With agents, people accept a free-text blob and pray. So you get pipelines that hold together for the demo and shatter the first time the model returns "Sure! Here's the JSON:" before the JSON.
A contract-first agent flips it. You define the shape of a good output up front — a schema, an invariant, a checkable claim — and the artifact has to clear that gate or the run stops. The model is free to be creative inside the contract. It is not free to break it.
This maps cleanly onto how you should rank evidence: not cheap-to-expensive, but independent-to-corruptible.
Tier 1: proof the agent can't forge
The first gate is stuff that's externally true. Did it emit valid JSON in the expected schema? Does the file path it returned exist? Does the diff actually change something? Did it finish before the timeout? Is the field non-empty? None of these ask the model's opinion. The agent can't talk its way past them — they're observable proof.
import { z } from "zod";
const SummarySchema = z.object({
status: z.enum(["ok", "needs_review"]),
citations: z.array(z.string().url()).min(1),
summary: z.string().min(40),
});
type Gate = { ok: true } | { ok: false; reason: string };
function tier1(raw: string): Gate {
let parsed: unknown;
try { parsed = JSON.parse(raw); }
catch { return { ok: false, reason: "invalid_json" }; }
const r = SummarySchema.safeParse(parsed);
if (!r.success) return { ok: false, reason: "schema_violation" };
for (const u of r.data.citations) {
if (!u.startsWith("https://")) return { ok: false, reason: "bad_citation" };
}
return { ok: true };
}
If this fails, you don't ask a judge what it thinks. You stop. The 80% of real failures — stale output, crashes, malformed JSON, a hallucinated path, an empty field — are caught right here, deterministically, for about zero dollars, in milliseconds.
Tier 2: signal against a baseline it didn't author
Some failures pass the schema and are still garbage. The JSON is valid but the summary repeats one line forty times, or it's wildly off-topic. Tier 2 is statistical: embedding similarity between the output and the actual task, length and repetition checks, "did the diff touch the files it claimed." The agent didn't write the baseline, so it can't game it.
Tier 1 and Tier 2 together are your real-time gate: deterministic, near-free, fast enough to block a bad run before it propagates. They can even run over the agent's whole trajectory, because nothing in them depends on a model's mood.
Tier 3: the judge is a signal, never a verdict
Then there's the subjective tail — was the tone right, did the argument hang together. That's model-as-judge, and it's offline-only. It's metered, slow, and non-deterministic, so it cannot sit in the hot path. More important: a model grading another model's reasoning is circular — judge and judged share a substrate, so there's no independent ground truth. The judge only gets to inspect artifacts the judged agent didn't get to write, and its output is labeled opinion, not a gate. Reserve it for the ~20% no schema can express.
The distinction that matters: this is not "LLM-as-judge gives you a 7/10." Tier 1+2 already shipped the 80% deterministically. The judge is the small, clearly-marked subjective remainder.
Why a gate needs a trace
Here's the catch — a Tier 1 failure tells you what broke, not why. "schema_violation" doesn't tell you the tool returned null three steps back. To gate, you need something to gate against: every model and tool step, the resolved inputs, the raw outputs, captured as they happened. That's exactly what Tier 1+2 score over — trace data the agent didn't author and can't retroactively edit.
This is the split worth building around. agent-eval scores and gates the output along the tier ladder. AgentLens captures the trace of how the agent got there — every step, every resolved input, every raw output — so the eval signal is debuggable and Tier 1+2 have unforgeable, agent-didn't-author data to check. One scores the destination, the other records the road.
async function runStep(input: Task) {
const trace = AgentLens.start(input); // record everything
const raw = await agent.run(input);
trace.capture("output", raw);
const gate = tier1(raw); // forgeable? no.
if (!gate.ok) {
trace.fail(gate.reason);
throw new Error(`blocked: ${gate.reason}`);
}
return JSON.parse(raw); // judge runs later, offline
}
Stop trusting prose
If your agent's contract is "returns helpful text," you don't have a contract, you have a vibe. Give every step a schema. Gate on Tier 1+2 in real time, send the subjective tail to an offline judge, and keep the trace so failures are debuggable. The demo will look identical. Production won't fall over.
Top comments (0)