Short-Circuit Your Agent Evals: Tier Order Is a Latency Budget, Not a Preference

#ai #agents #evaluation #typescript

There's a tempting way to build an eval layer that feels thorough and is quietly broken: you run every check on every run, collect all the scores, and then decide pass/fail at the end. It looks rigorous. It's also slow, expensive, and — worst of all — it lets a model-as-judge veto a run that already failed a hard, deterministic check.

The fix isn't more checks. It's ordering. The order you run your evals in is not a stylistic choice. It's a latency and cost budget, and it encodes what you actually trust.

The mistake: eval-as-report instead of eval-as-gate

Most teams' first eval harness is a fan-out. Kick off the format check, the similarity check, and the LLM judge in parallel, await Promise.all, aggregate into a dashboard. Green if the average clears some threshold.

The problem shows up the moment you want that harness to block a run in real time — a retry gate, a CI check, a pre-publish guard. Now the slowest, most expensive, least reliable component (the judge) is on the critical path for a decision that a 2ms check already made for you. If the agent emitted invalid JSON, there is nothing for the judge to have an opinion about. You're paying a model call and 4 seconds of p95 to grade output that was already dead on arrival.

Eval-as-report and eval-as-gate are different jobs. The gate must be fast, deterministic, and cheap enough to sit in the hot path. The report can be slow and thoughtful because nothing waits on it.

Evidence has an independence axis, and it determines order

The reason order matters isn't just performance. It's that not all evidence is the same kind of evidence. This is the core idea behind agent-eval: rank your checks on an independence axis — how forgeable the signal is by the agent under test — not a cost axis.

Tier 1 — externally observable proof the agent can't forge. Valid JSON. The file it claimed to write exists. The code compiled. Tests passed. It finished within the timeout. Output is non-empty. These are facts about the world, not claims about quality.
Tier 2 — statistical signal against a baseline the agent didn't author. Embedding similarity between the output and the task spec. Length and repetition sanity. Did the diff actually change anything. The agent didn't write the baseline, so it can't trivially game it.
Tier 3 — model-as-judge. A shared-substrate opinion. Useful for the subjective tail — tone, helpfulness, "is this argument coherent." It is a signal, never a verdict.

Notice this is an independence ranking, not a "cheap to expensive" ranking. It happens to correlate with cost, which is why the ordering also saves you money — but the reason Tier 1 comes first is that it's unforgeable, so a Tier 1 failure is dispositive. No opinion can rescue output that didn't compile.

Two hard rules that fall out of the axis

Tier 1+2 are your real-time gate. Tier 3 is offline-only. Tier 1 and Tier 2 are deterministic, cost roughly nothing, and run in single-digit milliseconds — they can block a run. Tier 3 is metered, slow, and non-deterministic — it cannot sit in the hot path. Put the judge behind a real-time decision boundary and it becomes a flaky, expensive dependency for something a regex already settled.

Tier 1+2 may score trajectories; Tier 3 may not. You can run deterministic and statistical checks over the agent's reasoning trace all day — "did it call the tool it said it called," "did the retrieved chunk actually contain the cited fact." But a model judging another model's reasoning is circular: judge and judged share a substrate and there's no independent ground truth. So Tier 3 may only inspect artifacts the judged agent didn't get to write — the final output, not the chain of thought that produced it.

Wiring it: short-circuit, don't fan out

Here's the gate as a fail-fast pipeline. Tier 1 runs first and short-circuits. Tier 2 runs only if Tier 1 passes. Tier 3 never runs inline at all — it's queued for offline scoring.

type Verdict = { pass: boolean; tier: 1 | 2 | 3; reason: string };

type Check = {
  tier: 1 | 2;
  name: string;
  run: (out: AgentOutput) => Promise<boolean> | boolean;
};

// Ordered by independence: unforgeable proof first.
const gate: Check[] = [
  { tier: 1, name: "non-empty",    run: (o) => o.text.trim().length > 0 },
  { tier: 1, name: "valid-json",   run: (o) => tryParse(o.text) },
  { tier: 1, name: "within-slo",   run: (o) => o.elapsedMs < 30_000 },
  { tier: 1, name: "file-exists",  run: (o) => fileExists(o.claimedPath) },
  { tier: 2, name: "on-task",      run: async (o) =>
      (await cosine(embed(o.text), embed(o.taskSpec))) > 0.75 },
  { tier: 2, name: "diff-nonzero", run: (o) => o.diffLines > 0 },
];

async function runGate(out: AgentOutput): Promise<Verdict> {
  for (const c of gate) {
    const ok = await c.run(out);
    if (!ok) {
      // Short-circuit: stop at the first unforgeable failure.
      return { pass: false, tier: c.tier, reason: `failed ${c.name}` };
    }
  }
  // Passed the gate. The judge is NOT consulted here.
  enqueueOfflineJudge(out); // Tier 3, metered, non-blocking
  return { pass: true, tier: 2, reason: "gate clear" };
}

The judge result lands later, on a dashboard, clearly labeled "opinion, not evidence." It never blocks a user, a retry, or a deploy. It informs the ~20% subjective tail that Tier 1+2 can't reason about.

And that 80/20 split is the whole payoff: the failures that actually bite in production — stale output, a crash, malformed format, a hallucinated file path, an empty response, an SLO blown — are all caught at Tier 1+2, for ~$0, before any model call. You reserve the expensive, fuzzy judge for the genuine minority of cases where the only question left is a matter of taste.

The trace is what makes the gate honest

There's a load-bearing assumption hiding in fileExists(o.claimedPath) and o.diffLines: those inputs have to be real, not the agent's self-report. If your Tier 1 check reads "did the agent say it wrote the file," you've handed the agent the pen and asked it to grade itself. That's not Tier 1 anymore; it's Tier 3 wearing a boolean's clothes.

This is why the gate needs a trace it can trust, and why AgentLens is the other half of this workflow. agent-eval scores and gates the output; AgentLens captures the trace of how the agent got there — every model call and tool step, the resolved inputs (not the templated ones), the raw outputs. That trace is exactly the unforgeable, agent-didn't-author substrate that Tier 1+2 need to score against. Without it, "the file exists" degrades into "the agent claims the file exists," and your independent gate quietly collapses into a self-assessment.

Put differently: agent-eval tells you whether the run is good; AgentLens tells you why, and — critically — gives the gate ground truth to check instead of the agent's own narration. They ship as a unit because a gate without an honest trace isn't a gate.

The takeaway

Stop treating your eval layer as a scoreboard you tally at the end. Order it by independence, short-circuit on the first unforgeable failure, keep the judge offline where its latency and non-determinism can't hurt anyone, and feed the whole thing a trace the agent didn't get to write. The order isn't a preference. It's the architecture.