Every guide to evaluating AI agents quietly assumes there is one agent. One model, one loop, one output you can score. So you build a clean eval harness, you trace the loop, you gate on a pass rate, and you feel good.
Then your system grows up. A router agent decides which specialist to call. A researcher agent hands a draft to a writer agent. A planner spawns three workers and merges their results. Now you do not have an agent. You have an org chart of agents, and the thing that breaks is almost never inside one of them. It is the handoff — the seam where one agent's output becomes another agent's input.
This is the failure class nobody puts in their eval suite, because it does not live in any single agent. I want to argue that multi-agent systems need a different shape of evaluation and a different shape of observability, and that if you bolt your single-agent tooling onto them you will ship blind.
The seam is where the bodies are buried
Here is a concrete incident. A support pipeline: a triage agent classifies an inbound ticket, then routes to either a billing agent or a technical agent. Each agent, in isolation, was excellent. Triage scored 0.94 on its classification eval. Billing scored 0.91 on resolution quality. Technical scored 0.89.
The pipeline as a whole was a disaster. Refund requests were landing in the technical agent, which would cheerfully invent a troubleshooting plan for a billing problem. Every component passed its own eval. The system failed anyway.
Why? Because triage emitted {"category": "refund_issue"} and the router was matching on "billing". The category vocabulary had drifted between two prompts owned by two people. No single-agent eval can catch this, because no single agent is wrong. The contract between them is wrong.
If you only evaluate agents in isolation, you are unit-testing a distributed system and calling it integration coverage. It is not.
Evaluate the contract, not just the agent
The fix is to treat every handoff as a first-class thing to assert on. Two layers:
- Structural contract — deterministic. The producing agent's output must match the consuming agent's expected schema and its expected value domain. This is cheap, fast, and catches the vocabulary-drift class of bug completely.
- Semantic handoff quality — model-judged. Given what the upstream agent produced, did the downstream agent receive enough context to do its job? Did the writer agent get the facts the researcher actually found, or a lossy summary?
The structural layer is where most of your protection comes from, and it is the cheapest thing in the entire stack. Here is the kind of contract check I put between every pair of agents:
import { z } from "zod";
// The contract is owned jointly by producer + consumer.
const TriageOutput = z.object({
category: z.enum(["refund_issue", "charge_dispute", "tech_fault"]),
confidence: z.number().min(0).max(1),
customerId: z.string().uuid(),
});
type Handoff = {
from: string;
to: string;
payload: unknown;
};
function assertHandoff(h: Handoff, schema: z.ZodTypeAny) {
const result = schema.safeParse(h.payload);
if (!result.success) {
throw new HandoffViolation(h.from, h.to, result.error.issues);
}
return result.data;
}
class HandoffViolation extends Error {
constructor(from: string, to: string, issues: unknown) {
super(`Contract broken: ${from} -> ${to}`);
this.cause = issues;
}
}
Run this as an eval over recorded production handoffs, not just live. If triage starts emitting a category the router has never heard of, that is a failing test before it is a 2am page. This is exactly the deterministic-first, judge-second tiering that works for single agents — you are just applying it to the edges of the graph instead of the nodes.
But here is the part teams get wrong: a green contract eval tells you the seam is typed correctly. It does not tell you the seam is good. For that you need to see what actually flowed.
You cannot debug a seam you cannot see
When a handoff eval goes red, the score is useless on its own. "Handoff quality 0.6" tells you nothing actionable. You need to answer: what did agent A actually emit, what did agent B actually receive after the router mangled it, and which tool call in between dropped a field?
This is the split that matters, and it is why I run agent-eval and AgentLens as one workflow rather than two tools. agent-eval owns the judgment: it scores the agent's output, runs the structural contract checks, flags drift when a category vocabulary shifts, and catches the ungrounded claim when the technical agent invents a refund policy. It is the layer that decides pass or fail and gates the release.
AgentLens owns the trace: it captures every model call and every tool step across all the agents in the pipeline as one connected run — the resolved inputs each agent actually saw, the raw outputs each one actually produced, and the exact payload that crossed each seam. So when agent-eval says "handoff triage->billing scored 0.6," AgentLens lets you click into that specific run and watch refund_issue get silently coerced to null at the router boundary. The eval gives you the signal; the trace makes the signal debuggable. One without the other is either a number you cannot act on or a firehose you cannot grade.
In a single-agent world you can sometimes get away with eyeballing logs. In a multi-agent world the trace is a graph, and you will not reconstruct it by hand. The eval tells you a seam is bad; the trace is the only thing that tells you which seam and why.
A scoring model for graphs, not loops
Concretely, stop reporting one pass rate for "the system." Report a matrix:
- Node scores — each agent in isolation, as you do today.
- Edge scores — each handoff: structural contract pass rate + semantic quality.
- Path scores — end-to-end on real routes (triage->billing, triage->technical), because an agent can be locally correct and globally useless.
The edge and path scores are the new information. They are also where regressions hide, because a prompt change to one agent can pass that agent's node eval while quietly breaking the contract its downstream neighbor depends on. Catch it at the edge, then jump to the AgentLens trace to see the field that changed.
The takeaway
Single-agent evals are a solved-enough problem. Multi-agent systems are not, because the unit of failure moves from the agent to the seam between agents, and almost no one is evaluating the seam. Assert the contract deterministically at every handoff, score your system as a graph with node/edge/path layers, and keep the eval signal welded to the trace that produced it — agent-eval to grade the seam, AgentLens to show you the byte that broke it. Your agents were never the problem. The handshake was.
Top comments (0)