Here's a failure mode that no single-agent eval will catch.
A crash tracker agent starts returning slightly different classifications after a model update. Not wrong — different. Where it used to call things crash_regression, it now splits some of them into performance_degradation. Subtle. Defensible, even.
The telemetry analyzer downstream doesn't break. It correlates dutifully against the new categories. But its correlations shift, because it's now grouping incidents differently. The PR creator still opens PRs — correct PRs, for the new classifications. But a human reviewing them notices: "why are we treating this latency spike as a performance issue instead of a crash regression?"
No component throws errors. But behavior changed — quietly, across the pipeline. Agent-level evals didn't catch it because they test each agent in isolation: "given this input, is the output good?" The crash tracker's output was good. It just drew a boundary differently than before, and everything downstream shifted with it.
You can't reliably catch this with only ad-hoc spot checks or single-agent evals.
The fix is straightforward: ten canonical crash logs as a weekly regression suite — fixed inputs with expected classification labels. When the model draws a boundary differently, the test fails before production sees it. Simple, boring, effective.
But the point is that this testing should exist from the start — not after an incident reveals the gap.
Why Evals Are Necessary But Not Sufficient
Let me be clear: evals are good. You should have them. "Given this input, does the output meet quality criteria?" is a real question that deserves a real answer.
But most teams run unit evals — testing one agent's output quality without running the downstream workflow. What's missing are integration and system evals that test what happens when agents are wired together. That's where the interesting failures live.
What happens when one agent's output is subtly degraded and the next agent builds on it? Cascade failure. What happens when two agents run concurrently and both try to create a PR for the same issue? Race condition. What happens when the telemetry service is slow and the agent times out mid-analysis? Partial failure. What happens when a model update shifts an agent's classification boundaries? Drift.
These are distributed systems failure modes. Every backend engineer has war stories about them. We have decades of tooling for testing them in traditional systems: contract tests, chaos engineering, snapshot testing, SLOs.
Multi-agent systems are distributed systems. We should test them like it.
Contract Testing for Agent Handoffs
The structured summary packets from the agents lie article aren't just good architecture — they're testable contracts. Quick recap: instead of passing raw agent output between agents, the orchestrator translates each agent's response into a typed handoff packet — a versioned, typed JSON object that contains only the facts needed for the next step, stripping away the LLM's reasoning and prose. Only typed fields cross the boundary.
Every handoff has a schema. That schema IS the contract. And contracts can be validated:
// validate() here is Zod's safeParse or JSON Schema (Ajv) —
// any schema validator that works on plain objects
test("crash tracker output conforms to schema", async () => {
const output = await crashTracker.analyze(KNOWN_CRASH_LOG);
validate(output, "crash_regression_v1");
expect(output.pattern_type).toBeDefined();
expect(output.affected_component).toBeDefined();
expect(output.confidence).toBeGreaterThanOrEqual(0);
expect(output.confidence).toBeLessThanOrEqual(1);
});
test("orchestrator strips reasoning", () => {
const raw = {
pattern_type: "crash_regression",
confidence: 0.72,
reasoning: "Looks like a race condition...",
};
const handoff = orchestrator.translate(raw, "telemetry_analyzer");
expect(handoff.reasoning).toBeUndefined(); // reduce downstream variance
expect(handoff.signal_strength).toBeDefined(); // confidence bucketed to low/med/high
});
test("handoff has all required fields", () => {
const handoff = orchestrator.translate(
SAMPLE_CRASH_OUTPUT,
"telemetry_analyzer",
);
const required = [
"schema",
"pattern_type",
"affected_component",
"timestamp_range",
"signal_strength",
"request",
];
for (const field of required) {
expect(handoff).toHaveProperty(field);
}
});
test("schema backward compatibility", () => {
const oldOutput = loadFixture("crash_tracker_v1_output.json");
const handoff = orchestrator.translate(oldOutput, "telemetry_analyzer");
validate(handoff, "cross_agent_v1");
});
The key insight: because handoffs are structured JSON with versioned schemas, you can test them exactly like API contracts. When you validate stored fixtures or stub the LLM call, no model invocation is needed — no flaky assertions about "output quality." Does the JSON conform? Do the required fields exist? Does the orchestrator strip what it's supposed to strip? Schema validation prevents integration breakage; it doesn't guarantee the content is true — that's what the layers above are for.
This is the same discipline as consumer-driven contract tests in microservices. The downstream agent is the consumer. The upstream agent is the provider. The schema is the contract. Break the contract, break the build.
Fault Injection: What Happens When One Agent Returns Garbage?
Chaos engineering for agents. The question isn't "does the agent work?" The question is "what happens when it doesn't?"
class FaultInjector {
constructor(private agent: Agent) {}
/** Valid schema, semantically nonsensical */
async garbageResponse(input: InputData) {
const output = await this.agent.analyze(input);
return {
...output,
pattern_type: "crash_regression",
affected_component: "definitely_not_real",
confidence: 0.99,
};
}
/** Agent never responds — simulates a hung downstream service */
timeout(_input: InputData) {
return new Promise<never>(() => {}); // never resolves
}
/** Valid JSON, missing non-critical fields */
async partialResponse(input: InputData) {
const { platform, trigger, context, ...rest } =
await this.agent.analyze(input);
return rest;
}
/** Everything comes back with suspiciously low confidence */
async confidenceAnomaly(input: InputData) {
const output = await this.agent.analyze(input);
return { ...output, confidence: 0.01 };
}
}
Four failure modes, four tests:
Garbage response. Inject a semantically wrong but schema-valid output. Does the orchestrator catch it? Does the downstream agent produce garbage, or does it gracefully degrade? An orchestrator that checks for known component names would reject definitely_not_real. Without that check, a hallucinated component name passes schema validation and sends the telemetry analyzer on a wild goose chase.
Timeout. Make an agent exceed the orchestrator's configured deadline. Does the orchestrator wait forever? It shouldn't. Every agent dispatch has a deadline. If the crash tracker hasn't responded in 15 seconds, the orchestrator marks it as timed out, logs the incident, and continues the workflow without that input. The downstream agent gets a handoff packet with a source_status: "unavailable" field.
Partial response. Valid JSON, but missing optional fields like platform or trigger. Does the telemetry analyzer crash, or does it correlate with what it has? This test caught a bug where the telemetry analyzer assumed platform was always present and threw a KeyError when it wasn't.
Confidence anomaly. Everything comes back at 0.01 confidence. The orchestrator should flag this as anomalous — a well-functioning crash tracker doesn't return near-zero confidence on everything. This is a canary for model degradation or prompt corruption.
You don't need a chaos engineering framework. You need a wrapper that corrupts outputs in predictable ways and four tests that assert the system handles it.
Snapshot Testing for Orchestration Flows
Record a full workflow. Dispatch → agent calls → handoffs → result. Serialize it as a trace. Snapshot it.
{
"trace_id": "golden-crash-workflow-001",
"trigger": "crash_spike_detected",
"steps": [
{
"agent": "crash_tracker",
"input_schema": "crash_spike_trigger_v1",
"output_schema": "crash_regression_v1",
"duration_ms": 3200,
"cost_usd": 0.003
},
{
"agent": "orchestrator",
"action": "translate",
"input_schema": "crash_regression_v1",
"output_schema": "cross_agent_v1"
},
{
"agent": "telemetry_analyzer",
"input_schema": "cross_agent_v1",
"output_schema": "telemetry_correlation_v1",
"duration_ms": 4100,
"cost_usd": 0.003
},
{
"agent": "pr_creator",
"input_schema": "fix_request_v1",
"output_schema": "pr_draft_v1",
"duration_ms": 8700,
"cost_usd": 0.15
}
],
"total_cost_usd": 0.156,
"total_duration_ms": 16000
}
This is not testing exact output — that's brittle with LLMs and will break every time a model gets updated. (Pin model versions, set temperature to 0, and use a seed where the API supports it to reduce variance — but still assert structure, not prose. Temperature 0 doesn't guarantee determinism across hosted model updates.) This is testing three things:
Flow structure. Same agents called in the same order? If a prompt change causes the orchestrator to skip the telemetry analyzer, the trace diff shows it instantly. You didn't mean to change routing. The snapshot caught it.
Schema conformance. Every handoff in the trace validated against its schema? If an agent starts producing outputs that don't match the expected schema version, the snapshot test fails before anything downstream sees it.
Budget. Did this workflow cost more than the baseline? If the PR creator suddenly starts using 3x more tokens because a prompt change made it chattier, the cost assertion catches it. Use a percentage threshold (e.g., >150% of baseline) rather than exact amounts — costs fluctuate with model routing, tokenization changes, and tool call verbosity. The golden trace says this workflow costs ~$0.16. If it starts costing $0.50, something changed.
Keep a set of golden traces — five to ten known-good workflows that cover your critical paths. Run them on every change. Diff the traces. Review the diffs like you'd review a code diff.
The SLO Question: What Does "Reliable" Mean for an Agent?
Traditional SLOs — 99.9% uptime, p95 latency under 200ms — don't map directly to agent systems. Your agent can be "up" and still be useless if it's classifying everything wrong. You still need classic SLOs (tool latency, API errors, queue depth), but they're not sufficient.
Agent-specific SLOs worth tracking (targets here are examples — baseline your own system first):
Task success rate. Percentage of completed workflow runs that produce a result a human actually uses. Denominator: all workflow runs that reached a terminal state. A PR that gets merged counts. A PR that gets immediately closed doesn't. Target: 85%+ over a rolling 7-day window.
Cost per successful outcome. Not cost per API call — cost per result that a human actually used. If 30% of your PRs get closed, your real cost per useful PR is ~1.4x what your token bill says. This is the number that matters for ROI conversations.
Classification stability. Does the crash tracker classify the same input the same way over time? Run ten canonical crash logs through it weekly. Track per-label consistency: if a log classified as crash_regression last week is now performance_degradation, that's a boundary shift — flag it regardless of whether the new classification is "better." Target: 95%+ label consistency week over week. (Real changes in underlying data are expected; the test catches unintended drift from model updates or prompt changes.)
Cascade failure rate. When an upstream agent degrades, how often does it cause downstream failures? Measured as: (workflow runs where a downstream agent failed and the upstream agent's output was flagged as degraded) / (total workflow runs with upstream degradation). If the crash tracker has a bad day, do downstream agents gracefully degrade or fall over? Target: under 10%.
Time to detection. When an agent starts drifting, how long until you notice? Measured from first anomalous output to first alert. If the crash tracker's classifications shifted three weeks ago and you just now noticed — that's a three-week detection gap. The canary queries and golden traces above shrink this to hours. Target: under 4 hours for critical agents.
These SLOs are measurable because you have structured handoffs and correlation IDs linking every step in a workflow. The boring infrastructure work — schemas, trace IDs, structured logging — pays for itself here.
A Minimal Test Harness You Can Build This Week
You don't need a specialized AI testing framework. Your existing Vitest or Jest setup is enough. You need five tests that catch the failures evals miss.
Here's the checklist:
┌──────────────────────────────────────────────────────┐
│ Agent Test Pyramid │
│ │
│ ╱╲ │
│ ╱ ╲ Golden Traces │
│ ╱ GT ╲ (5-10 per critical │
│ ╱──────╲ workflow) │
│ ╱ ╲ │
│ ╱ Fault ╲ Fault Injection │
│ ╱ Injection ╲ (1 per agent) │
│ ╱──────────────╲ │
│ ╱ ╲ │
│ ╱ Schema / Contract╲ Contract Tests │
│ ╱ Validation ╲ (every handoff) │
│ ╱──────────────────────╲ │
│ ╱ ╲ │
│ ╱ Canaries + Cost ╲ Canary Queries │
│ ╱ Assertions ╲ (1 per agent) │
│ ╱──────────────────────────────╲ │
│ │
│ Run in CI ──────────────────────── Run on schedule │
└──────────────────────────────────────────────────────┘
Bottom layer: Canary queries + cost assertions. One known input per agent, assert the output shape is correct. One cost assertion per critical workflow: "this should cost under $0.20." Run on every deploy.
Schema/contract validation. JSON Schema tests for every handoff point. Does the crash tracker's output conform? Does the orchestrator's translation conform? Does the downstream agent accept it? Run against fixtures in CI — no LLM calls needed.
Fault injection. One test per agent: inject garbage, assert graceful degradation. Does the orchestrator catch bad output? Does the downstream agent handle missing fields? Run in CI.
Golden traces. One snapshot per critical workflow. Replay on every change to prompts, schemas, or routing rules. Diff the traces. Review the diffs. Run on schedule and on prompt changes.
Total setup time: a day, maybe two — if you already have structured outputs and tracing. If you're starting from raw text outputs, budget a week to add schemas first (which you should do regardless). Total ongoing maintenance: update golden traces when you intentionally change behavior. That's it.
The Punchline
Evals tell you if your agent is smart. These tests tell you if your system is reliable. You need both.
The eval catches "this agent's output quality dropped." The contract test catches "this agent's output doesn't match what the next agent expects." The fault injection catches "this agent's failure takes down the pipeline." The golden trace catches "this workflow quietly changed shape and nobody noticed." The SLO catches "this system is slowly getting worse and we haven't noticed yet."
Different failure modes. Different tests. Same system.
Multi-agent systems are distributed systems. Test them like it.
For the architecture being tested here, see Part 1: Fleet Architecture (container isolation, tiered LLMs, deterministic routing) and Part 2: Security (JIT tokens, zero-trust, self-healing workflows). For the structured handoff contracts these tests validate, see Agents Lie to Each Other.
The multi-LLM patterns used in the orchestrator's validation layer — council discussions, structured voting, adversarial debate — are open-source in mcp-rubber-duck.

Top comments (1)
Great post. The distributed-systems framing is exactly right.
What feels underexplored in the ecosystem is the black-box side: many teams won’t have clean internal traces, but they still need a tamper-evident record of what the automation attempted, decided, and actually did. I’ve been building something in that direction for infra automation — very adjacent to your thesis.