DEV Community

I Found 54 Reliability Issues in My 14-Agent AI System — Here's What Broke

suraj kumar on May 31, 2026

Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them. I learned...
Collapse
 
theuniverseson profile image
Andrii Krugliak

The 54-versus-individual split is the part most testing misses. We saw the same thing: single agents passed every check, then broke at the handoff when one agent's output format drifted and the next assumed the old shape. Are your 54 mostly contract failures between agents, or state leaks?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Great question — and the handoff format drift you described is one of the hardest failures to catch because both agents "work" individually.

Of the 54 findings, the breakdown is:

  • Structural cascade failures: 15 (CRITICAL) — agents whose failure propagates to 92-100% of the system. This is the "topology kills you" category.
  • Timeout/fragility: 22 — agents with single upstream dependencies and zero timeout handling. When the orchestrator slows down, 9 agents freeze with no fallback.
  • Coordination failures: 17 — 13 intent drift (low-privilege agents delegating to high-privilege ones without access control) + 4 collusion cliques (agents communicating directly, bypassing orchestrator oversight).
  • State/context leaks: 0 in this run — the system didn't have sensitive data in the test interactions. But the context leakage scanner now detects 23 pattern types (AWS keys, JWT tokens, credit cards, database connection strings, PII) crossing agent boundaries.

The contract failure you described — output format drift between agents — is actually a gap I'm actively working on for a future release. Right now swarm-test tests the structural graph (who connects to whom, what breaks if they fail). Output schema validation between agents is the next layer: "Agent A promised JSON with field X, Agent B received JSON without it." That's on the roadmap.

What framework were you using when you hit the format drift issue? Curious how the agents were passing context.

Collapse
 
theuniverseson profile image
Andrii Krugliak

It wasn't one framework, it was a set of specialist agents picking up tasks, so the drift showed at the seam between two that had never run together. What helped was making each agent's output a checked contract the next one validates, instead of a blob it just trusts. That schema check between agents is the thing that catches the silent breaks.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

"Drift showed at the seam between two that had never run together" — that's the exact failure mode swarm-test is built to catch. Individual agents pass every test, but the interaction between them is where production breaks.

The checked contract approach is smart — schema validation at every agent boundary turns silent breaks into immediate errors. Right now swarm-test catches structural risks at the graph level (cascade paths, SPOFs, collusion). Output contract validation — "Agent A promised this schema, Agent B received something different" — is the natural next layer.

That's on the roadmap: defining expected output contracts per edge in the graph, then flagging mismatches during testing. Would be useful to know what your contract checks look like — are you validating JSON schema, or something more semantic?

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

Glad the seam framing landed. Catching the interaction failures unit tests miss is the hard part, and the checked-contract bit only worked once we treated each agent's output as data the next one validates instead of prose it trusts. Did your 15 cascades cluster around shared state or around format assumptions?

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Great question — the 15 cascades clustered around both, but format assumptions were the silent killer. Shared state failures at least threw errors. Format drift just propagated bad data quietly.

That's exactly why we just shipped output contract validation in swarm-test v0.2.6. You define expected schemas per agent in a contracts.yaml:

analyst:
required: [analysis, confidence]
properties:
confidence: {type: number}

If analyst returns confidence as a string instead of a number — caught. If it drops a required field — caught. Edge-specific contracts too, so analyst→writer can enforce different schemas than analyst→reviewer.

Your "checked contract" pattern is literally what this implements. Missing required fields surface as CRITICAL, type mismatches as HIGH.

pip install swarm-test --upgrade

Would love to know if the schema format maps to what your contract checks look like — JSON Schema or something more semantic?

Collapse
 
harjjotsinghh profile image
Harjot Singh

54 issues in a 14-agent system is the math nobody warns you about: reliability compounds multiplicatively, not additively. If each agent is 95% reliable, 14 of them in a chain is 0.95^14 ≈ 49% end-to-end - so a system of individually-"good" agents is a coin flip overall. That's why naive multi-agent setups feel magical in the demo and fall apart in practice: the failure surface is the product of every handoff.

The implication is that adding agents without adding gates between them makes things worse, not better - each new agent is another multiplication against your reliability. The fixes that actually move the needle: verification at each handoff (so errors fail fast instead of cascading), idempotent retries, and ruthlessly minimizing the number of agents to the fewest that do the job. It's exactly what I obsess over in Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - fewer agents, hard gates between them, so the chain doesn't multiply its way to failure. ~$3 flat, first run free. Excellent writeup - of the 54, what was the most common class: handoff/context loss, cascading errors, or coordination races?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

The reliability multiplication math is exactly right — and it's worse than the chain model suggests, because real multi-agent systems aren't linear chains. They're graphs with cycles, shared dependencies, and fan-out patterns that compound failure in non-obvious ways.

To answer your question — the breakdown of the 54 findings across the 14-agent system:

Cascading errors: 14 findings (all CRITICAL) — the dominant class. Every agent had 92%+ blast radius because the hub-and-spoke architecture around OrchestratorAgent means any single failure propagates to 12 of 14 agents.

Timeout/fragility: 22 findings — 9 HIGH fragile dependencies (single upstream, no fallback) + 13 MEDIUM (zero timeout handling across every agent). This is your "handoff loss" category — agents with no graceful degradation when upstream is slow.

Coordination failures: 17 findings — 13 intent drift (peripheral agents delegating directly to the orchestrator without access control) + 4 collusion cliques (agents communicating outside orchestrator oversight).

The most dangerous finding wasn't any single class — it was the combination. OrchestratorAgent scored 4/100 health: it's a SPOF with 92% blast radius, belongs to 3 collusion cliques, sits on the critical path, and has zero timeout handling. No individual test catches that. The graph-level view is what surfaces it.

That's what swarm-test does — maps the interaction graph and finds where reliability compounds against you.

Collapse
 
harjjotsinghh profile image
Harjot Singh

54 reliability issues in a 14-agent system is the most honest thing anyone's posted about multi-agent. The failure surface scales with agent count and the bugs are mostly at the seams (handoffs, shared state, partial failures), not inside any single agent. I run a 14-agent pipeline myself in Moonshift and hit the same truth: the agents are the easy part, the orchestration reliability (retries, state passing, one bad agent not poisoning the whole run, verifying each step before the next) is where the real engineering and most of the bugs live. Categorizing 54 of them is genuinely useful work. What was the most common failure class, state handoff or silent wrong-output that passed downstream?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Appreciate that — and you nailed the core insight: "the agents are the easy part, the orchestration reliability is where the bugs live."

To answer your question — the 54 broke down like this:

The dominant class was cascade topology (15 CRITICAL): not a single agent misbehaving, but the graph structure itself being fragile. My OrchestratorAgent connects to 12 of 14 agents with zero redundancy — any failure cascades to 92% of the system. That's not a bug in any agent's code. It's an architecture problem that's invisible until you map the interaction graph.

Second was timeout fragility (22 findings): 9 agents have a single upstream dependency with no fallback. If the orchestrator slows down, they don't timeout and retry — they just freeze. Silent hang, no error, no log.

Third was coordination drift (17 findings): agents delegating to the wrong privilege level, and three groups forming communication cliques that bypass the orchestrator entirely.

The "silent wrong-output that passed downstream" pattern you're describing is actually the next frontier I'm building toward — output schema validation between agents. Right now swarm-test catches the structural and topology failures. Catching "Agent A sent malformed JSON that Agent B silently accepted and propagated" requires runtime trace analysis, which is the paid platform roadmap.

With 14 agents in your Moonshift pipeline — would be curious what swarm-test surfaces on your topology. Happy to run it if you want to share the agent graph structure.

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

If anyone wants to see what swarm-test finds on their own agent system, I'm happy to run it and share the results. Just describe your agent setup (framework, number of agents, how they connect) and I'll generate a report.