DEV Community

suraj kumar
suraj kumar

Posted on

I Found 54 Reliability Issues in My 14-Agent AI System — Here's What Broke

Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them.

I learned this the hard way.

The Problem Nobody Is Solving

I built a 14-agent document processing system using CrewAI. Each agent worked perfectly in isolation. In production, the system failed constantly — and I couldn't figure out why.

The problem wasn't any single agent. It was the interactions:

  • One agent failing silently took down 12 others
  • Agents were sharing sensitive data across boundaries they shouldn't cross
  • Three agents formed a communication clique that bypassed the orchestrator
  • Every agent depended on one central orchestrator with zero fallback

No existing tool could find these issues. Arize, Langfuse, Braintrust — they all monitor individual agents. None of them test the graph of agent interactions.

So I built one.

What I Built: swarm-test

swarm-test builds a NetworkX interaction graph of your multi-agent system and runs 6 chaos engineering tests against it:

  1. Cascade Failure — which agents bring down the whole system if they fail
  2. Context Leakage — sensitive data (API keys, PII, credentials) crossing agent boundaries
  3. Intent Drift — agents acting outside their role or being manipulated
  4. Collusion Detection — agents communicating outside the orchestrator's oversight
  5. Blast Radius — single points of failure and critical dependency paths
  6. Timeout Resilience — agents with no fallback if upstream is slow

3-line API:

from swarm_test import SwarmProbe

probe = SwarmProbe(crew)
report = probe.run_all()
report.print_summary()
Enter fullscreen mode Exit fullscreen mode

What It Found On My Real System

I ran swarm-test on my 14-agent system. The results were brutal:

54 total findings:

  • 15 CRITICAL (14 cascade failures + 1 SPOF)
  • 13 HIGH (9 timeout vulnerabilities + 4 collusion cliques)
  • 26 MEDIUM (13 intent drift + 13 missing timeout handling)

The worst agent: OrchestratorAgent scored 4 out of 100. It's a single point of failure with 92% blast radius — if it fails, 12 of 14 agents go down. And it had zero timeout handling.

The scariest finding: EvolutionAgent has 100% blast radius. If it fails, every other agent in the system is affected.

Three agents (OrchestratorAgent, FileOptimizerAgent, PrintOptimizerAgent) formed a collusion clique — communicating directly with each other and bypassing orchestrator oversight.

None of this was visible from testing individual agents. It only appeared when I tested the interaction graph.

I Shipped 7 Features in 7 Days

After launching, I shipped one feature every day:

Day Feature Impact
0 Launch — 5 chaos tests, GitHub + PyPI First multi-agent testing tool on PyPI
1 Timeout resilience test Found 22 new issues in my system
2 JSON export Another developer integrated it into his runtime gate within hours
3 LangGraph adapter Now supports CrewAI + LangGraph
4 Sensitive data detection (23 patterns) Catches AWS keys, JWT tokens, credit cards crossing agent boundaries
5 Per-agent health scores (0-100) Know exactly which agent to fix first
6 Before/after comparison Measure if your refactor actually improved reliability
7 ASCII agent graph See your agent topology right in the terminal

94 tests passing. Two frameworks supported. And growing.

The First Integration

Within 48 hours of launch, another developer built an integration. He has a runtime action-gate that blocks dangerous agent actions before execution. He connected swarm-test's findings as "priors" — so when swarm-test flags an edge as high-risk, his gate becomes more cautious on that edge.

The result: the same run_sql action went from "CONFIRM" (risk 62) to "HUMAN_REQUIRED" (risk 78) when swarm-test's cascade finding was attached.

Structural testing (swarm-test) + runtime enforcement (his gate) = the full reliability stack for multi-agent systems.

Why This Matters Now

According to recent industry research:

  • 88% of organizations report AI agent security incidents
  • Only 14.4% of agents go live with full security approval
  • OWASP classified cascade failures as ASI08 — a top AI security risk

Multi-agent systems are going to production faster than anyone can secure them. The tools exist for single-agent monitoring. Nothing existed for multi-agent interaction testing — until now.

Try It

pip install swarm-test
Enter fullscreen mode Exit fullscreen mode
from swarm_test import SwarmProbe

# Works with CrewAI
probe = SwarmProbe(your_crew)
report = probe.run_all()
report.print_summary()
report.to_html("report.html")  # Interactive D3 graph
report.to_json("report.json")  # Machine-readable for CI/CD
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/surajkumar811/swarm-test

Open source. MIT licensed. Solo founder building in public.

What reliability tests would YOU want for your multi-agent systems? Drop a comment — I'm shipping features based on real feedback.
Uploading image

Top comments (20)

Collapse
 
theuniverseson profile image
Andrii Krugliak

The 54-versus-individual split is the part most testing misses. We saw the same thing: single agents passed every check, then broke at the handoff when one agent's output format drifted and the next assumed the old shape. Are your 54 mostly contract failures between agents, or state leaks?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Great question — and the handoff format drift you described is one of the hardest failures to catch because both agents "work" individually.

Of the 54 findings, the breakdown is:

  • Structural cascade failures: 15 (CRITICAL) — agents whose failure propagates to 92-100% of the system. This is the "topology kills you" category.
  • Timeout/fragility: 22 — agents with single upstream dependencies and zero timeout handling. When the orchestrator slows down, 9 agents freeze with no fallback.
  • Coordination failures: 17 — 13 intent drift (low-privilege agents delegating to high-privilege ones without access control) + 4 collusion cliques (agents communicating directly, bypassing orchestrator oversight).
  • State/context leaks: 0 in this run — the system didn't have sensitive data in the test interactions. But the context leakage scanner now detects 23 pattern types (AWS keys, JWT tokens, credit cards, database connection strings, PII) crossing agent boundaries.

The contract failure you described — output format drift between agents — is actually a gap I'm actively working on for a future release. Right now swarm-test tests the structural graph (who connects to whom, what breaks if they fail). Output schema validation between agents is the next layer: "Agent A promised JSON with field X, Agent B received JSON without it." That's on the roadmap.

What framework were you using when you hit the format drift issue? Curious how the agents were passing context.

Collapse
 
theuniverseson profile image
Andrii Krugliak

It wasn't one framework, it was a set of specialist agents picking up tasks, so the drift showed at the seam between two that had never run together. What helped was making each agent's output a checked contract the next one validates, instead of a blob it just trusts. That schema check between agents is the thing that catches the silent breaks.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

"Drift showed at the seam between two that had never run together" — that's the exact failure mode swarm-test is built to catch. Individual agents pass every test, but the interaction between them is where production breaks.

The checked contract approach is smart — schema validation at every agent boundary turns silent breaks into immediate errors. Right now swarm-test catches structural risks at the graph level (cascade paths, SPOFs, collusion). Output contract validation — "Agent A promised this schema, Agent B received something different" — is the natural next layer.

That's on the roadmap: defining expected output contracts per edge in the graph, then flagging mismatches during testing. Would be useful to know what your contract checks look like — are you validating JSON schema, or something more semantic?

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

Glad the seam framing landed. Catching the interaction failures unit tests miss is the hard part, and the checked-contract bit only worked once we treated each agent's output as data the next one validates instead of prose it trusts. Did your 15 cascades cluster around shared state or around format assumptions?

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Great question — the 15 cascades clustered around both, but format assumptions were the silent killer. Shared state failures at least threw errors. Format drift just propagated bad data quietly.

That's exactly why we just shipped output contract validation in swarm-test v0.2.6. You define expected schemas per agent in a contracts.yaml:

analyst:
required: [analysis, confidence]
properties:
confidence: {type: number}

If analyst returns confidence as a string instead of a number — caught. If it drops a required field — caught. Edge-specific contracts too, so analyst→writer can enforce different schemas than analyst→reviewer.

Your "checked contract" pattern is literally what this implements. Missing required fields surface as CRITICAL, type mismatches as HIGH.

pip install swarm-test --upgrade

Would love to know if the schema format maps to what your contract checks look like — JSON Schema or something more semantic?

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Thanks for the back-and-forth on this — your contract validation insight directly shaped v0.2.6. If swarm-test is useful to you, a GitHub star helps with visibility: github.com/surajkumar811/swarm-test

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

Mostly JSON Schema for the structural stuff: required fields, types, enums, because it fails loud and is easy to diff. The semantic checks live a layer up, where we score whether the output is actually usable, not just well-formed. v0.2.6 sounds like it nails the first half cleanly.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

That's exactly the split — structural (is it well-formed) vs semantic (is it useful). v0.2.6 handles the structural layer with JSON Schema validation at every agent boundary.

The semantic scoring layer is the next challenge — validating that an agent's output is actually relevant and complete, not just schema-compliant. Thinking about a lightweight relevance scorer that checks output against the agent's stated goal. Early days on that.

Curious what you use for the semantic scoring — embeddings similarity, LLM-as-judge, or something simpler?

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

We use an LLM-as-judge against the task's stated goal rather than embeddings, since relevance is usually about intent and not surface similarity. Cheap structural checks run first, and the judge only sees what passes. Embeddings let too many schema-valid but useless outputs through for us.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Makes sense — structural checks as the fast filter, LLM judge for intent validation on what passes. That's the exact layering we're building toward. Contract validation in swarm-test handles your Layer 1 (schema enforcement at every agent boundary). Layer 2 — semantic validation against task goals — is the natural next step. The challenge is keeping judge costs low when you're validating outputs at every edge in a multi-agent graph. Are you running the judge on every output or sampling?

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

Every final deliverable gets judged, no sampling there, since that's what the user pays for and accepts. The intermediate hops I sample: a cheap structural check on every edge, and the LLM judge only when that check comes back ambiguous or it's a step I've watched drift before. Judging every edge with an LLM got expensive fast, so the real work was deciding which edges earn it.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Smart prioritization. The "which edges earn the judge" decision is exactly where graph-level risk data helps. swarm-test v0.2.8 scores every agent 0-100 on replaceability — factors like path redundancy, centrality, and blast radius. Agents under 20 get flagged IRREPLACEABLE. Those edges should always get your LLM judge. Fully redundant agents scoring 80+ probably don't need it. So the graph topology becomes your budget allocator — route judge spend to high-risk edges, cheap structural checks handle the rest. On v0.3.0 we also added a GitHub Action that runs this as a CI gate on every PR — catches new SPOFs before merge, zero LLM cost. Your drift-history approach for promoting edges to "always judge" is interesting — we don't track that yet but it's a natural fit on top of the redundancy scoring

Thread Thread
 
theuniverseson profile image
Andrii Krugliak

Routing judge spend by graph topology instead of guesswork is the part I've been hand-rolling badly. The CI gate catching new single points of failure before merge is the bit I want most, since my drift history only learns after something already broke once. Going to look at swarm-test.

Thread Thread
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Glad it landed. Everything you're describing is already live:

CI gate: add this to your GitHub workflow and it catches new SPOFs on every PR before merge:

  • uses: surajkumar811/swarm-test@v0.3.0 with: script: your_crew.py fail-on-severity: high

SPOF detection runs automatically — any agent scoring under 20/100 on redundancy gets flagged as IRREPLACEABLE and blocks the merge if you set the threshold.

Quickest way to try it locally:

pip install swarm-test
swarm-test run -a "Agent1,Agent2,Agent3" -e "Agent1>Agent2,Agent2>Agent3"

Or point it at your actual crew script:

swarm-test run your_crew.py

It auto-detects CrewAI, LangGraph, and AutoGen. The HTML report (--output-format html --open) gives you the full topology graph with redundancy scores and interaction heatmap.

Would be curious what it surfaces on your system — especially whether the contract validation catches the drift patterns you've been seeing.

Collapse
 
harjjotsinghh profile image
Harjot Singh

54 issues in a 14-agent system is the math nobody warns you about: reliability compounds multiplicatively, not additively. If each agent is 95% reliable, 14 of them in a chain is 0.95^14 ≈ 49% end-to-end - so a system of individually-"good" agents is a coin flip overall. That's why naive multi-agent setups feel magical in the demo and fall apart in practice: the failure surface is the product of every handoff.

The implication is that adding agents without adding gates between them makes things worse, not better - each new agent is another multiplication against your reliability. The fixes that actually move the needle: verification at each handoff (so errors fail fast instead of cascading), idempotent retries, and ruthlessly minimizing the number of agents to the fewest that do the job. It's exactly what I obsess over in Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - fewer agents, hard gates between them, so the chain doesn't multiply its way to failure. ~$3 flat, first run free. Excellent writeup - of the 54, what was the most common class: handoff/context loss, cascading errors, or coordination races?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

The reliability multiplication math is exactly right — and it's worse than the chain model suggests, because real multi-agent systems aren't linear chains. They're graphs with cycles, shared dependencies, and fan-out patterns that compound failure in non-obvious ways.

To answer your question — the breakdown of the 54 findings across the 14-agent system:

Cascading errors: 14 findings (all CRITICAL) — the dominant class. Every agent had 92%+ blast radius because the hub-and-spoke architecture around OrchestratorAgent means any single failure propagates to 12 of 14 agents.

Timeout/fragility: 22 findings — 9 HIGH fragile dependencies (single upstream, no fallback) + 13 MEDIUM (zero timeout handling across every agent). This is your "handoff loss" category — agents with no graceful degradation when upstream is slow.

Coordination failures: 17 findings — 13 intent drift (peripheral agents delegating directly to the orchestrator without access control) + 4 collusion cliques (agents communicating outside orchestrator oversight).

The most dangerous finding wasn't any single class — it was the combination. OrchestratorAgent scored 4/100 health: it's a SPOF with 92% blast radius, belongs to 3 collusion cliques, sits on the critical path, and has zero timeout handling. No individual test catches that. The graph-level view is what surfaces it.

That's what swarm-test does — maps the interaction graph and finds where reliability compounds against you.

Collapse
 
harjjotsinghh profile image
Harjot Singh

54 reliability issues in a 14-agent system is the most honest thing anyone's posted about multi-agent. The failure surface scales with agent count and the bugs are mostly at the seams (handoffs, shared state, partial failures), not inside any single agent. I run a 14-agent pipeline myself in Moonshift and hit the same truth: the agents are the easy part, the orchestration reliability (retries, state passing, one bad agent not poisoning the whole run, verifying each step before the next) is where the real engineering and most of the bugs live. Categorizing 54 of them is genuinely useful work. What was the most common failure class, state handoff or silent wrong-output that passed downstream?

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Appreciate that — and you nailed the core insight: "the agents are the easy part, the orchestration reliability is where the bugs live."

To answer your question — the 54 broke down like this:

The dominant class was cascade topology (15 CRITICAL): not a single agent misbehaving, but the graph structure itself being fragile. My OrchestratorAgent connects to 12 of 14 agents with zero redundancy — any failure cascades to 92% of the system. That's not a bug in any agent's code. It's an architecture problem that's invisible until you map the interaction graph.

Second was timeout fragility (22 findings): 9 agents have a single upstream dependency with no fallback. If the orchestrator slows down, they don't timeout and retry — they just freeze. Silent hang, no error, no log.

Third was coordination drift (17 findings): agents delegating to the wrong privilege level, and three groups forming communication cliques that bypass the orchestrator entirely.

The "silent wrong-output that passed downstream" pattern you're describing is actually the next frontier I'm building toward — output schema validation between agents. Right now swarm-test catches the structural and topology failures. Catching "Agent A sent malformed JSON that Agent B silently accepted and propagated" requires runtime trace analysis, which is the paid platform roadmap.

With 14 agents in your Moonshift pipeline — would be curious what swarm-test surfaces on your topology. Happy to run it if you want to share the agent graph structure.

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

If anyone wants to see what swarm-test finds on their own agent system, I'm happy to run it and share the results. Just describe your agent setup (framework, number of agents, how they connect) and I'll generate a report.