Every testing tool for AI agents tests individual agents. But production failures don't happen inside agents — they happen between them.
I learned this the hard way.
The Problem Nobody Is Solving
I built a 14-agent document processing system using CrewAI. Each agent worked perfectly in isolation. In production, the system failed constantly — and I couldn't figure out why.
The problem wasn't any single agent. It was the interactions:
- One agent failing silently took down 12 others
- Agents were sharing sensitive data across boundaries they shouldn't cross
- Three agents formed a communication clique that bypassed the orchestrator
- Every agent depended on one central orchestrator with zero fallback
No existing tool could find these issues. Arize, Langfuse, Braintrust — they all monitor individual agents. None of them test the graph of agent interactions.
So I built one.
What I Built: swarm-test
swarm-test builds a NetworkX interaction graph of your multi-agent system and runs 6 chaos engineering tests against it:
- Cascade Failure — which agents bring down the whole system if they fail
- Context Leakage — sensitive data (API keys, PII, credentials) crossing agent boundaries
- Intent Drift — agents acting outside their role or being manipulated
- Collusion Detection — agents communicating outside the orchestrator's oversight
- Blast Radius — single points of failure and critical dependency paths
- Timeout Resilience — agents with no fallback if upstream is slow
3-line API:
from swarm_test import SwarmProbe
probe = SwarmProbe(crew)
report = probe.run_all()
report.print_summary()
What It Found On My Real System
I ran swarm-test on my 14-agent system. The results were brutal:
54 total findings:
- 15 CRITICAL (14 cascade failures + 1 SPOF)
- 13 HIGH (9 timeout vulnerabilities + 4 collusion cliques)
- 26 MEDIUM (13 intent drift + 13 missing timeout handling)
The worst agent: OrchestratorAgent scored 4 out of 100. It's a single point of failure with 92% blast radius — if it fails, 12 of 14 agents go down. And it had zero timeout handling.
The scariest finding: EvolutionAgent has 100% blast radius. If it fails, every other agent in the system is affected.
Three agents (OrchestratorAgent, FileOptimizerAgent, PrintOptimizerAgent) formed a collusion clique — communicating directly with each other and bypassing orchestrator oversight.
None of this was visible from testing individual agents. It only appeared when I tested the interaction graph.
I Shipped 7 Features in 7 Days
After launching, I shipped one feature every day:
| Day | Feature | Impact |
|---|---|---|
| 0 | Launch — 5 chaos tests, GitHub + PyPI | First multi-agent testing tool on PyPI |
| 1 | Timeout resilience test | Found 22 new issues in my system |
| 2 | JSON export | Another developer integrated it into his runtime gate within hours |
| 3 | LangGraph adapter | Now supports CrewAI + LangGraph |
| 4 | Sensitive data detection (23 patterns) | Catches AWS keys, JWT tokens, credit cards crossing agent boundaries |
| 5 | Per-agent health scores (0-100) | Know exactly which agent to fix first |
| 6 | Before/after comparison | Measure if your refactor actually improved reliability |
| 7 | ASCII agent graph | See your agent topology right in the terminal |
94 tests passing. Two frameworks supported. And growing.
The First Integration
Within 48 hours of launch, another developer built an integration. He has a runtime action-gate that blocks dangerous agent actions before execution. He connected swarm-test's findings as "priors" — so when swarm-test flags an edge as high-risk, his gate becomes more cautious on that edge.
The result: the same run_sql action went from "CONFIRM" (risk 62) to "HUMAN_REQUIRED" (risk 78) when swarm-test's cascade finding was attached.
Structural testing (swarm-test) + runtime enforcement (his gate) = the full reliability stack for multi-agent systems.
Why This Matters Now
According to recent industry research:
- 88% of organizations report AI agent security incidents
- Only 14.4% of agents go live with full security approval
- OWASP classified cascade failures as ASI08 — a top AI security risk
Multi-agent systems are going to production faster than anyone can secure them. The tools exist for single-agent monitoring. Nothing existed for multi-agent interaction testing — until now.
Try It
pip install swarm-test
from swarm_test import SwarmProbe
# Works with CrewAI
probe = SwarmProbe(your_crew)
report = probe.run_all()
report.print_summary()
report.to_html("report.html") # Interactive D3 graph
report.to_json("report.json") # Machine-readable for CI/CD
GitHub: github.com/surajkumar811/swarm-test
Open source. MIT licensed. Solo founder building in public.
What reliability tests would YOU want for your multi-agent systems? Drop a comment — I'm shipping features based on real feedback.
Top comments (20)
The 54-versus-individual split is the part most testing misses. We saw the same thing: single agents passed every check, then broke at the handoff when one agent's output format drifted and the next assumed the old shape. Are your 54 mostly contract failures between agents, or state leaks?
Great question — and the handoff format drift you described is one of the hardest failures to catch because both agents "work" individually.
Of the 54 findings, the breakdown is:
The contract failure you described — output format drift between agents — is actually a gap I'm actively working on for a future release. Right now swarm-test tests the structural graph (who connects to whom, what breaks if they fail). Output schema validation between agents is the next layer: "Agent A promised JSON with field X, Agent B received JSON without it." That's on the roadmap.
What framework were you using when you hit the format drift issue? Curious how the agents were passing context.
It wasn't one framework, it was a set of specialist agents picking up tasks, so the drift showed at the seam between two that had never run together. What helped was making each agent's output a checked contract the next one validates, instead of a blob it just trusts. That schema check between agents is the thing that catches the silent breaks.
"Drift showed at the seam between two that had never run together" — that's the exact failure mode swarm-test is built to catch. Individual agents pass every test, but the interaction between them is where production breaks.
The checked contract approach is smart — schema validation at every agent boundary turns silent breaks into immediate errors. Right now swarm-test catches structural risks at the graph level (cascade paths, SPOFs, collusion). Output contract validation — "Agent A promised this schema, Agent B received something different" — is the natural next layer.
That's on the roadmap: defining expected output contracts per edge in the graph, then flagging mismatches during testing. Would be useful to know what your contract checks look like — are you validating JSON schema, or something more semantic?
Glad the seam framing landed. Catching the interaction failures unit tests miss is the hard part, and the checked-contract bit only worked once we treated each agent's output as data the next one validates instead of prose it trusts. Did your 15 cascades cluster around shared state or around format assumptions?
Great question — the 15 cascades clustered around both, but format assumptions were the silent killer. Shared state failures at least threw errors. Format drift just propagated bad data quietly.
That's exactly why we just shipped output contract validation in swarm-test v0.2.6. You define expected schemas per agent in a contracts.yaml:
analyst:
required: [analysis, confidence]
properties:
confidence: {type: number}
If analyst returns confidence as a string instead of a number — caught. If it drops a required field — caught. Edge-specific contracts too, so analyst→writer can enforce different schemas than analyst→reviewer.
Your "checked contract" pattern is literally what this implements. Missing required fields surface as CRITICAL, type mismatches as HIGH.
pip install swarm-test --upgrade
Would love to know if the schema format maps to what your contract checks look like — JSON Schema or something more semantic?

Thanks for the back-and-forth on this — your contract validation insight directly shaped v0.2.6. If swarm-test is useful to you, a GitHub star helps with visibility: github.com/surajkumar811/swarm-test
Mostly JSON Schema for the structural stuff: required fields, types, enums, because it fails loud and is easy to diff. The semantic checks live a layer up, where we score whether the output is actually usable, not just well-formed. v0.2.6 sounds like it nails the first half cleanly.
That's exactly the split — structural (is it well-formed) vs semantic (is it useful). v0.2.6 handles the structural layer with JSON Schema validation at every agent boundary.
The semantic scoring layer is the next challenge — validating that an agent's output is actually relevant and complete, not just schema-compliant. Thinking about a lightweight relevance scorer that checks output against the agent's stated goal. Early days on that.
Curious what you use for the semantic scoring — embeddings similarity, LLM-as-judge, or something simpler?
We use an LLM-as-judge against the task's stated goal rather than embeddings, since relevance is usually about intent and not surface similarity. Cheap structural checks run first, and the judge only sees what passes. Embeddings let too many schema-valid but useless outputs through for us.
Makes sense — structural checks as the fast filter, LLM judge for intent validation on what passes. That's the exact layering we're building toward. Contract validation in swarm-test handles your Layer 1 (schema enforcement at every agent boundary). Layer 2 — semantic validation against task goals — is the natural next step. The challenge is keeping judge costs low when you're validating outputs at every edge in a multi-agent graph. Are you running the judge on every output or sampling?
Every final deliverable gets judged, no sampling there, since that's what the user pays for and accepts. The intermediate hops I sample: a cheap structural check on every edge, and the LLM judge only when that check comes back ambiguous or it's a step I've watched drift before. Judging every edge with an LLM got expensive fast, so the real work was deciding which edges earn it.
Smart prioritization. The "which edges earn the judge" decision is exactly where graph-level risk data helps. swarm-test v0.2.8 scores every agent 0-100 on replaceability — factors like path redundancy, centrality, and blast radius. Agents under 20 get flagged IRREPLACEABLE. Those edges should always get your LLM judge. Fully redundant agents scoring 80+ probably don't need it. So the graph topology becomes your budget allocator — route judge spend to high-risk edges, cheap structural checks handle the rest. On v0.3.0 we also added a GitHub Action that runs this as a CI gate on every PR — catches new SPOFs before merge, zero LLM cost. Your drift-history approach for promoting edges to "always judge" is interesting — we don't track that yet but it's a natural fit on top of the redundancy scoring
Routing judge spend by graph topology instead of guesswork is the part I've been hand-rolling badly. The CI gate catching new single points of failure before merge is the bit I want most, since my drift history only learns after something already broke once. Going to look at swarm-test.
Glad it landed. Everything you're describing is already live:
CI gate: add this to your GitHub workflow and it catches new SPOFs on every PR before merge:
SPOF detection runs automatically — any agent scoring under 20/100 on redundancy gets flagged as IRREPLACEABLE and blocks the merge if you set the threshold.
Quickest way to try it locally:
pip install swarm-test
swarm-test run -a "Agent1,Agent2,Agent3" -e "Agent1>Agent2,Agent2>Agent3"
Or point it at your actual crew script:
swarm-test run your_crew.py
It auto-detects CrewAI, LangGraph, and AutoGen. The HTML report (--output-format html --open) gives you the full topology graph with redundancy scores and interaction heatmap.
Would be curious what it surfaces on your system — especially whether the contract validation catches the drift patterns you've been seeing.
54 issues in a 14-agent system is the math nobody warns you about: reliability compounds multiplicatively, not additively. If each agent is 95% reliable, 14 of them in a chain is 0.95^14 ≈ 49% end-to-end - so a system of individually-"good" agents is a coin flip overall. That's why naive multi-agent setups feel magical in the demo and fall apart in practice: the failure surface is the product of every handoff.
The implication is that adding agents without adding gates between them makes things worse, not better - each new agent is another multiplication against your reliability. The fixes that actually move the needle: verification at each handoff (so errors fail fast instead of cascading), idempotent retries, and ruthlessly minimizing the number of agents to the fewest that do the job. It's exactly what I obsess over in Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - fewer agents, hard gates between them, so the chain doesn't multiply its way to failure. ~$3 flat, first run free. Excellent writeup - of the 54, what was the most common class: handoff/context loss, cascading errors, or coordination races?
The reliability multiplication math is exactly right — and it's worse than the chain model suggests, because real multi-agent systems aren't linear chains. They're graphs with cycles, shared dependencies, and fan-out patterns that compound failure in non-obvious ways.
To answer your question — the breakdown of the 54 findings across the 14-agent system:
Cascading errors: 14 findings (all CRITICAL) — the dominant class. Every agent had 92%+ blast radius because the hub-and-spoke architecture around OrchestratorAgent means any single failure propagates to 12 of 14 agents.
Timeout/fragility: 22 findings — 9 HIGH fragile dependencies (single upstream, no fallback) + 13 MEDIUM (zero timeout handling across every agent). This is your "handoff loss" category — agents with no graceful degradation when upstream is slow.
Coordination failures: 17 findings — 13 intent drift (peripheral agents delegating directly to the orchestrator without access control) + 4 collusion cliques (agents communicating outside orchestrator oversight).
The most dangerous finding wasn't any single class — it was the combination. OrchestratorAgent scored 4/100 health: it's a SPOF with 92% blast radius, belongs to 3 collusion cliques, sits on the critical path, and has zero timeout handling. No individual test catches that. The graph-level view is what surfaces it.
That's what swarm-test does — maps the interaction graph and finds where reliability compounds against you.
54 reliability issues in a 14-agent system is the most honest thing anyone's posted about multi-agent. The failure surface scales with agent count and the bugs are mostly at the seams (handoffs, shared state, partial failures), not inside any single agent. I run a 14-agent pipeline myself in Moonshift and hit the same truth: the agents are the easy part, the orchestration reliability (retries, state passing, one bad agent not poisoning the whole run, verifying each step before the next) is where the real engineering and most of the bugs live. Categorizing 54 of them is genuinely useful work. What was the most common failure class, state handoff or silent wrong-output that passed downstream?
Appreciate that — and you nailed the core insight: "the agents are the easy part, the orchestration reliability is where the bugs live."
To answer your question — the 54 broke down like this:
The dominant class was cascade topology (15 CRITICAL): not a single agent misbehaving, but the graph structure itself being fragile. My OrchestratorAgent connects to 12 of 14 agents with zero redundancy — any failure cascades to 92% of the system. That's not a bug in any agent's code. It's an architecture problem that's invisible until you map the interaction graph.
Second was timeout fragility (22 findings): 9 agents have a single upstream dependency with no fallback. If the orchestrator slows down, they don't timeout and retry — they just freeze. Silent hang, no error, no log.
Third was coordination drift (17 findings): agents delegating to the wrong privilege level, and three groups forming communication cliques that bypass the orchestrator entirely.
The "silent wrong-output that passed downstream" pattern you're describing is actually the next frontier I'm building toward — output schema validation between agents. Right now swarm-test catches the structural and topology failures. Catching "Agent A sent malformed JSON that Agent B silently accepted and propagated" requires runtime trace analysis, which is the paid platform roadmap.
With 14 agents in your Moonshift pipeline — would be curious what swarm-test surfaces on your topology. Happy to run it if you want to share the agent graph structure.
If anyone wants to see what swarm-test finds on their own agent system, I'm happy to run it and share the results. Just describe your agent setup (framework, number of agents, how they connect) and I'll generate a report.