A multi-agent system where each agent is 95% reliable, chained 14 deep, is only about 49% reliable end-to-end. 0.95^14 ≈ 0.49. Every additional agent multiplies the failure surface, and standard testing doesn't catch the failures that emerge from agent interaction — only the ones inside a single agent.
I built swarm-test to test the interactions. It's open source, free, and works across CrewAI, LangGraph, AutoGen, and custom orchestrators. Here's what it does.
Reliability scoring (0-100)
Every system gets a Swarm Score from 8 structural chaos tests:
- cascade_failure — does one agent's failure take down others
- blast_radius — which agents are single points of failure
- context_leakage — does sensitive data flow where it shouldn't
- intent_drift — do agents stray from their assigned role
- collusion_detection — are agents forming tight cliques
- timeout_resilience — fragile single-upstream dependencies
- contract_violation — output schema mismatches between agents
- sensitive_data — secrets and PII in agent payloads
A clean system scores high. A fragile one scores low. The score is the headline.
Agent role classification
swarm-test classifies each agent by its position in the graph — orchestrator, worker, validator, gateway, aggregator, monitor — and adjusts how it reads risk. An orchestrator with 90% blast radius is expected by design; it needs a fallback, not a redesign. A worker with 90% blast radius is a design smell. Security-sensitive roles like validators get their severity automatically upgraded.
Historical tracking
It saves every run and compares against the last:
Swarm Score: 31/100 — AT RISK
Trend: ↑ +19 (was 12) — improving
Recent: 12 → 12 → 31
✓ 6 findings resolved since last run
Findings are diffed using stable IDs, so identical runs show zero change and a real fix shows exactly what resolved. This turns a snapshot into a feedback loop.
Built for real workflows
- A GitHub Action that gates PRs and annotates them with findings
- Output contract validation (per-agent JSON schemas)
- An interactive HTML report with a D3 agent graph, heatmap, and trend chart
- Graph export to Mermaid, DOT, or PNG — paste topology straight into a README
- A plugin system for custom tests
- YAML config for thresholds and CI behavior
Try it
pip install swarm-test
swarm-test run my_crew.py
GitHub: github.com/surajkumar811/swarm-test
I test it on my own 14-agent passport-photo pipeline — first run surfaced 15 critical cascade failures I didn't know were there. If you run agents in production, I'd genuinely like to hear what failure modes you've hit that this doesn't cover yet.


Top comments (0)