DEV Community

suraj kumar
suraj kumar

Posted on

I built an open-source reliability tester for multi-agent AI systems

A multi-agent system where each agent is 95% reliable, chained 14 deep, is only ~49% reliable end-to-end. 0.95^14 ≈ 0.49. Every agent you add multiplies the failure surface — and standard testing misses it, because the failures don't live inside any single agent. They live in how the agents connect.

I built swarm-test to test the connections. Open source, free, works across CrewAI, LangGraph, AutoGen, and custom orchestrators. No live LLM calls — it's static analysis on your agent topology, so it runs in milliseconds and costs nothing per run.

It models your system as a directed graph and runs 8 structural tests: cascade failure, single points of failure (blast radius), context leakage, intent drift, collusion, timeout resilience, contract violations, and sensitive-data scanning. You get a 0–100 Swarm Score, and it classifies each agent's role (orchestrator, validator, gateway, worker, etc.) to interpret severity in context — an orchestrator with high blast radius is expected; a worker with high blast radius is a design smell.

Here's a real 14-agent pipeline of mine. The Orchestrator lights up red — it's a single point of failure that everything routes through:

That's the whole value in one picture: you can't see that bottleneck in a console table, but it's obvious in the graph, and swarm-test flags it as critical automatically.

It also tracks reliability across runs (trend line, diffs what got fixed vs. what regressed), ships a GitHub Action to gate PRs in CI, and exports topology to Mermaid/DOT/PNG.

pip install swarm-test
swarm-test run my_crew.py --open

GitHub: github.com/surajkumar811/swarm-test

The main limitation, honestly: it reasons about structure, not runtime semantics. It won't tell you an agent gave a wrong answer — only that the topology has a fragility. If you run agents in production, I'd genuinely like to hear what interaction-level failures you've hit that static analysis would miss.

Top comments (0)