suraj kumar

Posted on Jun 28

I ran my own reliability tool on my production system. 90% of the findings were wrong — and that was the most valuable bug I've fixed.

#ai #python #testing #opensource

I build swarm-test — an open-source tool that statically analyzes multi-agent AI systems for structural reliability problems: loops with no exit, single points of failure, handoffs that break silently. No live LLM calls, no API cost, runs in milliseconds on your agent topology.

Last week I did the obvious thing every tool author should do and most avoid: I pointed it at my own production system.

It returned 99 findings. Risk score: 0 out of 100. Fifteen criticals.
And almost all of it was wrong.

Here's what happened, why it happened, and the fix — because the bug turned out to teach me more about building reliability tooling than any feature I've shipped.

The system under test

The target was real: a 14-agent FastAPI orchestrator running passport-photo, signature, and document pipelines in production. One OrchestratorAgent coordinates the other 13 — face detection, background removal, enhancement, validation, compliance, layout, and so on — plus a couple of background loops for evolution and health monitoring.

It's a textbook hub-and-spoke design. The orchestrator is central on purpose — that's its entire job. Every request flows through it; it delegates to workers and aggregates their results.

This is a normal, correct architecture. It is not a bug. Keep that in mind, because my own tool didn't.

The 99-finding report

Here's the summary swarm-test handed me:
Swarm Score: 0/100 — CRITICAL (15 critical, 41 high findings)
Cost Risk: 100/100 — SEVERE

cascade_failure FAILED 14 findings 14 critical
blast_radius FAILED 1 finding 1 critical
collusion_detection FAILED 4 findings
cost_risk FAILED 26 findings
...
1/8 tests passed — Overall: FAILED

Fourteen critical cascade findings — one for every single agent — each saying the same thing: "92.3% blast radius, failure cascades to 12 agents." The orchestrator flagged as a catastrophic single point of failure. Twenty-five cost-risk findings, nearly all of them "feedback loop between SomeAgent and OrchestratorAgent."

A healthy, correctly-architected system, scored as a five-alarm fire.

My first instinct was the dangerous one: maybe I should refactor my production system. Add fallback agents. Break up the orchestrator. Then I read the findings properly.

Why a healthy system scored 0/100

Look at what's actually being said in those 14 cascade findings. Every agent has "92.3% blast radius." When every node in a graph is flagged critical with the same number, that's not 14 problems — that's one structural fact being counted 14 times.

The fact: it's a hub-and-spoke graph. Everything routes through the orchestrator. So:

The orchestrator's "blast radius" is huge because everything depends on it — which is what an orchestrator is.
Every worker's blast radius looked huge too, because the reachability math counted descendants through the hub. A leaf worker that only returns a result to the orchestrator was being scored as if it could take down the whole system.
Every "feedback loop between Worker and Orchestrator" was just a worker returning its result — normal request/response, not a token-burning cycle.

The tool was penalizing my system for having an orchestrator. It mistook intentional centrality for accidental fragility.

And here's the part that stung: in the same report, swarm-test had correctly inferred that OrchestratorAgent was an ORCHESTRATOR, at 99% confidence, with a profile that literally said "expected high blast." It knew. And then every scoring function ignored what it knew.

The root cause

I went into the code. The role classifier (classify_agent) ran, inferred roles with confidence scores, stored them, and rendered them in the report. There was even a function called role_adjusted_severity — defined, unit-tested, and wired to absolutely nothing in the runtime path.

So the architecture was: infer the role, display the role, then compute every severity and score as if the role didn't exist. The cascade test, the blast-radius test, the cost-risk test — none of them consulted the role. A grep confirmed it: role_adjusted_severity appeared only in its own definition and its tests.
That's the whole bug in one sentence: roles were inferred and then ignored by every decision that mattered.

*Why this is the bug that matters *

It would be easy to file this as a minor scoring tweak. It isn't. It's the failure mode that kills reliability tooling entirely.

A tool that cries wolf is worse than no tool at all. The first time a developer runs swarm-test on their healthy orchestrator system, sees 0/100 CRITICAL and 14 identical alarms, they conclude the tool is noise and uninstall it. And then the one time it's genuinely right — a real unbounded loop, a real unguarded write — nobody's listening anymore.

In reliability work, false positives aren't an annoyance. They're the product. Trust is the entire value proposition. A finding is only worth anything if the user believes it.

The fix: make scoring role-aware

I wired the inference into the scoring, the way it always should have been:

Recognize intentional hubs. When an agent is an orchestrator (declared, or inferred with high confidence), its centrality is expected. Its high blast radius gets reported as an informational "hub-and-spoke topology" note, not a CRITICAL cascade. It's only escalated to a real SPOF if there's no fallback path and it isn't a recognized hub.
Compute effective blast radius. A spoke that only returns to the orchestrator shouldn't inherit the hub's reach. Blast radius now reflects the agents that actually depend on this agent's output, not everything reachable through the hub.
Stop counting return paths as loops. A Worker → Orchestrator → Worker cycle is normal request/response, not a feedback loop. Only genuine orchestrator-bypass cycles (agents talking directly, skipping oversight) and cycles with no exit edge get flagged.
Deduplicate. Fourteen identical findings collapse into one, carrying the list of affected agents.

The critical constraint while fixing: it had to keep catching the real problems. Over-correcting into a tool that calls everything healthy is just the opposite failure. So I added a safety test asserting that a misclassified non-hub can never suppress a real finding — because false negatives in a reliability tool are even worse than false positives.

*Before and After *

Same system, same topology, after the fix:
Before After
Swarm Score 0/100 CRITICAL 70/100 NEEDS IMPROVEMENT
Total findings 99 10
CRITICAL 15 0
HIGH 41 2

And — this is the part that proves it didn't just go quiet — the two HIGH findings that survived are the two genuinely real issues:

Unvalidated write-back: EvolutionAgent mutates the orchestrator's config every 10 minutes with no schema validation on the write path. A bad value silently propagates to all 14 agents. That's a real risk, and now it's the headline finding instead of being buried under 99 false alarms.
Orchestrator-bypass cycle: two agents call each other directly, skipping the orchestrator's oversight, so failures there are invisible to it.

Those are worth fixing. The other 90 findings never were.

*What I took away *

A few things I'll carry into everything I build after this:

Dogfood your own tool on a real system, not a toy. The toy examples all passed. It took my actual 14-agent production system to expose that the role inference was decorative.
"Defined but disconnected" is a real and dangerous state. role_adjusted_severity existed, was tested, and did nothing, because nothing called it. Green tests on an unwired function prove nothing. I added a regression test asserting the function is actually invoked, so it can never silently disconnect again.
In reliability tooling, calibration is the product. The detection engine was fine the whole time. What made the tool trustworthy or worthless was entirely in how it scored and prioritized what it found.

swarm-test is open source and MIT licensed. It does static reliability analysis for CrewAI, LangGraph, AutoGen, and custom agent systems — no live LLM calls, no API cost.

pip install swarm-test

If you're running multi-agent systems in production, I'm happy to run it against your topology and send you the findings — the real structural risks and where your handoffs might break. No pitch, just the report.
And if you've shipped a silent multi-agent failure of your own — the kind that didn't throw an error — I'd genuinely like to hear it. Those stories are how the rest of us learn where the gaps are.

Top comments (1)

suraj kumar • Jul 2

Update: the CI gate is live

Since writing this, I shipped the natural next step — swarm-test now runs in CI. Add the GitHub Action to your repo and it fails a build when reliability findings cross a threshold: a broken agent graph (unbounded loop, unguarded SPOF) gets blocked before it merges; a healthy topology passes clean. Deterministic, no LLM calls, JSON output for your pipeline.

The lesson from this post is exactly why the gate defaults matter: it had to pass healthy systems, not just catch broken ones. A CI gate that fails clean topologies gets disabled on day one — same false-positive trap, higher stakes. So the gate is calibrated to block real structural failures and let good systems through.

pip install swarm-test