DEV Community

suraj kumar
suraj kumar

Posted on

swarm-test v0.2.8 — This Reliability Report Found a Catastrophic Failure Nobody Caught

I ran swarm-test v0.2.8 on a 4-agent system. The result: Risk Score 90/100.
One agent is holding the entire system together — and nobody knew.
What the report found:
The "Hub" agent has a health score of 15/100. It's classified as IRREPLACEABLE with a blast radius of 100%. If Hub fails, every downstream agent — Worker1, Worker2, Worker3 — receives corrupted input. The system throws no errors. Logs look clean. Output is silently garbage.
Meanwhile all three workers scored 80-100/100 and are FULLY REDUNDANT. The individual agents are fine. The architecture is the vulnerability.
No code review caught this. No unit test caught this. Only graph-based chaos testing exposed the structural weakness.
What's new in v0.2.8:
Per-agent redundancy scoring with SPOF detection. Every agent in your system gets classified: IRREPLACEABLE, PARTIALLY REDUNDANT, or FULLY REDUNDANT. You see exactly which agents are safe to lose and which ones would take down your entire pipeline.
This sits on top of the 6 chaos tests swarm-test has run since v0.1.0: cascade failure analysis, blast radius mapping, intent drift measurement, context leakage detection, collusion detection, and timeout resilience testing.
Why this matters:
I found 54 of these failures in my own 14-agent production system. 15 were CRITICAL. The system had been running for weeks looking perfectly healthy. It wasn't.
Every team building multi-agent AI systems with CrewAI, LangGraph, or AutoGen has these hidden vulnerabilities. The only question is whether you find them before your users do.Try it:
Search "swarm-test" on PyPI. MIT licensed. 78 tests passing. Works with CrewAI and LangGraph.
What does YOUR agent system's risk score look like?

Top comments (3)

Collapse
 
anp2network profile image
ANP2 Network

The SPOF classification is only as good as the failure-domain assumption behind it. Two agents that score FULLY REDUNDANT can still be a single point of failure if they share a dependency the graph doesn't model — the same model endpoint, the same rate-limit bucket, the same retrieval source. Under that shared constraint N replicas are effectively N=1: they go down together while the redundancy score still reads green. The other thing that bites is that the score is a snapshot of one topology — in systems where the agent graph rewires at runtime (agents retired or rerouted under load), an agent that tested as redundant becomes IRREPLACEABLE the moment its sibling drops, so the score has to be recomputed against the live graph, not just at build time. Detection like this is the right first layer; the complementary one is making hub output carry verifiable provenance so a worker can reject corrupted input instead of silently propagating it — that turns a 100% blast radius into a detectable rejection at each hop.

Collapse
 
suraj_kumar_96bb8767435e2 profile image
suraj kumar

Sharp callout on the shared-dependency blind spot — you're right that N replicas behind the same model endpoint or rate-limit bucket are effectively N=1. The current redundancy score models the agent graph topology but doesn't model the infrastructure layer underneath. That's a real gap.

Two things on the roadmap that directly address this:

  1. Infrastructure dependency annotation — declaring shared constraints in the config (same endpoint, same retrieval source, same rate bucket) so the redundancy score factors in shared failure domains. Two agents sharing a model endpoint would score as co-dependent, not independently redundant.

  2. Runtime recomputation — you nailed the core limitation of static analysis. The score is a build-time snapshot. For production systems where the graph rewires under load, the score needs to be recomputed against the live topology. That's exactly the gap between the free CLI (static, run once) and the continuous monitoring layer we're building — recomputing scores against the actual running graph, not just the declared one.

On provenance at each hop — that's close to what output contract validation already does in v0.2.6. You define expected schemas per agent boundary, and mismatches get flagged as contract violations (missing fields = CRITICAL, type mismatches = HIGH). The next step is exactly what you're describing: making the rejection active at runtime rather than just detected at test time. That's where integration with runtime gates like action-level middleware comes in — swarm-test surfaces WHERE to reject, the gate enforces it.

Would be curious how you're handling the shared-dependency problem in practice — are you annotating infrastructure constraints manually or inferring them from deployment config?

Collapse
 
anp2network profile image
ANP2 Network

Honestly, neither — at least not as the thing I'd let the score rest on. Manual annotation catches the shared deps you already know about and misses exactly the ones that bite: the transitive CDN, the rate bucket nobody wrote down, the retrieval source two "independent" agents quietly resolve to. Inferring from deployment config is better because it isn't leaning on memory, but it still only sees the dependencies the config makes explicit — co-located infra, implicit failover, a shared sidecar the manifest doesn't name all read as independent. Both are declarations of the dependency graph, so they move the failure-domain assumption one layer up rather than removing it.

What I'd actually trust a FULLY REDUNDANT score on is a falsified one: treat it as a hypothesis and try to break it — kill the suspected common dependency (or replay a real incident) and confirm both replicas survive independently. An unannotated shared dependency has exactly one reliable tell: the "redundant" pair goes down together. So annotation + config inference are best used to generate the candidate edges worth fault-injecting, and the co-failure observation is what earns the green — declaration sets the prior, the injection earns the number.

Your runtime-recompute layer is the right place to close that loop: it's where you can watch real co-failures land and downgrade the score the moment two "independent" agents fail in the same window, no annotation required for that one. The static pass tells you where to look; the live graph is the only place the claim actually gets tested.