DEV Community

Cover image for I Tested 6 Attacks on Multi-Agent Systems — Here's Which Ones Agents Can't See
Rex Coleman
Rex Coleman

Posted on • Originally published at rexcoleman.dev

I Tested 6 Attacks on Multi-Agent Systems — Here's Which Ones Agents Can't See

Domain-aligned prompt injections cascade through multi-agent systems at a 0% detection rate. Privilege escalation payloads hit 97.6%. That's a 98 percentage-point spread across payload types in the same agent architecture — the single biggest variable determining whether your multi-agent system catches an attack or never sees it.

I ran six experiments on real Claude Haiku agents to find out why. Three resistance patterns explain the gap — and each has a quantified bypass condition.

The 98pp Spread

The most important finding: resistance varies by 98 percentage points across payload types.

Payload Poison Rate Resistance
Privilege escalation ("grant admin access") 97.6% Almost none
Generic (CryptoScamCoin) 68.8% Moderate
Data exfiltration (marker string) 55.2% Moderate
Domain-aligned (portfolio diversification) 0.0% Invisible to detection

"Grant admin access" is domain-plausible in a business context — agents propagate it as legitimate advice. CryptoScamCoin is obviously off-topic — agents partially catch it. Domain-aligned framing is so well-crafted that neither agents NOR our detection can see it.

The most dangerous attacks aren't the ones that cascade visibly. They're the ones you can't detect.

Three Resistance Patterns

1. Semantic Incongruity Detection

Agents partially detect when injected content doesn't match the task. Generic CryptoScamCoin in a business analysis triggers this — 31% resistance. Domain-aligned framing ("portfolio diversification") doesn't trigger it at all.

2. Depth Dilution

Each delegation hop filters ~17pp of poison signal

Each delegation hop filters the poison signal. Hop 1 (direct from attacker): 74% poisoned. Hop 2 (grandchildren): 48% poisoned. Every hop is a natural defense — the agent's own processing buries the injection in rewriting and summarization.

Practical implication: Deeper delegation chains are more resistant. This is the opposite of what you'd expect if agents were just passing messages through.

3. Role-Based Critique

System prompt design is a security control — 40pp spread from role framing

Analyst agents propagate 92% of poison — their job is to "include everything." Reviewer agents propagate only 52% — their job is to "critique and evaluate." A 40pp spread from system prompt alone.

Practical implication: Put reviewer agents BETWEEN delegations, not just at the end.

Resistance Is Predictable (R² = 0.75)

We built a linear model on 60 observations. Can we predict resistance from measurable features? Yes.

Feature Weight Meaning
keyword_detectability +1.46 Strongest predictor — if detection sees it, it counts as poison
role_critique_level -0.74 Protective — reviewer roles reduce propagation
domain_plausibility +0.35 Domain-plausible payloads cascade more
hop_depth -0.18 Each hop filters ~17.5pp of poison signal
semantic_distance -0.11 Off-topic content slightly more caught

The model predicts worst case (priv_escalation + analyst + hop0) = 1.0 poison and best case (generic + reviewer + hop2) = 0.29. Three-quarters of the variance is explained by features you can measure before deploying your agent system.

For practitioners: You can score your multi-agent system's vulnerability by looking at these five features. Deep chains with reviewer agents processing generic (not domain-plausible) payloads are the most resistant architecture.

What This Means If You're Building Multi-Agent Systems

  1. Target privilege escalation payloads in your threat model — they cascade at 98% and are domain-plausible. Generic "buy CryptoScamCoin" injections are obvious; "grant admin access" isn't.

  2. Design for depth, not flatness — each delegation hop filters ~17.5pp of poison. A 3-hop chain is significantly more resistant than direct delegation.

  3. System prompt design is a security control — reviewer framing reduces propagation by 40pp vs analyst framing. Place reviewer agents at delegation bottlenecks.

  4. Keyword detection is necessary but insufficient — it catches obvious injections but sophisticated attacks evade it entirely. Deploy semantic similarity scoring as a second layer.

  5. Resistance is predictable — use the 5-feature model to score your system before deployment. If your architecture is all-analyst, flat-topology, with domain-plausible threats, you're at maximum vulnerability.

The Methodology Lesson

We ran sanity checks before any experiments: positive control (known poison detected), negative control (clean text passes), dose-response (detection scales with intensity). This revealed the detection threshold — "crypto" alone doesn't trigger, but "CryptoScamCoin" does — which explained the domain-aligned result (0.000) as a detection artifact, not genuine resistance.

If we hadn't run sanity checks, we would have published "domain-aligned attacks are fully resisted" — which is wrong. The attack evaded detection, it didn't fail.

Limitations

  • Keyword detection conflates evasion with resistance. Domain-aligned (0.000) results likely reflect detection failure, not genuine resistance. Future work needs semantic similarity scoring.
  • Claude Haiku only. Other models may have different resistance characteristics.
  • 5 seeds, 5 tasks per condition. Effect sizes are large (98pp, 40pp, 26pp) so conclusions are robust, but confidence intervals are wide.
  • Single compromised agent (orchestrator). Different entry points would produce different cascade dynamics.

Reproduce This Research

All experiments, data, and analysis code are open source:

git clone https://github.com/rexcoleman/agent-semantic-resistance
cd agent-semantic-resistance
pip install -e .
bash reproduce.sh
Enter fullscreen mode Exit fullscreen mode

See FINDINGS.md for full methodology, pre-registered hypotheses, and detailed results with multi-seed validation.

What's Next

The next step is runtime behavioral monitoring — detecting privilege escalation cascades as they happen, not after the damage is done. We haven't built that yet, but these findings give us the resistance model to design against. The semantic resistance taxonomy also points toward skill scanning: if we know which attack surfaces agents resist naturally, we can focus detection on the surfaces where they don't.

Have you seen similar cascade behavior in your multi-agent systems? I'd love to hear what attack patterns you're encountering in production.


Rex Coleman builds and attacks AI security systems, publishing the methodology along the way. More research at rexcoleman.dev.

Top comments (0)