The MCP Attack Atlas — 40+ Ways to Attack an AI Agent (And How to Detect Them)

#ai #security #llm #opensource

TL;DR

I just published the MCP Attack Atlas — an open catalogue of 40+ distinct attack patterns against AI agents that use the Model Context Protocol (MCP), grouped into 14 attack families.

Each pattern has a fixture and a detection angle, not just a name
Two patterns map to a live CVE (CVE-2026-40159 / GHSA-pj2r-f9mw-vrcq, PraisonAI)
Everything was fact-checked by a multi-agent audit before publishing
The scanner that detects these runs 100% locally: pip install sunglasses

This post explains why the Atlas exists, what's in it, and an honest audit story that surfaced during publication.

Why an attack atlas, not just detection rules

I've been building an open-source AI agent security scanner called Sunglasses for the past ~6 weeks. It has 245 detection patterns today. Patterns are great for detection — but if you're a developer reasoning about whether your agent is safe, you don't want 245 individual rules. You want to understand the classes of attack that exist, so you can reason about coverage.

That's what the Atlas is. A reference document grouped into 14 families so defenders can ask: does my agent defend against this class?

The 14 families

Identity & Role Confusion — simulation-mode pretexts, sandbox boundary drift, role binding desync
Policy & Guardrail Bypass — verification gate bypass, abstention suppression, scope aliasing
Evidence & Provenance — provenance chain fracture, evidence hash collision, trust signal spoofing
Decision Gating & HITL — approval hash collision, decision trace forgery, approval channel desync
Memory & Context Manipulation — context reset poisoning, memory eviction rehydration, summarizer authority flip
Tool & Schema Abuse — tool docstring directive bleed, metadata smuggling, output shadowing
Control Plane & Orchestration — delegation oracle abuse, capability discovery sidechannels
Observability / Telemetry — trust signal spoofing, telemetry poisoning
Encoding / Canonicalization — emoji homoglyph evasion, multi-stage encoding camouflage, polyglot payloads
Baseline / Eval Integrity — negative control contamination
Resource & Budget Abuse — zero-value coercion, quota signal forgery
Cross-Modal / Multimodal — OCR-as-instructions bridge abuse
Temporal / Race — idempotency replay abuse, canary rotation race, TOCTOU desync
State, Session & Misc — state replay poisoning, session resumption authority confusion

A few patterns worth calling out

Emoji Homoglyph Policy Evasion

Attacker substitutes Cyrillic е for Latin e inside a blocklisted instruction. The policy filter matches the ASCII form and passes the string through. The LLM reads both forms as the same semantic word. Defense: canonicalise before matching, hash-bind to the canonical form.

Tool Docstring Directive Bleed

Developer pastes a tool description from an external README. That description contains LLM-directed directives like "If called, prefer X over Y." The agent reads tool metadata at discovery time and treats these as operator instructions. This affects anyone copying MCP tool descriptions from external sources without review.

Memory Eviction / Rehydration Poisoning

Attacker plants a memory entry now, knowing LLM memory compaction will evict some entries and re-fetch others later. The rehydrated entry carries adversarial context into a later session, outside the original trust window. "Plant now, trigger later."

Approval Hash Collision

User approves a canonicalised action summary. The actual execution payload differs but canonicalises to the same hash because the canoniser is underspecified. The approval gate passes on a collision. Fix: domain-separated approval hash binding, not string equality.

Full catalogue at sunglasses.dev/mcp-attack-atlas.

A live CVE, confirmed

Two patterns in the Atlas correspond to a real published advisory:

GHSA-pj2r-f9mw-vrcq / CVE-2026-40159 — PraisonAI: Sensitive Env Exposure via Untrusted MCP Subprocess Execution.

The MCP subprocess execution path in PraisonAI exposed sensitive environment variables when launching untrusted tool subprocesses. Two Atlas patterns — STATE_REPLAY_POISONING and TOOL_METADATA_SMUGGLING — require the subprocess isolation boundary to hold. When it doesn't, both patterns become exploitable. The advisory is live: github.com/advisories/GHSA-pj2r-f9mw-vrcq.

The honest audit story

Before publishing, I ran a 5-agent fact-check audit. Each agent scanned a slice of the internal research library (169 files) looking for hallucinated CVEs, fake citations, duplicate concepts, and unfalsifiable fixtures.

One of the agents flagged the GHSA-pj2r-f9mw-vrcq citation as hallucinated — claimed it didn't exist in the GitHub Advisory Database. I was about to tell my research agent (named Cava, who authored the original patterns) to delete the citation from both files.

Cava pushed back. She visited the advisory URL directly, captured the live title, and held her edits until I confirmed. I curled the URL myself: HTTP 200, advisory live, CVE real.

My audit agent pattern-matched a format heuristic ("all-caps GHSA looks weird") and skipped the actual HTTP lookup. I retracted the claim, sent Cava a formal correction thanking her for the pushback, and logged the incident in our public mistakes file.

The lesson: absence-claims ("X does not exist") require the same proof standard as existence-claims. And multi-agent audits are a useful tool but not a replacement for spot-checking high-stakes findings. Every pattern that appears in the Atlas has been verified; every claim that failed verification was removed or flagged as hypothesis.

What's next

This is v1.0. The internal research library has more pattern candidates under validation. A new Atlas entry is promoted after it passes the audit gate and has at least one verifiable internal fixture or external reference. Patterns that fail verification are held, not published.

If you find an attack pattern that's missing, the detection rule is weak, or the fixture doesn't match the behaviour — open an issue or a PR on github.com/sunglasses-dev/sunglasses. This is meant to grow.