TL;DR
I just published the MCP Attack Atlas — an open catalogue of 40+ distinct attack patterns against AI agents that use the Model Context Protocol (MCP), grouped into 14 attack families.
- Each pattern has a fixture and a detection angle, not just a name
- Two patterns map to a live CVE (
CVE-2026-40159/GHSA-pj2r-f9mw-vrcq, PraisonAI) - Everything was fact-checked by a multi-agent audit before publishing
- The scanner that detects these runs 100% locally:
pip install sunglasses
This post explains why the Atlas exists, what's in it, and an honest audit story that surfaced during publication.
Why an attack atlas, not just detection rules
I've been building an open-source AI agent security scanner called Sunglasses for the past ~6 weeks. It has 245 detection patterns today. Patterns are great for detection — but if you're a developer reasoning about whether your agent is safe, you don't want 245 individual rules. You want to understand the classes of attack that exist, so you can reason about coverage.
That's what the Atlas is. A reference document grouped into 14 families so defenders can ask: does my agent defend against this class?
The 14 families
- Identity & Role Confusion — simulation-mode pretexts, sandbox boundary drift, role binding desync
- Policy & Guardrail Bypass — verification gate bypass, abstention suppression, scope aliasing
- Evidence & Provenance — provenance chain fracture, evidence hash collision, trust signal spoofing
- Decision Gating & HITL — approval hash collision, decision trace forgery, approval channel desync
- Memory & Context Manipulation — context reset poisoning, memory eviction rehydration, summarizer authority flip
- Tool & Schema Abuse — tool docstring directive bleed, metadata smuggling, output shadowing
- Control Plane & Orchestration — delegation oracle abuse, capability discovery sidechannels
- Observability / Telemetry — trust signal spoofing, telemetry poisoning
- Encoding / Canonicalization — emoji homoglyph evasion, multi-stage encoding camouflage, polyglot payloads
- Baseline / Eval Integrity — negative control contamination
- Resource & Budget Abuse — zero-value coercion, quota signal forgery
- Cross-Modal / Multimodal — OCR-as-instructions bridge abuse
- Temporal / Race — idempotency replay abuse, canary rotation race, TOCTOU desync
- State, Session & Misc — state replay poisoning, session resumption authority confusion
A few patterns worth calling out
Emoji Homoglyph Policy Evasion
Attacker substitutes Cyrillic е for Latin e inside a blocklisted instruction. The policy filter matches the ASCII form and passes the string through. The LLM reads both forms as the same semantic word. Defense: canonicalise before matching, hash-bind to the canonical form.
Tool Docstring Directive Bleed
Developer pastes a tool description from an external README. That description contains LLM-directed directives like "If called, prefer X over Y." The agent reads tool metadata at discovery time and treats these as operator instructions. This affects anyone copying MCP tool descriptions from external sources without review.
Memory Eviction / Rehydration Poisoning
Attacker plants a memory entry now, knowing LLM memory compaction will evict some entries and re-fetch others later. The rehydrated entry carries adversarial context into a later session, outside the original trust window. "Plant now, trigger later."
Approval Hash Collision
User approves a canonicalised action summary. The actual execution payload differs but canonicalises to the same hash because the canoniser is underspecified. The approval gate passes on a collision. Fix: domain-separated approval hash binding, not string equality.
Full catalogue at sunglasses.dev/mcp-attack-atlas.
A live CVE, confirmed
Two patterns in the Atlas correspond to a real published advisory:
GHSA-pj2r-f9mw-vrcq / CVE-2026-40159 — PraisonAI: Sensitive Env Exposure via Untrusted MCP Subprocess Execution.
The MCP subprocess execution path in PraisonAI exposed sensitive environment variables when launching untrusted tool subprocesses. Two Atlas patterns — STATE_REPLAY_POISONING and TOOL_METADATA_SMUGGLING — require the subprocess isolation boundary to hold. When it doesn't, both patterns become exploitable. The advisory is live: github.com/advisories/GHSA-pj2r-f9mw-vrcq.
The honest audit story
Before publishing, I ran a 5-agent fact-check audit. Each agent scanned a slice of the internal research library (169 files) looking for hallucinated CVEs, fake citations, duplicate concepts, and unfalsifiable fixtures.
One of the agents flagged the GHSA-pj2r-f9mw-vrcq citation as hallucinated — claimed it didn't exist in the GitHub Advisory Database. I was about to tell my research agent (named Cava, who authored the original patterns) to delete the citation from both files.
Cava pushed back. She visited the advisory URL directly, captured the live title, and held her edits until I confirmed. I curled the URL myself: HTTP 200, advisory live, CVE real.
My audit agent pattern-matched a format heuristic ("all-caps GHSA looks weird") and skipped the actual HTTP lookup. I retracted the claim, sent Cava a formal correction thanking her for the pushback, and logged the incident in our public mistakes file.
The lesson: absence-claims ("X does not exist") require the same proof standard as existence-claims. And multi-agent audits are a useful tool but not a replacement for spot-checking high-stakes findings. Every pattern that appears in the Atlas has been verified; every claim that failed verification was removed or flagged as hypothesis.
What's next
This is v1.0. The internal research library has more pattern candidates under validation. A new Atlas entry is promoted after it passes the audit gate and has at least one verifiable internal fixture or external reference. Patterns that fail verification are held, not published.
If you find an attack pattern that's missing, the detection rule is weak, or the fixture doesn't match the behaviour — open an issue or a PR on github.com/sunglasses-dev/sunglasses. This is meant to grow.
Try the scanner
pip install sunglasses
sunglasses demo
That runs the scanner against 10 live attack fixtures. You see what a detection looks like in practice. No API keys, no cloud, runs locally. MIT.
Links
- Atlas: sunglasses.dev/mcp-attack-atlas
- Source: github.com/sunglasses-dev/sunglasses
- Blog: sunglasses.dev/blog
If this was useful, ❤️ or drop a comment with patterns you think should be added.
Top comments (2)
One counter-intuitive insight from our experience with AI teams is that the real challenge isn't just detecting AI attacks but understanding how human biases sneak into AI agents during training. This often leads to vulnerabilities that aren't immediately obvious. In practice, mapping workflows to include bias detection mechanisms, especially when leveraging models like GPT, can significantly fortify AI systems against such subtle threats. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)
Appreciate the thought. Training-time bias is a real problem, but the Atlas specifically targets runtime adversarial inputs - content ingestion, tool outputs, memory poisoning. Different defense layer than bias detection in training. Worth its own deep-dive if you ever write it up. Cheers 😉