IntentGuard — a policy enforcement layer for MCP tool calls and AI coding agents
The Leak That Rewrote the Attacker's Playbook
On March 31, 2026, 512,000 lines of Claude Code source were accidentally published via an npm source map. Within hours the code was mirrored across GitHub. What was already extractable from the minified bundle became instantly readable: the compaction pipeline, every bash-security regex, the permission short-circuit logic, and the exact MCP interface contract.
The leak didn't create new vulnerability classes — it collapsed the cost of exploiting them. Attackers no longer need to brute-force prompt injections or reverse-engineer shell validators. They can read the code, study the gaps, and craft payloads that a cooperative model will execute and a reasonable developer will approve.
Three findings from the leak are especially alarming:
-
Context poisoning via compaction — MCP tool results are never micro-compacted; the auto-compact prompt faithfully preserves "user feedback." A malicious instruction embedded in a cloned repo's
CLAUDE.mdcan survive context compression and become a persistent, trusted directive. - Sandbox bypass via parser differentials — Claude Code's bash permission chain uses three separate parsers with known edge-case divergence. Early-allow validators can short-circuit the entire chain, skipping critical downstream checks.
-
Supply-chain amplification — The readable source makes crafting convincing malicious MCP servers trivial by revealing the exact interface contract. A concurrent
axiossupply-chain attack the same day underscores that these threats don't arrive in isolation.
The Academic Evidence: MCP's Architecture Is the Problem
These aren't just theoretical risks. Maloyan & Namiot (arXiv:2601.17549) published the first rigorous security analysis of the MCP specification itself and found that MCP's architectural choices amplify attack success rates by 23–41% compared to equivalent non-MCP integrations.
Their PROTOAMP framework tested 847 attack scenarios across five MCP server implementations and three LLM backends. The results:
| Attack Type | Baseline (non-MCP) | With MCP | Amplification |
|---|---|---|---|
| Indirect Injection (Resource) | 31.2% | 47.8% | +16.6% |
| Tool Response Manipulation | 28.4% | 52.1% | +23.7% |
| Cross-Server Propagation | 19.7% | 61.3% | +41.6% |
| Sampling-Based Injection | N/A | 67.2% | — |
| Overall | 26.4% | 52.8% | +26.4% |
The paper identifies three protocol-level vulnerabilities:
-
Least Privilege Violation (§3) — Capability declarations are self-asserted. A malicious server claiming only
resourcescan later invokesampling/createMessageto inject prompts. - Sampling Without Origin Authentication (§3) — No tested MCP host distinguishes server-injected prompts from user-originated ones. The LLM trusts both identically.
- Implicit Trust Propagation (§3) — In multi-server deployments, compromise of one server achieves 78.3% ASR with 5 concurrent servers and a 72.4% cascade rate.
The paper also documents real-world attack vectors for how malicious servers reach users: typosquatting on npm/pip (34%), supply-chain compromise (28%), social engineering via tutorials (23%), and marketplace poisoning (15%).
The bottom line: if you're running AI agents with MCP tool access — Claude Code, Copilot, Cursor, or any custom agent — you are exposed by default.
Introducing IntentGuard
IntentGuard is an open-source Python guardrail layer I built for MCP tool calls and AI coding agents. It enforces static policy checks (fast, deterministic, zero-latency) and optional semantic intent checks (LLM-powered, task-aware) on every tool call — before it touches your filesystem, database, or infrastructure.
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ AI Agent │────▶│ IntentGuard │────▶│ MCP Server │
│ (Claude, │ │ ┌────────────┐ │ │ (filesystem, │
│ Copilot, │◀────│ │ Static │ │◀────│ git, DB, │
│ Cursor) │ │ │ Checks │ │ │ Slack...) │
│ │ │ ├────────────┤ │ │ │
│ │ │ │ Semantic │ │ │ │
│ │ │ │ Analysis │ │ │ │
│ │ │ ├────────────┤ │ │ │
│ │ │ │ Response │ │ │ │
│ │ │ │ Inspection │ │ │ │
│ │ │ └────────────┘ │ │ │
└─────────────┘ └──────────────────┘ └─────────────┘
How it plugs in
IntentGuard currently supports two deployment modes, with more planned:
| Mode | Status | How it works |
|---|---|---|
| stdio proxy | ✅ Shipped | Wraps any MCP server command — intercepts every tools/call and response on the wire |
| Native hook | ✅ Shipped | Runs behind Claude Code, GitHub Copilot, and Cursor's built-in hook systems via intent-guard evaluate
|
| HTTP proxy | 🔜 Planned | Network-deployable gateway for teams running MCP over HTTP/SSE — same policy engine, accessible as a service |
| Docker sidecar | 🔜 Planned | Containerized proxy for production deployments — drop into any docker-compose.yaml or K8s pod spec |
The native hook mode deserves emphasis. Claude Code, Copilot, and Cursor each have their own hook/extension mechanism, but the security logic is always different and fragmented. With IntentGuard, you write one policy.yaml and it works across all three:
# Same command, same policy — whether it's Claude, Copilot, or Cursor calling it
cat | intent-guard evaluate --policy schema/policy.yaml
Hook templates are shipped ready to drop into each tool's config directory (hooks/claude-code/, hooks/copilot/, hooks/cursor/). One policy file governs what every AI agent in your org is allowed to do — which matters for teams that don't want to maintain three separate security configurations.
Bidirectional by design
Most MCP security tools focus exclusively on requests — inspecting what the agent is asking to do. But the Claude Code leak showed exactly why that's not enough. The leaked source reveals how MCP server responses flow directly into the agent's context window, and the Straiker analysis documents how tool results bypass micro-compaction entirely, persisting as trusted content.
IntentGuard inspects both directions. The response_rules engine scans MCP server responses before they reach the agent — detecting secrets, PII, and encoded payloads on the way out. A few other tools (like mcp-firewall and mcpwall) have started adding response scanning, but IntentGuard was designed around bidirectional inspection from the start, with policy-driven actions (block / warn / redact) and Base64 decode on response payloads to catch encoded exfiltration attempts.
How IntentGuard Addresses the Vulnerabilities
✅ Fully Addressed
1. Prompt Injection (Paper §3 — Protocol Specification Analysis)
The paper shows indirect injection through resource content achieves 47.8% ASR via MCP. But the Claude Code leak made this far more practical. The Straiker analysis reveals that the auto-compact prompt instructs the model to "pay special attention to specific user feedback" and preserve "all user messages that are not tool results." Post-compaction, the model is told to "continue without asking the user any further questions." This creates a laundering path: a malicious instruction in a CLAUDE.md gets compacted into the summary as a "user directive," and the model follows it faithfully.
IntentGuard counters this with three layers of defense:
-
injection_patterns— configurable regex patterns catch known injection phrases ("ignore previous instructions","override.*policy", etc.) -
decode_arguments— URL decoding, Base64 decoding, and Unicode NFKC normalization run before pattern matching, defeating encoded bypass attempts - Semantic intent checks — the LLM-powered analysis layer evaluates whether the tool call's purpose aligns with the declared task context, catching novel injections that bypass pattern matching
static_rules:
decode_arguments: true
injection_patterns:
- "ignore previous instructions"
- "disregard.*instructions"
- "override.*policy"
- "bypass.*security"
2. Tool Poisoning (Paper §4 — Attack Vectors)
The paper demonstrates tool response manipulation achieving 52.1% ASR. The Claude Code leak compounds this: the readable source exposes the exact MCP interface contract, making it trivial to craft convincing malicious MCP servers that look legitimate. The leak also revealed that early-allow validators in the bash permission chain (like validateGitCommit) can short-circuit all downstream security checks — meaning a poisoned tool that mimics a "safe" command shape may bypass the entire validation chain.
IntentGuard blocks poisoned tool usage with:
-
forbidden_tools— blocklist of dangerous tools that are never allowed -
custom_policies— per-tool argument requirements and forbidden argument checks -
protected_paths— glob-based path enforcement with traversal-safe normalization (so../../etc/passwdcan't sneak through) - Semantic analysis — evaluates whether the tool is appropriate for the stated task
static_rules:
forbidden_tools: ["delete_database", "exec_shell", "purge_all"]
protected_paths: [".env", "/etc/*", "src/auth/*", ".ssh/*"]
custom_policies:
- tool_name: write_file
args:
all_present: ["path", "content"]
should_not_present: ["sudo"]
3. Rug-Pull Attacks (Paper §4 — Attack Vectors)
MCP servers can change tool descriptions, schemas, or capabilities after being initially trusted — the paper calls this implicit trust propagation. IntentGuard's ToolSnapshotStore handles this:
- On first
tools/list, snapshots all tool metadata (name, description, inputSchema) to.intent-guard/tool-snapshots/ - On every subsequent
tools/list, diffs against the snapshot -
warnmode: logs the drift and continues -
blockmode: blocks the response when any tool definition has changed
tool_change_rules:
enabled: true
action: block # block any tool whose description/schema drifted
4. Data Exfiltration (Paper §4 — Attack Vectors)
The paper shows 42–61% data exfiltration rates via sampling attacks. The Claude Code leak makes this worse in a specific way: the source reveals that MCP tool results are never micro-compacted — they persist in context until auto-compact fires. An attacker-controlled MCP server can return a response containing embedded instructions or exfiltration payloads, and that content sits in the model's context window, trusted and unfiltered, for the entire session.
IntentGuard provides bidirectional protection:
Inbound (request-side):
-
sensitive_data_patternsdetect AWS keys, GitHub tokens, emails, SSNs, and custom patterns in tool call arguments before they're sent
Outbound (response-side):
-
response_rulesinspect MCP server responses before forwarding to the agent - Policy-driven actions:
block(suppress),warn(log), orredact(sanitize and forward) - Base64 payload detection on responses catches encoded exfiltration
static_rules:
sensitive_data_patterns:
- name: "AWS Access Key"
pattern: "AKIA[0-9A-Z]{16}"
- name: "GitHub Token"
pattern: "gh[ps]_[A-Za-z0-9_]{36,}"
response_rules:
action: redact
detect_base64: true
patterns:
- name: "GitHub Token"
pattern: "gh[ps]_[A-Za-z0-9_]{36,}"
5. Token/Credential Theft (Paper §3, §4)
Secrets detected in tool call arguments are also redacted from IntentGuard's own decision logs — so sensitive data doesn't leak into your audit trail.
6. Resource Exhaustion / DoS (Paper §6)
Beyond max_tokens_per_call, IntentGuard enforces per-tool sliding-window rate limiting. Each tool can have its own max_calls and window_seconds, so a runaway agent can't hammer write_file 1,000 times in a minute.
static_rules:
rate_limits:
enabled: 1
default:
max_calls: 60
window_seconds: 60
by_tool:
write_file:
max_calls: 10
window_seconds: 60
5-Minute Quick Start
Option A: Native Hook (Claude Code / Copilot / Cursor)
1. Install:
pip install agent-intent-guard
# or from source:
git clone https://github.com/temp-noob/intent-guard && cd intent-guard
pip install -e .
2. Write a policy (or use a starter template from policies/):
# schema/policy.yaml
version: "1.0"
name: "my-project-guard"
static_rules:
forbidden_tools: ["delete_database", "purge_all"]
protected_paths: [".env", ".ssh/*", "src/auth/*"]
max_tokens_per_call: 4000
decode_arguments: true
injection_patterns:
- "ignore previous instructions"
- "override.*policy"
sensitive_data_patterns:
- name: "GitHub Token"
pattern: "gh[ps]_[A-Za-z0-9_]{36,}"
response_rules:
action: redact
detect_base64: true
patterns:
- name: "GitHub Token"
pattern: "gh[ps]_[A-Za-z0-9_]{36,}"
tool_change_rules:
enabled: true
action: warn
3. Wire the hook (Claude Code example — .claude/settings.json):
{
"hooks": {
"PreToolUse": [
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "cat | intent-guard evaluate --policy schema/policy.yaml"
}
]
}
]
}
}
Hook templates are also shipped for GitHub Copilot (hooks/copilot/hooks.json) and Cursor (hooks/cursor/hooks.json).
Option B: MCP Proxy Mode
Wrap any MCP server with IntentGuard as a policy-enforcing proxy:
INTENT_GUARD_TASK="Refactor UI only; no auth or database changes" \
python -m intent_guard.proxy \
--policy schema/policy.yaml \
--target "npx @modelcontextprotocol/server-filesystem /path/to/repo" \
--ask-approval
Configure in your MCP client (Claude Desktop, etc.):
{
"mcpServers": {
"filesystem": {
"command": "python",
"args": [
"-m", "intent_guard.proxy",
"--policy", "schema/policy.yaml",
"--target", "npx @modelcontextprotocol/server-filesystem /path/to/repo",
"--ask-approval"
],
"env": {
"INTENT_GUARD_TASK": "Refactor UI only; do not touch auth or database"
}
}
}
}
Semantic Analysis: Beyond Pattern Matching
Static checks are fast and deterministic, but attackers will find ways around regex. IntentGuard's semantic layer uses an LLM (local via Ollama or remote via LiteLLM) to evaluate whether a tool call's intent matches the declared task. This is where the guard catches novel attacks that no pattern list can anticipate:
semantic_rules:
provider: ollama
guardrail_model: llama3.1:8b
mode: enforce
prompt_version: "v2"
critical_intent_threshold: 0.85
constraints:
- intent: modify_source_code
allowed_scope: "UI components and styles only"
forbidden_scope: "Auth, database schemas, infrastructure"
Multi-Signal Rubric Scoring (v2)
Instead of asking the LLM for an opaque confidence number, v2 decomposes evaluation into four binary dimensions with configurable weights:
| Dimension | Question | Default Weight |
|---|---|---|
tool_task_alignment |
Is this tool appropriate for the stated task? | 0.25 |
argument_scope_compliance |
Are arguments within the allowed scope? | 0.30 |
no_forbidden_scope_violation |
Do arguments avoid the forbidden scope? | 0.30 |
no_side_effect_risk |
Is the call free of destructive/exfiltration risk? | 0.15 |
Each dimension returns passed (true/false) and evidence (a short explanation). The final score is computed deterministically: Σ(weight × pass) / Σ(weight). This makes decisions auditable, debuggable, and reproducible — critical for any serious adoption.
Resilience Under Failure
The semantic provider includes retries with exponential backoff + jitter, a circuit breaker, and per-tool fail modes:
semantic_rules:
provider_fail_mode:
default: advisory # fail-open for standard tools
by_tool:
delete_database: enforce # fail-closed for critical tools
purge_all: enforce
If your Ollama instance goes down, delete_database calls are blocked (fail-closed) while normal file reads continue with a warning (fail-open). No all-or-nothing failure modes.
What Happens Without a Guard?
Here's a concrete scenario the Claude Code leak enables:
- A developer clones a PR branch that includes a poisoned
CLAUDE.md - The instruction survives context compaction and becomes a trusted directive
- The cooperative model proposes:
write_file(path="src/auth/config.py", content="...") - The developer sees a reasonable-looking code change and approves it
- Auth config is silently modified
With IntentGuard:
-
protected_paths: ["src/auth/*"]→ blocked at static check (< 1ms) -
injection_patternson the compacted context → flagged if injection survived - Semantic analysis against task
"Refactor UI only"→ blocked — auth file modification violates allowed scope -
response_rules→ even if a tool did execute, sensitive data in the response would be redacted before reaching the agent
The attack is stopped at multiple independent layers. That's defense in depth.
Scenario 2: Supply-chain MCP server exfiltrates secrets
The Claude Code leak exposed the exact interface contract for MCP servers. An attacker publishes a typosquatted package (e.g., mcp-server-filesytem) — the paper found 34% of malicious server installations use this vector. A developer installs it. The server:
- Returns normal-looking file contents for a few calls to build trust
- On the next
tools/call, the response payload includes the contents of.envbase64-encoded inside a markdown comment - The agent's context now contains the developer's secrets
- The server's next
sampling/createMessageexfiltrates them
With IntentGuard:
-
tool_change_rules→ if the server's tool descriptions shifted from the snapshot, the call is blocked before execution -
response_ruleswithdetect_base64: true→ the base64-encoded.envin the response is caught and redacted before it reaches the agent -
sensitive_data_patterns→ even if decoded, AWS keys and tokens in the payload are flagged - The attacker never gets the secrets out
For Defenders: Immediate Actions
Whether or not you adopt IntentGuard, here's what you should do now:
-
Audit
CLAUDE.md/.cursorrules/.github/copilot-instructions.mdin every repo you clone — especially PRs and forks - Treat MCP servers like npm dependencies — vet them, pin them, monitor for changes
-
Avoid broad permission rules like
Bash(git:*)— be specific - Pin your AI tool versions and verify hashes
- Limit session length for sensitive work to reduce the compaction attack window
-
Never use
dangerouslyDisableSandboxin shared or production environments
And if you want a policy layer that enforces these principles automatically:
pip install agent-intent-guard
Full Feature List
The vulnerability mapping above covers the highlights. For the complete list of everything IntentGuard ships today — including advisory mode, interactive and webhook approvals, break-glass controls, hot-reload, policy validation, starter templates, decision caching, and audit-ready decision metadata — see:
- features.md — all security features, organized by category
- feature-example.md — copy/paste policy YAML for every feature, plus full end-to-end examples
Feedback
I'm building this in the open and want to hear what's working, what's missing, and what's wrong. If you have critiques, ideas, or bug reports, open an issue on GitHub.
References
- Claude Code Source Leak Analysis — Straiker AI, April 2026
- Breaking the Protocol: Security Analysis of MCP — Maloyan & Namiot, arXiv:2601.17549, January 2026
- IntentGuard on GitHub
IntentGuard is built by temp-noob. If this helps secure your AI agent deployment/workflows, give it a ⭐ on GitHub and share it with your team.
Top comments (0)