DEV Community

temp-noob
temp-noob

Posted on

The Claude Code Leak Changed the Threat Model. Here's How to Defend Your AI Agents.

IntentGuard — a policy enforcement layer for MCP tool calls and AI coding agents


The Leak That Rewrote the Attacker's Playbook

On March 31, 2026, 512,000 lines of Claude Code source were accidentally published via an npm source map. Within hours the code was mirrored across GitHub. What was already extractable from the minified bundle became instantly readable: the compaction pipeline, every bash-security regex, the permission short-circuit logic, and the exact MCP interface contract.

The leak didn't create new vulnerability classes — it collapsed the cost of exploiting them. Attackers no longer need to brute-force prompt injections or reverse-engineer shell validators. They can read the code, study the gaps, and craft payloads that a cooperative model will execute and a reasonable developer will approve.

Three findings from the leak are especially alarming:

  1. Context poisoning via compaction — MCP tool results are never micro-compacted; the auto-compact prompt faithfully preserves "user feedback." A malicious instruction embedded in a cloned repo's CLAUDE.md can survive context compression and become a persistent, trusted directive.
  2. Sandbox bypass via parser differentials — Claude Code's bash permission chain uses three separate parsers with known edge-case divergence. Early-allow validators can short-circuit the entire chain, skipping critical downstream checks.
  3. Supply-chain amplification — The readable source makes crafting convincing malicious MCP servers trivial by revealing the exact interface contract. A concurrent axios supply-chain attack the same day underscores that these threats don't arrive in isolation.

The Academic Evidence: MCP's Architecture Is the Problem

These aren't just theoretical risks. Maloyan & Namiot (arXiv:2601.17549) published the first rigorous security analysis of the MCP specification itself and found that MCP's architectural choices amplify attack success rates by 23–41% compared to equivalent non-MCP integrations.

Their PROTOAMP framework tested 847 attack scenarios across five MCP server implementations and three LLM backends. The results:

Attack Type Baseline (non-MCP) With MCP Amplification
Indirect Injection (Resource) 31.2% 47.8% +16.6%
Tool Response Manipulation 28.4% 52.1% +23.7%
Cross-Server Propagation 19.7% 61.3% +41.6%
Sampling-Based Injection N/A 67.2%
Overall 26.4% 52.8% +26.4%

The paper identifies three protocol-level vulnerabilities:

  1. Least Privilege Violation (§3) — Capability declarations are self-asserted. A malicious server claiming only resources can later invoke sampling/createMessage to inject prompts.
  2. Sampling Without Origin Authentication (§3) — No tested MCP host distinguishes server-injected prompts from user-originated ones. The LLM trusts both identically.
  3. Implicit Trust Propagation (§3) — In multi-server deployments, compromise of one server achieves 78.3% ASR with 5 concurrent servers and a 72.4% cascade rate.

The paper also documents real-world attack vectors for how malicious servers reach users: typosquatting on npm/pip (34%), supply-chain compromise (28%), social engineering via tutorials (23%), and marketplace poisoning (15%).

The bottom line: if you're running AI agents with MCP tool access — Claude Code, Copilot, Cursor, or any custom agent — you are exposed by default.


Introducing IntentGuard

IntentGuard is an open-source Python guardrail layer I built for MCP tool calls and AI coding agents. It enforces static policy checks (fast, deterministic, zero-latency) and optional semantic intent checks (LLM-powered, task-aware) on every tool call — before it touches your filesystem, database, or infrastructure.

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  AI Agent    │────▶│   IntentGuard    │────▶│  MCP Server │
│ (Claude,     │     │  ┌────────────┐  │     │ (filesystem, │
│  Copilot,    │◀────│  │ Static     │  │◀────│  git, DB,   │
│  Cursor)     │     │  │ Checks     │  │     │  Slack...)  │
│              │     │  ├────────────┤  │     │             │
│              │     │  │ Semantic   │  │     │             │
│              │     │  │ Analysis   │  │     │             │
│              │     │  ├────────────┤  │     │             │
│              │     │  │ Response   │  │     │             │
│              │     │  │ Inspection │  │     │             │
│              │     │  └────────────┘  │     │             │
└─────────────┘     └──────────────────┘     └─────────────┘
Enter fullscreen mode Exit fullscreen mode

How it plugs in

IntentGuard currently supports two deployment modes, with more planned:

Mode Status How it works
stdio proxy ✅ Shipped Wraps any MCP server command — intercepts every tools/call and response on the wire
Native hook ✅ Shipped Runs behind Claude Code, GitHub Copilot, and Cursor's built-in hook systems via intent-guard evaluate
HTTP proxy 🔜 Planned Network-deployable gateway for teams running MCP over HTTP/SSE — same policy engine, accessible as a service
Docker sidecar 🔜 Planned Containerized proxy for production deployments — drop into any docker-compose.yaml or K8s pod spec

The native hook mode deserves emphasis. Claude Code, Copilot, and Cursor each have their own hook/extension mechanism, but the security logic is always different and fragmented. With IntentGuard, you write one policy.yaml and it works across all three:

# Same command, same policy — whether it's Claude, Copilot, or Cursor calling it
cat | intent-guard evaluate --policy schema/policy.yaml
Enter fullscreen mode Exit fullscreen mode

Hook templates are shipped ready to drop into each tool's config directory (hooks/claude-code/, hooks/copilot/, hooks/cursor/). One policy file governs what every AI agent in your org is allowed to do — which matters for teams that don't want to maintain three separate security configurations.

Bidirectional by design

Most MCP security tools focus exclusively on requests — inspecting what the agent is asking to do. But the Claude Code leak showed exactly why that's not enough. The leaked source reveals how MCP server responses flow directly into the agent's context window, and the Straiker analysis documents how tool results bypass micro-compaction entirely, persisting as trusted content.

IntentGuard inspects both directions. The response_rules engine scans MCP server responses before they reach the agent — detecting secrets, PII, and encoded payloads on the way out. A few other tools (like mcp-firewall and mcpwall) have started adding response scanning, but IntentGuard was designed around bidirectional inspection from the start, with policy-driven actions (block / warn / redact) and Base64 decode on response payloads to catch encoded exfiltration attempts.


How IntentGuard Addresses the Vulnerabilities

✅ Fully Addressed

1. Prompt Injection (Paper §3 — Protocol Specification Analysis)

The paper shows indirect injection through resource content achieves 47.8% ASR via MCP. But the Claude Code leak made this far more practical. The Straiker analysis reveals that the auto-compact prompt instructs the model to "pay special attention to specific user feedback" and preserve "all user messages that are not tool results." Post-compaction, the model is told to "continue without asking the user any further questions." This creates a laundering path: a malicious instruction in a CLAUDE.md gets compacted into the summary as a "user directive," and the model follows it faithfully.

IntentGuard counters this with three layers of defense:

  • injection_patterns — configurable regex patterns catch known injection phrases ("ignore previous instructions", "override.*policy", etc.)
  • decode_arguments — URL decoding, Base64 decoding, and Unicode NFKC normalization run before pattern matching, defeating encoded bypass attempts
  • Semantic intent checks — the LLM-powered analysis layer evaluates whether the tool call's purpose aligns with the declared task context, catching novel injections that bypass pattern matching
static_rules:
  decode_arguments: true
  injection_patterns:
    - "ignore previous instructions"
    - "disregard.*instructions"
    - "override.*policy"
    - "bypass.*security"
Enter fullscreen mode Exit fullscreen mode

2. Tool Poisoning (Paper §4 — Attack Vectors)

The paper demonstrates tool response manipulation achieving 52.1% ASR. The Claude Code leak compounds this: the readable source exposes the exact MCP interface contract, making it trivial to craft convincing malicious MCP servers that look legitimate. The leak also revealed that early-allow validators in the bash permission chain (like validateGitCommit) can short-circuit all downstream security checks — meaning a poisoned tool that mimics a "safe" command shape may bypass the entire validation chain.

IntentGuard blocks poisoned tool usage with:

  • forbidden_tools — blocklist of dangerous tools that are never allowed
  • custom_policies — per-tool argument requirements and forbidden argument checks
  • protected_paths — glob-based path enforcement with traversal-safe normalization (so ../../etc/passwd can't sneak through)
  • Semantic analysis — evaluates whether the tool is appropriate for the stated task
static_rules:
  forbidden_tools: ["delete_database", "exec_shell", "purge_all"]
  protected_paths: [".env", "/etc/*", "src/auth/*", ".ssh/*"]

custom_policies:
  - tool_name: write_file
    args:
      all_present: ["path", "content"]
      should_not_present: ["sudo"]
Enter fullscreen mode Exit fullscreen mode

3. Rug-Pull Attacks (Paper §4 — Attack Vectors)

MCP servers can change tool descriptions, schemas, or capabilities after being initially trusted — the paper calls this implicit trust propagation. IntentGuard's ToolSnapshotStore handles this:

  • On first tools/list, snapshots all tool metadata (name, description, inputSchema) to .intent-guard/tool-snapshots/
  • On every subsequent tools/list, diffs against the snapshot
  • warn mode: logs the drift and continues
  • block mode: blocks the response when any tool definition has changed
tool_change_rules:
  enabled: true
  action: block  # block any tool whose description/schema drifted
Enter fullscreen mode Exit fullscreen mode

4. Data Exfiltration (Paper §4 — Attack Vectors)

The paper shows 42–61% data exfiltration rates via sampling attacks. The Claude Code leak makes this worse in a specific way: the source reveals that MCP tool results are never micro-compacted — they persist in context until auto-compact fires. An attacker-controlled MCP server can return a response containing embedded instructions or exfiltration payloads, and that content sits in the model's context window, trusted and unfiltered, for the entire session.

IntentGuard provides bidirectional protection:

Inbound (request-side):

  • sensitive_data_patterns detect AWS keys, GitHub tokens, emails, SSNs, and custom patterns in tool call arguments before they're sent

Outbound (response-side):

  • response_rules inspect MCP server responses before forwarding to the agent
  • Policy-driven actions: block (suppress), warn (log), or redact (sanitize and forward)
  • Base64 payload detection on responses catches encoded exfiltration
static_rules:
  sensitive_data_patterns:
    - name: "AWS Access Key"
      pattern: "AKIA[0-9A-Z]{16}"
    - name: "GitHub Token"
      pattern: "gh[ps]_[A-Za-z0-9_]{36,}"

response_rules:
  action: redact
  detect_base64: true
  patterns:
    - name: "GitHub Token"
      pattern: "gh[ps]_[A-Za-z0-9_]{36,}"
Enter fullscreen mode Exit fullscreen mode

5. Token/Credential Theft (Paper §3, §4)

Secrets detected in tool call arguments are also redacted from IntentGuard's own decision logs — so sensitive data doesn't leak into your audit trail.

6. Resource Exhaustion / DoS (Paper §6)

Beyond max_tokens_per_call, IntentGuard enforces per-tool sliding-window rate limiting. Each tool can have its own max_calls and window_seconds, so a runaway agent can't hammer write_file 1,000 times in a minute.

static_rules:
  rate_limits:
    enabled: 1
    default:
      max_calls: 60
      window_seconds: 60
    by_tool:
      write_file:
        max_calls: 10
        window_seconds: 60
Enter fullscreen mode Exit fullscreen mode

5-Minute Quick Start

Option A: Native Hook (Claude Code / Copilot / Cursor)

1. Install:

pip install agent-intent-guard
# or from source:
git clone https://github.com/temp-noob/intent-guard && cd intent-guard
pip install -e .
Enter fullscreen mode Exit fullscreen mode

2. Write a policy (or use a starter template from policies/):

# schema/policy.yaml
version: "1.0"
name: "my-project-guard"

static_rules:
  forbidden_tools: ["delete_database", "purge_all"]
  protected_paths: [".env", ".ssh/*", "src/auth/*"]
  max_tokens_per_call: 4000
  decode_arguments: true
  injection_patterns:
    - "ignore previous instructions"
    - "override.*policy"
  sensitive_data_patterns:
    - name: "GitHub Token"
      pattern: "gh[ps]_[A-Za-z0-9_]{36,}"

response_rules:
  action: redact
  detect_base64: true
  patterns:
    - name: "GitHub Token"
      pattern: "gh[ps]_[A-Za-z0-9_]{36,}"

tool_change_rules:
  enabled: true
  action: warn
Enter fullscreen mode Exit fullscreen mode

3. Wire the hook (Claude Code example — .claude/settings.json):

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "*",
        "hooks": [
          {
            "type": "command",
            "command": "cat | intent-guard evaluate --policy schema/policy.yaml"
          }
        ]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Hook templates are also shipped for GitHub Copilot (hooks/copilot/hooks.json) and Cursor (hooks/cursor/hooks.json).

Option B: MCP Proxy Mode

Wrap any MCP server with IntentGuard as a policy-enforcing proxy:

INTENT_GUARD_TASK="Refactor UI only; no auth or database changes" \
python -m intent_guard.proxy \
  --policy schema/policy.yaml \
  --target "npx @modelcontextprotocol/server-filesystem /path/to/repo" \
  --ask-approval
Enter fullscreen mode Exit fullscreen mode

Configure in your MCP client (Claude Desktop, etc.):

{
  "mcpServers": {
    "filesystem": {
      "command": "python",
      "args": [
        "-m", "intent_guard.proxy",
        "--policy", "schema/policy.yaml",
        "--target", "npx @modelcontextprotocol/server-filesystem /path/to/repo",
        "--ask-approval"
      ],
      "env": {
        "INTENT_GUARD_TASK": "Refactor UI only; do not touch auth or database"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Semantic Analysis: Beyond Pattern Matching

Static checks are fast and deterministic, but attackers will find ways around regex. IntentGuard's semantic layer uses an LLM (local via Ollama or remote via LiteLLM) to evaluate whether a tool call's intent matches the declared task. This is where the guard catches novel attacks that no pattern list can anticipate:

semantic_rules:
  provider: ollama
  guardrail_model: llama3.1:8b
  mode: enforce
  prompt_version: "v2"
  critical_intent_threshold: 0.85

  constraints:
    - intent: modify_source_code
      allowed_scope: "UI components and styles only"
      forbidden_scope: "Auth, database schemas, infrastructure"
Enter fullscreen mode Exit fullscreen mode

Multi-Signal Rubric Scoring (v2)

Instead of asking the LLM for an opaque confidence number, v2 decomposes evaluation into four binary dimensions with configurable weights:

Dimension Question Default Weight
tool_task_alignment Is this tool appropriate for the stated task? 0.25
argument_scope_compliance Are arguments within the allowed scope? 0.30
no_forbidden_scope_violation Do arguments avoid the forbidden scope? 0.30
no_side_effect_risk Is the call free of destructive/exfiltration risk? 0.15

Each dimension returns passed (true/false) and evidence (a short explanation). The final score is computed deterministically: Σ(weight × pass) / Σ(weight). This makes decisions auditable, debuggable, and reproducible — critical for any serious adoption.

Resilience Under Failure

The semantic provider includes retries with exponential backoff + jitter, a circuit breaker, and per-tool fail modes:

semantic_rules:
  provider_fail_mode:
    default: advisory        # fail-open for standard tools
    by_tool:
      delete_database: enforce  # fail-closed for critical tools
      purge_all: enforce
Enter fullscreen mode Exit fullscreen mode

If your Ollama instance goes down, delete_database calls are blocked (fail-closed) while normal file reads continue with a warning (fail-open). No all-or-nothing failure modes.


What Happens Without a Guard?

Here's a concrete scenario the Claude Code leak enables:

  1. A developer clones a PR branch that includes a poisoned CLAUDE.md
  2. The instruction survives context compaction and becomes a trusted directive
  3. The cooperative model proposes: write_file(path="src/auth/config.py", content="...")
  4. The developer sees a reasonable-looking code change and approves it
  5. Auth config is silently modified

With IntentGuard:

  • protected_paths: ["src/auth/*"]blocked at static check (< 1ms)
  • injection_patterns on the compacted context → flagged if injection survived
  • Semantic analysis against task "Refactor UI only"blocked — auth file modification violates allowed scope
  • response_rules → even if a tool did execute, sensitive data in the response would be redacted before reaching the agent

The attack is stopped at multiple independent layers. That's defense in depth.

Scenario 2: Supply-chain MCP server exfiltrates secrets

The Claude Code leak exposed the exact interface contract for MCP servers. An attacker publishes a typosquatted package (e.g., mcp-server-filesytem) — the paper found 34% of malicious server installations use this vector. A developer installs it. The server:

  1. Returns normal-looking file contents for a few calls to build trust
  2. On the next tools/call, the response payload includes the contents of .env base64-encoded inside a markdown comment
  3. The agent's context now contains the developer's secrets
  4. The server's next sampling/createMessage exfiltrates them

With IntentGuard:

  • tool_change_rules → if the server's tool descriptions shifted from the snapshot, the call is blocked before execution
  • response_rules with detect_base64: true → the base64-encoded .env in the response is caught and redacted before it reaches the agent
  • sensitive_data_patterns → even if decoded, AWS keys and tokens in the payload are flagged
  • The attacker never gets the secrets out

For Defenders: Immediate Actions

Whether or not you adopt IntentGuard, here's what you should do now:

  1. Audit CLAUDE.md / .cursorrules / .github/copilot-instructions.md in every repo you clone — especially PRs and forks
  2. Treat MCP servers like npm dependencies — vet them, pin them, monitor for changes
  3. Avoid broad permission rules like Bash(git:*) — be specific
  4. Pin your AI tool versions and verify hashes
  5. Limit session length for sensitive work to reduce the compaction attack window
  6. Never use dangerouslyDisableSandbox in shared or production environments

And if you want a policy layer that enforces these principles automatically:

pip install agent-intent-guard
Enter fullscreen mode Exit fullscreen mode

Full Feature List

The vulnerability mapping above covers the highlights. For the complete list of everything IntentGuard ships today — including advisory mode, interactive and webhook approvals, break-glass controls, hot-reload, policy validation, starter templates, decision caching, and audit-ready decision metadata — see:

  • features.md — all security features, organized by category
  • feature-example.md — copy/paste policy YAML for every feature, plus full end-to-end examples

Feedback

I'm building this in the open and want to hear what's working, what's missing, and what's wrong. If you have critiques, ideas, or bug reports, open an issue on GitHub.


References


IntentGuard is built by temp-noob. If this helps secure your AI agent deployment/workflows, give it a ⭐ on GitHub and share it with your team.

Top comments (0)