What Anthropic's Claude Code Leak Teaches Us About AI Agent Security

#ai #opensource #security #claude

On March 31, 2026, Anthropic shipped a source map file inside the @anthropic/claude-code npm package (v2.1.88). That .map file contained the full original TypeScript source - 512,000+ lines of it. Security researcher Chaofan Shou spotted it and the code was quickly reconstructed and published.

The leak itself isn't the interesting part. Source maps in npm packages happen all the time. What's interesting is what the code reveals about how AI agents are built - and where the real security gaps are.

I spent a few days reading through the reconstructed source. Here are three things that stood out.

1. "Undercover Mode" - Guarding the Front Door, Shipping the Back

Anthropic built an entire subsystem called "undercover mode" into Claude Code. Its job: prevent the LLM from revealing internal system prompts, tool definitions, and operational details during conversations. If you asked Claude Code how it worked internally, undercover mode would kick in and deflect.

They were worried about prompt extraction attacks. Fair enough - that's a real threat. But while they were building walls around what the AI could say, their build pipeline was packaging the entire source into a .map file and shipping it to npm.

The source map format is straightforward. Here's what a .map file looks like:

{
  "version": 3,
  "sources": ["../src/tools/file-reader.ts", "../src/tools/shell.ts", "..."],
  "sourcesContent": ["// full original source code here", "..."],
  "mappings": "AAAA,SAAS..."
}

The sourcesContent array holds the complete, unminified source. Every file. Every comment. Every internal string.

The irony is hard to miss. They invested engineering time into making sure their AI wouldn't leak secrets in conversation. Meanwhile, npm publish did it for them.

The lesson: supply chain security matters as much as prompt security. You can build the most sophisticated prompt injection defense in the world, but if your CI/CD pipeline ships source maps, .env files, or internal configs, none of that matters. Check your .npmignore. Check your build artifacts. Run npm pack --dry-run before every publish.

2. 43+ Tools With OS-Level Access

The leaked code defines 43+ tool functions. These aren't sandboxed API calls. They include:

File system access - read, write, list, search across the entire file system
Shell execution - run arbitrary commands with the user's permissions
Network access - make HTTP requests, interact with APIs
Git operations - commit, push, manage repositories
Browser control - navigate, click, extract page content

Here's a simplified version of what a tool definition looks like in the source:

{
  name: "shell",
  description: "Execute a shell command on the user's machine",
  parameters: {
    command: { type: "string", description: "The command to run" },
    workdir: { type: "string", description: "Working directory" }
  }
}

This is the exact attack surface that MCP tool poisoning targets. In an MCP setup, tool descriptions are passed to the LLM as part of the context. If an attacker can inject instructions into a tool description - via a compromised MCP server, a malicious package, or a poisoned tool registry - the LLM might follow those injected instructions using any of the 43+ tools available to it.

Think about that. An injected instruction in one tool description could tell the model to use the shell tool to exfiltrate data. Or the file_write tool to drop a payload. The model doesn't distinguish between legitimate tool descriptions and injected ones - they're all just text in the context window.

This isn't theoretical. Research from Invariant Labs has demonstrated working MCP tool poisoning attacks. The more tools an agent has, the larger the blast radius.

This is why scanning MCP tool descriptions before they reach your LLM matters. Tools like ClawGuard can intercept and audit tool definitions at the MCP layer, catching poisoned descriptions before they enter the context window.

3. KAIROS - What Happens When a Proactive Agent Gets Compromised?

The most interesting find in the leaked source is a system called "KAIROS" - an always-on, proactive agent mode. Instead of waiting for user input, KAIROS watches file changes, terminal output, and system events, then acts on them autonomously.

Traditional AI coding assistants follow a request-response pattern. You ask, it does. If it gets hit with a prompt injection, the damage is limited to that single interaction. You see the output, you catch the problem, you stop.

A proactive agent changes the threat model completely. If KAIROS gets compromised via prompt injection - say, from a malicious file it reads during monitoring - it doesn't wait for you to type something. It acts. It might modify files, run commands, or make network requests before you even know something went wrong.

The attack window for a reactive agent is one turn. The attack window for a proactive agent is continuous.

This doesn't mean proactive agents are a bad idea. They're probably the future of developer tools. But they need a different security model:

Continuous monitoring of agent actions, not just input validation
Anomaly detection - flag when agent behavior deviates from expected patterns
Kill switches - immediate shutdown when suspicious activity is detected
Audit logs - complete records of every action taken without user initiation

We don't have great tooling for this yet. It's an open problem.

What You Can Do Today

Check your npm packages for source maps. Large .map files in production packages are both a security risk and a waste of bandwidth:

find node_modules -name "*.map" -size +1M

On Windows:

Get-ChildItem -Path node_modules -Recurse -Filter "*.map" | Where-Object { $_.Length -gt 1MB }

If you're a package author, add *.map to your .npmignore unless you specifically need them in production.

Scan MCP tool descriptions. If you're using MCP servers - especially third-party ones - inspect the tool descriptions they serve. Look for hidden instructions, unusual formatting, or text that looks like it's trying to direct the model's behavior rather than describe the tool.

Audit your agent's tool access. Know exactly what tools your AI agent can use and what permissions they have. If your agent has shell access, it effectively has root-equivalent power within your user context. Treat it accordingly.

The Gap

This leak is embarrassing for Anthropic, but it's educational for everyone building with AI agents.

There's a gap between AI-level security and software-level security. The AI security community spends a lot of time on prompt injection, jailbreaks, and alignment. Important work. But the Claude Code leak happened because of a missing line in .npmignore - a problem we solved in the Node.js ecosystem a decade ago.

AI agents inherit all the security problems of traditional software (dependency management, build pipelines, supply chain attacks) and add new ones on top (prompt injection, tool poisoning, autonomous action). You need both layers.

The 512,000 lines of leaked TypeScript will be picked apart for months. But the biggest takeaway is simple: if you're building AI agents, don't forget that they're also just software. And software security basics still apply.

Sources: Chaofan Shou's discovery, Kuberwastaken/claurst reconstruction, Invariant Labs MCP research

This is part of Awesome AI Anatomy - deep source code teardowns of 11 AI agent projects. Star it for updates.