Jangwook Kim

Posted on May 7 • Originally published at effloow.com

MCP Code Execution: Build Token-Efficient AI Agents

#mcp #aiagents #tokenefficiency #anthropic

Every AI agent team eventually hits the same wall: you add more MCP servers to give your agent more capabilities, and suddenly the context window is half-full before the first user message even arrives.

This is not a hypothetical. A typical five-server MCP setup with around 58 tools consumes approximately 55,000 tokens of overhead on every single turn. Add one more server — say, Jira — and you are looking at 100,000+ tokens before the conversation starts. For models with a 200K context limit, that leaves precious little room for actual reasoning.

Anthropic's engineering team published a solution to this problem: Code Execution MCP (CE-MCP). The results are striking — a Google Drive to Salesforce workflow that consumes ~150,000 tokens under the default pattern drops to roughly 2,000 tokens with CE-MCP. That is a 98.7% reduction. Cloudflare went even further with their Code Mode implementation, compressing a 2,500-endpoint API from 1.17 million tokens down to approximately 1,000.

This guide covers what CE-MCP is, how to implement it, where it matters most, and the security trade-offs you need to understand before deploying it.

Why Standard MCP Tool Calling Struggles at Scale

The default MCP architecture is straightforward: an agent connects to one or more MCP servers, and each server exposes a list of tools. When the agent runs, every tool definition from every connected server gets loaded into the context window upfront. The agent then calls tools one at a time, and each tool's full response flows back through the context.

This works fine for small setups. The trouble starts when you connect multiple servers, or when tool responses contain large payloads.

Two independent research groups documented the scale of this problem. A February 2026 arXiv paper ("Model Context Protocol Tool Descriptions Are Smelly!") examined 856 tools across 103 MCP servers and found that tool metadata quality is a primary driver of token inflation. Verbose, redundant, or poorly structured tool descriptions consistently push agents toward worse performance while consuming more context.

A second paper ("From Tool Orchestration to Code Execution: A Study of MCP Design Choices", arXiv 2602.15945) identified the structural root cause: in the standard pattern, every intermediate result from every tool call flows through the model's context window, even when the agent just needs to extract one field from a 50,000-token document.

The Context Economics Problem

The real constraint on AI agents is not raw capability — it is context economics. The question teams are actually asking is: "How many tools can we afford to expose on every turn?" Standard tool calling forces a binary choice: give the agent every tool (expensive) or carefully curate a narrow toolset (limiting).

CE-MCP removes that choice. Agents can access hundreds of tools without paying the upfront token cost for all of them.

What Is CE-MCP?

CE-MCP (Code Execution MCP) is a design pattern where the agent interacts with MCP servers by writing and executing code rather than by calling tools directly. Instead of the agent receiving descriptions of what tools do, it examines a lightweight representation of the available API and then writes a short program to accomplish its goal.

The key architectural difference: intermediate data stays inside the execution environment's memory. It never re-enters the model's context window unless you explicitly log or return it.

Here is the conceptual difference in pseudocode:

Standard tool calling (high token cost):

Agent context → load all 58 tool definitions (55K tokens)
Agent calls: read_google_drive_file(id="...")
Context receives: full 50,000-token document
Agent calls: create_salesforce_contact(name=...)
Context receives: API response (2,000 tokens)
Total context overhead: ~107,000 tokens

CE-MCP (low token cost):

Agent context → load lightweight API reference (~500 tokens)
Agent writes:
  doc = drive.read_file("meeting_id")    # stays in execution env
  sfdc.create_contact(name=doc.attendees[0])  # only result returned
Agent executes the program
Context receives: execution result (~200 tokens)
Total context overhead: ~700 tokens

The document itself — potentially 50,000 tokens — never enters the model's context window at all. The agent's code reads it in the execution environment, extracts what it needs, and discards the rest.

How Anthropic's CE-MCP Pattern Works

Anthropic's engineering blog describes a practical implementation of CE-MCP with three core components:

1. MCP servers as code modules
Instead of presenting MCP servers as lists of tools, you expose them as importable modules in the agent's execution environment. The agent sees a compact module reference rather than a full tool schema for every tool.

2. A code execution environment
The agent has access to a sandboxed runtime where it can write and run Python or JavaScript. This is where intermediate data lives — file contents, API responses, computed values — without ever re-entering the LLM context.

3. Selective context surfacing
Only the final result of the agent's code (or explicit log statements) flows back into the conversation context. The agent's author controls exactly what enters the model's attention.

This pattern handles three classes of efficiency gains that standard tool calling cannot:

Large intermediate documents: read a file in code, extract two fields, return only those fields
Multi-step workflows: chain five API calls in one program, return the final state
Sensitive data isolation: process PII (email addresses, phone numbers) inside the execution environment without it ever appearing in the model's visible context

Cloudflare's Code Mode: CE-MCP at Industrial Scale

The clearest demonstration of CE-MCP's scalability is Cloudflare's Code Mode, launched in April 2026. Cloudflare's own API has over 2,500 endpoints. Under the standard MCP pattern, loading the full API spec would consume roughly 1.17 million tokens — larger than most models' context limits entirely.

Code Mode collapses this to approximately 1,000 tokens by exposing just two tools: search() and execute().

search() queries the API spec by product area, path, or metadata. The full spec never enters the model's context — only the search results do.
execute() runs JavaScript against the API inside a secure V8 isolate. The model's code handles pagination, conditional logic, and chained API calls in a single execution cycle.

The result: a fixed ~1,000-token footprint regardless of the size of the underlying API surface. Per Cloudflare's published figures, this represents a 99.9% reduction compared to loading the full spec, and approximately an 81% reduction compared to typical direct tool-call patterns for real agent workflows.

Cloudflare open-sourced the underlying Cloudflare Agents SDK so developers can apply the same pattern to their own MCP servers. For teams building agents that interact with large, feature-rich platforms, Code Mode establishes what efficient MCP architecture looks like in practice.

Pattern	Token overhead	Intermediate data	Security surface
Direct tool calling (5 servers)	~55K tokens/turn	Flows through LLM context	Limited to tool schemas
Direct tool calling (large API)	Up to 1.17M tokens	Flows through LLM context	Limited to tool schemas
CE-MCP / Code Mode	~1K–2K tokens	Stays in execution env	Full code execution

When CE-MCP Makes the Biggest Difference

Not every agent workflow benefits equally from CE-MCP. The pattern is most impactful when:

Large intermediate payloads are the bottleneck. If your agent reads a 200KB document to extract a single field, CE-MCP eliminates the document from the context entirely. Standard tool calling forces the full document through the LLM.

Multi-step workflows involve data chaining. When step 3 depends on output from steps 1 and 2, standard tool calling keeps all intermediate outputs in context. CE-MCP keeps them in the execution environment's memory.

You need PII or sensitive data isolation. CE-MCP lets sensitive data flow through your code without entering the model's visible context — relevant for compliance use cases where you do not want PII in model prompts.

You connect to large APIs or many servers. The 1K-token fixed footprint in Cloudflare's Code Mode only materializes when the underlying API is large. For a server with three tools, the gains are modest.

CE-MCP is not the right choice when:

Your agent calls only a handful of tools with small responses
You need the model to reason about the raw content of intermediate results (not just the extracted fields)
Your execution environment cannot be sufficiently isolated (see security section below)

The Security Trade-Off You Cannot Ignore

CE-MCP's power comes with a genuine security cost. Giving an agent a code execution environment means the agent can now run arbitrary code. This fundamentally expands the attack surface.

In April 2026, OX Security published a critical advisory covering exactly this class of vulnerability. CVE-2026-30623 involves command injection via MCP's stdio transport: StdioServerParameters passes unsanitized command strings to the execution environment. The vulnerability affects 7,000+ publicly accessible MCP servers and more than 150 million downloads across Python, TypeScript, Java, and Rust SDKs. Downstream CVEs were found in LiteLLM, LangChain, LangFlow, Flowise, LettaAI, and LangBot.

Anthropic declined to modify the protocol architecture, describing the behavior as "expected."

This is not a reason to avoid CE-MCP — but it is a reason to be deliberate about isolation. The practical recommendations from security researchers:

Sandbox the execution environment. Cloudflare's implementation runs code in V8 isolates with no access to host file system, network interfaces, or environment variables. If your sandbox is weaker than this, your attack surface is proportionally larger.

Validate all inputs before they reach the execution environment. Prompt injection attacks targeting CE-MCP can instruct the agent to execute adversarial code. Input validation and output filtering reduce the risk.

Treat the execution environment as untrusted. Credentials and secrets should not be available inside the execution environment. Pass API tokens via secure side channels, not as variables the agent can inspect.

Monitor for unexpected command patterns. CE-MCP agents that suddenly start executing shell commands, reading environment files, or making outbound network requests to unknown hosts are showing signs of compromise.

The arXiv paper on CE-MCP design choices (2602.15945) notes that the security trade-off is the primary reason CE-MCP adoption has been slower than its token efficiency gains would predict. Teams that invest in proper sandboxing see the efficiency benefits. Teams that deploy code execution in a loosely sandboxed environment take on real risk.

Common Mistakes When Implementing CE-MCP

Returning intermediate results to the model context anyway. The whole point of CE-MCP is that intermediate data stays in the execution environment. If your code logs every intermediate value back to the model, you will not see the token savings.

Using CE-MCP for simple single-tool calls. The overhead of code generation and execution is not worth it when the agent just needs to call one tool with a small response. CE-MCP is a tool for complex, multi-step, high-payload workflows.

Assuming the execution environment is the same as the model's context. The agent writes code that runs in an external sandbox. That sandbox has its own state, its own memory, and its own lifecycle. Variables the agent sets in code do not persist between execution calls unless you explicitly persist them.

Not auditing what code gets generated. In production CE-MCP deployments, it is worth logging what code the agent actually generates and executes. Unexpected patterns — especially file system access or network calls — are early indicators of prompt injection or misaligned agent behavior.

Treating CE-MCP as a replacement for good tool design. CE-MCP does not fix poorly designed MCP servers. If your tools have verbose, redundant descriptions, you should clean those up regardless. The arXiv paper on tool description quality (2602.14878) found that description quality affects agent performance independent of whether CE-MCP is used.

FAQ

Q: How much does CE-MCP actually reduce token costs in practice?

The published figures range widely. Anthropic's engineering blog cites 98.7% for a specific Google Drive to Salesforce workflow. Cloudflare reports 99.9% for their full API spec (a somewhat extreme case) and 81% for typical agent workflows. A reasonable expectation for a well-implemented CE-MCP setup handling multi-step, large-payload workflows is 70–95% reduction. Simple workflows with small payloads may see 20–40% reduction.

Q: Does CE-MCP work with all LLMs, or only Claude?

The pattern is not model-specific. Any model capable of writing and debugging code can use CE-MCP. In practice, models with stronger coding capabilities (Claude 3.5+, GPT-4o, Gemini 2.5 Pro) produce more reliable CE-MCP code with fewer errors. Weaker models may generate code with bugs or require more fallback handling.

Q: What is the minimum sandboxing I need for CE-MCP to be safe?

At minimum: no access to the host file system, no outbound network access from within the execution environment (except to explicitly allow-listed API endpoints), and no access to host environment variables or secrets. Cloudflare's V8 isolate approach is a reference implementation. For Python-based execution environments, look at RestrictedPython or Docker-based sandboxing with network policies.

Q: Should I retrofit existing agents with CE-MCP?

Start with your most token-heavy workflows — those with large intermediate payloads or many sequential tool calls. These are where CE-MCP's return on implementation effort is highest. For simple agents that call a few tools with compact responses, the refactor may not be worth it.

Q: Is CE-MCP the same as giving the agent a REPL?

Conceptually similar, but CE-MCP typically uses a more structured execution model. The agent writes a complete program (not interactive REPL commands), which executes in a sandboxed environment. Some implementations (like Cloudflare's Code Mode) add API-aware execution that handles authentication and rate limiting transparently — things a bare REPL would not do.

Key Takeaways

CE-MCP is the pattern that turns context economics from an agent capability ceiling into a solved problem. The core insight is simple: intermediate data does not need to enter the model's context window just because the agent needs to process it.

The measurable gains are real. Anthropic's benchmark shows 98.7% token reduction on a real enterprise workflow. Cloudflare's production implementation handles a 2,500-endpoint API in 1,000 tokens. The arXiv paper (2602.15945) provides the formal design analysis behind these numbers.

What you need to be clear-eyed about is the security trade-off. Adding code execution to your agent's toolkit meaningfully expands the attack surface. CVE-2026-30623 and the OX Security advisory show what happens when that surface is not properly isolated. Cloudflare's V8 isolate approach and the input validation recommendations from security researchers are not optional hardening — they are the baseline for safe deployment.

The practical path forward: identify your token-heaviest workflows, implement CE-MCP with proper sandboxing for those specific cases, audit what code gets generated, and expand coverage incrementally as your security posture firms up.

Bottom Line

CE-MCP is the most impactful structural change available for agents that hit context limits — but it demands real sandboxing work before deployment. Start with your highest-payload workflows, implement Cloudflare-grade isolation, and treat the execution environment as adversarial input territory by default.

DEV Community