Cutting MCP Tool-Call Token Costs by 50%+ with Code Mode

#ai #llm #mcp #agents

MCP tool-call token costs grow fast as agents add servers. Code Mode trims token usage 50%+ by letting the model write code, not call tools directly.

For teams running production AI agents, MCP tool-call token costs are often the biggest invisible item on the monthly bill. The Model Context Protocol (MCP) works by loading every tool definition from every connected server into the model's context window before the user's prompt is even read. Connect five servers with thirty tools apiece, and the model is now parsing 150 schemas on every turn regardless of whether those tools get called. Anthropic and Cloudflare both documented a workaround for this, and Bifrost's MCP gateway now ships it natively as Code Mode: the model writes a brief orchestration script, runs it in a sandbox, and stops sending individual tool definitions through its context.

Where MCP Tool-Call Token Costs Actually Come From

The default behavior of most MCP clients is the same across the ecosystem. They pull every tool definition from every connected server into the prompt, and then hand the model a list to pick from. Anthropic's engineering team identified two distinct failure modes in this pattern in their post on code execution with MCP, and both show up fast at production scale.

Problem one is context-window bloat. Every tool ships with a description, a parameter schema, and return-type metadata, all of which sit in the prompt on every single request. An agent wired to thousands of tools will burn hundreds of thousands of tokens before the user's message has even been read.

Problem two is harder to catch until the bill arrives: intermediate results travel through the model between each hop. The canonical example from Anthropic is a Google Drive to Salesforce workflow. The model pulls a meeting transcript out of Drive, then has to write the same transcript back into a Salesforce field. The transcript therefore flows through context twice. For a two-hour sales meeting, that is roughly 50,000 extra tokens consumed on a single run.

The standard response, "just trim the tool list," is not actually a solution. It trades capability for cost, which is the wrong tradeoff. Code Mode fixes the root cause rather than patching the symptom.

How Code Mode Works Inside an MCP Gateway

Code Mode is an execution pattern for MCP in which the agent writes a short orchestration script (usually in TypeScript, Python, or a locked-down variant like Starlark) and the gateway runs that script inside a sandbox to coordinate every tool call. Nothing about the full tool catalog ever needs to enter the model's context, and intermediate results stay inside the sandbox unless the script deliberately surfaces them.

The core idea is counterintuitive at first glance but has held up under scrutiny: models are stronger at writing code that calls tools than at picking tools through JSON function-calling syntax. Anthropic and Cloudflare arrived at this conclusion independently, publishing their findings within weeks of each other. Cloudflare's version compresses 2,500+ API endpoints down to just two tools and around 1,000 tokens of surface area. Anthropic demonstrated a 98.7% reduction on the Drive-to-Salesforce scenario, cutting token consumption from 150,000 to 2,000.

Bifrost's implementation follows the same principle with two production-oriented tweaks. Python stubs replace the TypeScript approach because models have seen significantly more Python in training, and a dedicated documentation meta-tool lets the agent pull deeper interface details only when it actually needs them.

Code Mode's MCP Token Cost Reductions, Measured

Once Code Mode is switched on, Bifrost stops pushing tool definitions into the model's context entirely. In their place, the gateway exposes four meta-tools that let the model discover, inspect, and invoke tools on its own schedule:

listToolFiles: enumerates the MCP servers and tools that are reachable
readToolFile: returns the Python function signatures for a chosen server or tool
getToolDocs: fetches the detailed documentation for a specific tool before the model uses it
executeToolCode: takes an orchestration script and runs it against live tool bindings inside a sandboxed Starlark interpreter

From the model's perspective, the tool catalog looks like a virtual filesystem of small stub files. It pulls in only the interfaces required for the current task, writes a script that wires those calls together, and the gateway handles sequential execution without ever shuttling intermediate results back through the LLM.

The benchmarks Bifrost ran against this setup show exactly how the advantage compounds as the MCP footprint expands. Running 64 identical queries, once with Code Mode off and once with it on:

96 tools, 6 servers: input tokens drop 58%, estimated cost drops 56%
251 tools, 11 servers: tokens drop 84%, cost drops 83%
508 tools, 16 servers: tokens drop 93%, cost drops 92%

All three rounds kept a 100% pass rate in both configurations, which means none of the savings came from sacrificed accuracy. The complete methodology and raw numbers sit in the Bifrost MCP Gateway benchmark writeup.

Independent evaluations line up with these numbers. A third-party test from AIMultiple using GPT-4.1 against the Bright Data MCP server found a 78.5% input token reduction under the code execution pattern, again with a 100% success rate across 50 task runs.

What Code Mode Delivers Beyond Lower Token Spend

Token reduction is the headline metric, but Code Mode's operational impact extends well past the prompt bill.

Faster end-to-end execution: Cloudflare and Anthropic both observed 30 to 40% latency improvements, driven by skipping the back-and-forth through the agent loop. Bifrost's own benchmarks show the same trend: fewer turns, less waiting between them.
Round-trip compression: a classic MCP task that takes eight turns collapses down to a single executeToolCode call, which works out to a 75% drop in sequential invocations of the model.
Deterministic, auditable runs: the Starlark sandbox blocks imports, file I/O, and any network access outside the whitelisted tool bindings. That makes every execution path repeatable, reviewable, and safe enough to run automatically.
Intermediate data stays local: unless the script explicitly logs or returns something, the sandbox keeps it contained. PII, database extracts, and sensitive payloads can now travel between tools without ever crossing the model's context.
Savings that compound: classic MCP costs scale directly with the number of connected servers. Code Mode costs are bounded by the interfaces a given task actually touches, so the gap between the two widens as the MCP fleet grows.

Taken together, these are why both Cloudflare and Anthropic treat the context-window problem as the most consequential efficiency issue facing production agent infrastructure today.

Choosing Between Code Mode and Classic MCP Tool Calls

For any agent workflow that spans multiple tools and multiple steps, Code Mode should be the starting assumption. The scenarios where the payoff is largest:

Agents attached to three or more MCP servers
Tasks that chain four or more tool invocations end-to-end
Workflows that carry large payloads, think transcripts, spreadsheets, or database rows, between tools
Long-running agent loops where the complete tool list would otherwise reload on each turn
CLI agents and editors like Claude Code and Cursor connected through the gateway

Direct tool calls still earn their keep in narrow, single-purpose agents that only touch a handful of tools, or in interactive flows where a human approves each step. Both execution modes run on the same Bifrost gateway and can be configured per MCP client, which means teams can shift workloads over incrementally rather than committing to a wholesale rewrite.

Setting Up Code Mode on Bifrost's MCP Gateway

Turning on Code Mode inside Bifrost is a four-step flow in the dashboard:

Register an MCP client: open the MCP section, add the upstream server (over HTTP, SSE, STDIO, or in-process), and allow Bifrost to catalog its tools.
Flip the Code Mode switch: inside the client settings, enable Code Mode. No schema edits, no redeployment. From that moment on, the model sees only the four meta-tools rather than every tool definition.
Allowlist tools for auto-execution: the three read-only meta-tools run without intervention. executeToolCode becomes auto-executable only when every tool in its generated script is already allowlisted, which keeps write operations behind an approval gate by default.
Scope consumer access with virtual keys: hand out scoped credentials per team, customer, or agent, with each key restricted to the tools it is cleared to call. Scoping happens at the individual tool level, not just at the server level, so the same upstream server can expose filesystem_read to one key while holding filesystem_write back from another.

Wiring Claude Code, Cursor, or any other MCP-aware client into Bifrost is a one-line change: point the client at the gateway's /mcp endpoint. Every connected MCP server is now reachable through that single URL, governed by whichever virtual key is in use.

Where Efficient Agent Infrastructure Is Heading Next

The trajectory is now obvious. Agents are not calling one or two tools anymore; they are coordinating dozens of systems across dozens of MCP servers, and the naive approach (load every tool definition on every request) stops scaling beyond a handful of integrations. Anthropic, Cloudflare, and the wider MCP community have all landed on the same answer: treat tools as code, let the model author a program, run that program in a sandbox, and let only the final result return to the context window. Public performance benchmarks consistently report 50 to 90%+ input token reductions once this pattern ships in production, and none of them show task accuracy slipping.

In Bifrost, Code Mode is a native primitive of the MCP gateway, sitting alongside tool filtering, per-tool cost attribution, OAuth 2.0 authentication, and immutable audit trails for every tool execution. Token cost, latency, governance, and observability all flow through the same layer, reachable through the same /mcp endpoint, under a single control plane.

Start Cutting MCP Token Costs with Bifrost

MCP tool-call token costs creep up quietly, then start running the agent bill. Code Mode changes the underlying math: fewer schemas hitting context, fewer intermediate round trips to pay for, and a sandbox that holds its cost flat as the tool count climbs into the hundreds. To see how Code Mode performs against your own MCP footprint and what the savings look like at your scale, book a demo with the Bifrost team.