DEV Community

Kuldeep Paul
Kuldeep Paul

Posted on

MCP Token Costs at Scale: How Code Mode Drives a 92% Reduction

Scaling MCP tools inflates context windows fast. See how Bifrost Code Mode cuts MCP token costs by up to 92.8% at 500+ tools in verified benchmarks.

For teams running production agents, MCP token costs are the infrastructure line item that nobody sees coming. An agent that behaved fine with three connected Model Context Protocol servers turns into a four-figure daily bill once the connected server count passes twenty and the tool catalog crosses a few hundred entries. Neither the model nor the prompt is responsible. The root cause is the default MCP execution path, which ships every tool definition from every connected server into the model's context on every request. Bifrost, the open-source AI gateway from Maxim AI, tackles the problem at its source with a Code Mode execution path that, in controlled benchmarks, dropped input tokens by 92.8% at 508 tools without sacrificing any accuracy.

What follows is a breakdown of why MCP token costs scale so sharply, why teams across the industry are coalescing around code execution as the answer, and what the Bifrost benchmark tells us about the true cost of running agents in production.

What Makes MCP Token Costs Blow Up at Scale

Model Context Protocol, originally introduced by Anthropic, is the standard that defines how AI applications wire up to external tools. By default, MCP clients serialize every tool definition from every connected server into the model's context on every single turn. Anthropic's own engineering team laid out the consequence in their writeup on code execution with MCP: as connected tool counts grow, upfront tool loading combined with intermediate-result round-trips drives up latency and pushes cost in the wrong direction.

The math is brutal. With five MCP servers at thirty tools each, the model takes in 150 tool schemas before it ever sees the actual user request. Multi-turn workflows make the situation worse because large intermediate payloads (documents, datasets, API responses) loop back through the context more than once. A single cross-system workflow that pulls a transcript from Google Drive and writes it to Salesforce, for instance, can drop from around 150,000 tokens under the default flow to roughly 2,000 tokens when the same job is expressed as code execution, roughly a 98.7% savings.

Four distinct forces turn MCP adoption into a climbing cost curve:

  • Entire tool catalog sent on every turn: all tool schemas are serialized into the prompt regardless of whether the model will actually invoke them
  • Round-tripped intermediate outputs: tool results travel back through context before being fed into the next tool call
  • Repeated catalog loads across agent loops: each new turn re-injects the full tool list into context
  • Cost scaling with server fanout: each new MCP server is roughly linear integration work but extracts a proportional token tax per request

The obvious workaround, pruning the tool list, is not actually a fix. It swaps capability for cost and calls the tradeoff an optimization.

Code Execution Is Becoming the Default Pattern for MCP

Over the past several months, a cleaner pattern has been emerging across AI infrastructure teams. Rather than presenting tools to the model as a flat set of function-call schemas, the tools are exposed as a typed API, and the model is asked to write a compact program that orchestrates the calls. Documentation is pulled on demand, logic runs locally, and only the final result is handed back.

Cloudflare took this public first with Code Mode. In the Code Mode announcement, they reported that rendering an MCP server as a TypeScript API and asking the model to write code against it produced roughly an 81% drop in token usage versus direct tool calling. A follow-up implementation went even further: the Cloudflare MCP server now fronts the full Cloudflare API, more than 2,500 endpoints spanning DNS, Workers, R2, and Zero Trust, behind two meta-tools (search() and execute()) that together consume about 1,000 tokens regardless of catalog size. Wrapping the same surface as a flat tool list would blow past a million tokens, which is larger than the context window of most foundation models.

Anthropic's engineering team arrived at the same design independently, describing it as a way to give agents more tools while spending fewer tokens. The pattern now has two common names in the wild: code execution with MCP and Code Mode. It has three defining properties:

  • MCP tools are presented to the model as a filesystem of typed API stubs rather than as a flat tool list
  • The model only loads the stubs it needs for the current task
  • The model writes a short script that executes in a sandbox, invoking tools directly and handing back only the final output

Bifrost Code Mode is this pattern, implemented at the gateway layer, inside the same control plane that already runs routing, access control, and observability. Teams evaluating how it fits into a broader architecture can skim the Bifrost MCP gateway resource page for the complete feature surface.

How Bifrost Code Mode Works: Python Stubs, Meta-Tools, and a Starlark Sandbox

Inside Bifrost, connected MCP servers are rendered as a virtual filesystem of lightweight Python stub files. Python was chosen over JavaScript deliberately. Large language models have encountered far more real-world Python than any other language during training, which translates into higher first-pass success rates on generated orchestration scripts. A dedicated documentation tool trims the footprint further, letting the model pull doc strings for a specific tool only at the moment it is about to call it.

Four meta-tools give the model everything it needs:

  • listToolFiles: surface the available servers and tools
  • readToolFile: fetch the Python function signatures for a given server or tool
  • getToolDocs: pull detailed documentation for a specific tool before invoking it
  • executeToolCode: run the orchestration script against the live tool bindings

Generated code is executed inside a Starlark interpreter sandbox that blocks imports, file I/O, and network access. That restriction keeps runs deterministic, fast, and safe to trigger automatically inside an agent loop. Platform teams can choose server-level stubs for compact discovery or tool-level stubs for finer permission control. Because tool scoping is enforced through virtual keys, a model that lacks permission to call a tool never sees that tool's definition in the first place. The broader governance picture, including MCP Tool Groups and per-tool cost accounting, is documented in the Bifrost engineering deep-dive on MCP access control, cost governance, and 92% lower token costs at scale.

The Numbers: Three Benchmark Rounds Across 96, 251, and 508 Tools

The savings were measured with three controlled rounds, flipping Code Mode on and off while scaling the tool count between rounds. The same query set was used against the same models in every configuration, and pass rate was tracked throughout to confirm the reduction did not come at the expense of accuracy. The full methodology sits alongside additional performance data on the Bifrost benchmarks resource page.

Headline outcomes:

  • Round 1, 96 tools across 6 servers: input tokens moved from 19.9M to 8.3M (−58.2%); estimated cost dropped from $104.04 to $46.06 (−55.7%); pass rate stayed at 100% in both configurations
  • Round 2, 251 tools across 11 servers: input tokens moved from 35.7M to 5.5M (−84.5%); estimated cost dropped from $180.07 to $29.80 (−83.4%); pass rate reached 100% with Code Mode enabled
  • Round 3, 508 tools across 16 servers: input tokens moved from 75.1M to 5.4M (−92.8%); estimated cost dropped from $377.00 to $29.00 (−92.2%); pass rate stayed at 100% in both configurations

Two patterns emerge from the data. First, the savings are nonlinear. They compound as the MCP footprint grows, because the default pattern's cost tracks with tool count while Code Mode's cost tracks with what the model actually reads. Second, the gain did not cost accuracy. Pass rate held at 100% across Rounds 1 and 3 and matched that number in Round 2. The complete raw data, query set, and methodology are published in the Bifrost MCP Code Mode benchmark report.

Reading the Cost Curve: What It Says About MCP Economics

The most instructive part of the benchmark is the shape of the savings curve itself. Around 100 tools, Code Mode produces a solid but unspectacular advantage. At 250 tools, the gap widens noticeably. By 500 tools, the two approaches operate in entirely different cost regimes, with roughly 14× fewer input tokens per query and a total cost ratio near 13 to 1.

Three takeaways follow for teams architecting AI infrastructure in 2026:

  • Context economics, not tool count, sets the ceiling on agent capability. The right question has shifted from "how many tools can we connect?" to "how many tools can we afford to expose on every turn?" Code execution removes that ceiling.
  • MCP governance and MCP cost are the same problem wearing two hats. The cleanest way to stop paying for tool definitions that go unused is to stop injecting them into context by default. Scoped access through virtual keys, tool groups, and per-tool bindings shrinks both the blast radius and the token bill simultaneously.
  • The gateway layer is the correct place to solve this. Implementing code execution once per agent or per application is fragile and duplicative. Solving it inside a gateway that already handles routing, authentication, and observability gives every MCP consumer the same economics with zero client changes.

Rolling Out Code Mode in Production

Switching Code Mode on inside Bifrost is a configuration change, not a migration project. In practice, the rollout that has worked best follows four steps:

  • Register the MCP clients: add each MCP server along with its connection type (HTTP, SSE, STDIO, or in-process). Bifrost discovers the available tools and starts syncing them on a configurable interval.
  • Flip Code Mode on per client: toggle Code Mode in the client settings and the four meta-tools replace the flat tool catalog automatically. No schema changes and no redeployment are involved.
  • Mark safe tools as auto-executable: add read-only tools to the auto-execute allowlist. executeToolCode only becomes auto-executable once every tool the generated script calls is itself on the allowlist, which keeps write operations behind an explicit approval gate by default.
  • Scope consumers with virtual keys and MCP Tool Groups: issue a scoped credential per consumer and bundle tools into named groups that attach to keys, teams, or customers. Access and enterprise AI governance policies are applied at request time.

Every tool invocation is written to an audit log as a first-class entry with the tool name, the source server, the arguments passed in, the result returned, the latency, the virtual key that triggered the call, and the parent LLM request that started the agent loop. That level of telemetry is what turns the cost curve into something teams can audit rather than just anecdote.

Start Cutting MCP Token Costs Today

When tool exposure is decoupled from context loading, MCP token costs stop being the ceiling on how far an agent can scale. The Bifrost benchmark at 508 tools across 16 servers delivered a 92.8% drop in input tokens and a 92.2% drop in estimated cost with no loss of accuracy, and the gap keeps widening as the tool catalog grows. To see how Bifrost handles MCP token cost optimization, governance, and observability in a live environment, book a Bifrost demo with the team.

Top comments (0)