DEV Community

Kamya Shah
Kamya Shah

Posted on

Stop Burning Tokens: How an MCP Gateway Fixes Claude Code and Codex CLI Cost Leaks

Bifrost MCP Gateway cuts coding agent token costs by up to 92% using Code Mode, virtual keys, and on-demand tool loading. Here's how it works.

If you run Claude Code or Codex CLI against more than a couple of MCP servers, your token bill is quietly inflating. Every turn of the agent loop resends the complete tool catalog into the model's context, whether the agent needs those tools or not. Bifrost MCP Gateway solves this at the infrastructure layer by exposing tools on demand through Code Mode, scoping access with virtual keys, and consolidating every MCP server behind one endpoint. In controlled benchmarks across 16 servers and 508 tools, input tokens dropped 92.8% while pass rate stayed at 100%.

The tool catalog problem nobody talks about

Here is what the classic MCP execution model does under the hood. Every tool exposed by every connected MCP server is serialized into the model's context on every request. Connect five servers with thirty tools each, and you are pushing 150 tool schemas before the prompt even gets read. Connect sixteen servers with 500 tools, and the model is spending more of its token budget reading a catalog than actually reasoning about your code.

Anthropic's engineering team called this out in their writeup on code execution with MCP. They documented a Google Drive to Salesforce workflow where context usage fell from 150,000 tokens to 2,000 when tools were loaded on demand instead of dumped upfront. The same economics hit every Claude Code or Codex CLI user who wires up a serious fleet of MCP servers.

The side effects compound:

  • Inference cost scales with your MCP footprint, not with the work the agent actually does.
  • Agent latency grows as the tool catalog grows, because more tokens need to be read before reasoning begins.
  • Tool selection accuracy degrades when the model has to disambiguate the right tool from dozens of irrelevant ones.

Claude Code's docs acknowledge this pressure directly, noting that tool search is on by default to reduce the problem. But client-side heuristics do not fix the underlying architecture, especially when multiple teams and agents share the same tool fleet.

What this costs in practice

A typical coding agent setup looks something like this:

  • Filesystem MCP server for code access.
  • GitHub MCP server for PR and issue management.
  • A handful of internal tool servers for databases, CI, and ops.
  • Each server exposing anywhere from ten to fifty tools.

A moderately complex task runs six to ten turns in the agent loop. With 150 tool definitions averaging a few hundred tokens each, a single task can burn 300K input tokens on schemas alone before producing a useful line of output. Multiply by hundreds of daily runs per engineer and the spend gets uncomfortable fast.

How Bifrost MCP Gateway fixes the token leak

Bifrost is the open-source AI gateway by Maxim AI, written in Go, with 11 microseconds of overhead at 5,000 RPS. It runs as both an MCP client (connecting upstream to your tool servers) and an MCP server (exposing a unified /mcp endpoint to Claude Code, Codex CLI, Cursor, and anything else that speaks MCP). The cost reduction comes from three layers, not one.

Layer 1: Code Mode replaces schema dumps with stub files

Code Mode is the main mechanism. Instead of injecting every tool definition into the context, Bifrost presents connected servers as a virtual filesystem of compact Python stub files. The model works with just four meta-tools and navigates the catalog on demand:

  • listToolFiles: list which servers and tools are available
  • readToolFile: load Python function signatures for a specific server or tool
  • getToolDocs: pull detailed documentation for a single tool when needed
  • executeToolCode: run an orchestration script against live tool bindings inside a sandboxed Starlark interpreter

Workflow: the model reads only the stubs it needs, writes a short script that chains several tool calls, and submits that script through executeToolCode. Bifrost runs it in the sandbox, executes the chain, and returns only the final result. Intermediate results never touch the model's context.

Code Mode supports two binding levels. Server-level binding bundles every tool from one server into a single stub file (efficient for servers with modest tool counts). Tool-level binding gives each tool its own stub (useful when a server exposes thirty-plus tools with rich schemas). Both use the same four-meta-tool interface, so the switch is a configuration flag, not a rewrite.

Layer 2: Tool filtering scopes what each agent sees

Not every Claude Code session or Codex CLI instance needs the same tool surface. Bifrost's tool filtering lets you define, per virtual key, exactly which tools are exposed. A CI agent running unattended can get a read-only subset. An interactive Claude Code session for a senior engineer can get the full surface. The model literally never sees definitions for tools outside its scope, so there is no prompt-level workaround and no wasted context.

Layer 3: One endpoint for every connected server

Teams stop maintaining MCP configs inside each coding agent. You point Claude Code or Codex CLI at Bifrost's /mcp endpoint and it discovers every upstream server through one connection, governed by the virtual key attached to the request. Add a new server to Bifrost and every connected coding agent picks it up automatically, no config changes required on the client side.

The benchmark numbers

Bifrost ran three controlled benchmark rounds with Code Mode on and off, scaling tool count between rounds:

Round Tools × Servers Input Tokens (OFF) Input Tokens (ON) Token Reduction Cost Reduction Pass Rate
1 96 tools · 6 servers 19.9M 8.3M −58.2% −55.7% 100%
2 251 tools · 11 servers 35.7M 5.5M −84.5% −83.4% 100%
3 508 tools · 16 servers 75.1M 5.4M −92.8% −92.2% 100%

Two takeaways matter here. First, the savings compound rather than growing linearly, because classic MCP's cost scales with the total number of connected tools while Code Mode's cost scales with what the model actually reads. The bigger your MCP footprint, the bigger the delta. Second, accuracy held at 100% across all three rounds, so this is not a capability-for-cost trade. The full methodology and raw results are in the Bifrost MCP Code Mode benchmarks repo.

For context on how Code Mode combines with access control and per-tool cost tracking, the Bifrost MCP Gateway deep-dive goes further.

Wiring it up

The full configuration walkthroughs live in the Claude Code integration guide and the Codex CLI integration guide. The short version:

  1. Run Bifrost locally or inside your VPC and add your MCP servers through the dashboard. HTTP, SSE, and STDIO transports are all supported.
  2. Toggle Code Mode on at the client level. No redeployment, no schema rewrites.
  3. Create a virtual key per consumer (a developer, a CI bot, a customer integration) and attach the tool set it is allowed to call.
  4. Point Claude Code or Codex CLI at the Bifrost /mcp endpoint using that virtual key.
  5. For multi-team setups, use MCP Tool Groups to manage access at team or customer scope instead of per individual key.

Once traffic starts flowing, every tool call becomes a first-class log entry: tool name, source server, arguments, result, latency, originating virtual key, and parent LLM request. LLM token costs and per-tool execution costs sit next to each other, so spend attribution stops being guesswork.

What you pick up along the way

Lower token costs are the headline, but coding agents running through Bifrost MCP Gateway also get infrastructure most teams eventually build themselves:

  • Scoped access: every agent sees only the tools it should see.
  • Audit trails: every tool execution is logged with arguments and results, useful for security review and debugging.
  • Health monitoring: automatic reconnects on upstream failure, with periodic refresh to pick up new tools.
  • OAuth 2.0 with PKCE: including dynamic client registration and auto token refresh.
  • Unified model routing: the same gateway handles provider routing, failover, and load balancing across 20+ LLM providers.

More deployment-specific guidance is on the Bifrost MCP gateway resource page and the Claude Code integration resource.

Getting off the token treadmill

If your Claude Code or Codex CLI setup is quietly burning tokens on tool catalogs every turn, the leak is architectural, not configurable. Bifrost MCP Gateway closes it by loading tools on demand, scoping access per consumer, and consolidating every connected server behind one endpoint, without sacrificing accuracy or capability.

To see how Bifrost can cut token costs across your coding agent fleet, book a demo with the Bifrost team.

Top comments (0)