If you run Claude Code with multiple MCP servers, you have probably noticed that token costs grow faster than expected. The reason is architectural, not accidental: every MCP server you connect loads its full tool catalog into the context window on every single request. Before Claude Code processes your actual task, it has already consumed thousands of tokens in tool definitions.
Bifrost, the open-source AI gateway by Maxim AI, solves this with Code Mode, an execution model that reduces MCP token costs by 50% to 92% without trimming tools or losing capability.
The Token Cost Problem is Structural, Not Incidental
MCP has crossed 97 million monthly downloads and is now standard infrastructure for AI agents. The protocol itself is well-designed. The cost problem is a consequence of how tool discovery works by default.
When Claude Code connects directly to MCP servers:
- Each server exposes tool definitions containing names, descriptions, input schemas, and parameter types.
- All definitions from all connected servers are injected into the context window before every request.
- A single tool definition runs 150 to 300 tokens. Fifty tools across five servers translates to 7,500 to 15,000 tokens of overhead per call.
- In multi-step workflows, intermediate tool results also pass back through the model on each turn, stacking token costs further.
Trimming your tool list is the standard workaround. It trades capability for cost control. An MCP gateway eliminates the need for that trade-off entirely.
How an MCP Gateway Addresses This
An MCP gateway sits between Claude Code and all your tool servers as a single aggregation and control layer. Claude Code connects once to the gateway. The gateway manages all server connections, tool discovery, routing, and execution behind that single endpoint.
For Claude Code specifically, this means:
- One endpoint, all tools: Add or remove MCP servers in the gateway and they appear or disappear in Claude Code automatically, no client config changes needed.
- Scoped tool visibility: Control exactly which tools each developer or workflow can see using virtual keys, reducing context overhead.
- Token-efficient execution: Replace full tool injection with an on-demand model that loads only what the current task requires.
- Semantic caching: Serve repeated or similar queries from cache instead of the provider.
Bifrost's MCP gateway functions as both an MCP client (connecting to external tool servers) and an MCP server (exposing a governed endpoint to Claude Code). That dual role is what enables centralized control without changing how Claude Code operates.
Code Mode: How Bifrost Achieves 50 to 92% Token Reduction
Standard MCP has no concept of lazy loading. Every tool from every server goes into context, every time. As you add servers, costs scale linearly and then worse.
Bifrost's Code Mode replaces that model entirely. The approach draws on research published by Anthropic's engineering team, which found that switching from direct tool calls to code-based orchestration reduced context from 150,000 tokens to 2,000 for a complex multi-tool workflow.
Instead of injecting raw tool definitions, Code Mode represents connected MCP servers as lightweight Python stub files in a virtual filesystem. The model uses four meta-tools to work with them:
| Meta-tool | What it does |
|---|---|
listToolFiles |
Lists available servers and tools by name |
readToolFile |
Retrieves Python function signatures for a specific server or tool |
getToolDocs |
Loads full documentation for a tool before execution |
executeToolCode |
Runs the orchestration script in a sandboxed interpreter |
The flow: Claude reads the stub for the relevant server, writes a short Python orchestration script, and calls executeToolCode. Bifrost executes it in a Starlark sandbox and returns the final result. Intermediate tool outputs never touch the model context. The complete tool catalog never enters the context window.
Benchmark results from three controlled test rounds:
| Setup | Without Code Mode | With Code Mode | Cost Reduction |
|---|---|---|---|
| 6 servers, 96 tools | $104.04 | $46.06 | 55.7% |
| 11 servers, 251 tools | $180.07 | $29.80 | 83.4% |
| 16 servers, 508 tools | $377.00 | $29.00 | 92.2% |
The savings compound as MCP footprint grows because Code Mode's cost is bounded by what the model reads, not by how many tools are registered. Full benchmark data and methodology are in Bifrost's published performance benchmarks.
Code Mode also cuts latency by 40% on multi-tool tasks. Rather than five separate tool calls each requiring a provider round trip, the model writes one script that executes all five sequentially. The Starlark sandbox is intentionally constrained: no file I/O, no network access, no imports. Tool calls and basic Python-like logic only. This makes it safe to enable inside Agent Mode for fully automated execution.
Connecting Claude Code to Bifrost
The Claude Code integration is one command:
claude mcp add --transport http bifrost http://localhost:8080/mcp
With Virtual Key authentication:
claude mcp add-json bifrost '{"type":"http","url":"http://localhost:8080/mcp","headers":{"Authorization":"Bearer your-virtual-key"}}'
After that, Claude Code routes all MCP traffic through Bifrost. New servers added to the gateway surface in Claude Code automatically. The full setup guide covers Code Mode activation, virtual key scoping, and environment-specific configuration.
Tool Filtering: The Second Cost Lever
Unscoped tool access is a separate token cost vector that compounds with the tool injection problem. When every Claude Code session can see every tool from every server, the context includes tools with no relevance to the current task.
Bifrost's virtual key system scopes tool access at the individual tool level. A key for a developer's day-to-day workflow can allow filesystem_read while blocking filesystem_write from the same MCP server. Admin tooling sits behind a separate key that standard developer keys cannot reach.
Tool Groups let you manage this at scale: define named collections of tools from one or more servers, then attach them to any combination of virtual keys, teams, or users. Bifrost resolves the permitted set at request time from memory, with no database queries. The result is that Claude Code sees a scoped, relevant tool list on every request, and that smaller list compounds the savings from Code Mode.
Semantic Caching
Development sessions generate a lot of repetition: the same file structure queries, the same dependency lookups, the same documentation requests throughout a session. Bifrost's semantic caching matches incoming requests against previous ones by meaning rather than exact string. "How do I sort an array in Python?" and "Python array sorting?" hit the same cache entry and return without touching the provider.
For Claude Code workflows that return to the same codebase context repeatedly, cache hit rates are high and the savings stack on top of Code Mode and tool filtering.
Observability at the Tool Level
Every tool execution is logged as a first-class entry in Bifrost: tool name, source server, arguments, response, latency, the virtual key that triggered it, and the upstream LLM request. Any Claude Code session is fully traceable: which tools were called, in what order, what each returned.
The built-in dashboard displays real-time breakdowns of token consumption, tool call frequency, and per-session costs. For production setups, Bifrost exposes Prometheus metrics and OpenTelemetry traces, compatible with Grafana, Datadog, and New Relic. Per-tool pricing configuration captures external API costs from tools that call paid third-party services, giving a complete view of what each agent run actually costs.
Capability Comparison
| Feature | Bifrost | Direct MCP | Generic gateways |
|---|---|---|---|
| Code Mode (50-92% token savings) | Yes | No | No |
| Virtual key tool scoping | Yes | No | Limited |
| Semantic caching | Yes | No | Varies |
| Single-command Claude Code setup | Yes | N/A | Partial |
| Self-hosted / in-VPC | Yes | N/A | Varies |
| Per-tool audit logging | Yes | No | Varies |
| Agent Mode (autonomous execution) | Yes | No | No |
| Multi-provider LLM routing | Yes | No | Limited |
Code Mode is the differentiator no other production MCP gateway offers. The orchestration-first execution model keeps token cost flat regardless of how many servers are connected.
More Than an MCP Gateway
Beyond MCP, Bifrost routes Claude Code traffic across 20+ LLM providers through a single OpenAI-compatible API. Teams can run Claude Code against different model providers per task type, or cap per-developer spend, entirely at the gateway layer with no changes to Claude Code configuration.
Enterprise deployments extend this with in-VPC hosting, RBAC, SSO via Okta or Microsoft Entra, audit logs for SOC 2 and HIPAA compliance, and MCP with federated authentication for turning existing internal APIs into MCP tools without writing a custom server.
Getting Started
Start Bifrost with a single command:
npx @maximai/bifrost
Full MCP gateway setup, including Code Mode and Claude Code integration, is in the Bifrost MCP docs. The Bifrost MCP Gateway blog post covers access control architecture and Code Mode benchmarks in full detail.
For enterprise deployments, book a demo with the Bifrost team.
Top comments (0)