DEV Community

Kamya Shah
Kamya Shah

Posted on

Best MCP Gateway for Claude Code to Cut Token Costs

If you run Claude Code with multiple MCP servers, you have probably noticed that token costs grow faster than expected. The reason is architectural, not accidental: every MCP server you connect loads its full tool catalog into the context window on every single request. Before Claude Code processes your actual task, it has already consumed thousands of tokens in tool definitions.

Bifrost, the open-source AI gateway by Maxim AI, solves this with Code Mode, an execution model that reduces MCP token costs by 50% to 92% without trimming tools or losing capability.


The Token Cost Problem is Structural, Not Incidental

MCP has crossed 97 million monthly downloads and is now standard infrastructure for AI agents. The protocol itself is well-designed. The cost problem is a consequence of how tool discovery works by default.

When Claude Code connects directly to MCP servers:

  • Each server exposes tool definitions containing names, descriptions, input schemas, and parameter types.
  • All definitions from all connected servers are injected into the context window before every request.
  • A single tool definition runs 150 to 300 tokens. Fifty tools across five servers translates to 7,500 to 15,000 tokens of overhead per call.
  • In multi-step workflows, intermediate tool results also pass back through the model on each turn, stacking token costs further.

Trimming your tool list is the standard workaround. It trades capability for cost control. An MCP gateway eliminates the need for that trade-off entirely.


How an MCP Gateway Addresses This

An MCP gateway sits between Claude Code and all your tool servers as a single aggregation and control layer. Claude Code connects once to the gateway. The gateway manages all server connections, tool discovery, routing, and execution behind that single endpoint.

For Claude Code specifically, this means:

  • One endpoint, all tools: Add or remove MCP servers in the gateway and they appear or disappear in Claude Code automatically, no client config changes needed.
  • Scoped tool visibility: Control exactly which tools each developer or workflow can see using virtual keys, reducing context overhead.
  • Token-efficient execution: Replace full tool injection with an on-demand model that loads only what the current task requires.
  • Semantic caching: Serve repeated or similar queries from cache instead of the provider.

Bifrost's MCP gateway functions as both an MCP client (connecting to external tool servers) and an MCP server (exposing a governed endpoint to Claude Code). That dual role is what enables centralized control without changing how Claude Code operates.


Code Mode: How Bifrost Achieves 50 to 92% Token Reduction

Standard MCP has no concept of lazy loading. Every tool from every server goes into context, every time. As you add servers, costs scale linearly and then worse.

Bifrost's Code Mode replaces that model entirely. The approach draws on research published by Anthropic's engineering team, which found that switching from direct tool calls to code-based orchestration reduced context from 150,000 tokens to 2,000 for a complex multi-tool workflow.

Instead of injecting raw tool definitions, Code Mode represents connected MCP servers as lightweight Python stub files in a virtual filesystem. The model uses four meta-tools to work with them:

Meta-tool What it does
listToolFiles Lists available servers and tools by name
readToolFile Retrieves Python function signatures for a specific server or tool
getToolDocs Loads full documentation for a tool before execution
executeToolCode Runs the orchestration script in a sandboxed interpreter

The flow: Claude reads the stub for the relevant server, writes a short Python orchestration script, and calls executeToolCode. Bifrost executes it in a Starlark sandbox and returns the final result. Intermediate tool outputs never touch the model context. The complete tool catalog never enters the context window.

Benchmark results from three controlled test rounds:

Setup Without Code Mode With Code Mode Cost Reduction
6 servers, 96 tools $104.04 $46.06 55.7%
11 servers, 251 tools $180.07 $29.80 83.4%
16 servers, 508 tools $377.00 $29.00 92.2%

The savings compound as MCP footprint grows because Code Mode's cost is bounded by what the model reads, not by how many tools are registered. Full benchmark data and methodology are in Bifrost's published performance benchmarks.

Code Mode also cuts latency by 40% on multi-tool tasks. Rather than five separate tool calls each requiring a provider round trip, the model writes one script that executes all five sequentially. The Starlark sandbox is intentionally constrained: no file I/O, no network access, no imports. Tool calls and basic Python-like logic only. This makes it safe to enable inside Agent Mode for fully automated execution.


Connecting Claude Code to Bifrost

The Claude Code integration is one command:

claude mcp add --transport http bifrost http://localhost:8080/mcp
Enter fullscreen mode Exit fullscreen mode

With Virtual Key authentication:

claude mcp add-json bifrost '{"type":"http","url":"http://localhost:8080/mcp","headers":{"Authorization":"Bearer your-virtual-key"}}'
Enter fullscreen mode Exit fullscreen mode

After that, Claude Code routes all MCP traffic through Bifrost. New servers added to the gateway surface in Claude Code automatically. The full setup guide covers Code Mode activation, virtual key scoping, and environment-specific configuration.


Tool Filtering: The Second Cost Lever

Unscoped tool access is a separate token cost vector that compounds with the tool injection problem. When every Claude Code session can see every tool from every server, the context includes tools with no relevance to the current task.

Bifrost's virtual key system scopes tool access at the individual tool level. A key for a developer's day-to-day workflow can allow filesystem_read while blocking filesystem_write from the same MCP server. Admin tooling sits behind a separate key that standard developer keys cannot reach.

Tool Groups let you manage this at scale: define named collections of tools from one or more servers, then attach them to any combination of virtual keys, teams, or users. Bifrost resolves the permitted set at request time from memory, with no database queries. The result is that Claude Code sees a scoped, relevant tool list on every request, and that smaller list compounds the savings from Code Mode.


Semantic Caching

Development sessions generate a lot of repetition: the same file structure queries, the same dependency lookups, the same documentation requests throughout a session. Bifrost's semantic caching matches incoming requests against previous ones by meaning rather than exact string. "How do I sort an array in Python?" and "Python array sorting?" hit the same cache entry and return without touching the provider.

For Claude Code workflows that return to the same codebase context repeatedly, cache hit rates are high and the savings stack on top of Code Mode and tool filtering.


Observability at the Tool Level

Every tool execution is logged as a first-class entry in Bifrost: tool name, source server, arguments, response, latency, the virtual key that triggered it, and the upstream LLM request. Any Claude Code session is fully traceable: which tools were called, in what order, what each returned.

The built-in dashboard displays real-time breakdowns of token consumption, tool call frequency, and per-session costs. For production setups, Bifrost exposes Prometheus metrics and OpenTelemetry traces, compatible with Grafana, Datadog, and New Relic. Per-tool pricing configuration captures external API costs from tools that call paid third-party services, giving a complete view of what each agent run actually costs.


Capability Comparison

Feature Bifrost Direct MCP Generic gateways
Code Mode (50-92% token savings) Yes No No
Virtual key tool scoping Yes No Limited
Semantic caching Yes No Varies
Single-command Claude Code setup Yes N/A Partial
Self-hosted / in-VPC Yes N/A Varies
Per-tool audit logging Yes No Varies
Agent Mode (autonomous execution) Yes No No
Multi-provider LLM routing Yes No Limited

Code Mode is the differentiator no other production MCP gateway offers. The orchestration-first execution model keeps token cost flat regardless of how many servers are connected.


More Than an MCP Gateway

Beyond MCP, Bifrost routes Claude Code traffic across 20+ LLM providers through a single OpenAI-compatible API. Teams can run Claude Code against different model providers per task type, or cap per-developer spend, entirely at the gateway layer with no changes to Claude Code configuration.

Enterprise deployments extend this with in-VPC hosting, RBAC, SSO via Okta or Microsoft Entra, audit logs for SOC 2 and HIPAA compliance, and MCP with federated authentication for turning existing internal APIs into MCP tools without writing a custom server.


Getting Started

Start Bifrost with a single command:

npx @maximai/bifrost
Enter fullscreen mode Exit fullscreen mode

Full MCP gateway setup, including Code Mode and Claude Code integration, is in the Bifrost MCP docs. The Bifrost MCP Gateway blog post covers access control architecture and Code Mode benchmarks in full detail.

For enterprise deployments, book a demo with the Bifrost team.

Top comments (0)