DEV Community

Kamya Shah
Kamya Shah

Posted on

The Real Cost of MCP in Claude Code, and How to Bring It Down

Bifrost's MCP gateway and Code Mode reduce MCP token costs for Claude Code by up to 92%, with centralized governance and per-tool cost visibility.

The pattern is familiar to any team that has rolled Claude Code out beyond a single developer. Integrations multiply, MCP servers get wired in one by one, workflows genuinely improve, and then the API bill lands and nobody can quite explain the shape of it. The easy assumption is usage growth. The actual story is almost always tool overhead, and it has a structural cause: the Model Context Protocol loads tool schemas into context on every single request. Reducing MCP token costs for Claude Code at team scale isn't a matter of using the tool less. It's a matter of putting a governance and execution layer in the right place. Bifrost, the open-source AI gateway from Maxim AI, is designed for exactly that. This piece lays out where the costs actually come from, what Claude Code's native controls already handle, and how Bifrost's MCP gateway with Code Mode reduces token consumption by up to 92% in production-scale workloads.

The Structural Source of MCP Token Costs

Unlike most context costs, MCP overhead isn't paid once per session. It's paid on every turn. Each MCP server Claude Code connects to injects its full tool schemas, every name, description, and parameter definition, into the model's context on every single message. Five servers with thirty tools each means 150 tool definitions shipped before the model has seen the user's prompt.

Independent measurement has made the scale of this concrete. An analysis of real-world Claude Code sessions found that a four-server configuration typically carries around 7,000 tokens of MCP overhead per message, with heavier setups crossing 50,000 tokens before a single prompt is typed. A separate breakdown reported multi-server Claude Code setups commonly adding 15,000 to 20,000 tokens of overhead per turn under usage-based billing.

Three dynamics amplify the problem as teams grow:

  • Repetition on every message: a 50-turn session pays the overhead 50 times.
  • Tools you don't use still charge you: a Playwright server's 22 browser tools load even during a Python edit.
  • Verbose descriptions by default: most open-source MCP servers ship with long, readable descriptions that inflate every tool's per-token cost.

The downstream effect isn't limited to the bill. Overhead crowds out the working context the model needs, pushes compaction earlier in the session, and degrades output quality on long tasks.

What Claude Code's Native Controls Cover

Anthropic has been responsive to this problem. Claude Code's cost management documentation covers tool search deferral, prompt caching, auto-compaction, model tiering, and custom preprocessing hooks. Tool search is the most relevant for MCP: once total tool definitions cross a threshold, Claude Code defers them, and only tool names remain in context until the model actually invokes one. In heavy sessions this alone can save 13,000+ tokens.

For an individual developer running a few MCP servers locally, the native controls are sufficient. At team scale, three gaps remain:

  • Client-side optimization, not organizational control: tool search deferral optimizes one session. It doesn't let a platform team define which tools a given developer, team, or customer integration is permitted to call.
  • No orchestration layer: even with deferral, every multi-step workflow still incurs schema loads, intermediate tool results, and model round-trips on every step.
  • No cross-team visibility: per-session introspection is available to each developer, but there's no organizational view of which tools are consuming tokens across the team.

Once the problem shifts from "one developer's cost" to "fifty developers' governed MCP usage," the solution has to move into the infrastructure layer.

How Bifrost Reduces MCP Token Costs for Claude Code

Bifrost sits between Claude Code and the MCP servers a team depends on. Rather than Claude Code connecting to each server directly, it connects to Bifrost's single /mcp endpoint. Bifrost handles discovery, governance, execution, and the orchestration model that produces the largest cost reduction: Code Mode.

The impact is documented in Bifrost's MCP gateway cost benchmark. Across three controlled rounds, input tokens dropped by 58% with 96 tools connected, 84% with 251 tools, and 92% with 508 tools. Pass rate held at 100% throughout.

Code Mode: moving orchestration out of the prompt

Code Mode is the single most consequential shift. Instead of injecting every tool definition into context, Bifrost exposes connected MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only the stubs it needs, writes a short Python script to orchestrate them, and Bifrost executes the script in a sandboxed Starlark interpreter.

The model interacts with four meta-tools, regardless of how many MCP servers are connected:

  • listToolFiles: discover the available servers and tools.
  • readToolFile: load Python function signatures for a given server or tool.
  • getToolDocs: pull detailed documentation for a specific tool on demand.
  • executeToolCode: run the orchestration script against live bindings.

The approach has broad industry validation. Anthropic's engineering team documented the pattern with code execution and MCP, showing a Google Drive to Salesforce workflow dropping from 150,000 tokens to 2,000. Cloudflare observed the same exponential savings in their own implementation. Bifrost builds it natively into the gateway, uses Python instead of JavaScript for better LLM fluency, and adds the dedicated docs tool to compress context further.

The savings compound as tool count grows. Classic MCP scales linearly with the number of tools connected. Code Mode's cost is bounded by what the model actually reads, so the curve flattens instead of accelerating.

Governance that directly reduces token exposure

Every request through Bifrost carries a virtual key, and each key is scoped to a specific set of tools. The scoping works at the tool level, not just the server level, so filesystem_read can be granted without filesystem_write from the same MCP server. The model only ever receives definitions for tools the key is allowed to call. Unauthorized tools don't load into context and don't cost tokens.

At organizational scale, MCP Tool Groups make this manageable: a named collection of tools can be attached to any combination of keys, teams, customers, or providers. Bifrost resolves the right set at request time, indexed in memory and synced across cluster nodes, with no database query on the hot path.

A single endpoint with complete audit coverage

All connected MCP servers sit behind one /mcp endpoint. Claude Code connects once and sees every tool the virtual key allows. Adding new MCP servers to Bifrost surfaces them in Claude Code automatically, with no client-side change.

That single endpoint is also where cost attribution becomes possible. Every tool execution logs as a first-class entry with tool name, server, arguments, result, latency, virtual key, and the parent LLM request, alongside token costs and per-tool costs for tools that call paid external APIs.

Implementation: Claude Code on Bifrost

The integration is short because Bifrost runs as a drop-in replacement for existing SDKs and requires no application code changes.

  1. Register MCP clients: in the Bifrost dashboard, add each MCP server with its connection type (HTTP, SSE, or STDIO), endpoint, and any required headers.
  2. Enable Code Mode: toggle it on in the client settings. No schema changes, no redeployment. Token usage drops on the next request.
  3. Configure virtual keys and auto-execute: create scoped virtual keys for each consumer. For autonomous agent loops, allowlist read-only tools while keeping writes behind approval gates.
  4. Point Claude Code at Bifrost: add Bifrost as an MCP server in Claude Code's MCP settings using the gateway URL. Claude Code discovers the full tool set through that single connection.

Measuring Impact at Team Scale

Cost reductions only land with finance and platform leadership if they can be measured. Bifrost's observability provides the data required for that conversation:

  • Token cost by virtual key, tool, and MCP server, tracked over time.
  • Complete trace of every agent run: tools called, order, arguments, latency.
  • Combined spend view showing LLM token costs and tool costs side by side.
  • Native Prometheus metrics and OpenTelemetry (OTLP) integration for Grafana, Datadog, New Relic, and Honeycomb.

Teams evaluating Bifrost can also reference published performance benchmarks showing 11µs of overhead at 5,000 RPS, and the LLM Gateway Buyer's Guide for a full capability comparison.

The Wider Infrastructure Picture

MCP without governance and cost control becomes unsustainable as soon as a team moves past a single developer's local setup. Bifrost's MCP gateway addresses the full set of production concerns in one layer:

  • Scoped access through virtual keys and per-tool filtering.
  • Organizational governance with MCP Tool Groups.
  • Complete audit trails suitable for SOC 2, GDPR, HIPAA, and ISO 27001.
  • Per-tool cost visibility alongside LLM token usage.
  • Code Mode to reduce context cost without reducing capability.
  • The same gateway also handles LLM provider routing, automatic failover, load balancing, semantic caching, and unified key management across 20+ AI providers.

When model calls and tool calls flow through the same gateway, model tokens and tool costs sit in one audit log, under one access control model. Teams already running Claude Code on Bifrost can explore the Claude Code integration guide for workflow-specific implementation detail.

Bringing MCP Token Costs for Claude Code Under Control

Reducing MCP token costs for Claude Code isn't about trimming tools or accepting smaller capability surface. It's about moving governance and orchestration into the layer where they can actually scale. Bifrost's MCP gateway and Code Mode deliver up to 92% token reduction on large tool catalogs while giving platform teams the access control and cost attribution they need to run Claude Code across an engineering organization.

To see how Bifrost fits against your own Claude Code deployment, book a demo with the Bifrost team.

Top comments (0)