Cutting MCP Token Costs in Claude Code: A Practical Guide with Bifrost

Cut MCP token costs for Claude Code by up to 92% using Bifrost's MCP gateway and Code Mode. Here's how, and what Claude Code's built-ins miss.

If you've wired more than a couple of MCP servers into Claude Code, you've probably seen the pattern: token counts climb faster than expected, /context fills up before you've typed a prompt, and the API bill at month-end doesn't match how much "real work" the model did. The culprit isn't your tools. It's how the Model Context Protocol ships tool schemas into context on every single turn. To actually cut MCP token costs in Claude Code without throwing away capability, the fix has to live one layer deeper, at the gateway. This is where Bifrost, the open-source AI gateway by Maxim AI, comes in. This post walks through where the tokens really go, what Claude Code already does for you, and how Bifrost's MCP gateway with Code Mode drops token use by up to 92% on large tool catalogs.

Where MCP Tokens Actually Disappear

The thing most people miss about MCP is that tool definitions aren't loaded once per session. They're loaded once per message. Every MCP server you connect pushes its full schema, every tool name, every description, every parameter, into the model's context on every turn. Wire up five servers with thirty tools each and you're shipping 150 tool definitions before Claude Code even reads your prompt.

The numbers are public and they're not small. A recent teardown found that a typical four-server Claude Code setup carries around 7,000 tokens of MCP overhead per message, with heavier configurations crossing 50,000 tokens before the first prompt. A separate analysis pegged multi-server setups at 15,000 to 20,000 tokens of overhead per turn on usage-based billing.

Three things make it worse the bigger your setup gets:

Overhead is per-message, not per-session. A 50-turn session pays the tax 50 times.
Unused tools still cost. A Playwright server's 22 browser tools ride along even when you're editing Python.
Descriptions are verbose by default. Most OSS MCP servers ship human-readable descriptions that inflate every tool's token cost.

And the spill-over hurts quality: overhead eats into the working context Claude actually needs, which pushes compaction earlier and makes long sessions flakier.

What Claude Code Already Does (And Where It Stops)

Credit where it's due: Anthropic has shipped real optimizations for this. Claude Code's cost docs cover tool search deferral, prompt caching, auto-compaction, model tiering, and preprocessing hooks. Tool search is the big one: once your tool definitions exceed a threshold, Claude Code defers them and only tool names stay in context until Claude actually picks one up. Reported savings land in the 13,000-token range for heavy sessions.

If you're a solo developer with two or three MCP servers running locally, this is enough. Where it runs out of road is at team scale:

Client-side, not org-side. Tool search deferral optimizes your session. It doesn't give a platform team control over which tools a given developer, team, or customer integration is actually allowed to call.
No orchestration savings. Even with deferral, every multi-step workflow still pays for intermediate tool results, model round-trips, and context reloads on each turn.
No shared visibility. /context and /mcp are per-developer introspection tools. There's no view at the org level showing which tools across which teams are burning tokens.

Past a certain scale, the question stops being "how do I trim my own session?" and starts being "how do I govern MCP for a team of fifty?" That needs an infrastructure layer.

How Bifrost Cuts MCP Token Costs in Claude Code

Bifrost drops between Claude Code and your MCP servers. Claude Code stops connecting to each server directly and instead talks to Bifrost's single /mcp endpoint. Bifrost handles discovery, governance, execution, and, most importantly, the execution pattern that actually crushes token cost: Code Mode.

The benchmark numbers from Bifrost's MCP gateway cost study are worth reading in full, but the short version: input tokens fell 58% at 96 tools, 84% at 251 tools, and 92% at 508 tools, with pass rate holding at 100% across all rounds.

Code Mode is the part that moves the needle

Code Mode is the single biggest lever. Rather than injecting tool definitions into context, Bifrost exposes your connected MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only the stubs it actually needs, writes a short Python script to chain the tools together, and Bifrost runs that script in a sandboxed Starlark interpreter.

The model sees four meta-tools, period, regardless of whether you have 6 MCP servers or 60:

listToolFiles: list the servers and tools available.
readToolFile: load Python signatures for a server or tool.
getToolDocs: fetch documentation for a specific tool on demand.
executeToolCode: run the orchestration script against live bindings.

The pattern has independent validation. Anthropic's engineering team wrote about this approach, showing a Google Drive to Salesforce workflow dropping from 150,000 tokens to 2,000. Cloudflare reported the same exponential savings curve with their own implementation. Bifrost builds it natively into the gateway, picks Python over JavaScript (better LLM fluency), and adds the dedicated docs tool to compress context even further.

The payoff compounds the more MCP servers you add. Classic MCP scales linearly with tool count; every server you add is more overhead. Code Mode is bounded by what the model actually reads, so the curve flattens instead of climbing.

Virtual keys: if a tool shouldn't be callable, don't load it

Every request through Bifrost carries a virtual key, and each key is scoped to a specific set of tools. The scoping is per-tool, not per-server, so you can grant filesystem_read without granting filesystem_write from the same MCP server. The model only ever sees definitions for tools the key allows. Tools outside the scope don't show up, don't load, don't cost tokens.

At team scale, MCP Tool Groups take this further: define a named collection of tools once, then attach it to any combination of keys, teams, customers, or providers. Resolution happens in-memory at request time, synced across cluster nodes, no database query on the hot path.

One endpoint, one audit log

All connected MCP servers sit behind a single /mcp endpoint. Claude Code connects once and sees every tool the key permits. Add a new MCP server in Bifrost later and it shows up in Claude Code automatically, no client-side config change required.

That single endpoint is also where cost observability actually becomes possible. Every tool execution logs as a first-class entry: tool name, server, arguments, result, latency, virtual key, and the parent LLM request, with token and per-tool costs sitting side by side.

Getting Claude Code Running on Bifrost

The setup takes a few minutes, and your app code doesn't change because Bifrost is a drop-in SDK replacement.

Register your MCP clients. In the Bifrost dashboard, add each MCP server with its connection type (HTTP, SSE, or STDIO), endpoint, and any required headers.
Turn on Code Mode. One toggle in the client settings. No redeployment, no schema changes. Token usage drops on the next request.
Configure virtual keys and auto-execute. Create scoped virtual keys for each consumer. For autonomous loops, allowlist read-only tools while keeping writes behind approval gates.
Point Claude Code at Bifrost. Open Claude Code's MCP settings, add Bifrost as an MCP server using the gateway URL. Claude Code now sees a governed, token-efficient view of every MCP tool the key permits.

That's the full path from vanilla Claude Code to governed MCP with Code Mode.

Measuring What You Actually Saved

Cutting MCP token costs only matters if you can prove it to whoever pays the bill. Bifrost's observability gives you the numbers that decision-makers ask for:

Token cost by virtual key, by tool, and by MCP server, over time.
Full trace of every agent run: tools called, order, arguments, latency.
Combined spend view with LLM tokens and tool costs side by side.
Prometheus metrics and OpenTelemetry (OTLP) for Grafana, Datadog, Honeycomb, or New Relic.

For broader context on gateway performance and evaluation criteria, Bifrost's benchmarks document 11µs overhead at 5,000 RPS, and the LLM Gateway Buyer's Guide covers the full capability matrix.

The Bigger Picture: Production MCP Needs a Gateway

MCP without a governance layer doesn't survive the transition from "one developer's local setup" to "fifty engineers shipping to production." Bifrost's MCP gateway is the layer that makes that transition possible:

Scoped access via virtual keys and per-tool filtering.
Org-scale governance with MCP Tool Groups.
Complete audit trails for SOC 2, GDPR, HIPAA, and ISO 27001.
Per-tool cost visibility alongside LLM token usage.
Code Mode to slash context cost without cutting capability.
The same gateway also handles LLM provider routing, automatic failover, load balancing, semantic caching, and unified key management across 20+ AI providers.

Model tokens and tool costs end up in one audit log, under one access control model. Teams already running Claude Code on Bifrost can check the Claude Code integration guide for workflow-specific details.

Start Cutting MCP Token Costs for Claude Code

The way to cut MCP token costs in Claude Code isn't to trim tools and accept less capability. It's to move governance and orchestration into the gateway, where they belong. Bifrost's MCP gateway plus Code Mode delivers up to 92% token reduction on large catalogs while giving platform teams the access control and visibility they need to run Claude Code at scale.

To see what Bifrost looks like against your own Claude Code setup, book a demo with the Bifrost team.