TL;DR: Claude Code token costs grow fast on agent-heavy workflows because every tool definition gets injected into the context. An AI gateway in front of Claude Code lets you cache responses, swap to cheaper models, and use Code Mode to cut tool definition overhead. After testing the setup with Bifrost, the largest single optimisation is Code Mode for MCP, which reduces tool definition tokens by 58% to 92.8% depending on tool count.
This post assumes familiarity with Claude Code, MCP server basics, and what ANTHROPIC_BASE_URL does in CLI agents.
Where Claude Code Token Cost Comes From
Three places drive cost on a typical Claude Code workload.
First, the model itself. Default Claude Code uses Sonnet for most tasks and Opus for harder ones. Opus is several times more expensive per token than Sonnet, and Sonnet is several times more than Haiku.
Second, repeat work. Claude Code re-reads files, re-runs grep, and re-thinks the same problem inside long sessions. If two adjacent prompts hit the same code path, the second one is paid for in full unless something is caching it.
Third, the MCP tool catalog. Every tool definition from every connected MCP server gets injected into the context for every request. Anthropic's Model Context Protocol overview spells out how tool discovery works at the protocol level. With a few servers connected, you might be paying tens of thousands of tokens per request to describe tools that the model may never call.
A gateway addresses all three.
Step 1: Point Claude Code at a Gateway
For Bifrost, set the environment variable Claude Code reads to discover its API host.
export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
export ANTHROPIC_API_KEY=vk-prod-abc123
claude
Bifrost exposes a 100% compatible Anthropic endpoint, so Claude Code does not know it is talking to a gateway. The full integration steps are in the Bifrost Claude Code docs.
Step 2: Pin Claude Code to a Cheaper Model When Appropriate
Claude Code respects model overrides at startup or mid-session. Through the gateway, you can pin Sonnet for default work and only switch to Opus when needed.
claude --model claude-sonnet-4-6
If you are routing through Bedrock or Vertex for compliance reasons, the gateway accepts pinning syntax that maps to those backends:
export ANTHROPIC_DEFAULT_SONNET_MODEL="bedrock/global.anthropic.claude-sonnet-4-6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="vertex/claude-opus-4-7"
The Bifrost CLI agents overview covers Claude Code, Codex CLI, and Gemini CLI in the same setup.
Step 3: Enable Semantic Caching
Claude Code re-asks similar questions inside the same session. Dual-layer caching catches both exact repeats and semantically similar prompts.
semantic_cache:
enabled: true
vector_store: weaviate
weaviate_url: ${WEAVIATE_URL}
similarity_threshold: 0.92
conversation_history_threshold: 3
ttl_seconds: 86400
Every request needs the x-bf-cache-key header to participate in caching. Claude Code does not set this header by default, so you have to either configure Bifrost to inject it per virtual key or wrap the CLI invocation with a header-setting proxy. The semantic caching docs cover the mechanics.
The cache is model and provider isolated, so a Sonnet response never gets returned for an Opus request.
Step 4: Use Code Mode for MCP Servers
This is the biggest single optimisation in my testing. Code Mode replaces full tool definition injection with a Python stub generation approach. Instead of seeing every tool from every server, the model sees four meta-tools: listToolFiles, readToolFile, getToolDocs, and executeToolCode. The model loads only the tool definitions it actually needs.
Bifrost publishes Code Mode benchmarks on its MCP gateway resource page:
| MCP tools connected | Token reduction | Pass rate |
|---|---|---|
| 96 tools | 58% | 100% |
| 251 tools | 84.5% | 100% |
| 508 tools | 92.8% | 100% |
At ~500 tools, the per-query payload drops from roughly 1.15M tokens to 83K, a 14x reduction. Tool calls execute inside a sandboxed Starlark interpreter so behaviour is bounded and auditable. The same numbers and detail are in the Bifrost benchmarks resource.
Configuring Code Mode for Claude Code's MCP setup:
mcp:
code_mode:
enabled: true
sandbox: starlark
servers:
- name: github-mcp
type: stdio
command: ["npx", "-y", "@modelcontextprotocol/server-github"]
- name: linear-mcp
type: http
url: ${LINEAR_MCP_URL}
The MCP tool execution docs cover the sandbox model in depth.
Step 5: Set Per-Session Budget Caps
Long Claude Code sessions can run for hours. Per-virtual-key budget caps stop runaway sessions before they hit your finance team.
virtual_keys:
- key_name: claude-code-dev
key: vk-cc-dev
rate_limit:
token_limit: 5000000
token_limit_duration: "1d"
budget_limit: 50.00
budget_duration: "1d"
allowed_models: ["claude-sonnet-4-6", "claude-opus-4-7"]
Reset durations are calendar-aligned (1d, 1w, 1M, 1Y in UTC) so caps line up with billing cycles. The four-tier budget hierarchy (Customer, Team, Virtual Key, Provider Config) is documented on the Bifrost governance resource.
Comparison: Optimisation Levers
| Lever | Mechanism | Best for |
|---|---|---|
| Model pinning | Default to Sonnet, opt into Opus | All Claude Code workloads |
| Semantic caching | Vector similarity match | Sessions with repeated patterns |
| Code Mode (MCP) | Stub generation, demand-loaded tools | Large MCP catalogs |
| Per-VK budgets | Hard caps with calendar resets | Long-running sessions |
Trade-offs and Limitations
Bifrost is self-hosted only with no managed cloud. If you do not have ops capacity to run a gateway, this is real overhead.
Code Mode requires the model to call meta-tools to load definitions, which adds a small number of extra round trips compared to the upfront-definition approach. On large catalogs that is a clear net win. On a 5-tool setup it is not.
Semantic caching introduces freshness questions for code workflows. If your repo state changed but the cached prompt looks identical, you get a stale response. Tight TTLs and per-key cache scoping reduce the risk.
OpenRouter is not compatible because of a tool call streaming issue, so if you currently route Claude Code through OpenRouter you cannot keep that path through Bifrost.
Bifrost is newer than LiteLLM, so the community and ecosystem of integrations is still building up.
Quick Recap
- Three cost drivers: model choice, repeat work, and MCP tool definition payload
- Pin Claude Code to Sonnet by default, opt into Opus only when needed
- Dual-layer semantic caching captures repeat patterns inside long sessions
- Code Mode for MCP cuts tool definition tokens by 58% to 92.8% depending on catalog size
- Per-virtual-key budgets put hard caps on runaway sessions
GitHub: https://git.new/bifrost | Docs: https://getmax.im/bifrostdocs | Website: https://getmax.im/bifrost-home
Top comments (0)