Best AI Gateway to Optimize Claude Code Token Cost

#ai #devops #opensource #programming

TL;DR: Claude Code token costs grow fast on agent-heavy workflows because every tool definition gets injected into the context. An AI gateway in front of Claude Code lets you cache responses, swap to cheaper models, and use Code Mode to cut tool definition overhead. After testing the setup with Bifrost, the largest single optimisation is Code Mode for MCP, which reduces tool definition tokens by 58% to 92.8% depending on tool count.

This post assumes familiarity with Claude Code, MCP server basics, and what ANTHROPIC_BASE_URL does in CLI agents.

Where Claude Code Token Cost Comes From

Three places drive cost on a typical Claude Code workload.

First, the model itself. Default Claude Code uses Sonnet for most tasks and Opus for harder ones. Opus is several times more expensive per token than Sonnet, and Sonnet is several times more than Haiku.

Second, repeat work. Claude Code re-reads files, re-runs grep, and re-thinks the same problem inside long sessions. If two adjacent prompts hit the same code path, the second one is paid for in full unless something is caching it.

Third, the MCP tool catalog. Every tool definition from every connected MCP server gets injected into the context for every request. Anthropic's Model Context Protocol overview spells out how tool discovery works at the protocol level. With a few servers connected, you might be paying tens of thousands of tokens per request to describe tools that the model may never call.

A gateway addresses all three.

Step 1: Point Claude Code at a Gateway

For Bifrost, set the environment variable Claude Code reads to discover its API host.

export ANTHROPIC_BASE_URL=http://localhost:8080/anthropic
export ANTHROPIC_API_KEY=vk-prod-abc123
claude

Bifrost exposes a 100% compatible Anthropic endpoint, so Claude Code does not know it is talking to a gateway. The full integration steps are in the Bifrost Claude Code docs.

Step 2: Pin Claude Code to a Cheaper Model When Appropriate

Claude Code respects model overrides at startup or mid-session. Through the gateway, you can pin Sonnet for default work and only switch to Opus when needed.

claude --model claude-sonnet-4-6

If you are routing through Bedrock or Vertex for compliance reasons, the gateway accepts pinning syntax that maps to those backends:

export ANTHROPIC_DEFAULT_SONNET_MODEL="bedrock/global.anthropic.claude-sonnet-4-6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="vertex/claude-opus-4-7"

The Bifrost CLI agents overview covers Claude Code, Codex CLI, and Gemini CLI in the same setup.

Step 3: Enable Semantic Caching

Claude Code re-asks similar questions inside the same session. Dual-layer caching catches both exact repeats and semantically similar prompts.

semantic_cache:
  enabled: true
  vector_store: weaviate
  weaviate_url: ${WEAVIATE_URL}
  similarity_threshold: 0.92
  conversation_history_threshold: 3
  ttl_seconds: 86400

Every request needs the x-bf-cache-key header to participate in caching. Claude Code does not set this header by default, so you have to either configure Bifrost to inject it per virtual key or wrap the CLI invocation with a header-setting proxy. The semantic caching docs cover the mechanics.

The cache is model and provider isolated, so a Sonnet response never gets returned for an Opus request.

Step 4: Use Code Mode for MCP Servers

This is the biggest single optimisation in my testing. Code Mode replaces full tool definition injection with a Python stub generation approach. Instead of seeing every tool from every server, the model sees four meta-tools: listToolFiles, readToolFile, getToolDocs, and executeToolCode. The model loads only the tool definitions it actually needs.

Bifrost publishes Code Mode benchmarks on its MCP gateway resource page:

MCP tools connected	Token reduction	Pass rate
96 tools	58%	100%
251 tools	84.5%	100%
508 tools	92.8%	100%

At ~500 tools, the per-query payload drops from roughly 1.15M tokens to 83K, a 14x reduction. Tool calls execute inside a sandboxed Starlark interpreter so behaviour is bounded and auditable. The same numbers and detail are in the Bifrost benchmarks resource.

Configuring Code Mode for Claude Code's MCP setup:

mcp:
  code_mode:
    enabled: true
    sandbox: starlark
  servers:
    - name: github-mcp
      type: stdio
      command: ["npx", "-y", "@modelcontextprotocol/server-github"]
    - name: linear-mcp
      type: http
      url: ${LINEAR_MCP_URL}

The MCP tool execution docs cover the sandbox model in depth.

Step 5: Set Per-Session Budget Caps

Long Claude Code sessions can run for hours. Per-virtual-key budget caps stop runaway sessions before they hit your finance team.

virtual_keys:
  - key_name: claude-code-dev
    key: vk-cc-dev
    rate_limit:
      token_limit: 5000000
      token_limit_duration: "1d"
    budget_limit: 50.00
    budget_duration: "1d"
    allowed_models: ["claude-sonnet-4-6", "claude-opus-4-7"]

Reset durations are calendar-aligned (1d, 1w, 1M, 1Y in UTC) so caps line up with billing cycles. The four-tier budget hierarchy (Customer, Team, Virtual Key, Provider Config) is documented on the Bifrost governance resource.

Comparison: Optimisation Levers

Lever	Mechanism	Best for
Model pinning	Default to Sonnet, opt into Opus	All Claude Code workloads
Semantic caching	Vector similarity match	Sessions with repeated patterns
Code Mode (MCP)	Stub generation, demand-loaded tools	Large MCP catalogs
Per-VK budgets	Hard caps with calendar resets	Long-running sessions

Trade-offs and Limitations

Bifrost is self-hosted only with no managed cloud. If you do not have ops capacity to run a gateway, this is real overhead.

Code Mode requires the model to call meta-tools to load definitions, which adds a small number of extra round trips compared to the upfront-definition approach. On large catalogs that is a clear net win. On a 5-tool setup it is not.

Semantic caching introduces freshness questions for code workflows. If your repo state changed but the cached prompt looks identical, you get a stale response. Tight TTLs and per-key cache scoping reduce the risk.

OpenRouter is not compatible because of a tool call streaming issue, so if you currently route Claude Code through OpenRouter you cannot keep that path through Bifrost.

Bifrost is newer than LiteLLM, so the community and ecosystem of integrations is still building up.

Quick Recap

Three cost drivers: model choice, repeat work, and MCP tool definition payload
Pin Claude Code to Sonnet by default, opt into Opus only when needed
Dual-layer semantic caching captures repeat patterns inside long sessions
Code Mode for MCP cuts tool definition tokens by 58% to 92.8% depending on catalog size
Per-virtual-key budgets put hard caps on runaway sessions

GitHub: https://git.new/bifrost | Docs: https://getmax.im/bifrostdocs | Website: https://getmax.im/bifrost-home