<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kamya Shah</title>
    <description>The latest articles on DEV Community by Kamya Shah (@kamya_shah_e69d5dd78f831c).</description>
    <link>https://dev.to/kamya_shah_e69d5dd78f831c</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3522106%2F50d11e9f-8be6-4fbb-b034-1c4168bf3a12.jpeg</url>
      <title>DEV Community: Kamya Shah</title>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kamya_shah_e69d5dd78f831c"/>
    <language>en</language>
    <item>
      <title>Best MCP Gateway for Claude Code to Cut Token Costs</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:52:52 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/best-mcp-gateway-for-claude-code-to-cut-token-costs-2joo</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/best-mcp-gateway-for-claude-code-to-cut-token-costs-2joo</guid>
      <description>&lt;p&gt;If you run Claude Code with multiple MCP servers, you have probably noticed that token costs grow faster than expected. The reason is architectural, not accidental: every MCP server you connect loads its full tool catalog into the context window on every single request. Before Claude Code processes your actual task, it has already consumed thousands of tokens in tool definitions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost&lt;/strong&gt;, the open-source AI gateway by Maxim AI, solves this with &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;, an execution model that reduces MCP token costs by 50% to 92% without trimming tools or losing capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Token Cost Problem is Structural, Not Incidental
&lt;/h2&gt;

&lt;p&gt;MCP has crossed &lt;a href="https://www.getmaxim.ai/articles/best-mcp-gateway-in-2026-how-bifrost-cuts-token-usage-by-50/" rel="noopener noreferrer"&gt;97 million monthly downloads&lt;/a&gt; and is now standard infrastructure for AI agents. The protocol itself is well-designed. The cost problem is a consequence of how tool discovery works by default.&lt;/p&gt;

&lt;p&gt;When Claude Code connects directly to MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each server exposes tool definitions containing names, descriptions, input schemas, and parameter types.&lt;/li&gt;
&lt;li&gt;All definitions from all connected servers are injected into the context window before every request.&lt;/li&gt;
&lt;li&gt;A single tool definition runs 150 to 300 tokens. Fifty tools across five servers translates to 7,500 to 15,000 tokens of overhead per call.&lt;/li&gt;
&lt;li&gt;In multi-step workflows, intermediate tool results also pass back through the model on each turn, stacking token costs further.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trimming your tool list is the standard workaround. It trades capability for cost control. An MCP gateway eliminates the need for that trade-off entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  How an MCP Gateway Addresses This
&lt;/h2&gt;

&lt;p&gt;An MCP gateway sits between Claude Code and all your tool servers as a single aggregation and control layer. Claude Code connects once to the gateway. The gateway manages all server connections, tool discovery, routing, and execution behind that single endpoint.&lt;/p&gt;

&lt;p&gt;For Claude Code specifically, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One endpoint, all tools&lt;/strong&gt;: Add or remove MCP servers in the gateway and they appear or disappear in Claude Code automatically, no client config changes needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoped tool visibility&lt;/strong&gt;: Control exactly which tools each developer or workflow can see using virtual keys, reducing context overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token-efficient execution&lt;/strong&gt;: Replace full tool injection with an on-demand model that loads only what the current task requires.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching&lt;/strong&gt;: Serve repeated or similar queries from cache instead of the provider.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; functions as both an MCP client (connecting to external tool servers) and an MCP server (exposing a governed endpoint to Claude Code). That dual role is what enables centralized control without changing how Claude Code operates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code Mode: How Bifrost Achieves 50 to 92% Token Reduction
&lt;/h2&gt;

&lt;p&gt;Standard MCP has no concept of lazy loading. Every tool from every server goes into context, every time. As you add servers, costs scale linearly and then worse.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; replaces that model entirely. The approach draws on research published by &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team&lt;/a&gt;, which found that switching from direct tool calls to code-based orchestration reduced context from 150,000 tokens to 2,000 for a complex multi-tool workflow.&lt;/p&gt;

&lt;p&gt;Instead of injecting raw tool definitions, Code Mode represents connected MCP servers as lightweight Python stub files in a virtual filesystem. The model uses four meta-tools to work with them:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Meta-tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;listToolFiles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lists available servers and tools by name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;readToolFile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retrieves Python function signatures for a specific server or tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;getToolDocs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Loads full documentation for a tool before execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;executeToolCode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs the orchestration script in a sandboxed interpreter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The flow: Claude reads the stub for the relevant server, writes a short Python orchestration script, and calls &lt;code&gt;executeToolCode&lt;/code&gt;. Bifrost executes it in a Starlark sandbox and returns the final result. Intermediate tool outputs never touch the model context. The complete tool catalog never enters the context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benchmark results from three controlled test rounds:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Without Code Mode&lt;/th&gt;
&lt;th&gt;With Code Mode&lt;/th&gt;
&lt;th&gt;Cost Reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;6 servers, 96 tools&lt;/td&gt;
&lt;td&gt;$104.04&lt;/td&gt;
&lt;td&gt;$46.06&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55.7%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11 servers, 251 tools&lt;/td&gt;
&lt;td&gt;$180.07&lt;/td&gt;
&lt;td&gt;$29.80&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;83.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 servers, 508 tools&lt;/td&gt;
&lt;td&gt;$377.00&lt;/td&gt;
&lt;td&gt;$29.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The savings compound as MCP footprint grows because Code Mode's cost is bounded by what the model reads, not by how many tools are registered. Full benchmark data and methodology are in Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;published performance benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Code Mode also cuts latency by 40% on multi-tool tasks. Rather than five separate tool calls each requiring a provider round trip, the model writes one script that executes all five sequentially. The Starlark sandbox is intentionally constrained: no file I/O, no network access, no imports. Tool calls and basic Python-like logic only. This makes it safe to enable inside &lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; for fully automated execution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting Claude Code to Bifrost
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code integration&lt;/a&gt; is one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add &lt;span class="nt"&gt;--transport&lt;/span&gt; http bifrost http://localhost:8080/mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Virtual Key authentication:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add-json bifrost &lt;span class="s1"&gt;'{"type":"http","url":"http://localhost:8080/mcp","headers":{"Authorization":"Bearer your-virtual-key"}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, Claude Code routes all MCP traffic through Bifrost. New servers added to the gateway surface in Claude Code automatically. The &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;full setup guide&lt;/a&gt; covers Code Mode activation, virtual key scoping, and environment-specific configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tool Filtering: The Second Cost Lever
&lt;/h2&gt;

&lt;p&gt;Unscoped tool access is a separate token cost vector that compounds with the tool injection problem. When every Claude Code session can see every tool from every server, the context includes tools with no relevance to the current task.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key system&lt;/a&gt; scopes tool access at the individual tool level. A key for a developer's day-to-day workflow can allow &lt;code&gt;filesystem_read&lt;/code&gt; while blocking &lt;code&gt;filesystem_write&lt;/code&gt; from the same MCP server. Admin tooling sits behind a separate key that standard developer keys cannot reach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;Tool Groups&lt;/a&gt; let you manage this at scale: define named collections of tools from one or more servers, then attach them to any combination of virtual keys, teams, or users. Bifrost resolves the permitted set at request time from memory, with no database queries. The result is that Claude Code sees a scoped, relevant tool list on every request, and that smaller list compounds the savings from Code Mode.&lt;/p&gt;




&lt;h2&gt;
  
  
  Semantic Caching
&lt;/h2&gt;

&lt;p&gt;Development sessions generate a lot of repetition: the same file structure queries, the same dependency lookups, the same documentation requests throughout a session. Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; matches incoming requests against previous ones by meaning rather than exact string. "How do I sort an array in Python?" and "Python array sorting?" hit the same cache entry and return without touching the provider.&lt;/p&gt;

&lt;p&gt;For Claude Code workflows that return to the same codebase context repeatedly, cache hit rates are high and the savings stack on top of Code Mode and tool filtering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability at the Tool Level
&lt;/h2&gt;

&lt;p&gt;Every tool execution is logged as a first-class entry in Bifrost: tool name, source server, arguments, response, latency, the virtual key that triggered it, and the upstream LLM request. Any Claude Code session is fully traceable: which tools were called, in what order, what each returned.&lt;/p&gt;

&lt;p&gt;The built-in dashboard displays real-time breakdowns of token consumption, tool call frequency, and per-session costs. For production setups, Bifrost exposes Prometheus metrics and OpenTelemetry traces, compatible with Grafana, Datadog, and New Relic. Per-tool pricing configuration captures external API costs from tools that call paid third-party services, giving a complete view of what each agent run actually costs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Capability Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;Direct MCP&lt;/th&gt;
&lt;th&gt;Generic gateways&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code Mode (50-92% token savings)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Virtual key tool scoping&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-command Claude Code setup&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted / in-VPC&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-tool audit logging&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent Mode (autonomous execution)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-provider LLM routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Code Mode is the differentiator no other production MCP gateway offers. The orchestration-first execution model keeps token cost flat regardless of how many servers are connected.&lt;/p&gt;




&lt;h2&gt;
  
  
  More Than an MCP Gateway
&lt;/h2&gt;

&lt;p&gt;Beyond MCP, Bifrost routes Claude Code traffic across 20+ LLM providers through a single OpenAI-compatible API. Teams can run Claude Code against different model providers per task type, or cap per-developer spend, entirely at the gateway layer with no changes to Claude Code configuration.&lt;/p&gt;

&lt;p&gt;Enterprise deployments extend this with &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC hosting&lt;/a&gt;, RBAC, SSO via Okta or Microsoft Entra, &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt; for SOC 2 and HIPAA compliance, and &lt;a href="https://docs.getbifrost.ai/enterprise/mcp-with-fa" rel="noopener noreferrer"&gt;MCP with federated authentication&lt;/a&gt; for turning existing internal APIs into MCP tools without writing a custom server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Start Bifrost with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @maximai/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full MCP gateway setup, including Code Mode and Claude Code integration, is in the &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;Bifrost MCP docs&lt;/a&gt;. The &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost MCP Gateway blog post&lt;/a&gt; covers access control architecture and Code Mode benchmarks in full detail.&lt;/p&gt;

&lt;p&gt;For enterprise deployments, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Use Bifrost CLI with Coding Agents like Claude Code</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:51:52 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/how-to-use-bifrost-cli-with-coding-agents-like-claude-code-1ieo</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/how-to-use-bifrost-cli-with-coding-agents-like-claude-code-1ieo</guid>
      <description>&lt;p&gt;The Bifrost CLI wires up coding agents like Claude Code to your &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; in a single command. Rather than hand-configuring base URLs, shuffling API keys between providers, and tweaking config files for each agent, you just run &lt;code&gt;bifrost&lt;/code&gt; in your terminal, pick the agent, pick the model, and get to work. This walkthrough covers how to use the Bifrost CLI with Claude Code and the other supported coding agents, starting from gateway setup and moving into workflows like tabbed sessions, git worktrees, and automatic MCP attach.&lt;/p&gt;

&lt;p&gt;Coding agents are now deeply embedded in how engineering teams ship. Anthropic notes that &lt;a href="https://www.anthropic.com/product/claude-code" rel="noopener noreferrer"&gt;the majority of code at Anthropic is now written by Claude Code&lt;/a&gt;, with engineers spending more of their time on architecture, review, and agent orchestration. But as teams layer in multiple agents (Claude Code for heavy refactors, Codex CLI for quick edits, Gemini CLI for model-specific work), the per-agent configuration burden starts to stack up fast. The Bifrost CLI folds all of that into one launcher.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Bifrost CLI Actually Does
&lt;/h2&gt;

&lt;p&gt;Think of the Bifrost CLI as an interactive terminal launcher for any supported coding agent, routed through your Bifrost gateway. It takes care of provider config, model selection, API key injection, and MCP auto-attach so you do not have to. Bifrost itself is the &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;open-source AI gateway built by Maxim AI&lt;/a&gt;, exposing 20+ LLM providers behind a single OpenAI-compatible API with roughly 11 microseconds of overhead at 5,000 RPS.&lt;/p&gt;

&lt;p&gt;Out of the box, the CLI supports four coding agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; (binary: &lt;code&gt;claude&lt;/code&gt;, provider path: &lt;code&gt;/anthropic&lt;/code&gt;), including MCP auto-attach and git worktree support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codex CLI&lt;/strong&gt; (binary: &lt;code&gt;codex&lt;/code&gt;, provider path: &lt;code&gt;/openai&lt;/code&gt;), with &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; pointed at &lt;code&gt;{base}/openai/v1&lt;/code&gt; and model overrides via &lt;code&gt;--model&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini CLI&lt;/strong&gt; (binary: &lt;code&gt;gemini&lt;/code&gt;, provider path: &lt;code&gt;/genai&lt;/code&gt;), with model overrides via &lt;code&gt;--model&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opencode&lt;/strong&gt; (binary: &lt;code&gt;opencode&lt;/code&gt;, provider path: &lt;code&gt;/openai&lt;/code&gt;), with custom models wired through a generated Opencode runtime config&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per-agent specifics live in the &lt;a href="https://docs.getbifrost.ai/cli-agents/overview" rel="noopener noreferrer"&gt;CLI agents reference&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Route Coding Agents Through Bifrost at All
&lt;/h2&gt;

&lt;p&gt;Pushing Claude Code and other coding agents through Bifrost unlocks three concrete wins for engineering teams: one unified model catalog, centralized governance across coding agent spend, and shared MCP tool configuration. Instead of every engineer wiring up API keys inside their personal agent setup, and every agent having its own tool list, Bifrost acts as a single control plane for all of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  One catalog, every model
&lt;/h3&gt;

&lt;p&gt;Claude Code ships configured for Claude Opus and Sonnet out of the box, but teams often want room to choose. Some tasks map better to GPT-4o, some to Gemini, some to a local model for speed or cost. When you launch Claude Code via the Bifrost CLI, it hits Bifrost's OpenAI-compatible API instead of Anthropic directly, which means any of Bifrost's 20+ supported providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex AI, Azure OpenAI, Gemini, Groq, Mistral, Cohere, Cerebras, Ollama, and more) can sit behind your coding agent. Bifrost's &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement design&lt;/a&gt; is what makes this seamless: the agent believes it is talking to OpenAI or Anthropic, and Bifrost silently handles routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budgets, rate limits, and spend attribution
&lt;/h3&gt;

&lt;p&gt;Coding agents eat tokens. A single multi-file refactor in Claude Code can chew through hundreds of thousands of tokens, and the cost scales linearly as your team grows. &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost governance&lt;/a&gt; treats virtual keys as the core governance primitive, so you can attach per-engineer or per-team budgets, rate limits, and model-access rules to them. Senior engineers might get the expensive reasoning models; juniors default to cost-efficient ones. Every request is attributed, every dollar is visible on the dashboard, and budgets are enforced at the virtual-key level. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;enterprise governance resource page&lt;/a&gt; goes deeper on the full model for larger engineering orgs.&lt;/p&gt;

&lt;h3&gt;
  
  
  One MCP config, every agent
&lt;/h3&gt;

&lt;p&gt;Coding agents get much more useful once they can hit MCP tools (filesystem, databases, GitHub, docs lookup, internal APIs). But configuring MCP servers one-by-one for each agent, across every engineer's machine, is genuinely miserable. Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; centralizes the whole thing. When the Bifrost CLI fires up Claude Code, it auto-attaches Bifrost's MCP endpoint so every tool configured in Bifrost shows up inside the agent immediately, without any &lt;code&gt;claude mcp add-json&lt;/code&gt; calls or hand-edited JSON. This matters a lot if you are standardizing on MCP for internal tooling. We dug into the token-cost side of this in our post on &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost MCP Gateway access control, cost governance, and 92% lower token costs at scale&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prereq: a Running Bifrost Gateway
&lt;/h2&gt;

&lt;p&gt;The Bifrost CLI needs a Bifrost gateway to talk to. If you do not already have one running, the gateway starts with zero config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Default gateway address is &lt;code&gt;http://localhost:8080&lt;/code&gt;. Open that URL in your browser to add providers through the web UI, set up virtual keys, and flip on features like semantic caching or observability. Prefer Docker?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull maximhq/bifrost
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/data:/app/data maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-v $(pwd)/data:/app/data&lt;/code&gt; mount keeps your configuration alive across container restarts. If you need more control (custom ports, log levels, file-based configuration, PostgreSQL-backed persistence), the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;gateway setup guide&lt;/a&gt; documents every flag and mode.&lt;/p&gt;

&lt;p&gt;Once the gateway is up and at least one provider is configured, you can launch the CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing and Running the Bifrost CLI
&lt;/h2&gt;

&lt;p&gt;Requirements: Node.js 18+. Install via npx:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the first run, the &lt;code&gt;bifrost&lt;/code&gt; binary is available on your PATH:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To pin a specific CLI version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost-cli &lt;span class="nt"&gt;--cli-version&lt;/span&gt; v1.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Launching Claude Code Through the Bifrost CLI
&lt;/h2&gt;

&lt;p&gt;Running &lt;code&gt;bifrost&lt;/code&gt; opens an interactive TUI that walks you through five steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base URL&lt;/strong&gt;: Enter your Bifrost gateway URL (usually &lt;code&gt;http://localhost:8080&lt;/code&gt; for local dev)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual Key (optional)&lt;/strong&gt;: If &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key authentication&lt;/a&gt; is on, drop in your key here. Virtual keys land in your OS keyring (macOS Keychain, Windows Credential Manager, Linux Secret Service), never in plaintext on disk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose a Harness&lt;/strong&gt;: Pick Claude Code from the list. The CLI shows install state and version, and if Claude Code is not installed, it offers to install it via npm for you&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select a Model&lt;/strong&gt;: The CLI pulls available models from your gateway's &lt;code&gt;/v1/models&lt;/code&gt; endpoint and shows a searchable list. Type to filter, arrow keys to navigate, or just type any model identifier manually (for example, &lt;code&gt;anthropic/claude-sonnet-4-5-20250929&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Launch&lt;/strong&gt;: Look over the configuration summary, hit Enter&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CLI sets every required environment variable, applies provider-specific configuration, and launches Claude Code in the same terminal. From there you are using Claude Code normally, with every request routed through Bifrost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic MCP attach for Claude Code
&lt;/h3&gt;

&lt;p&gt;Launching Claude Code through the Bifrost CLI auto-registers Bifrost's MCP endpoint at &lt;code&gt;/mcp&lt;/code&gt;, so every MCP tool you have configured in Bifrost is instantly available inside Claude Code. If a virtual key is set, the CLI also wires up authenticated MCP access with the right &lt;code&gt;Authorization&lt;/code&gt; header. No manual &lt;code&gt;claude mcp add-json&lt;/code&gt; commands. For the other harnesses (Codex CLI, Gemini CLI, Opencode), the CLI prints the MCP server URL so you can plug it into the agent's own settings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tabbed Session UI
&lt;/h2&gt;

&lt;p&gt;Once you launch, the Bifrost CLI drops you into a tabbed terminal UI rather than exiting after your session ends. A tab bar at the bottom shows the CLI version, one tab per active or recent agent session, and a status badge per tab:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🧠 means the session is still changing (the agent is working)&lt;/li&gt;
&lt;li&gt;✅ means the session looks idle and ready&lt;/li&gt;
&lt;li&gt;🔔 means the session emitted a real terminal alert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hit &lt;code&gt;Ctrl+B&lt;/code&gt; any time to focus the tab bar. From tab mode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;n&lt;/code&gt; opens a new tab and launches another agent session&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;x&lt;/code&gt; closes the current tab&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt; / &lt;code&gt;l&lt;/code&gt; jump left and right across tabs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt;-&lt;code&gt;9&lt;/code&gt; jump directly to a tab by number&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Esc&lt;/code&gt; / &lt;code&gt;Enter&lt;/code&gt; / &lt;code&gt;Ctrl+B&lt;/code&gt; drop you back into the active session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Super handy when you want Claude Code on one task and Gemini CLI on another, or multiple parallel Claude Code sessions against different branches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Git Worktree Support for Claude Code
&lt;/h2&gt;

&lt;p&gt;Worktree support is currently Claude Code only. It lets you run sessions in isolated git worktrees for parallel development:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost-cli &lt;span class="nt"&gt;-worktree&lt;/span&gt; feature-branch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also choose worktree mode from inside the TUI during setup. The CLI forwards the &lt;code&gt;--worktree&lt;/code&gt; flag to Claude Code, which creates a fresh working directory on that branch. This is exactly what you want when you need two Claude Code agents running side by side, one on &lt;code&gt;main&lt;/code&gt; and one on a feature branch, without stepping on each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration and CLI Flags
&lt;/h2&gt;

&lt;p&gt;The Bifrost CLI persists its configuration at &lt;code&gt;~/.bifrost/config.json&lt;/code&gt;, created on first run and updated through the TUI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"base_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"default_harness"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"default_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4-5-20250929"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Virtual keys are never written to this file; they stay in your OS keyring.&lt;/p&gt;

&lt;p&gt;CLI flags worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-config &amp;lt;path&amp;gt;&lt;/code&gt;: Point at a custom &lt;code&gt;config.json&lt;/code&gt; file (useful for per-project gateway configs)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-no-resume&lt;/code&gt;: Skip the resume flow and open a fresh setup&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-worktree &amp;lt;n&amp;gt;&lt;/code&gt;: Create a git worktree for the session (Claude Code only)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From the summary screen, shortcut keys let you tweak things without restarting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;u&lt;/code&gt; changes the base URL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;v&lt;/code&gt; updates the virtual key&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt; swaps to a different harness&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;m&lt;/code&gt; picks a different model&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;w&lt;/code&gt; sets a worktree name (Claude Code only)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;d&lt;/code&gt; opens the Bifrost dashboard in your browser&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;l&lt;/code&gt; toggles harness exit logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Swapping Between Coding Agents
&lt;/h2&gt;

&lt;p&gt;This is where the Bifrost CLI earns its keep. When a Claude Code session ends, you land back on the summary screen with your previous config intact. Press &lt;code&gt;h&lt;/code&gt; to swap Claude Code for Codex CLI, press &lt;code&gt;m&lt;/code&gt; to try GPT-4o instead of Claude Sonnet, then hit Enter to re-launch. The CLI redoes everything (base URLs, API keys, model flags, agent-specific config) for you.&lt;/p&gt;

&lt;p&gt;Opencode gets two extra behaviors: the CLI generates a provider-qualified model reference plus a runtime config so Opencode boots with the right model, and it preserves your existing theme from &lt;code&gt;tui.json&lt;/code&gt; or falls back to the adaptive system theme if you have not set one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflows You Actually See in the Wild
&lt;/h2&gt;

&lt;p&gt;A few patterns that tend to show up on teams running coding agents via Bifrost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Head-to-head agent comparison&lt;/strong&gt;: Open a tab, launch Claude Code on a task. Open another, launch Codex CLI on the same task. Compare outputs. Every request runs through Bifrost, so everything gets logged against the same virtual key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Worktree-based parallel work&lt;/strong&gt;: One engineer runs Claude Code on a bug fix in one worktree and Claude Code on a feature in another, with both sessions in view via the tabbed UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model switching per task&lt;/strong&gt;: Claude Opus for big architectural refactors, Gemini for documentation-heavy work, a local Ollama model for quick edits. No leaving the CLI, no reconfiguring anything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared MCP tools across a team&lt;/strong&gt;: Platform engineers configure MCP servers once in the Bifrost dashboard (filesystem, internal APIs, databases), and every engineer's Claude Code session picks those tools up automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting Cheat Sheet
&lt;/h2&gt;

&lt;p&gt;A few common gotchas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"npm not found in path"&lt;/strong&gt;: The CLI uses npm to install missing harnesses. Confirm Node.js 18+ is installed and &lt;code&gt;npm --version&lt;/code&gt; works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent not found after install&lt;/strong&gt;: Restart your terminal or add npm's global bin directory to your &lt;code&gt;PATH&lt;/code&gt; with &lt;code&gt;export PATH="$(npm config get prefix)/bin:$PATH"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models not loading&lt;/strong&gt;: Check that your Bifrost gateway is reachable at the configured base URL, at least one provider is set up, and (if virtual keys are on) your key has permission to list models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual key not persisting&lt;/strong&gt;: The CLI writes virtual keys to your OS keyring. On Linux, make sure &lt;code&gt;gnome-keyring&lt;/code&gt; or &lt;code&gt;kwallet&lt;/code&gt; is running. If keyring access fails, the CLI logs a warning and keeps going, but you will need to re-enter the key each session&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The Bifrost CLI makes every coding agent a first-class citizen of your AI gateway. Engineers stop wrestling env vars and per-agent config files; platform teams get centralized governance, observability, and MCP tool management across every agent in play. Claude Code, Codex CLI, Gemini CLI, and Opencode all launch through one CLI, behind one set of credentials, with one dashboard watching them.&lt;/p&gt;

&lt;p&gt;Ready to try it? Spin up a gateway with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;, grab the CLI with &lt;code&gt;npx -y @maximhq/bifrost-cli&lt;/code&gt;, and walk through the setup. For teams thinking about production coding agent workflows at scale, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team to see how the MCP gateway, governance layer, and CLI all come together.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Running Codex CLI at Scale? Here's Why You Need an AI Gateway</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:47:41 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/running-codex-cli-at-scale-heres-why-you-need-an-ai-gateway-2dmh</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/running-codex-cli-at-scale-heres-why-you-need-an-ai-gateway-2dmh</guid>
      <description>&lt;p&gt;&lt;em&gt;Routing Codex CLI through an AI gateway like Bifrost gives platform teams per-consumer spend controls, multi-provider access, automatic failover, and compliance logging without changing how developers work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Codex CLI now has more than 2 million weekly active users. Engineering organizations at Cisco, Nvidia, and Ramp are running it across their developer teams. The appeal is straightforward: a terminal-native coding agent that opens files, proposes diffs, runs test suites, and iterates entirely inside the shell. The problem that appears at team scale is just as straightforward: every session is a raw API call to OpenAI. There is no built-in mechanism for spend attribution, model access restrictions, or cross-team usage monitoring.&lt;/p&gt;

&lt;p&gt;A single developer's usage shows up cleanly on one invoice. Fifty engineers running concurrent sessions in Suggest, Auto Edit, or Full Auto mode create a spend problem that is invisible until it lands as a surprise at month-end. Inserting an &lt;strong&gt;AI gateway for Codex CLI&lt;/strong&gt; between the agent and the provider resolves this at the infrastructure level, with no changes pushed to individual machines. Bifrost, the open-source AI gateway built by Maxim AI, handles exactly this use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Governance Gap Looks Like Without a Gateway
&lt;/h2&gt;

&lt;p&gt;Codex CLI is configured through two environment variables: &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; and &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;. This is a deliberate design choice that makes individual developer setup fast. At organizational scale, it is also the root cause of every governance problem.&lt;/p&gt;

&lt;p&gt;Without a gateway layer, teams are left with two options, both of them inadequate. The first is a shared API key, which collapses all usage into a single account with no per-developer attribution. The second is distributing individual keys manually, which creates key rotation overhead and still gives platform teams no real-time window into aggregate spend or model selection.&lt;/p&gt;

&lt;p&gt;Both approaches fail as team size grows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No per-developer or per-team spend tracking&lt;/strong&gt;: Everything rolls up to one OpenAI account with limited granularity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No model access restrictions&lt;/strong&gt;: Anyone holding the key can call any available model, including the highest-cost options.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-consumer rate limiting&lt;/strong&gt;: A long-running Full Auto session on a large monorepo can drain the budget before the next request queue even notices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No automatic failover&lt;/strong&gt;: API degradation or a rate limit from OpenAI halts the session completely, with no recovery path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No compliance audit trail&lt;/strong&gt;: For regulated industries, there is no tamper-resistant record of which model received which prompt, from which user, and at what time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A gateway solves all of these centrally without requiring developers to change anything about how they run Codex CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Bifrost Connects to Codex CLI
&lt;/h2&gt;

&lt;p&gt;Bifrost sits at the network layer and intercepts Codex CLI's outbound OpenAI-format requests. Since Codex CLI already uses a standard OpenAI-compatible API structure, connecting it to Bifrost is a single environment variable change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://your-bifrost-gateway/openai/v1"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-bifrost-virtual-key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that Codex CLI specifically requires the base URL to end with &lt;code&gt;/v1&lt;/code&gt;, which distinguishes it from some other OpenAI SDK integrations that append the path automatically. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/bifrost-cli" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt; handles this automatically. Running &lt;code&gt;npx -y @maximhq/bifrost-cli&lt;/code&gt; starts an interactive terminal session that walks through gateway URL, virtual key, and model selection, then launches Codex CLI with every variable pre-configured. If Codex CLI is not installed on the machine, the Bifrost CLI installs it via npm before launch.&lt;/p&gt;

&lt;p&gt;From that point, all Codex CLI traffic passes through Bifrost's governance and routing layers before it reaches any LLM provider.&lt;/p&gt;




&lt;h2&gt;
  
  
  Virtual Keys: Scoped Access for Every Developer and Team
&lt;/h2&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key system&lt;/a&gt; is the core governance primitive. Each developer, team, or project gets a dedicated virtual key that defines their specific access policy. Provider credentials stay locked inside the gateway and are never distributed to end users.&lt;/p&gt;

&lt;p&gt;Virtual keys support granular policy enforcement at the individual request level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model access rules&lt;/strong&gt;: Define exactly which models a given key can reach. A staff engineer's key might cover GPT-5.4 and Claude Sonnet, while a vendor or contractor key is constrained to open-source models running on Groq.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spend limits&lt;/strong&gt;: Dollar-denominated hard caps by day, week, or month. Once a key reaches its ceiling, requests return a policy error rather than silently accumulating more spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits&lt;/strong&gt;: Maximum requests per minute or per hour, so a single automated workflow cannot saturate throughput and block the rest of the team.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider restrictions&lt;/strong&gt;: Pin a key to one provider or grant access to the full catalog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Policy changes take effect immediately at the gateway with no developer action required. Revoking access, tightening a budget cap, or changing model permissions propagates on the next request. There is no need to rotate keys across machines or push configuration updates to individual developers.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;Bifrost governance layer&lt;/a&gt; layers budget controls hierarchically. An engineering team might operate under a shared $500/month ceiling, with each individual virtual key carrying its own $75/month cap. Both limits are enforced independently, so a single engineer cannot exhaust the team allocation and a team cannot exhaust the organizational budget undetected.&lt;/p&gt;




&lt;h2&gt;
  
  
  Breaking the OpenAI Dependency with Multi-Provider Routing
&lt;/h2&gt;

&lt;p&gt;By default, Codex CLI only routes to OpenAI's GPT model family. For teams that want to benchmark models, reduce costs on specific task types, or reduce dependency on a single provider, this is a hard constraint.&lt;/p&gt;

&lt;p&gt;Bifrost connects to &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ LLM providers&lt;/a&gt; behind an OpenAI-compatible interface. API translation happens at the gateway layer, so Codex CLI can send requests to Claude models on Anthropic, Gemini on Google, Mistral, Groq, AWS Bedrock, Azure OpenAI, or any other configured provider without any modification to the agent itself. Developers switch models mid-session using Codex CLI's &lt;code&gt;/model&lt;/code&gt; command; the gateway handles the protocol conversion and routes to the correct backend.&lt;/p&gt;

&lt;p&gt;This makes meaningful task-based model selection practical within a single workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex multi-file refactors routed to GPT-5.4 for deeper reasoning&lt;/li&gt;
&lt;li&gt;High-volume unit test generation routed to a Groq-hosted Llama model for lower latency and cost&lt;/li&gt;
&lt;li&gt;Documentation and code explanation tasks sent to Claude Sonnet&lt;/li&gt;
&lt;li&gt;Automatic fallback to Gemini Flash if a primary provider hits rate limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For organizations in regulated industries, Bifrost's &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployment&lt;/a&gt; keeps all Codex CLI request traffic inside private cloud infrastructure, satisfying data residency and sovereignty requirements without removing any agent capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failover and Load Balancing for Long-Running Sessions
&lt;/h2&gt;

&lt;p&gt;A Codex CLI session on a complex task can run for several minutes, spanning multiple file reads, test executions, and iterative edits. An API error or rate limit from OpenAI mid-session forces a full restart, with context state gone.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt; removes this risk. Platform teams define ordered fallback chains specifying which providers Bifrost tries in sequence when a request fails. A 429 or 5xx from the primary provider triggers an automatic retry against the next entry in the chain, and Codex CLI receives a successful response with no visible interruption to the session.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/keys-management" rel="noopener noreferrer"&gt;Load balancing&lt;/a&gt; distributes concurrent requests across multiple API keys or provider accounts using weighted routing. When a full engineering team runs Codex CLI sessions simultaneously, no single key exhausts its rate limit and blocks others. This matters most for teams running Full Auto mode or agent subworkflows that produce high request volumes in short bursts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability: Token Spend Visibility Across the Whole Team
&lt;/h2&gt;

&lt;p&gt;Every Codex CLI request that passes through Bifrost generates structured telemetry: model name, provider routed to, input and output token counts, end-to-end latency, virtual key ID, and response outcome. This data surfaces through native integrations without any custom instrumentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus metrics&lt;/strong&gt;: Available at the Bifrost metrics scrape endpoint or pushed via Push Gateway, feeding Grafana dashboards with per-key usage breakdowns in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenTelemetry traces&lt;/strong&gt;: OTLP-compatible traces on every request, compatible with Datadog, New Relic, Honeycomb, and any other OTLP backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog connector&lt;/strong&gt;: Native integration for APM traces, LLM Observability dashboards, and infrastructure metrics without a custom exporter layer.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; makes visible what a direct-to-OpenAI setup cannot: which teams are generating the most tokens, which models are being selected for which task types, and where latency outliers are occurring. When a virtual key repeatedly hits its monthly cap early, the telemetry identifies exactly which sessions were responsible, turning a budget policy conversation from abstract to specific.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enterprise Compliance for Regulated Codex CLI Deployments
&lt;/h2&gt;

&lt;p&gt;In regulated environments, Codex CLI sessions carry compliance obligations that extend beyond cost governance. Source code submitted to an LLM may contain proprietary logic, personal data, or content subject to regional residency laws. A direct OpenAI integration cannot enforce constraints at the infrastructure level.&lt;/p&gt;

&lt;p&gt;Bifrost Enterprise adds the compliance controls that regulated teams require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Immutable audit logs&lt;/strong&gt;: Every request and response is written to an append-only log with full metadata, covering user identity, model, timestamps, and token counts. The &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;audit log&lt;/a&gt; satisfies SOC 2, GDPR, HIPAA, and ISO 27001 reporting requirements with tamper-resistant storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets management integration&lt;/strong&gt;: Provider API keys are stored in HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, or Azure Key Vault and retrieved at runtime through Bifrost's &lt;a href="https://docs.getbifrost.ai/enterprise/vault-support" rel="noopener noreferrer"&gt;vault integration&lt;/a&gt;. Keys never appear in plaintext environment variables or configuration files on developer machines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails&lt;/strong&gt;: Content safety checks using AWS Bedrock Guardrails, Azure Content Safety, or Patronus AI run against every Codex CLI request before the prompt reaches a provider, enabling PII redaction and organizational policy enforcement at the gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSO and RBAC&lt;/strong&gt;: Federated authentication via Okta and Entra (Azure AD) with role-based gateway administration ensures only authorized team members can modify virtual key policies, adjust budgets, or access telemetry data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams comparing gateway options across governance, compliance, and performance capabilities can review the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; for a structured comparison.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup: Codex CLI Through Bifrost in Under a Minute
&lt;/h2&gt;

&lt;p&gt;Bifrost is open source and starts without a configuration file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost-cli
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interactive setup covers provider configuration, virtual key creation, and Codex CLI launch in a guided flow. For Codex CLI specifically, the &lt;a href="https://docs.getbifrost.ai/cli-agents/codex-cli" rel="noopener noreferrer"&gt;integration guide&lt;/a&gt; in the Bifrost docs covers the &lt;code&gt;/openai/v1&lt;/code&gt; endpoint path requirement and common setup patterns.&lt;/p&gt;

&lt;p&gt;Gateway overhead is &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;11 microseconds per request at 5,000 RPS&lt;/a&gt;. Developers experience no perceptible change in session responsiveness. The governance, routing, and observability layers are entirely transparent to the agent.&lt;/p&gt;

&lt;p&gt;For engineering teams scaling Codex CLI across an organization and needing centralized access control, compliance logging, and multi-provider routing, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Code Mode in Bifrost MCP Gateway: Python-Driven Tool Orchestration for Cheaper AI Agents</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:38:00 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/code-mode-in-bifrost-mcp-gateway-python-driven-tool-orchestration-for-cheaper-ai-agents-20h7</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/code-mode-in-bifrost-mcp-gateway-python-driven-tool-orchestration-for-cheaper-ai-agents-20h7</guid>
      <description>&lt;p&gt;&lt;em&gt;Code Mode in Bifrost MCP Gateway has AI agents write Python scripts to orchestrate tools, trimming token usage up to 92% with pass rate fully preserved.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Rather than injecting every tool definition into the model's prompt on each request, Code Mode in Bifrost MCP Gateway takes a different route to agent execution. It keeps the exposed surface area small: four lightweight meta-tools, plus a short Python (Starlark) script that the model writes to orchestrate the work. Controlled benchmarks covering 500+ tools have shown input token reductions reaching 92.8%, with pass rate holding steady at 100%. For any team operating production AI agents across several &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; servers, that gap decides whether the monthly AI bill stays manageable or spirals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Mode in Bifrost MCP Gateway Explained
&lt;/h2&gt;

&lt;p&gt;Code Mode in Bifrost MCP Gateway shifts orchestration from one-shot function calls to model-written Python. Rather than invoking each MCP tool separately through the usual function-calling interface, the model produces a single script that strings the calls together. Bifrost presents the connected MCP servers as a virtual filesystem of Python stub files, using &lt;code&gt;.pyi&lt;/code&gt; signatures, which the model browses on demand. After locating only the relevant tools, the model drafts its script, and Bifrost runs it inside a sandboxed &lt;a href="https://github.com/bazelbuild/starlark" rel="noopener noreferrer"&gt;Starlark&lt;/a&gt; interpreter. The model's context receives only the final output, not the intermediate steps.&lt;/p&gt;

&lt;p&gt;Context bloat shows up almost immediately once a team wires more than a few MCP servers into an agent. The conventional MCP flow pushes every tool definition from every connected server into the prompt on every single turn. Do the math for 5 servers with 30 tools apiece and the agent is already carrying 150 schemas before the user's message has even been parsed. Code Mode severs that link: prompt cost scales with what the model actually opens, not with the total size of the tool registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Problem Baked Into Default MCP Execution
&lt;/h2&gt;

&lt;p&gt;The conventional MCP setup asks the gateway to push every available tool schema into every LLM request. That model works fine for demos and proof-of-concepts. Once it hits production, three failure modes surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-server token costs stack.&lt;/strong&gt; The classic MCP path ships the full tool catalog on each request and on every intermediate turn of the agent loop. Connecting more servers compounds the charge rather than amortizing it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bigger prompts mean slower responses.&lt;/strong&gt; Extensive tool lists inflate prompt length, which pushes up time-to-first-token and stretches end-to-end request latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pruning the tool list isn't a real fix.&lt;/strong&gt; Trimming capability to save tokens just redistributes the problem. Teams wind up managing multiple narrow tool sets across different agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Public work has already put numbers on these failures. &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team&lt;/a&gt; documented a workflow that went from 150,000 tokens to 2,000 when tool calls were swapped for code execution on a Google Drive to Salesforce pipeline, and &lt;a href="https://blog.cloudflare.com/code-mode" rel="noopener noreferrer"&gt;Cloudflare&lt;/a&gt; explored a comparable approach using a TypeScript runtime. Code Mode applies the same core idea, baking it directly into the &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;Bifrost MCP gateway&lt;/a&gt; with two deliberate design calls: Python rather than JavaScript (LLMs see substantially more Python during training) and a dedicated documentation meta-tool that trims prompt size further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inside Code Mode: The Four Meta-Tools That Power It
&lt;/h2&gt;

&lt;p&gt;Turning Code Mode on at the client level triggers Bifrost to attach four generic meta-tools to every request, taking the place of the direct tool schemas that would otherwise show up in context.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Meta-tool&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;listToolFiles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Discover which servers and tools are available as virtual &lt;code&gt;.pyi&lt;/code&gt; stub files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;readToolFile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load compact Python function signatures for a specific server or tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;getToolDocs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fetch detailed documentation for a specific tool before using it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;executeToolCode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run an orchestration script against the live tool bindings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Navigation happens on demand: the model lists the stub files, pulls in only the signatures it actually plans to use, optionally reaches for detailed docs on a specific tool, then composes a short Python script that Bifrost runs in the sandbox. Two binding granularities are available, server-level and tool-level; one stub per server keeps discovery compact, while one stub per tool supports more targeted lookups. Both share the same four-tool interface. Configuration details across both modes live in the &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode configuration reference&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inside the Sandbox: Boundaries of Generated Code
&lt;/h3&gt;

&lt;p&gt;Execution runs inside a Starlark interpreter, a deterministic Python-like language first built at Google for build system configuration. The sandbox is intentionally narrow by design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No imports&lt;/li&gt;
&lt;li&gt;No file I/O&lt;/li&gt;
&lt;li&gt;No network access&lt;/li&gt;
&lt;li&gt;Only tool calls against the allowed bindings and basic Python-like logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is fast, deterministic execution that is safe to run under &lt;a href="https://docs.getbifrost.ai/mcp/agent-mode" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; with auto-execution on. Because they are read-only, the three meta-tools &lt;code&gt;listToolFiles&lt;/code&gt;, &lt;code&gt;readToolFile&lt;/code&gt;, and &lt;code&gt;getToolDocs&lt;/code&gt; can always be auto-executed. &lt;code&gt;executeToolCode&lt;/code&gt; clears the auto-execution bar only when every tool referenced in the generated script appears on the configured allow-list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code Mode's Token Savings in Real Workloads
&lt;/h2&gt;

&lt;p&gt;Picture a multi-step e-commerce task: pull up a customer, review their order history, apply a discount, and fire off a confirmation. What separates classic MCP from Code Mode isn't only the final output; it's the entire shape of the context the model sees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classic MCP flow:&lt;/strong&gt; Every turn drags along the full tool list. Each intermediate tool result loops back through the model. Once a workload is running 10 MCP servers with 100+ tools, the bulk of every prompt is being spent on tool definitions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Mode flow:&lt;/strong&gt; The model pulls one stub file, writes a single script that chains the calls together, and the script runs in the Bifrost sandbox. Intermediate results never leave the sandbox. Only the compact final output returns to the model's context.&lt;/p&gt;

&lt;p&gt;Three controlled benchmark rounds were published, toggling Code Mode on and off while scaling tool count between rounds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Input tokens (off)&lt;/th&gt;
&lt;th&gt;Input tokens (on)&lt;/th&gt;
&lt;th&gt;Token reduction&lt;/th&gt;
&lt;th&gt;Cost reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;96 tools / 6 servers&lt;/td&gt;
&lt;td&gt;19.9M&lt;/td&gt;
&lt;td&gt;8.3M&lt;/td&gt;
&lt;td&gt;-58.2%&lt;/td&gt;
&lt;td&gt;-55.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;251 tools / 11 servers&lt;/td&gt;
&lt;td&gt;35.7M&lt;/td&gt;
&lt;td&gt;5.5M&lt;/td&gt;
&lt;td&gt;-84.5%&lt;/td&gt;
&lt;td&gt;-83.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;508 tools / 16 servers&lt;/td&gt;
&lt;td&gt;75.1M&lt;/td&gt;
&lt;td&gt;5.4M&lt;/td&gt;
&lt;td&gt;-92.8%&lt;/td&gt;
&lt;td&gt;-92.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gains compound with scale: the classic path reloads every definition on every call, while Code Mode's cost stays bounded by what the model actually reads. Pass rate held firm at 100% across all three rounds, confirming that efficiency came without an accuracy tradeoff. Full methodology and raw numbers sit in the &lt;a href="https://github.com/maximhq/bifrost-benchmarking/blob/main/mcp-code-mode-benchmark/benchmark_report.md" rel="noopener noreferrer"&gt;Bifrost MCP Code Mode benchmark report&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How all of this plays out in a live production setting, including cost governance, access control, and per-tool pricing, is covered in the &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost MCP Gateway launch post&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Code Mode Delivers for Enterprise AI Teams
&lt;/h2&gt;

&lt;p&gt;Token cost sits at the top of the list, but it is not the only reason Code Mode earns its place in production. Platform and infrastructure teams running AI agents at scale get a set of operational properties through Code Mode that classic MCP execution simply does not deliver:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capability without the cost penalty.&lt;/strong&gt; Every MCP server a team needs (internal APIs, search, databases, filesystem, CRM) can be connected without paying a per-request token tax for each tool definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictable scaling.&lt;/strong&gt; Bringing a new MCP server online does not balloon the context window of every downstream agent. Per-request cost stays flat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower end-to-end latency.&lt;/strong&gt; Fewer, larger model turns with sandboxed orchestration between them cut total response time compared to tool-by-tool multi-turn execution, a pattern consistent with Bifrost's broader &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;performance benchmarks&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic workflows.&lt;/strong&gt; Orchestration logic lives in a deterministic Starlark script instead of being reassembled across several stochastic model turns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditable execution.&lt;/strong&gt; Each tool call made from within a Code Mode script is still logged as a first-class event in Bifrost, recording tool name, server, arguments, result, latency, virtual key, and parent LLM request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Paired with Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys and governance&lt;/a&gt;, Code Mode slots into the pattern enterprise AI teams have been converging toward for a while: capability, cost control, and &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;centralized AI governance&lt;/a&gt; enforced at the infrastructure layer, not stitched onto each individual agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning On Code Mode for a Bifrost MCP Client
&lt;/h2&gt;

&lt;p&gt;Code Mode operates as a per-client toggle. Any MCP client attached to Bifrost, whether over STDIO, HTTP, SSE, or in-process through the Go SDK, can flip between classic mode and Code Mode on demand, with no redeployment and no schema changes required.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Register an MCP Server
&lt;/h3&gt;

&lt;p&gt;Head into the MCP section of the Bifrost dashboard and add a new client. Enter a name, choose the connection type, and provide the endpoint or command. Tool discovery runs automatically, with Bifrost syncing the server's tools on a configurable interval and surfacing each client in the list with a live health indicator. Step-by-step setup is walked through in the &lt;a href="https://docs.getbifrost.ai/mcp/connecting-to-servers" rel="noopener noreferrer"&gt;connecting to MCP servers guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Flip the Code Mode Switch
&lt;/h3&gt;

&lt;p&gt;Inside the client's settings, switch Code Mode on. At that moment, Bifrost stops injecting the full tool catalog into context for that specific client. Starting with the next request, the model gets the four meta-tools and browses the tool filesystem on its own. Token usage on agent loops drops from the first call.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Set Up Auto-Execution
&lt;/h3&gt;

&lt;p&gt;Out of the box, tool calls need manual approval. To let the agent loop run on its own, allowlist individual tools in the auto-execute settings. Because allowlisting is granular per tool, &lt;code&gt;filesystem_read&lt;/code&gt; can run without a prompt while &lt;code&gt;filesystem_write&lt;/code&gt; remains behind an approval gate. Under Code Mode, the three read-only meta-tools always run without approval, and &lt;code&gt;executeToolCode&lt;/code&gt; qualifies for auto-execution only when every tool that its script touches is on the allow-list.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Scope Tool Access Through Virtual Keys
&lt;/h3&gt;

&lt;p&gt;Combine Code Mode with &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; to scope tool access by consumer. A virtual key issued to a customer-facing agent can be locked to a specific tool subset, while an internal admin key can be granted broader reach. Tools that fall outside the key's scope never appear to the model, which rules out prompt-level attempts to bypass the restriction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting Code Mode in Bifrost MCP Gateway to Work
&lt;/h2&gt;

&lt;p&gt;Every team running MCP in production eventually runs into the same question: how do you keep adding capability without watching the token bill grow exponentially? Code Mode in Bifrost MCP Gateway is the pragmatic answer. By relocating orchestration from prompts into sandboxed Python, it brings token cost reductions of up to 92%, faster agent runs, and full auditability together under a single per-client toggle. Any MCP server works; virtual keys and tool groups handle access control; and the whole thing drops into Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway architecture&lt;/a&gt; next to its LLM routing, fallback, and observability layers.&lt;/p&gt;

&lt;p&gt;To see Code Mode in Bifrost MCP Gateway run against your own agent workloads, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a Bifrost demo&lt;/a&gt; with the team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Cut Claude Code token costs with MCP Gateway</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:35:48 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/cut-claude-code-mcp-token-costs-52h5</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/cut-claude-code-mcp-token-costs-52h5</guid>
      <description>&lt;p&gt;&lt;em&gt;Cut Claude Code MCP token costs by as much as 92% with Bifrost's MCP gateway, Code Mode orchestration, and scoped tool governance at production scale.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Any engineering team that wires Claude Code into more than a few MCP servers runs into the same outcome. Context windows fill up fast, request latency drifts higher, and monthly API spend ends up well above the original estimate. The source of the pain is not the tools being connected. It is the way the Model Context Protocol (MCP) pushes every tool definition into context on each individual request. Trimming Claude Code's tool set is not a real fix, because it trades capability for cost. What teams actually need is an infrastructure tier that controls which tools are exposed, caches what can safely be cached, and lifts orchestration out of the prompt itself. That is the design goal behind Bifrost, the open-source AI gateway by Maxim AI. This guide explains exactly where MCP token costs originate, which problems Claude Code's native optimizations can and cannot address, and how Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; with Code Mode delivers up to 92% token reduction in real production traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Claude Code's MCP Token Overhead Actually Comes From
&lt;/h2&gt;

&lt;p&gt;The core driver of MCP token cost is repetition. Tool schemas reload into context on every single message rather than once at session start, so the bill scales with conversation length. Each MCP server attached to Claude Code injects its complete set of tool definitions, including names, descriptions, parameter schemas, and expected outputs, into the model's context for every turn. Wire up five servers that each expose thirty tools, and the model is already parsing 150 definitions before it reads a single word of the user's actual request.&lt;/p&gt;

&lt;p&gt;Outside reporting has put numbers on the problem. One recent analysis documented that &lt;a href="https://www.jdhodges.com/blog/claude-code-mcp-server-token-costs/" rel="noopener noreferrer"&gt;a typical four-server Claude Code setup adds roughly 7,000 tokens of overhead per message, with heavier configurations crossing 50,000 tokens before the user types anything&lt;/a&gt;. A separate breakdown reported &lt;a href="https://www.mindstudio.ai/blog/claude-code-mcp-server-token-overhead" rel="noopener noreferrer"&gt;multi-server setups routinely adding 15,000 to 20,000 tokens of overhead per turn under usage-based billing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Three compounding dynamics make this worse as usage grows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool definitions reload on each turn&lt;/strong&gt;: a 50-message session pays the same overhead 50 times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Even unused tools bill full cost&lt;/strong&gt;: a Playwright server's 22 browser actions travel in the request whether the task involves a browser or a Python file edit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Descriptions skew verbose&lt;/strong&gt;: many open-source MCP servers ship with long, prose-heavy tool descriptions that inflate the token count per definition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This overhead is more than a cost concern. It eats into the working context that the model needs for the task itself, which hurts output quality in long sessions and forces compaction earlier than necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Claude Code's Native Optimizations Help (and Where They Stop)
&lt;/h2&gt;

&lt;p&gt;Anthropic has already shipped a handful of optimizations aimed at the obvious cases. Understanding exactly what they handle clarifies where an external layer still has to step in.&lt;/p&gt;

&lt;p&gt;Anthropic's &lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;official Claude Code cost guidance&lt;/a&gt; points to a mix of tool search deferral, prompt caching, auto-compaction, tiered model selection, and custom hooks. For MCP specifically, tool search deferral matters most. Once total tool definitions cross a threshold, Claude Code defers them so only tool names reach the context until Claude actually calls one, which can reclaim 13,000 or more tokens in heavier sessions.&lt;/p&gt;

&lt;p&gt;These controls move the needle, but they leave three gaps for teams running MCP at production scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No central governance layer&lt;/strong&gt;: tool deferral is a client-side behavior. It does not let a platform team decide which tools a given developer, squad, or customer integration is allowed to touch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No orchestration primitive&lt;/strong&gt;: even with deferral in place, every multi-step tool workflow still pays for schema loads, intermediate tool results, and model round trips at each step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No view across sessions&lt;/strong&gt;: individual developers can run &lt;code&gt;/context&lt;/code&gt; and &lt;code&gt;/mcp&lt;/code&gt; to audit their own sessions, but the organization has no way to see which MCP tools are burning tokens across the team.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For one developer running Claude Code locally against two or three servers, the native optimizations are sufficient. For a platform team deploying Claude Code to dozens or hundreds of engineers against shared MCP infrastructure, they are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Drives Claude Code MCP Token Costs Down
&lt;/h2&gt;

&lt;p&gt;Bifrost runs as a gateway between Claude Code and the fleet of MCP servers your team relies on. Rather than pointing Claude Code at every server individually, you point it at Bifrost's single &lt;code&gt;/mcp&lt;/code&gt; endpoint. From there, Bifrost manages discovery, tool governance, execution, and the orchestration pattern that actually changes the shape of the token curve: Code Mode.&lt;/p&gt;

&lt;p&gt;Benchmarks back this up. &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost's published MCP gateway cost study&lt;/a&gt; measured input token reductions of 58% with 96 tools connected, 84% with 251 tools, and 92% with 508 tools, while task pass rate remained at 100% across the matrix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Mode: orchestration that sidesteps per-turn schema loading
&lt;/h3&gt;

&lt;p&gt;Code Mode is where the largest slice of the token savings comes from. Instead of pouring every MCP tool definition into context, Bifrost surfaces the connected MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only the stubs it actually needs, writes a short Python script to wire the calls together, and Bifrost runs that script inside a sandboxed Starlark interpreter.&lt;/p&gt;

&lt;p&gt;Regardless of how many MCP servers sit behind Bifrost, the model interacts with just four meta-tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt;: scan which servers and tools are available.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt;: pull the Python function signatures for a specific server or tool.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt;: fetch the detailed documentation for a particular tool before calling it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt;: run the orchestration script against live tool bindings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern mirrors the approach &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team documented for code execution with MCP&lt;/a&gt;, where a Google Drive to Salesforce workflow fell from 150,000 tokens to 2,000. Bifrost bakes the same idea directly into the gateway, picks Python over JavaScript for stronger LLM fluency, and adds the dedicated docs tool to compress context even further. &lt;a href="https://blog.cloudflare.com/code-mode/" rel="noopener noreferrer"&gt;Cloudflare reported the same exponential savings curve&lt;/a&gt; in their own evaluation.&lt;/p&gt;

&lt;p&gt;Those savings grow as more servers connect. Classic MCP pays per tool definition on every request, so each new server widens the tax base. Code Mode's context cost is bounded by what the model actually reads, not by the size of the tool catalog behind the gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  Virtual keys and tool groups: scoped exposure, scoped cost
&lt;/h3&gt;

&lt;p&gt;Each request reaching Bifrost arrives with a virtual key attached. Every key carries a scoped tool allowlist, and scoping operates at the individual tool level rather than the server level. One key can be granted &lt;code&gt;filesystem_read&lt;/code&gt; while being denied &lt;code&gt;filesystem_write&lt;/code&gt; from the exact same MCP server. Because the model only ever sees definitions for tools its key is cleared for, anything out of scope contributes zero tokens to the context.&lt;/p&gt;

&lt;p&gt;At organizational scale, &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;MCP Tool Groups&lt;/a&gt; push this one step further. A named group of tools can be bound to any combination of virtual keys, teams, customer integrations, or providers, and Bifrost resolves the active set at request time with no database round trip, keeping the index in memory and syncing it across cluster nodes. For teams formalizing &lt;a href="https://www.getmaxim.ai/bifrost/resources/governance" rel="noopener noreferrer"&gt;AI gateway governance&lt;/a&gt;, this replaces ad-hoc tool filtering with auditable policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  A single gateway endpoint, a single audit trail
&lt;/h3&gt;

&lt;p&gt;All connected MCP servers sit behind one &lt;code&gt;/mcp&lt;/code&gt; endpoint on Bifrost. Claude Code makes a single connection and discovers every tool from every server its virtual key is allowed to reach. Registering a new MCP server in Bifrost makes it visible to Claude Code immediately, without any client-side configuration change.&lt;/p&gt;

&lt;p&gt;The cost angle here is visibility. Platform teams get a view that Claude Code's per-session tooling cannot provide. Each tool execution becomes a first-class log record with the tool name, the server, the arguments, the result, the latency, the virtual key, and the parent LLM request, and it sits next to token costs and per-tool costs in cases where the underlying tools invoke paid external APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring Bifrost as Claude Code's MCP Gateway
&lt;/h2&gt;

&lt;p&gt;Going from a clean Bifrost install to Claude Code running with Code Mode enabled is a job of a few minutes. Bifrost ships as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement for existing SDKs&lt;/a&gt;, so application code does not need to change.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Register MCP clients in Bifrost&lt;/strong&gt;: Open the MCP section of the Bifrost dashboard and add every MCP server you want to expose, specifying connection type (HTTP, SSE, or STDIO), endpoint, and any required headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turn on Code Mode&lt;/strong&gt;: In the client settings, flip the Code Mode toggle to on. No schema changes and no redeploy are needed. Token usage drops on the next request as the four meta-tools replace full schema injection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up auto-execute and virtual keys&lt;/strong&gt;: Under &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt;, create scoped credentials for each consumer and pick which tools each key may call. For autonomous agent loops, keep read-only tools on the auto-execute allowlist while routing write operations through approval.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add Bifrost to Claude Code's MCP config&lt;/strong&gt;: In Claude Code's MCP settings, register Bifrost as an MCP server using the gateway URL. Claude Code then discovers every tool its virtual key is allowed to see through that single connection.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once this is wired up, Claude Code operates against a governed, token-efficient slice of your MCP ecosystem, and every tool invocation is logged with full cost attribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantifying the Cost Impact for Your Team
&lt;/h2&gt;

&lt;p&gt;Reducing MCP token costs for Claude Code only matters if you can actually measure the savings. Bifrost's &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; exposes the data that cost decisions depend on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token cost sliced by virtual key, by tool, and by MCP server across time.&lt;/li&gt;
&lt;li&gt;A full trace for every agent run showing which tools ran, in what sequence, with what arguments, and at what latency.&lt;/li&gt;
&lt;li&gt;A side-by-side spend breakdown that places LLM token costs next to tool costs, so the complete cost of an agent workflow is visible in one place.&lt;/li&gt;
&lt;li&gt;Native Prometheus metrics and &lt;a href="https://docs.getbifrost.ai/features/telemetry" rel="noopener noreferrer"&gt;OpenTelemetry (OTLP)&lt;/a&gt; pipes into Grafana, New Relic, Honeycomb, and Datadog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams sizing the savings against their own traffic can reference &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;Bifrost's performance benchmarks&lt;/a&gt;, which record 11 microseconds of overhead at 5,000 requests per second, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM gateway buyer's guide&lt;/a&gt; for a full feature-by-feature comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  Past Token Costs: What a Production MCP Stack Requires
&lt;/h2&gt;

&lt;p&gt;MCP without governance and cost control stops scaling the moment it moves past one developer's local machine. Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; consolidates the full production surface in a single layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scoped access through virtual keys with per-tool filtering.&lt;/li&gt;
&lt;li&gt;Organizational governance backed by MCP Tool Groups.&lt;/li&gt;
&lt;li&gt;End-to-end audit trails for every tool invocation, aligned with SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/li&gt;
&lt;li&gt;Per-tool cost visibility sitting beside LLM token spend.&lt;/li&gt;
&lt;li&gt;Code Mode to compress context cost without compressing capability.&lt;/li&gt;
&lt;li&gt;One gateway that covers MCP traffic and also handles LLM provider routing, &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt;, load balancing, &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt;, and unified key management across 20+ AI providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Routing LLM calls and tool calls through the same gateway puts model tokens and tool costs into one audit log under one access control model. That is the infrastructure shape production AI systems actually need. Teams already pairing Claude Code with Bifrost can consult the &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; for workflow-specific implementation details, and teams evaluating broader terminal agent fit can review Bifrost's coverage of &lt;a href="https://www.getmaxim.ai/bifrost/resources/cli-agents" rel="noopener noreferrer"&gt;CLI coding agents&lt;/a&gt; beyond Claude Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Cutting Claude Code MCP Token Costs Today
&lt;/h2&gt;

&lt;p&gt;Reducing MCP token costs for Claude Code is not a matter of stripping tools or shrinking capability. It is a matter of pushing tool governance and orchestration into the infrastructure tier where they belong. Bifrost's MCP gateway and Code Mode combine to deliver up to 92% token reduction on large tool catalogs while tightening access control and giving platform teams the cost visibility they need to run Claude Code at scale.&lt;/p&gt;

&lt;p&gt;Ready to cut your team's Claude Code token bill and put production-grade MCP governance in place? &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The Real Cost of MCP in Claude Code, and How to Bring It Down</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:32:27 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/the-real-cost-of-mcp-in-claude-code-and-how-to-bring-it-down-2749</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/the-real-cost-of-mcp-in-claude-code-and-how-to-bring-it-down-2749</guid>
      <description>&lt;p&gt;&lt;em&gt;Bifrost's MCP gateway and Code Mode reduce MCP token costs for Claude Code by up to 92%, with centralized governance and per-tool cost visibility.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pattern is familiar to any team that has rolled Claude Code out beyond a single developer. Integrations multiply, MCP servers get wired in one by one, workflows genuinely improve, and then the API bill lands and nobody can quite explain the shape of it. The easy assumption is usage growth. The actual story is almost always tool overhead, and it has a structural cause: the Model Context Protocol loads tool schemas into context on every single request. Reducing MCP token costs for Claude Code at team scale isn't a matter of using the tool less. It's a matter of putting a governance and execution layer in the right place. Bifrost, the open-source AI gateway from Maxim AI, is designed for exactly that. This piece lays out where the costs actually come from, what Claude Code's native controls already handle, and how Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; with Code Mode reduces token consumption by up to 92% in production-scale workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Structural Source of MCP Token Costs
&lt;/h2&gt;

&lt;p&gt;Unlike most context costs, MCP overhead isn't paid once per session. It's paid on every turn. Each MCP server Claude Code connects to injects its full tool schemas, every name, description, and parameter definition, into the model's context on every single message. Five servers with thirty tools each means 150 tool definitions shipped before the model has seen the user's prompt.&lt;/p&gt;

&lt;p&gt;Independent measurement has made the scale of this concrete. An analysis of real-world Claude Code sessions found that &lt;a href="https://www.jdhodges.com/blog/claude-code-mcp-server-token-costs/" rel="noopener noreferrer"&gt;a four-server configuration typically carries around 7,000 tokens of MCP overhead per message, with heavier setups crossing 50,000 tokens before a single prompt is typed&lt;/a&gt;. A separate breakdown reported &lt;a href="https://www.mindstudio.ai/blog/claude-code-mcp-server-token-overhead" rel="noopener noreferrer"&gt;multi-server Claude Code setups commonly adding 15,000 to 20,000 tokens of overhead per turn under usage-based billing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Three dynamics amplify the problem as teams grow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repetition on every message&lt;/strong&gt;: a 50-turn session pays the overhead 50 times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools you don't use still charge you&lt;/strong&gt;: a Playwright server's 22 browser tools load even during a Python edit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verbose descriptions by default&lt;/strong&gt;: most open-source MCP servers ship with long, readable descriptions that inflate every tool's per-token cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downstream effect isn't limited to the bill. Overhead crowds out the working context the model needs, pushes compaction earlier in the session, and degrades output quality on long tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude Code's Native Controls Cover
&lt;/h2&gt;

&lt;p&gt;Anthropic has been responsive to this problem. &lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;Claude Code's cost management documentation&lt;/a&gt; covers tool search deferral, prompt caching, auto-compaction, model tiering, and custom preprocessing hooks. Tool search is the most relevant for MCP: once total tool definitions cross a threshold, Claude Code defers them, and only tool names remain in context until the model actually invokes one. In heavy sessions this alone can save 13,000+ tokens.&lt;/p&gt;

&lt;p&gt;For an individual developer running a few MCP servers locally, the native controls are sufficient. At team scale, three gaps remain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Client-side optimization, not organizational control&lt;/strong&gt;: tool search deferral optimizes one session. It doesn't let a platform team define which tools a given developer, team, or customer integration is permitted to call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No orchestration layer&lt;/strong&gt;: even with deferral, every multi-step workflow still incurs schema loads, intermediate tool results, and model round-trips on every step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No cross-team visibility&lt;/strong&gt;: per-session introspection is available to each developer, but there's no organizational view of which tools are consuming tokens across the team.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the problem shifts from "one developer's cost" to "fifty developers' governed MCP usage," the solution has to move into the infrastructure layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Reduces MCP Token Costs for Claude Code
&lt;/h2&gt;

&lt;p&gt;Bifrost sits between Claude Code and the MCP servers a team depends on. Rather than Claude Code connecting to each server directly, it connects to Bifrost's single &lt;code&gt;/mcp&lt;/code&gt; endpoint. Bifrost handles discovery, governance, execution, and the orchestration model that produces the largest cost reduction: Code Mode.&lt;/p&gt;

&lt;p&gt;The impact is documented in &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost's MCP gateway cost benchmark&lt;/a&gt;. Across three controlled rounds, input tokens dropped by 58% with 96 tools connected, 84% with 251 tools, and 92% with 508 tools. Pass rate held at 100% throughout.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Mode: moving orchestration out of the prompt
&lt;/h3&gt;

&lt;p&gt;Code Mode is the single most consequential shift. Instead of injecting every tool definition into context, Bifrost exposes connected MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only the stubs it needs, writes a short Python script to orchestrate them, and Bifrost executes the script in a sandboxed Starlark interpreter.&lt;/p&gt;

&lt;p&gt;The model interacts with four meta-tools, regardless of how many MCP servers are connected:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt;: discover the available servers and tools.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt;: load Python function signatures for a given server or tool.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt;: pull detailed documentation for a specific tool on demand.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt;: run the orchestration script against live bindings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The approach has broad industry validation. &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team documented the pattern with code execution and MCP&lt;/a&gt;, showing a Google Drive to Salesforce workflow dropping from 150,000 tokens to 2,000. &lt;a href="https://blog.cloudflare.com/code-mode/" rel="noopener noreferrer"&gt;Cloudflare observed the same exponential savings&lt;/a&gt; in their own implementation. Bifrost builds it natively into the gateway, uses Python instead of JavaScript for better LLM fluency, and adds the dedicated docs tool to compress context further.&lt;/p&gt;

&lt;p&gt;The savings compound as tool count grows. Classic MCP scales linearly with the number of tools connected. Code Mode's cost is bounded by what the model actually reads, so the curve flattens instead of accelerating.&lt;/p&gt;

&lt;h3&gt;
  
  
  Governance that directly reduces token exposure
&lt;/h3&gt;

&lt;p&gt;Every request through Bifrost carries a virtual key, and each key is scoped to a specific set of tools. The scoping works at the tool level, not just the server level, so &lt;code&gt;filesystem_read&lt;/code&gt; can be granted without &lt;code&gt;filesystem_write&lt;/code&gt; from the same MCP server. The model only ever receives definitions for tools the key is allowed to call. Unauthorized tools don't load into context and don't cost tokens.&lt;/p&gt;

&lt;p&gt;At organizational scale, &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;MCP Tool Groups&lt;/a&gt; make this manageable: a named collection of tools can be attached to any combination of keys, teams, customers, or providers. Bifrost resolves the right set at request time, indexed in memory and synced across cluster nodes, with no database query on the hot path.&lt;/p&gt;

&lt;h3&gt;
  
  
  A single endpoint with complete audit coverage
&lt;/h3&gt;

&lt;p&gt;All connected MCP servers sit behind one &lt;code&gt;/mcp&lt;/code&gt; endpoint. Claude Code connects once and sees every tool the virtual key allows. Adding new MCP servers to Bifrost surfaces them in Claude Code automatically, with no client-side change.&lt;/p&gt;

&lt;p&gt;That single endpoint is also where cost attribution becomes possible. Every tool execution logs as a first-class entry with tool name, server, arguments, result, latency, virtual key, and the parent LLM request, alongside token costs and per-tool costs for tools that call paid external APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: Claude Code on Bifrost
&lt;/h2&gt;

&lt;p&gt;The integration is short because Bifrost runs as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement for existing SDKs&lt;/a&gt; and requires no application code changes.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Register MCP clients&lt;/strong&gt;: in the Bifrost dashboard, add each MCP server with its connection type (HTTP, SSE, or STDIO), endpoint, and any required headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable Code Mode&lt;/strong&gt;: toggle it on in the client settings. No schema changes, no redeployment. Token usage drops on the next request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure virtual keys and auto-execute&lt;/strong&gt;: create scoped &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; for each consumer. For autonomous agent loops, allowlist read-only tools while keeping writes behind approval gates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Point Claude Code at Bifrost&lt;/strong&gt;: add Bifrost as an MCP server in Claude Code's MCP settings using the gateway URL. Claude Code discovers the full tool set through that single connection.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Measuring Impact at Team Scale
&lt;/h2&gt;

&lt;p&gt;Cost reductions only land with finance and platform leadership if they can be measured. Bifrost's &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; provides the data required for that conversation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token cost by virtual key, tool, and MCP server, tracked over time.&lt;/li&gt;
&lt;li&gt;Complete trace of every agent run: tools called, order, arguments, latency.&lt;/li&gt;
&lt;li&gt;Combined spend view showing LLM token costs and tool costs side by side.&lt;/li&gt;
&lt;li&gt;Native Prometheus metrics and &lt;a href="https://docs.getbifrost.ai/features/telemetry" rel="noopener noreferrer"&gt;OpenTelemetry (OTLP)&lt;/a&gt; integration for Grafana, Datadog, New Relic, and Honeycomb.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Teams evaluating Bifrost can also reference &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;published performance benchmarks&lt;/a&gt; showing 11µs of overhead at 5,000 RPS, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; for a full capability comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wider Infrastructure Picture
&lt;/h2&gt;

&lt;p&gt;MCP without governance and cost control becomes unsustainable as soon as a team moves past a single developer's local setup. Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; addresses the full set of production concerns in one layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scoped access through virtual keys and per-tool filtering.&lt;/li&gt;
&lt;li&gt;Organizational governance with MCP Tool Groups.&lt;/li&gt;
&lt;li&gt;Complete audit trails suitable for SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/li&gt;
&lt;li&gt;Per-tool cost visibility alongside LLM token usage.&lt;/li&gt;
&lt;li&gt;Code Mode to reduce context cost without reducing capability.&lt;/li&gt;
&lt;li&gt;The same gateway also handles LLM provider routing, &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt;, load balancing, &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt;, and unified key management across 20+ AI providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When model calls and tool calls flow through the same gateway, model tokens and tool costs sit in one audit log, under one access control model. Teams already running Claude Code on Bifrost can explore the &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; for workflow-specific implementation detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bringing MCP Token Costs for Claude Code Under Control
&lt;/h2&gt;

&lt;p&gt;Reducing MCP token costs for Claude Code isn't about trimming tools or accepting smaller capability surface. It's about moving governance and orchestration into the layer where they can actually scale. Bifrost's MCP gateway and Code Mode deliver up to 92% token reduction on large tool catalogs while giving platform teams the access control and cost attribution they need to run Claude Code across an engineering organization.&lt;/p&gt;

&lt;p&gt;To see how Bifrost fits against your own Claude Code deployment, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Cutting MCP Token Costs in Claude Code: A Practical Guide with Bifrost</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:31:10 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/cutting-mcp-token-costs-in-claude-code-a-practical-guide-with-bifrost-1716</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/cutting-mcp-token-costs-in-claude-code-a-practical-guide-with-bifrost-1716</guid>
      <description>&lt;p&gt;&lt;em&gt;Cut MCP token costs for Claude Code by up to 92% using Bifrost's MCP gateway and Code Mode. Here's how, and what Claude Code's built-ins miss.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you've wired more than a couple of MCP servers into Claude Code, you've probably seen the pattern: token counts climb faster than expected, &lt;code&gt;/context&lt;/code&gt; fills up before you've typed a prompt, and the API bill at month-end doesn't match how much "real work" the model did. The culprit isn't your tools. It's how the Model Context Protocol ships tool schemas into context on every single turn. To actually cut MCP token costs in Claude Code without throwing away capability, the fix has to live one layer deeper, at the gateway. This is where Bifrost, the open-source AI gateway by Maxim AI, comes in. This post walks through where the tokens really go, what Claude Code already does for you, and how Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; with Code Mode drops token use by up to 92% on large tool catalogs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where MCP Tokens Actually Disappear
&lt;/h2&gt;

&lt;p&gt;The thing most people miss about MCP is that tool definitions aren't loaded once per session. They're loaded once per message. Every MCP server you connect pushes its full schema, every tool name, every description, every parameter, into the model's context on every turn. Wire up five servers with thirty tools each and you're shipping 150 tool definitions before Claude Code even reads your prompt.&lt;/p&gt;

&lt;p&gt;The numbers are public and they're not small. A recent teardown found that &lt;a href="https://www.jdhodges.com/blog/claude-code-mcp-server-token-costs/" rel="noopener noreferrer"&gt;a typical four-server Claude Code setup carries around 7,000 tokens of MCP overhead per message, with heavier configurations crossing 50,000 tokens before the first prompt&lt;/a&gt;. A separate analysis pegged &lt;a href="https://www.mindstudio.ai/blog/claude-code-mcp-server-token-overhead" rel="noopener noreferrer"&gt;multi-server setups at 15,000 to 20,000 tokens of overhead per turn on usage-based billing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Three things make it worse the bigger your setup gets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overhead is per-message, not per-session.&lt;/strong&gt; A 50-turn session pays the tax 50 times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unused tools still cost.&lt;/strong&gt; A Playwright server's 22 browser tools ride along even when you're editing Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Descriptions are verbose by default.&lt;/strong&gt; Most OSS MCP servers ship human-readable descriptions that inflate every tool's token cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the spill-over hurts quality: overhead eats into the working context Claude actually needs, which pushes compaction earlier and makes long sessions flakier.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude Code Already Does (And Where It Stops)
&lt;/h2&gt;

&lt;p&gt;Credit where it's due: Anthropic has shipped real optimizations for this. &lt;a href="https://code.claude.com/docs/en/costs" rel="noopener noreferrer"&gt;Claude Code's cost docs&lt;/a&gt; cover tool search deferral, prompt caching, auto-compaction, model tiering, and preprocessing hooks. Tool search is the big one: once your tool definitions exceed a threshold, Claude Code defers them and only tool names stay in context until Claude actually picks one up. Reported savings land in the 13,000-token range for heavy sessions.&lt;/p&gt;

&lt;p&gt;If you're a solo developer with two or three MCP servers running locally, this is enough. Where it runs out of road is at team scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Client-side, not org-side.&lt;/strong&gt; Tool search deferral optimizes your session. It doesn't give a platform team control over which tools a given developer, team, or customer integration is actually allowed to call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No orchestration savings.&lt;/strong&gt; Even with deferral, every multi-step workflow still pays for intermediate tool results, model round-trips, and context reloads on each turn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No shared visibility.&lt;/strong&gt; &lt;code&gt;/context&lt;/code&gt; and &lt;code&gt;/mcp&lt;/code&gt; are per-developer introspection tools. There's no view at the org level showing which tools across which teams are burning tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Past a certain scale, the question stops being "how do I trim my own session?" and starts being "how do I govern MCP for a team of fifty?" That needs an infrastructure layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Cuts MCP Token Costs in Claude Code
&lt;/h2&gt;

&lt;p&gt;Bifrost drops between Claude Code and your MCP servers. Claude Code stops connecting to each server directly and instead talks to Bifrost's single &lt;code&gt;/mcp&lt;/code&gt; endpoint. Bifrost handles discovery, governance, execution, and, most importantly, the execution pattern that actually crushes token cost: Code Mode.&lt;/p&gt;

&lt;p&gt;The benchmark numbers from &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost's MCP gateway cost study&lt;/a&gt; are worth reading in full, but the short version: input tokens fell 58% at 96 tools, 84% at 251 tools, and 92% at 508 tools, with pass rate holding at 100% across all rounds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Mode is the part that moves the needle
&lt;/h3&gt;

&lt;p&gt;Code Mode is the single biggest lever. Rather than injecting tool definitions into context, Bifrost exposes your connected MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only the stubs it actually needs, writes a short Python script to chain the tools together, and Bifrost runs that script in a sandboxed Starlark interpreter.&lt;/p&gt;

&lt;p&gt;The model sees four meta-tools, period, regardless of whether you have 6 MCP servers or 60:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt;: list the servers and tools available.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt;: load Python signatures for a server or tool.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt;: fetch documentation for a specific tool on demand.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt;: run the orchestration script against live bindings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern has independent validation. &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;Anthropic's engineering team wrote about this approach&lt;/a&gt;, showing a Google Drive to Salesforce workflow dropping from 150,000 tokens to 2,000. &lt;a href="https://blog.cloudflare.com/code-mode/" rel="noopener noreferrer"&gt;Cloudflare reported the same exponential savings curve&lt;/a&gt; with their own implementation. Bifrost builds it natively into the gateway, picks Python over JavaScript (better LLM fluency), and adds the dedicated docs tool to compress context even further.&lt;/p&gt;

&lt;p&gt;The payoff compounds the more MCP servers you add. Classic MCP scales linearly with tool count; every server you add is more overhead. Code Mode is bounded by what the model actually reads, so the curve flattens instead of climbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Virtual keys: if a tool shouldn't be callable, don't load it
&lt;/h3&gt;

&lt;p&gt;Every request through Bifrost carries a virtual key, and each key is scoped to a specific set of tools. The scoping is per-tool, not per-server, so you can grant &lt;code&gt;filesystem_read&lt;/code&gt; without granting &lt;code&gt;filesystem_write&lt;/code&gt; from the same MCP server. The model only ever sees definitions for tools the key allows. Tools outside the scope don't show up, don't load, don't cost tokens.&lt;/p&gt;

&lt;p&gt;At team scale, &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;MCP Tool Groups&lt;/a&gt; take this further: define a named collection of tools once, then attach it to any combination of keys, teams, customers, or providers. Resolution happens in-memory at request time, synced across cluster nodes, no database query on the hot path.&lt;/p&gt;

&lt;h3&gt;
  
  
  One endpoint, one audit log
&lt;/h3&gt;

&lt;p&gt;All connected MCP servers sit behind a single &lt;code&gt;/mcp&lt;/code&gt; endpoint. Claude Code connects once and sees every tool the key permits. Add a new MCP server in Bifrost later and it shows up in Claude Code automatically, no client-side config change required.&lt;/p&gt;

&lt;p&gt;That single endpoint is also where cost observability actually becomes possible. Every tool execution logs as a first-class entry: tool name, server, arguments, result, latency, virtual key, and the parent LLM request, with token and per-tool costs sitting side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Claude Code Running on Bifrost
&lt;/h2&gt;

&lt;p&gt;The setup takes a few minutes, and your app code doesn't change because Bifrost is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in SDK replacement&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Register your MCP clients.&lt;/strong&gt; In the Bifrost dashboard, add each MCP server with its connection type (HTTP, SSE, or STDIO), endpoint, and any required headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turn on Code Mode.&lt;/strong&gt; One toggle in the client settings. No redeployment, no schema changes. Token usage drops on the next request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure virtual keys and auto-execute.&lt;/strong&gt; Create scoped &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt; for each consumer. For autonomous loops, allowlist read-only tools while keeping writes behind approval gates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Point Claude Code at Bifrost.&lt;/strong&gt; Open Claude Code's MCP settings, add Bifrost as an MCP server using the gateway URL. Claude Code now sees a governed, token-efficient view of every MCP tool the key permits.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the full path from vanilla Claude Code to governed MCP with Code Mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring What You Actually Saved
&lt;/h2&gt;

&lt;p&gt;Cutting MCP token costs only matters if you can prove it to whoever pays the bill. Bifrost's &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; gives you the numbers that decision-makers ask for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token cost by virtual key, by tool, and by MCP server, over time.&lt;/li&gt;
&lt;li&gt;Full trace of every agent run: tools called, order, arguments, latency.&lt;/li&gt;
&lt;li&gt;Combined spend view with LLM tokens and tool costs side by side.&lt;/li&gt;
&lt;li&gt;Prometheus metrics and &lt;a href="https://docs.getbifrost.ai/features/telemetry" rel="noopener noreferrer"&gt;OpenTelemetry (OTLP)&lt;/a&gt; for Grafana, Datadog, Honeycomb, or New Relic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For broader context on gateway performance and evaluation criteria, &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;Bifrost's benchmarks&lt;/a&gt; document 11µs overhead at 5,000 RPS, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; covers the full capability matrix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture: Production MCP Needs a Gateway
&lt;/h2&gt;

&lt;p&gt;MCP without a governance layer doesn't survive the transition from "one developer's local setup" to "fifty engineers shipping to production." Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt; is the layer that makes that transition possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scoped access via virtual keys and per-tool filtering.&lt;/li&gt;
&lt;li&gt;Org-scale governance with MCP Tool Groups.&lt;/li&gt;
&lt;li&gt;Complete audit trails for SOC 2, GDPR, HIPAA, and ISO 27001.&lt;/li&gt;
&lt;li&gt;Per-tool cost visibility alongside LLM token usage.&lt;/li&gt;
&lt;li&gt;Code Mode to slash context cost without cutting capability.&lt;/li&gt;
&lt;li&gt;The same gateway also handles LLM provider routing, &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt;, load balancing, &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt;, and unified key management across 20+ AI providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model tokens and tool costs end up in one audit log, under one access control model. Teams already running Claude Code on Bifrost can check the &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; for workflow-specific details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start Cutting MCP Token Costs for Claude Code
&lt;/h2&gt;

&lt;p&gt;The way to cut MCP token costs in Claude Code isn't to trim tools and accept less capability. It's to move governance and orchestration into the gateway, where they belong. Bifrost's MCP gateway plus Code Mode delivers up to 92% token reduction on large catalogs while giving platform teams the access control and visibility they need to run Claude Code at scale.&lt;/p&gt;

&lt;p&gt;To see what Bifrost looks like against your own Claude Code setup, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Stop Burning Tokens: How an MCP Gateway Fixes Claude Code and Codex CLI Cost Leaks</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 20 Apr 2026 04:28:27 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/stop-burning-tokens-how-an-mcp-gateway-fixes-claude-code-and-codex-cli-cost-leaks-ik0</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/stop-burning-tokens-how-an-mcp-gateway-fixes-claude-code-and-codex-cli-cost-leaks-ik0</guid>
      <description>&lt;p&gt;&lt;em&gt;Bifrost MCP Gateway cuts coding agent token costs by up to 92% using Code Mode, virtual keys, and on-demand tool loading. Here's how it works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you run Claude Code or Codex CLI against more than a couple of MCP servers, your token bill is quietly inflating. Every turn of the agent loop resends the complete tool catalog into the model's context, whether the agent needs those tools or not. Bifrost MCP Gateway solves this at the infrastructure layer by exposing tools on demand through &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;, scoping access with &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys&lt;/a&gt;, and consolidating every MCP server behind one endpoint. In controlled benchmarks across 16 servers and 508 tools, input tokens dropped 92.8% while pass rate stayed at 100%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tool catalog problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;Here is what the classic MCP execution model does under the hood. Every tool exposed by every connected MCP server is serialized into the model's context on every request. Connect five servers with thirty tools each, and you are pushing 150 tool schemas before the prompt even gets read. Connect sixteen servers with 500 tools, and the model is spending more of its token budget reading a catalog than actually reasoning about your code.&lt;/p&gt;

&lt;p&gt;Anthropic's engineering team called this out in their writeup on &lt;a href="https://www.anthropic.com/engineering/code-execution-with-mcp" rel="noopener noreferrer"&gt;code execution with MCP&lt;/a&gt;. They documented a Google Drive to Salesforce workflow where context usage fell from 150,000 tokens to 2,000 when tools were loaded on demand instead of dumped upfront. The same economics hit every Claude Code or Codex CLI user who wires up a serious fleet of MCP servers.&lt;/p&gt;

&lt;p&gt;The side effects compound:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference cost scales with your MCP footprint, not with the work the agent actually does.&lt;/li&gt;
&lt;li&gt;Agent latency grows as the tool catalog grows, because more tokens need to be read before reasoning begins.&lt;/li&gt;
&lt;li&gt;Tool selection accuracy degrades when the model has to disambiguate the right tool from dozens of irrelevant ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude Code's docs acknowledge this pressure directly, noting that &lt;a href="https://code.claude.com/docs/en/mcp" rel="noopener noreferrer"&gt;tool search is on by default&lt;/a&gt; to reduce the problem. But client-side heuristics do not fix the underlying architecture, especially when multiple teams and agents share the same tool fleet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this costs in practice
&lt;/h2&gt;

&lt;p&gt;A typical coding agent setup looks something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filesystem MCP server for code access.&lt;/li&gt;
&lt;li&gt;GitHub MCP server for PR and issue management.&lt;/li&gt;
&lt;li&gt;A handful of internal tool servers for databases, CI, and ops.&lt;/li&gt;
&lt;li&gt;Each server exposing anywhere from ten to fifty tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A moderately complex task runs six to ten turns in the agent loop. With 150 tool definitions averaging a few hundred tokens each, a single task can burn 300K input tokens on schemas alone before producing a useful line of output. Multiply by hundreds of daily runs per engineer and the spend gets uncomfortable fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost MCP Gateway fixes the token leak
&lt;/h2&gt;

&lt;p&gt;Bifrost is the open-source AI gateway by Maxim AI, written in Go, with 11 microseconds of overhead at 5,000 RPS. It runs as both an MCP client (connecting upstream to your tool servers) and an MCP server (exposing a unified &lt;a href="https://docs.getbifrost.ai/mcp/gateway-url" rel="noopener noreferrer"&gt;&lt;code&gt;/mcp&lt;/code&gt; endpoint&lt;/a&gt; to Claude Code, Codex CLI, Cursor, and anything else that speaks MCP). The cost reduction comes from three layers, not one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Code Mode replaces schema dumps with stub files
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; is the main mechanism. Instead of injecting every tool definition into the context, Bifrost presents connected servers as a virtual filesystem of compact Python stub files. The model works with just four meta-tools and navigates the catalog on demand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt;: list which servers and tools are available&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt;: load Python function signatures for a specific server or tool&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt;: pull detailed documentation for a single tool when needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt;: run an orchestration script against live tool bindings inside a sandboxed Starlark interpreter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Workflow: the model reads only the stubs it needs, writes a short script that chains several tool calls, and submits that script through &lt;code&gt;executeToolCode&lt;/code&gt;. Bifrost runs it in the sandbox, executes the chain, and returns only the final result. Intermediate results never touch the model's context.&lt;/p&gt;

&lt;p&gt;Code Mode supports two binding levels. Server-level binding bundles every tool from one server into a single stub file (efficient for servers with modest tool counts). Tool-level binding gives each tool its own stub (useful when a server exposes thirty-plus tools with rich schemas). Both use the same four-meta-tool interface, so the switch is a configuration flag, not a rewrite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Tool filtering scopes what each agent sees
&lt;/h3&gt;

&lt;p&gt;Not every Claude Code session or Codex CLI instance needs the same tool surface. Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;tool filtering&lt;/a&gt; lets you define, per virtual key, exactly which tools are exposed. A CI agent running unattended can get a read-only subset. An interactive Claude Code session for a senior engineer can get the full surface. The model literally never sees definitions for tools outside its scope, so there is no prompt-level workaround and no wasted context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: One endpoint for every connected server
&lt;/h3&gt;

&lt;p&gt;Teams stop maintaining MCP configs inside each coding agent. You point Claude Code or Codex CLI at Bifrost's &lt;code&gt;/mcp&lt;/code&gt; endpoint and it discovers every upstream server through one connection, governed by the virtual key attached to the request. Add a new server to Bifrost and every connected coding agent picks it up automatically, no config changes required on the client side.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark numbers
&lt;/h2&gt;

&lt;p&gt;Bifrost ran three controlled benchmark rounds with Code Mode on and off, scaling tool count between rounds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Tools × Servers&lt;/th&gt;
&lt;th&gt;Input Tokens (OFF)&lt;/th&gt;
&lt;th&gt;Input Tokens (ON)&lt;/th&gt;
&lt;th&gt;Token Reduction&lt;/th&gt;
&lt;th&gt;Cost Reduction&lt;/th&gt;
&lt;th&gt;Pass Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;96 tools · 6 servers&lt;/td&gt;
&lt;td&gt;19.9M&lt;/td&gt;
&lt;td&gt;8.3M&lt;/td&gt;
&lt;td&gt;−58.2%&lt;/td&gt;
&lt;td&gt;−55.7%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;251 tools · 11 servers&lt;/td&gt;
&lt;td&gt;35.7M&lt;/td&gt;
&lt;td&gt;5.5M&lt;/td&gt;
&lt;td&gt;−84.5%&lt;/td&gt;
&lt;td&gt;−83.4%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;508 tools · 16 servers&lt;/td&gt;
&lt;td&gt;75.1M&lt;/td&gt;
&lt;td&gt;5.4M&lt;/td&gt;
&lt;td&gt;−92.8%&lt;/td&gt;
&lt;td&gt;−92.2%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two takeaways matter here. First, the savings compound rather than growing linearly, because classic MCP's cost scales with the total number of connected tools while Code Mode's cost scales with what the model actually reads. The bigger your MCP footprint, the bigger the delta. Second, accuracy held at 100% across all three rounds, so this is not a capability-for-cost trade. The full methodology and raw results are in the &lt;a href="https://github.com/maximhq/bifrost-benchmarking/blob/main/mcp-code-mode-benchmark/benchmark_report.md" rel="noopener noreferrer"&gt;Bifrost MCP Code Mode benchmarks repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For context on how Code Mode combines with access control and per-tool cost tracking, the &lt;a href="https://www.getmaxim.ai/bifrost/blog/bifrost-mcp-gateway-access-control-cost-governance-and-92-lower-token-costs-at-scale" rel="noopener noreferrer"&gt;Bifrost MCP Gateway deep-dive&lt;/a&gt; goes further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiring it up
&lt;/h2&gt;

&lt;p&gt;The full configuration walkthroughs live in the &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; and the &lt;a href="https://docs.getbifrost.ai/cli-agents/codex-cli" rel="noopener noreferrer"&gt;Codex CLI integration guide&lt;/a&gt;. The short version:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run Bifrost locally or inside your VPC and add your MCP servers through the dashboard. HTTP, SSE, and STDIO transports are all supported.&lt;/li&gt;
&lt;li&gt;Toggle Code Mode on at the client level. No redeployment, no schema rewrites.&lt;/li&gt;
&lt;li&gt;Create a virtual key per consumer (a developer, a CI bot, a customer integration) and attach the tool set it is allowed to call.&lt;/li&gt;
&lt;li&gt;Point Claude Code or Codex CLI at the Bifrost &lt;code&gt;/mcp&lt;/code&gt; endpoint using that virtual key.&lt;/li&gt;
&lt;li&gt;For multi-team setups, use &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;MCP Tool Groups&lt;/a&gt; to manage access at team or customer scope instead of per individual key.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once traffic starts flowing, every tool call becomes a first-class log entry: tool name, source server, arguments, result, latency, originating virtual key, and parent LLM request. LLM token costs and per-tool execution costs sit next to each other, so spend attribution stops being guesswork.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you pick up along the way
&lt;/h2&gt;

&lt;p&gt;Lower token costs are the headline, but coding agents running through Bifrost MCP Gateway also get infrastructure most teams eventually build themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scoped access&lt;/strong&gt;: every agent sees only the tools it should see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trails&lt;/strong&gt;: every tool execution is logged with arguments and results, useful for security review and debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health monitoring&lt;/strong&gt;: automatic reconnects on upstream failure, with periodic refresh to pick up new tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OAuth 2.0 with PKCE&lt;/strong&gt;: including dynamic client registration and auto token refresh.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified model routing&lt;/strong&gt;: the same gateway handles provider routing, failover, and load balancing across 20+ LLM providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More deployment-specific guidance is on the &lt;a href="https://www.getmaxim.ai/bifrost/resources/mcp-gateway" rel="noopener noreferrer"&gt;Bifrost MCP gateway resource page&lt;/a&gt; and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/claude-code" rel="noopener noreferrer"&gt;Claude Code integration resource&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting off the token treadmill
&lt;/h2&gt;

&lt;p&gt;If your Claude Code or Codex CLI setup is quietly burning tokens on tool catalogs every turn, the leak is architectural, not configurable. Bifrost MCP Gateway closes it by loading tools on demand, scoping access per consumer, and consolidating every connected server behind one endpoint, without sacrificing accuracy or capability.&lt;/p&gt;

&lt;p&gt;To see how Bifrost can cut token costs across your coding agent fleet, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>AI Cost Observability Tools in 2026: A Practical Comparison</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 13 Apr 2026 06:19:53 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/ai-cost-observability-tools-in-2026-a-practical-comparison-21bn</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/ai-cost-observability-tools-in-2026-a-practical-comparison-21bn</guid>
      <description>&lt;p&gt;&lt;em&gt;Compare the top AI cost observability tools in 2026. From gateway-level LLM spend tracking to trace-level token attribution, find the right platform for your team.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most AI teams discover their LLM cost problem the same way: a billing alert, a surprised finance team, or a month-end review where the numbers are meaningfully larger than expected. By that point, the relevant requests have already been served, the tokens have been consumed, and the conversation about ownership and attribution starts from a deficit.&lt;/p&gt;

&lt;p&gt;In 2026, managing AI cost has become a first-order operational problem. Multi-provider stacks, multi-team access to shared model capacity, and increasingly complex agentic workflows have made LLM spend both harder to predict and harder to contain. The tools that address this problem fall into two distinct approaches: gateway platforms that govern spend at the infrastructure layer, and observability platforms that reconstruct cost attribution from trace data after the fact. Understanding both approaches, and knowing which your team actually needs, is the starting point for any serious AI cost observability strategy.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is AI Cost Observability?
&lt;/h2&gt;

&lt;p&gt;AI cost observability refers to the discipline of instrumenting LLM systems so that token usage, inference spend, model selection decisions, and cost attribution are continuously visible across every dimension that matters: team, application, environment, customer, and provider.&lt;/p&gt;

&lt;p&gt;Traditional cloud FinOps operates at the billing aggregate. AI cost observability operates at the request. The difference matters because aggregate visibility tells you that costs are high; request-level visibility tells you why, and which part of your system to address.&lt;/p&gt;

&lt;p&gt;A production-grade AI cost observability stack typically provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token tracking per request, broken down by model and provider&lt;/li&gt;
&lt;li&gt;Cost attribution by team, feature, environment, or end customer&lt;/li&gt;
&lt;li&gt;Budget enforcement with hard limits that block requests before thresholds are exceeded&lt;/li&gt;
&lt;li&gt;Cost-aware routing that shifts traffic to cheaper models or providers under budget pressure&lt;/li&gt;
&lt;li&gt;Historical spend analysis through searchable trace logs and cost dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tools reviewed below serve different portions of this stack, and most teams operating at scale will use more than one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bifrost: Gateway-Level LLM Cost Control
&lt;/h2&gt;

&lt;p&gt;Bifrost is an open-source AI gateway that routes requests across &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ LLM providers&lt;/a&gt; through a single OpenAI-compatible interface. Among all the tools reviewed here, it is the only one that handles cost governance at the infrastructure layer: every request passes through Bifrost's governance system before reaching a provider, and budget enforcement happens in the request path, not as a downstream alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hierarchical Budget Management
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance system&lt;/a&gt; in Bifrost structures budgets across a four-level hierarchy: Customer, Team, Virtual Key, and Provider Config. Every applicable budget is checked independently before a request is forwarded. An engineering team capped at $500 per month will be blocked when that ceiling is reached, even if individual virtual keys within that team still carry unused balance.&lt;/p&gt;

&lt;p&gt;This is the critical distinction between gateway-level and observability-layer cost management. Observability platforms record what was spent; Bifrost enforces what can be spent before it happens.&lt;/p&gt;

&lt;p&gt;Rate limits complement budgets at the &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key level&lt;/a&gt;, where teams configure both request-frequency limits and token-volume limits. A virtual key capped at 50,000 tokens per hour enforces that limit across any model or provider it routes to, whether that is GPT-4o, Claude, Gemini, or a Bedrock deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost-Aware Model Routing
&lt;/h3&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing rules&lt;/a&gt; allow budget state to influence model selection automatically. A virtual key can be configured to send requests to a higher-capability model under normal conditions and route to a more economical alternative as budget utilization rises. Regional data residency requirements and pricing differentials across providers can be encoded as routing policy, with no application code changes required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/adaptive-load-balancing" rel="noopener noreferrer"&gt;Adaptive load balancing&lt;/a&gt;, available in Bifrost Enterprise, extends this by routing in real time based on provider latency and error rates, reducing the cost associated with retries and degraded provider performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching for Spend Reduction
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt; eliminates provider calls for requests that are semantically equivalent to a prior cached query. When a match is found, Bifrost returns the cached response without a provider round-trip. For workloads with repeated or structurally similar queries, this reduces token spend directly, without any changes to prompt design or application architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability Integration
&lt;/h3&gt;

&lt;p&gt;Bifrost emits &lt;a href="https://docs.getbifrost.ai/features/telemetry" rel="noopener noreferrer"&gt;real-time telemetry&lt;/a&gt; with native Datadog integration for APM traces, LLM observability metrics, and spend data. Prometheus metrics are available via scraping or Push Gateway for Grafana-based monitoring. &lt;a href="https://docs.getbifrost.ai/enterprise/log-exports" rel="noopener noreferrer"&gt;Log exports&lt;/a&gt; push request logs and cost telemetry to external storage and data lake destinations.&lt;/p&gt;

&lt;p&gt;At 5,000 requests per second, Bifrost adds only 11 µs of overhead per request. The governance and observability layer operates without becoming a throughput constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Platform and infrastructure teams managing LLM access across multiple teams or customer tenants, who need budget enforcement, cost-aware routing, and spend attribution operating at the infrastructure layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Langfuse: Trace-Level Cost Attribution
&lt;/h2&gt;

&lt;p&gt;Langfuse is an open-source LLM observability platform that records each provider call as a trace, attaching token counts, model, latency, and estimated cost to every span. Because cost, quality, and performance data share the same data model, teams can run joint queries across all three dimensions without assembling data from separate systems.&lt;/p&gt;

&lt;p&gt;Langfuse's primary value for cost management is attribution depth. Spend can be viewed at the level of a single request, a user session, a specific application feature, or any custom metadata dimension attached to the trace at instrumentation time. Engineering teams can identify which product areas are generating disproportionate token spend without building custom logging pipelines.&lt;/p&gt;

&lt;p&gt;What Langfuse does not provide is enforcement. It has no mechanism to block requests or halt a workflow when a budget ceiling is reached. Teams that need that control will need a gateway running upstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that need request-level cost attribution combined with quality and latency data in a single platform, and who will manage budget enforcement through a separate gateway layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Arize Phoenix: ML Observability with Cost Tracking
&lt;/h2&gt;

&lt;p&gt;Arize Phoenix is an open-source observability framework designed for production monitoring of LLM and ML systems. Its core capabilities cover prompt and completion tracing, token usage dashboards, and cost attribution across models and providers.&lt;/p&gt;

&lt;p&gt;Phoenix is particularly strong in analysis workflows. Its embedding monitoring, anomaly detection, and clustering tools are well-suited to teams running retrieval-augmented generation pipelines, where retrieval quality and inference cost are related variables. Identifying expensive low-quality outputs, where high token spend produced poor results, is a natural Phoenix use case.&lt;/p&gt;

&lt;p&gt;Phoenix surfaces cost data as part of its analysis workflow but does not act on it. Budget enforcement and cost-aware routing are outside the platform's scope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams running RAG pipelines or ML-intensive systems who want cost as a signal within a broader quality and performance analysis workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  LangSmith: Cost Visibility in the LangChain Ecosystem
&lt;/h2&gt;

&lt;p&gt;LangSmith is the native observability and debugging layer for LangChain. It captures traces at the chain, agent, and LLM call level, attaching token counts and cost estimates to every span in the execution tree.&lt;/p&gt;

&lt;p&gt;For teams building with LangChain or LangGraph, LangSmith provides the lowest-friction instrumentation path. The trace explorer handles multi-step agent workflows well, which matters for teams debugging cost compounding across sequential tool calls and reasoning steps.&lt;/p&gt;

&lt;p&gt;Teams working outside the LangChain ecosystem will find the integration overhead higher and the cost attribution less automatic. LangSmith is framework-native by design, and that is both its strength and its boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams building LangChain or LangGraph agents who need framework-native cost tracing and debugging without additional tooling overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Datadog LLM Observability: Cost Inside Your Existing APM Stack
&lt;/h2&gt;

&lt;p&gt;Datadog's LLM Observability module records LLM calls as traces within the Datadog APM platform, tagging each span with token counts, cost, latency, and error data. For teams already operating Datadog for infrastructure and application monitoring, this path avoids introducing a new platform. AI cost data arrives in the same environment as the rest of the system's telemetry.&lt;/p&gt;

&lt;p&gt;The consolidation advantage is real: a cost spike in an LLM call can be linked directly to the application behavior and infrastructure state that produced it, using existing Datadog tooling. The limitation is that Datadog is an infrastructure observability platform first. AI output quality evaluation and cross-functional evaluation workflows are add-on considerations rather than native capabilities. Teams that need cost monitoring alongside quality measurement will typically need a purpose-built AI observability tool alongside Datadog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams already running Datadog who want AI cost tracking integrated into their existing stack without operating a separate platform.&lt;/p&gt;




&lt;h2&gt;
  
  
  Weights &amp;amp; Biases Weave: Cost in the ML Experiment Context
&lt;/h2&gt;

&lt;p&gt;Weights &amp;amp; Biases offers LLM cost tracking through Weave, embedding token usage and spend data alongside model experiments, prompt comparison runs, and evaluation workflows. The platform is most useful for teams treating cost as one variable in a multi-objective optimization that also covers output quality and latency.&lt;/p&gt;

&lt;p&gt;The user experience is oriented toward researchers and ML practitioners. Traces are explored in the context of an experiment or evaluation run, and production monitoring is secondary to the experiment-tracking workflow. Real-time enforcement is not part of the platform's design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; ML research teams and teams running systematic prompt and model evaluation who want cost as an optimization dimension in their experimentation workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing the Right AI Cost Observability Tool
&lt;/h2&gt;

&lt;p&gt;The right tool for a given team depends on where the cost visibility problem actually sits in the stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If teams are exceeding LLM budgets with no enforcement in place:&lt;/strong&gt; begin at the gateway. Trace observability has limited value when spend is uncontrolled at the infrastructure layer. Bifrost provides the enforcement foundation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If costs are bounded but attribution is unclear&lt;/strong&gt; (which features, users, or workflows are expensive): layer in a trace-level platform such as Langfuse or Arize Phoenix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the team is already on Datadog&lt;/strong&gt; and needs AI spend data correlated with system performance: the LLM Observability module is the path of least friction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If the stack is LangChain-native:&lt;/strong&gt; LangSmith is the natural starting point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most production teams operating across multiple providers and multiple internal consumers, gateway-level governance is the prerequisite that makes downstream observability useful. Trace observability explains the distribution of past costs. Gateway enforcement shapes future ones.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Bifrost Fits Into an AI Cost Observability Stack
&lt;/h2&gt;

&lt;p&gt;Every spend decision in an LLM-powered system begins with a request. Bifrost intercepts each one and runs governance checks (budget validation, rate limit enforcement, routing logic) at under 11 µs of added latency. Control happens before cost is incurred, not after.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual keys system&lt;/a&gt; provides the attribution scaffold. Each key maps to a position in the governance hierarchy (team, customer, or standalone) and carries its own budget, model restrictions, and spend tracking. Allocations reset at calendar-aligned boundaries. Teams that exhaust their allocation stop sending requests until the next period.&lt;/p&gt;

&lt;p&gt;Downstream observability infrastructure connects through native integrations and &lt;a href="https://docs.getbifrost.ai/enterprise/log-exports" rel="noopener noreferrer"&gt;log exports&lt;/a&gt;. Cost data flows into Datadog dashboards, Prometheus alert rules, and data lake pipelines through Bifrost's telemetry layer, with no need to rebuild the analytics infrastructure that teams already operate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt; and cost-aware routing extend Bifrost's role from governance to active optimization: eliminating redundant provider calls and shifting traffic to lower-cost options when budget conditions warrant it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started with Bifrost
&lt;/h2&gt;

&lt;p&gt;For teams managing LLM spend across multiple providers, teams, or products, Bifrost provides the infrastructure-layer foundation for AI cost observability. Budget policies, team allocations, and routing logic are configurable through the Bifrost web UI. Existing observability stacks connect through native Datadog and Prometheus integrations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; to see how Bifrost fits your AI cost observability requirements, or review the &lt;a href="https://docs.getbifrost.ai/overview" rel="noopener noreferrer"&gt;Bifrost documentation&lt;/a&gt; to explore governance configuration for your LLM infrastructure.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Enterprise LLM Gateway for Cost Tracking in Coding Agents</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 13 Apr 2026 06:18:26 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/enterprise-llm-gateway-for-cost-tracking-in-coding-agents-1m22</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/enterprise-llm-gateway-for-cost-tracking-in-coding-agents-1m22</guid>
      <description>&lt;p&gt;&lt;em&gt;Coding agents generate dozens of LLM calls per session. Here is how enterprise teams use a gateway to track, attribute, and control that spend before it becomes a problem.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you run Claude Code or Codex CLI across an engineering team, you already know the pattern: one developer instruction spirals into a sequence of autonomous API calls covering file reads, terminal commands, code edits, and context syncs, each one hitting a high-cost model like Claude Opus or GPT-4o. At individual scale that is manageable. Across a team running agents all day, it compounds into one of the steepest-climbing line items in your infrastructure spend.&lt;/p&gt;

&lt;p&gt;The deeper issue is not the amount spent, it is that no one knows where the money is going. When coding agents call provider APIs directly, there is no shared view of per-team consumption, no mechanism to enforce a spending ceiling, and no way to connect token usage to a specific team, project, or tool configuration. The bill arrives at the end of the month as a surprise.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;enterprise LLM gateway&lt;/strong&gt; sits between your agents and your providers, capturing every request as it passes through. It attributes spend to the right team or project, enforces configurable budget limits, and can reroute requests to lower-cost providers automatically when a threshold approaches. This article covers what that looks like in practice, and how &lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; addresses each part of the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Cost Tracking in Coding Agents Is Uniquely Hard
&lt;/h2&gt;

&lt;p&gt;Most LLM cost monitoring is built around a simple interaction model: a user sends a query, the model returns a response. Coding agents do not fit that model, and that mismatch creates three specific tracking problems.&lt;/p&gt;

&lt;p&gt;The first is call volume. Coding agents operate autonomously across multiple steps, with each tool call potentially triggering another. A single high-level instruction from a developer can expand into ten or more sequential API calls before a result is returned. Token consumption per session runs far higher than an equivalent chat interaction.&lt;/p&gt;

&lt;p&gt;The second is model fragmentation. Agents like Claude Code divide work across model tiers: Sonnet handles routine tasks, Opus takes over for complex reasoning, and Haiku processes lightweight completions. Without a gateway aggregating this data, there is no way to see what each tier is costing or whether the tier assignments are working efficiently.&lt;/p&gt;

&lt;p&gt;The third is provider fragmentation. Enterprise teams rarely run on a single LLM provider. Cost data distributed across separate provider dashboards with different schemas cannot be reconciled without significant manual effort.&lt;/p&gt;

&lt;p&gt;A well-built LLM gateway addresses all three at the infrastructure level, before the data ever reaches a dashboard.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Look for in an Enterprise LLM Gateway for Cost Tracking
&lt;/h2&gt;

&lt;p&gt;Not every gateway is suited for coding agent environments. The capabilities that matter most for this use case are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical budget enforcement&lt;/strong&gt;: Independent spend limits across teams, projects, and individual keys, each with its own reset cadence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-request cost attribution&lt;/strong&gt;: Full logging of provider, model, input tokens, output tokens, and cost on every call, visible in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget-aware routing&lt;/strong&gt;: Automatic redirection to cheaper providers or models when a budget threshold is crossed, requiring no changes to agent configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native coding agent support&lt;/strong&gt;: Direct compatibility with Claude Code, Codex CLI, Cursor, and similar tools without custom middleware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching&lt;/strong&gt;: Deduplication of provider calls for semantically similar queries, eliminating redundant spend on repeated patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider routing&lt;/strong&gt;: A single endpoint covering OpenAI, Anthropic, AWS Bedrock, Google Vertex, and other providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bifrost satisfies all of these and operates with only 11 microseconds of added latency per request at 5,000 RPS, making it viable for production coding agent workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Bifrost Handles LLM Cost Tracking for Coding Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hierarchical Budget Control
&lt;/h3&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance system&lt;/a&gt; organizes cost control across four independent scopes: customer, team, virtual key, and per-provider configuration. Every scope carries its own budget with a configurable spend ceiling and reset interval.&lt;/p&gt;

&lt;p&gt;For a typical enterprise coding agent deployment, that hierarchy maps like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Organization level&lt;/strong&gt;: Aggregate monthly LLM budget for the whole company&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team level&lt;/strong&gt;: Separate allocation per engineering team (platform, product, infrastructure, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual key level&lt;/strong&gt;: Per-tool or per-environment budgets (Claude Code production vs. Codex CLI staging)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider config level&lt;/strong&gt;: Provider-specific caps within a key (Anthropic at $200/month, OpenAI at $300/month)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every incoming request is checked against all applicable scopes in the hierarchy. If any scope has exhausted its budget, the request is blocked before reaching the provider. Overruns cannot occur at any level, not just at the top-level account ceiling.&lt;/p&gt;

&lt;p&gt;Reset intervals support daily, weekly, monthly, and annual cadences. Calendar alignment is optional, allowing budgets to reset on the first of the month rather than on a rolling 30-day window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Virtual Keys as the Attribution Unit
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are Bifrost's primary governance primitive. Each key is a scoped credential that bundles a budget, rate limits, and an allowlist of providers and models. Coding agents authenticate using a virtual key in place of a raw provider credential.&lt;/p&gt;

&lt;p&gt;Connecting Claude Code is two environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://your-bifrost-instance.com/anthropic"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bf-your-virtual-key"&lt;/span&gt;
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request Claude Code makes is now routed through Bifrost and counted against that key's budget. The same pattern works for &lt;a href="https://docs.getbifrost.ai/cli-agents/codex-cli" rel="noopener noreferrer"&gt;Codex CLI&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/cli-agents/cursor" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/cli-agents/gemini-cli" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/cli-agents/zed-editor" rel="noopener noreferrer"&gt;Zed Editor&lt;/a&gt;, and every other tool in Bifrost's &lt;a href="https://docs.getbifrost.ai/cli-agents/overview" rel="noopener noreferrer"&gt;CLI agent ecosystem&lt;/a&gt;. No modifications to the agents are needed. Attribution happens at the gateway.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget-Aware Routing Rules
&lt;/h3&gt;

&lt;p&gt;Bifrost supports dynamic routing using CEL (Common Expression Language) expressions evaluated per request. When budget consumption on a virtual key crosses a defined threshold, Bifrost reroutes to a lower-cost target automatically.&lt;/p&gt;

&lt;p&gt;A rule for this looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Budget Fallback to Cheaper Model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cel_expression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"budget_used &amp;gt; 85"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"groq"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-3.3-70b-versatile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once budget usage exceeds 85%, incoming requests are quietly redirected to the cheaper alternative. Developer workflows continue uninterrupted. Budget exhaustion no longer means session termination.&lt;/p&gt;

&lt;p&gt;Rules can be scoped to a virtual key, team, customer, or the whole gateway, and evaluated in configurable priority order. The &lt;a href="https://docs.getbifrost.ai/providers/routing-rules" rel="noopener noreferrer"&gt;routing rules documentation&lt;/a&gt; covers the full CEL expression syntax and target configuration options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching to Reduce Redundant Spend
&lt;/h3&gt;

&lt;p&gt;Coding agents repeat themselves. Across sessions and developers, similar queries appear frequently: summarize this function, write a unit test for this method, explain this block of code. Without caching, each instance of a repeated query becomes a billable provider call.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; matches incoming queries against previous responses using embedding-based similarity search. When a sufficiently similar match is found, the cached response is returned without a provider call. Exact cache hits cost nothing. Near-matches cost only the embedding lookup, a small fraction of a full inference request.&lt;/p&gt;

&lt;p&gt;Teams running many parallel agent sessions on shared codebases typically see meaningful cost reduction from caching alone, with no changes required to how agents operate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-Time Observability and Cost Attribution
&lt;/h3&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; records every request: provider, model, input token count, output token count, and computed cost. The dashboard provides real-time filtering by virtual key, provider, model, and time window, so teams can answer operational questions directly: which team is the highest consumer, which model tier contributes the most to spend, what does per-session cost look like for a given agent configuration.&lt;/p&gt;

&lt;p&gt;Datadog users get native integration with LLM cost metrics surfaced alongside standard APM data. Teams on OpenTelemetry can export through the &lt;a href="https://docs.getbifrost.ai/features/telemetry" rel="noopener noreferrer"&gt;telemetry integration&lt;/a&gt; to Grafana, New Relic, Honeycomb, or any OTLP-compatible collector.&lt;/p&gt;

&lt;p&gt;Bifrost also connects natively to &lt;a href="https://www.getmaxim.ai/products/agent-observability" rel="noopener noreferrer"&gt;Maxim AI's observability platform&lt;/a&gt;, which layers production quality monitoring on top of cost data. Cost trends and output quality metrics appear together, making it possible to catch both budget overruns and quality regressions from a single view.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Tier Overrides for Cost Optimization
&lt;/h3&gt;

&lt;p&gt;Claude Code's default behavior assigns tasks to Sonnet and escalates to Opus for complex work. Bifrost lets engineering managers remap those defaults at the environment level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Send Opus-tier requests to a less expensive model&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_OPUS_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"anthropic/claude-sonnet-4-5-20250929"&lt;/span&gt;

&lt;span class="c"&gt;# Send Haiku-tier requests to a hosted open-source model&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_HAIKU_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"groq/llama-3.1-8b-instant"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Developers keep using their tools as normal. Bifrost handles the provider translation based on the model name, and costs shift without any workflow disruption.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deploying Bifrost for Coding Agent Cost Control
&lt;/h2&gt;

&lt;p&gt;Bifrost starts in under a minute with NPX or Docker and requires no configuration files to launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @maximhq/bifrost@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Providers and virtual keys can be configured through the web UI or REST API after startup. For regulated environments, Bifrost supports &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployment&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/enterprise/vault-support" rel="noopener noreferrer"&gt;Vault and cloud secret manager integration&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/enterprise/mcp-with-fa" rel="noopener noreferrer"&gt;RBAC with Okta and Entra ID&lt;/a&gt;, and &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;immutable audit logging&lt;/a&gt; for SOC 2, GDPR, and HIPAA compliance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/adaptive-load-balancing" rel="noopener noreferrer"&gt;Adaptive load balancing&lt;/a&gt; is available as an enterprise feature, routing requests to the best-performing provider based on real-time latency and health data without manual rule maintenance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started with Bifrost for Coding Agent Cost Tracking
&lt;/h2&gt;

&lt;p&gt;The path to full cost visibility involves three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Deploy Bifrost and configure your LLM provider API keys.&lt;/li&gt;
&lt;li&gt;Create virtual keys for each team or tool, with spend limits and reset cadences appropriate to your budget cycle.&lt;/li&gt;
&lt;li&gt;Point Claude Code, Codex CLI, Cursor, or any other coding agent at your Bifrost endpoint using the virtual key as the API credential.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From that point, every session is tracked and attributed automatically. Routing rules, caching, and observability integrations can be layered in as requirements grow.&lt;/p&gt;

&lt;p&gt;To see how Bifrost handles cost visibility and governance for coding agent infrastructure, &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; with the Bifrost team.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Cut LLM Costs and Latency in Production: A 2026 Playbook</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 13 Apr 2026 06:10:00 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/how-to-cut-llm-costs-and-latency-in-production-a-2026-playbook-53fp</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/how-to-cut-llm-costs-and-latency-in-production-a-2026-playbook-53fp</guid>
      <description>&lt;p&gt;&lt;em&gt;Six practical strategies for reducing LLM cost and latency at enterprise scale, from semantic caching to agentic optimization with Bifrost.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Enterprise AI budgets are growing fast. LLM API spending nearly doubled from $3.5 billion to $8.4 billion in the span of a year, and three-quarters of organizations expect to spend even more through 2026. What most teams lack is a structured approach to controlling what they spend and how fast their systems respond. The savings potential is real: teams that apply the right techniques consistently see 40-70% reductions in API spend without touching output quality.&lt;/p&gt;

&lt;p&gt;This playbook breaks down six strategies that work at production scale, from caching and routing to agentic execution optimization. Each technique is independent, but they compound when applied together.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LLM Costs and Latency Spiral in Production
&lt;/h2&gt;

&lt;p&gt;The gap between prototype economics and production economics is wider than most teams expect. A deployment that runs for pennies per day during development can easily reach five figures per month once real users arrive. Three factors drive most of the escalation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token usage&lt;/strong&gt;: Output tokens cost 3-5x more than input tokens at most major providers. Verbose responses and bloated context windows are among the most common sources of avoidable spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model selection&lt;/strong&gt;: There is a 20-30x price difference between frontier models like GPT-4 or Claude Opus and smaller alternatives for equivalent token counts. Sending every request to a top-tier model regardless of task complexity is one of the fastest ways to burn through an AI budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request volume&lt;/strong&gt;: Per-call costs appear small until you multiply them. A customer support agent running 10,000 conversations daily at $0.05 per call produces $1,500 in monthly API costs before you account for other teams and applications on the same infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Latency amplifies these problems. Slow responses degrade user experience and create bottlenecks in any system where LLM outputs feed downstream processes. Both issues are addressable at the gateway layer, between your application and the LLM providers, without restructuring application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Semantic Caching: The Highest-ROI Starting Point
&lt;/h2&gt;

&lt;p&gt;The single most impactful optimization available to most production teams is also one of the most underused. &lt;a href="https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering" rel="noopener noreferrer"&gt;Research shows&lt;/a&gt; that approximately 31% of enterprise LLM queries are semantically equivalent to requests that have already been answered, just worded differently. Two users asking "How do I reset my password?" and "What are the steps to update my login credentials?" are asking the same question. Without semantic caching, both generate full API calls at full cost.&lt;/p&gt;

&lt;p&gt;Traditional exact-match caching cannot catch this overlap. Semantic caching uses vector embeddings to measure meaning rather than string similarity, serving cached responses whenever a new query falls within a configurable similarity threshold of a previous one.&lt;/p&gt;

&lt;p&gt;The measured outcomes across production deployments are consistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;40-70% cost reduction&lt;/strong&gt; on workloads with clustered or repetitive queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;7x latency improvement&lt;/strong&gt; on cache hits, dropping response times from ~850ms to ~120ms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No quality degradation&lt;/strong&gt;: cache hits return the same response the model would have produced&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; is embedded directly into the gateway request pipeline. Matching queries return cached responses before traffic ever reaches an LLM provider, so there is no additional network round-trip.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Complexity-Based Model Routing
&lt;/h2&gt;

&lt;p&gt;The assumption that all requests need the same model is expensive and usually wrong. Simple classification tasks, short extractions, and repetitive FAQ responses perform at equivalent quality on smaller, faster, cheaper models. &lt;a href="https://www.tribe.ai/applied-in-reducing-latency-and-cost-at-scale-llm-performance" rel="noopener noreferrer"&gt;SciForce's hybrid routing research&lt;/a&gt; found that routing simpler queries to lighter models achieves a 37-46% reduction in overall LLM consumption, with simple queries returning 32-38% faster.&lt;/p&gt;

&lt;p&gt;The challenge with routing is implementation complexity: different providers have different APIs, and maintaining routing logic at the application layer means every code change affects multiple services.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/providers/routing-rules" rel="noopener noreferrer"&gt;routing rules&lt;/a&gt; centralize this at the gateway level. Define the routing logic once, and Bifrost handles provider-specific API differences automatically. Routing strategy changes happen in configuration, not code. Combined with &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt;, the routing layer also handles provider outages and rate limit events without application-level error handling.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Adaptive Load Balancing
&lt;/h2&gt;

&lt;p&gt;At production request volumes, how traffic is distributed across API keys and providers directly determines both cost and latency. Rate limit collisions create retry loops that add latency and, in some billing models, result in charges for failed requests. Uneven key utilization leaves capacity unused while other keys get throttled.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/enterprise/adaptive-load-balancing" rel="noopener noreferrer"&gt;adaptive load balancing&lt;/a&gt; scores each route continuously based on live signals: error rate, observed latency, and throughput. Error rate carries the most weight, which means degraded routes get deprioritized the moment problems appear rather than after a fixed polling window. Each route moves through four states (Healthy, Degraded, Failed, Recovering) with automatic recovery once metrics stabilize.&lt;/p&gt;

&lt;p&gt;In clustered Bifrost deployments, routing intelligence is shared across all nodes via a gossip synchronization mechanism. Every node makes consistent decisions without relying on a central coordinator, removing a common point of failure in distributed gateway setups.&lt;/p&gt;

&lt;p&gt;The result is higher throughput and lower average latency at the same cost envelope, with no manual intervention required.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Prompt Engineering for Token Efficiency
&lt;/h2&gt;

&lt;p&gt;Gateway-level controls address the infrastructure problem. Prompt engineering attacks the token budget at the source. Because output tokens cost more than inputs, reducing response length has an outsized effect on API spend per request.&lt;/p&gt;

&lt;p&gt;The changes with the greatest practical impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Set explicit output constraints&lt;/strong&gt;: Tell the model how long to be ("Answer in 50 words or fewer") and enforce it with &lt;code&gt;max_tokens&lt;/code&gt; in the API call. Unconstrained models default to more verbose outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit and trim system prompts&lt;/strong&gt;: A system prompt that runs 200 tokens longer than needed becomes a significant cost multiplier at millions of daily requests. Remove anything that does not measurably change model behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compress conversation history&lt;/strong&gt;: Passing full chat histories for multi-turn interactions consumes input tokens that could be replaced by a short summary of prior context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Request structured output&lt;/strong&gt;: JSON or structured formats produce shorter, more parseable responses than natural-language explanations and eliminate unnecessary preamble.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prompt optimization typically delivers 20-30% reductions in token consumption per request, and it stacks directly on top of caching and routing gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Budget Controls and Cost Visibility
&lt;/h2&gt;

&lt;p&gt;Optimization without visibility is guesswork. Most teams first notice cost problems when the monthly invoice arrives, not when the spend is happening. The only reliable approach is real-time attribution: knowing which team, application, or use case is generating costs as it happens.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget and rate limit controls&lt;/a&gt; operate at the virtual key level. Every team, application, or customer account gets a dedicated virtual key with a configurable budget cap, rate limit, and model allowlist. When a threshold is crossed, the configured response fires automatically: an alert, a throttle, or a hard block. No single use case can silently exhaust shared infrastructure budget.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; provides a real-time view across every provider, model, and key: token consumption, cost attribution, error rates, and latency, all flowing into existing monitoring tools via Prometheus and OpenTelemetry. Before changing provider or model configurations, the &lt;a href="https://www.getmaxim.ai/bifrost/llm-cost-calculator" rel="noopener noreferrer"&gt;LLM Cost Calculator&lt;/a&gt; lets you model the expected impact in advance.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Code Mode for Agents and Bifrost CLI for Coding Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code Mode: Lower Token Overhead for Any Agent
&lt;/h3&gt;

&lt;p&gt;Standard agentic execution is expensive at the token level. On each iteration, the agent receives full tool schemas and result payloads, makes one tool call at a time through a full LLM round-trip, and accumulates cost across every step. This overhead applies regardless of the agent's domain: research agents, internal system query agents, and multi-step workflow orchestrators all follow the same pattern.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/code-mode" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt; changes the execution model. Rather than sequential one-at-a-time tool calls, the model generates Python that orchestrates multiple tool invocations in a single step. Bifrost runs the code and returns the combined results, collapsing several round-trips into one. The gains hold across agent types: approximately 50% fewer tokens per completed task and approximately 40% lower end-to-end latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bifrost CLI: One Command for Coding Agent Control
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.getmaxim.ai/bifrost/resources/bifrost-cli" rel="noopener noreferrer"&gt;Bifrost CLI&lt;/a&gt; is the fastest way to apply gateway-level cost and latency controls to terminal-based coding agents. It launches Claude Code, Codex CLI, Gemini CLI, and other &lt;a href="https://www.getmaxim.ai/bifrost/resources/cli-agents" rel="noopener noreferrer"&gt;CLI coding agents&lt;/a&gt; through Bifrost automatically, handling gateway and MCP configuration without any manual setup. Developers continue using their existing tools. The CLI routes all traffic through semantic caching, model routing, budget enforcement, and observability from a single command.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Gateway Layer Is the Right Place to Solve This
&lt;/h2&gt;

&lt;p&gt;Teams that implement cost and latency optimization at the application layer eventually encounter the same problem: each service reimplements the same logic independently, routing strategy changes require code deployments, and observability is fragmented across different implementations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmaxim.ai/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; centralizes all of these controls at the infrastructure layer. Configure semantic caching, routing rules, adaptive load balancing, budget caps, and observability once, and they apply uniformly to every LLM request across every team and application. The overhead Bifrost adds to accomplish this is 11 microseconds per request at 5,000 RPS, which is negligible against the hundreds of milliseconds consumed by provider API calls.&lt;/p&gt;

&lt;p&gt;Bifrost connects to 20+ providers including OpenAI, Anthropic, AWS Bedrock, Google Vertex AI, Azure OpenAI, Groq, Mistral, and Cohere through a single OpenAI-compatible API. Provider and model changes happen in gateway configuration, not application code. The &lt;a href="https://www.getmaxim.ai/bifrost/resources/benchmarks" rel="noopener noreferrer"&gt;performance benchmarks&lt;/a&gt; cover throughput and latency comparisons in detail. Teams evaluating gateways can use the &lt;a href="https://www.getmaxim.ai/bifrost/resources/buyers-guide" rel="noopener noreferrer"&gt;LLM Gateway Buyer's Guide&lt;/a&gt; as a structured reference, and the &lt;a href="https://www.getmaxim.ai/bifrost/resources/enterprise-scalability" rel="noopener noreferrer"&gt;enterprise scalability resource&lt;/a&gt; covers high-throughput, multi-team deployment patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Strategies Stack
&lt;/h2&gt;

&lt;p&gt;No single technique delivers everything. The teams achieving 50-70% reductions in production API spend apply several layers simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Semantic caching eliminates full API calls for the roughly one-third of queries that overlap semantically with prior requests&lt;/li&gt;
&lt;li&gt;Complexity-based routing shifts cheaper tasks to lower-cost models without affecting output quality&lt;/li&gt;
&lt;li&gt;Adaptive load balancing removes rate limit friction and reduces retry-driven latency&lt;/li&gt;
&lt;li&gt;Prompt engineering reduces token consumption at the source, across every request whether cached or not&lt;/li&gt;
&lt;li&gt;Budget controls surface spend in real time rather than at invoice time&lt;/li&gt;
&lt;li&gt;Code Mode halves per-task token usage and cuts latency by approximately 40% for any agent workload&lt;/li&gt;
&lt;li&gt;The Bifrost CLI extends these controls to coding agent workflows with a single terminal command&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer compounds on the others. Caching reduces the effective volume of requests hitting routing and load balancing. Tighter prompts reduce costs on every live request. The combination produces outcomes that no single technique achieves on its own.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;Bifrost applies every strategy in this guide at the gateway level with 11 microseconds of added overhead and no changes to application code. Start with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; or Docker to get running in under a minute, or &lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;book a demo&lt;/a&gt; to see how the full optimization stack maps to your specific workloads.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Best Enterprise AI Gateway for Retail AI</title>
      <dc:creator>Kamya Shah</dc:creator>
      <pubDate>Mon, 06 Apr 2026 04:38:11 +0000</pubDate>
      <link>https://dev.to/kamya_shah_e69d5dd78f831c/best-enterprise-ai-gateway-for-retail-ai-3jd1</link>
      <guid>https://dev.to/kamya_shah_e69d5dd78f831c/best-enterprise-ai-gateway-for-retail-ai-3jd1</guid>
      <description>&lt;p&gt;&lt;em&gt;Retail AI workloads demand an enterprise AI gateway that delivers budget enforcement, privacy compliance, and intelligent provider routing. Here is how Bifrost solves it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Retail has moved past the AI experimentation phase. &lt;a href="https://www.nvidia.com/en-us/lp/industries/state-of-ai-in-retail-and-cpg/" rel="noopener noreferrer"&gt;NVIDIA's 2026 State of AI in Retail and CPG report&lt;/a&gt; shows that 97% of retailers intend to grow their AI budgets this year, with 69% already seeing higher revenue and 72% reporting lower operating costs from AI adoption. The global AI in retail market is on track to expand from $18.64 billion in 2026 to &lt;a href="https://www.mordorintelligence.com/industry-reports/artificial-intelligence-in-retail-market" rel="noopener noreferrer"&gt;$82.72 billion by 2031&lt;/a&gt;, a 34.7% compound annual growth rate. From personalized product suggestions and real-time pricing adjustments to inventory forecasting, conversational support, fraud prevention, and agentic shopping flows, AI now touches every part of the retail value chain. Yet as these workloads proliferate across teams and regions, the gaps in infrastructure become obvious: fragmented cost tracking, missing audit trails, no per-application access controls, and inconsistent compliance posture across privacy jurisdictions. An enterprise AI gateway for retail closes these gaps by sitting between applications and LLM providers, centralizing governance, routing, and compliance in one layer. Bifrost, the open-source AI gateway from Maxim AI, delivers the budget management, security controls, and multi-provider orchestration that retail AI needs to operate reliably at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case for a Dedicated AI Gateway in Retail
&lt;/h2&gt;

&lt;p&gt;Retailers interact with AI at more operational touchpoints than nearly any other industry. Product recommendation engines, visual merchandising systems, customer support bots, supply chain planners, content generation tools, dynamic pricing modules, and loss prevention models each operate with different performance requirements, different provider dependencies, and different cost profiles.&lt;/p&gt;

&lt;p&gt;Without a centralized gateway, these problems compound:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Invisible API spend&lt;/strong&gt;: AI investment is spreading well beyond IT departments. Retail executives expect non-IT AI spending to jump 52% year over year. When marketing, merchandising, logistics, and CX teams each run their own LLM integrations, nobody has a consolidated view of total spend. A product copy pipeline generating descriptions for a 100,000-SKU catalog can rack up thousands of dollars weekly with no budget guardrails in place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared credentials and ungoverned access&lt;/strong&gt;: A shopper-facing chatbot, a back-office pricing optimizer, and a seasonal campaign writer should operate under separate API credentials with distinct model permissions and safety policies. Without a gateway to enforce this separation, teams share keys, and every application has unrestricted access to every model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing compliance evidence&lt;/strong&gt;: The EU AI Act's high-risk requirements take full effect in August 2026. Retailers deploying AI for personalized pricing, customer segmentation, or automated decisioning need to prove auditability. Without centralized request logging, there is no way to reconstruct which model handled a given interaction or what data it processed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single-provider fragility&lt;/strong&gt;: Retail AI runs on tight timelines. When a recommendation engine drops during a flash sale or a support bot stalls during the holiday rush, the revenue impact is immediate. Direct provider connections offer no fallback path if a single API goes down or starts throttling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unfiltered model output&lt;/strong&gt;: AI-generated product descriptions, marketing emails, and chat responses all carry brand risk. Without output filtering, a model can produce misleading claims, incorrect policy information, or content that conflicts with advertising regulations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Privacy and Regulatory Obligations for Retail AI
&lt;/h2&gt;

&lt;p&gt;Retail AI sits at the intersection of multiple privacy frameworks, each with different scope and enforcement mechanisms. The cost of compliance is climbing: businesses are spending 30-40% more on privacy programs than they did in 2023, and cumulative GDPR penalties have surpassed &lt;a href="https://secureprivacy.ai/blog/data-privacy-trends-2026" rel="noopener noreferrer"&gt;€6.7 billion&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  GDPR and the EU AI Act
&lt;/h3&gt;

&lt;p&gt;European retailers face a converging set of requirements. GDPR controls how shopper data is collected, stored, and moved across borders. The EU AI Act, fully enforceable for high-risk systems starting August 2026, designates retail use cases like personalized pricing, automated profiling, and algorithmic decision-making as high-risk. These classifications trigger mandatory risk assessments, human oversight provisions, and full auditability of model behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  US state privacy legislation
&lt;/h3&gt;

&lt;p&gt;Nineteen US states now enforce comprehensive privacy laws. California's CPRA sets intentional violation penalties at $7,988 with no automatic cure window. Retailers that process customer data across state lines must comply with divergent consent rules, data minimization standards, and transparency mandates for automated decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  PCI DSS
&lt;/h3&gt;

&lt;p&gt;AI applications that touch payment card information, including customer service tools handling order lookups, refund processing, or payment troubleshooting, must satisfy PCI DSS requirements for data encryption and access control.&lt;/p&gt;

&lt;p&gt;A retail-ready enterprise AI gateway must provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Team-level budget caps&lt;/strong&gt; with live spend dashboards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tamper-proof audit logs&lt;/strong&gt; attributing every model call to a specific user or system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granular RBAC&lt;/strong&gt; restricting model and tool access per team and application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private deployment options&lt;/strong&gt; keeping customer data inside approved network boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configurable content filters&lt;/strong&gt; enforcing brand standards and regulatory rules per application&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Bifrost Solves Retail AI Infrastructure Challenges
&lt;/h2&gt;

&lt;p&gt;Bifrost is a high-performance, open-source AI gateway written in Go. It unifies access to &lt;a href="https://docs.getbifrost.ai/providers/supported-providers/overview" rel="noopener noreferrer"&gt;20+ LLM providers&lt;/a&gt; behind a single OpenAI-compatible API. As an enterprise AI gateway, its governance, cost management, and routing features address the specific operational and compliance pressures retail organizations face when scaling AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Team-level cost management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual keys&lt;/a&gt; are Bifrost's core governance primitive. Each key is a scoped credential controlling which models, providers, and MCP tools a consumer can reach, paired with enforced &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;spending limits&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/features/governance/rate-limits" rel="noopener noreferrer"&gt;request rate caps&lt;/a&gt;. Practical retail configurations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A marketing key capped at $5,000 per month, restricted to content models, with guardrails blocking off-brand language&lt;/li&gt;
&lt;li&gt;A customer support key pinned to a fast-response model, with adaptive rate limits that scale up during peak traffic windows&lt;/li&gt;
&lt;li&gt;A forecasting key directed at cost-optimized models with large context windows for historical data analysis&lt;/li&gt;
&lt;li&gt;A merchandising key for catalog copy generation, limited to approved models with a per-call cost ceiling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bifrost enforces budget caps in real time. When a key nears its limit, the gateway blocks further requests before the overage hits the invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance-grade audit trails
&lt;/h3&gt;

&lt;p&gt;The enterprise tier generates &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;tamper-proof audit logs&lt;/a&gt; for every request that passes through the gateway. Bifrost's compliance framework covers SOC 2 Type II, GDPR, ISO 27001, and HIPAA. For retailers preparing for EU AI Act audit obligations, every log entry captures model identity, input payload, output payload, and the initiating user or service account. &lt;a href="https://docs.getbifrost.ai/enterprise/log-exports" rel="noopener noreferrer"&gt;Log exports&lt;/a&gt; feed directly into Splunk, Datadog, or any SIEM platform your compliance team already uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Brand-safe and compliant output filtering
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/enterprise/guardrails" rel="noopener noreferrer"&gt;Guardrails&lt;/a&gt; apply real-time content controls on both model inputs and outputs, integrating with AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI. Retail-specific guardrail configurations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blocking product copy that includes unsupported health or safety claims&lt;/li&gt;
&lt;li&gt;Filtering chatbot replies that misstate return windows, warranty terms, or payment policies&lt;/li&gt;
&lt;li&gt;Rejecting marketing output that references competitor brands or violates advertising standards&lt;/li&gt;
&lt;li&gt;Stripping PII from prompts and responses to satisfy GDPR and CCPA data minimization rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each guardrail policy is scoped per virtual key, so different applications enforce different safety profiles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Private cloud deployment and data residency
&lt;/h3&gt;

&lt;p&gt;Retailers handling customer data under GDPR residency mandates or internal data governance policies can run Bifrost inside their own VPC with &lt;a href="https://docs.getbifrost.ai/enterprise/invpc-deployments" rel="noopener noreferrer"&gt;in-VPC deployments&lt;/a&gt;. No LLM request containing shopper data leaves the private network. &lt;a href="https://docs.getbifrost.ai/enterprise/vault-support" rel="noopener noreferrer"&gt;Vault integration&lt;/a&gt; with HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, and Azure Key Vault removes provider API keys from application code and configuration files entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Intelligent Provider Routing for Retail Workloads
&lt;/h2&gt;

&lt;p&gt;Different retail AI use cases place different demands on the underlying model infrastructure. A live recommendation widget needs responses in under a second. A nightly batch run generating thousands of product descriptions optimizes for cost per token. A customer chatbot balances speed and accuracy for order-specific queries.&lt;/p&gt;

&lt;p&gt;Bifrost sends requests to the right provider through a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;unified API&lt;/a&gt; and activates &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatic failover&lt;/a&gt; the moment a provider goes down. During high-stakes retail events like Black Friday, seasonal promotions, or limited-time drops, failover keeps every customer-facing AI application online regardless of which provider is experiencing issues.&lt;/p&gt;

&lt;p&gt;Key routing features for retail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Weighted distribution&lt;/strong&gt;: Assign traffic shares across providers based on cost, latency, or compliance targets, and shift weights dynamically for peak versus off-peak periods&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application-aware routing rules&lt;/strong&gt;: Push customer-facing workloads to premium low-latency endpoints while directing internal batch jobs to budget-friendly alternatives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt;&lt;/strong&gt;: Serve cached answers for semantically equivalent queries. Shipping policy questions, sizing inquiries, and product FAQ requests hit the cache instead of the provider, cutting both cost and response time for the most common customer interactions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP gateway&lt;/a&gt;&lt;/strong&gt;: Bifrost's native Model Context Protocol layer connects AI agents to inventory databases, CRM platforms, order management systems, and product catalogs through one governed endpoint, with &lt;a href="https://docs.getbifrost.ai/mcp/filtering" rel="noopener noreferrer"&gt;per-key tool filtering&lt;/a&gt; controlling which tools each application can invoke&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running Bifrost in Production for Retail
&lt;/h2&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/enterprise/clustering" rel="noopener noreferrer"&gt;cluster mode&lt;/a&gt; delivers high availability through automatic peer discovery and zero-downtime rolling deployments. Retail systems that need to absorb traffic spikes during seasonal peaks without service degradation rely on a gateway layer that scales horizontally and never becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;At 5,000 requests per second, Bifrost introduces just 11 microseconds of overhead per call. For shopper-facing AI where every millisecond of latency affects conversion, the governance layer is effectively invisible.&lt;/p&gt;

&lt;p&gt;Visit Bifrost's &lt;a href="https://www.getmaxim.ai/bifrost/industry-pages/retail" rel="noopener noreferrer"&gt;retail industry page&lt;/a&gt; for reference architectures and deployment blueprints built for retail environments. Performance benchmarks, deployment walkthroughs, and the LLM Gateway Buyer's Guide are all available in the &lt;a href="https://www.getmaxim.ai/bifrost/resources" rel="noopener noreferrer"&gt;Bifrost resource library&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ship Governed Retail AI with Bifrost
&lt;/h2&gt;

&lt;p&gt;Retail AI has moved beyond individual pilots into coordinated, enterprise-wide rollouts touching marketing, merchandising, support, supply chain, and commerce. The gateway connecting these applications to LLM providers must enforce the same spending controls, access policies, and compliance standards that retailers already demand from every other production system.&lt;/p&gt;

&lt;p&gt;Bifrost delivers the enterprise AI gateway for retail: team-scoped cost governance, tamper-proof audit trails, private cloud deployment, automatic multi-provider failover, brand-safe content guardrails, and MCP-native tool orchestration, all in one open-source platform running at sub-20-microsecond overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getmaxim.ai/bifrost/book-a-demo" rel="noopener noreferrer"&gt;Book a demo&lt;/a&gt; with the Bifrost team to explore how the gateway fits your retail AI stack.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
