Andrew Baisden

Posted on Apr 14

How Bifrost's MCP Gateway Cuts AI Agent Token Costs by 92% Without Sacrificing Capability

#llm #ai #opensource #programming

If you have been building AI agents with Model Context Protocol (MCP), you have probably noticed something uncomfortable. The more tools you connect, the more expensive every single request becomes, even when most of those tools are not being used.

This isn't a configuration issue. It's a fundamental problem with how classic MCP works. And it gets quietly ignored until the billing notification arrives.

Bifrost's MCP Gateway was built to solve it, and here is what's actually going on, how they fixed it, and why the benchmark numbers are worth paying attention to.

What's Actually Causing the Cost Problem?

Classic MCP has a straightforward but costly default behaviour, which is that every tool definition from every connected server gets injected into the model's context on every single request.

Let's do the maths. Connect 5 MCP servers with 30 tools each, and you are sending 150 tool definitions before your prompt even starts. Connect 16 servers with around 500 tools total, and that number increases further, consumed on every turn of every agent loop, regardless of which tools the model actually needs.

The token cost isn't a rounding error at that scale. It becomes the majority of your spending. The usual advice is to trim your tool list, but that's not a solution; it's a tradeoff. You are giving up the capability to control cost, which defeats the point of building with MCP in the first place.

What Bifrost MCP Gateway Is And Why It's Different

Bifrost started as an open-source LLM gateway, a unified layer for managing AI providers, routing, keys, and costs across your stack. As teams moved from single-model calls to full agent workflows, they naturally started running MCP servers through it.

The result is Bifrost MCP Gateway, which is a single deployment that handles both LLM routing and MCP tool execution, with access control, cost governance, and full audit trails built in. It's written in Go, open-source under Apache 2.0, and designed to behave like infrastructure instead of a developer convenience wrapper.

If you are familiar with alternatives like LiteLLM or Portkey, Bifrost occupies a different position, purpose-built for production scale as opposed to prototyping ease. You can see a detailed breakdown on the Bifrost alternatives page, but the main difference is performance. Bifrost adds only 11 microseconds of overhead at 5,000 requests per second, compared to hundreds of milliseconds with Python-based alternatives.

It's also a drop-in replacement for existing SDKs.

Switching looks like this:

import os
from anthropic import Anthropic

anthropic = Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
    base_url="https://<bifrost_url>/anthropic",
)

One line changed. Everything else stays the same, as this code shows as we import an Anthropic API key.

Code Mode: The Fix Bifrost Built

To solve the token bloat problem without limiting tool access, Bifrost built something called Code Mode.

The idea isn't completely new; Cloudflare explored a similar approach with a TypeScript runtime, and Anthropic's own engineering team wrote about context dropping from 150,000 tokens to 2,000 for a Google Drive to Salesforce workflow. Bifrost built it natively into the gateway, with two key choices. Python over JavaScript (because LLMs are trained on significantly more Python data), and a dedicated documentation tool to compress context even more.

Instead of injecting every tool definition into context, Code Mode exposes your MCP servers as a virtual filesystem of lightweight Python stub files. The model reads only what it needs, writes a short orchestration script, and Bifrost executes it in a sandboxed Starlark interpreter. The full tool list never touches the context.

The model gets four meta-tools to work with:

Meta-tool	What it does
`listToolFiles`	Discover which servers and tools are available
`readToolFile`	Load Python function signatures for a specific server or tool
`getToolDocs`	Fetch detailed documentation for a specific tool before using it
`executeToolCode`	Run the orchestration script against live tool bindings

Here's what a multi-step e-commerce workflow looks like when Code Mode generates the orchestration script, instead of calling tools turn by turn:

customer = crm.lookup_customer(email="john@example.com")
orders = crm.get_order_history(customer_id=customer["id"], limit=5)
discount = billing.calculate_discount(customer_tier=customer["tier"], order_count=len(orders))
billing.apply_discount(customer_id=customer["id"], discount_pct=discount["pct"])
email.send_confirmation(to=customer["email"], discount_pct=discount["pct"])

Bifrost executes this in a sandboxed Starlark environment, no imports, no file I/O, no network access, just tool calls and basic Python-like logic. The model never sees the intermediate results; it only gets the final output. The full tool list never touched the context.

The Benchmark Numbers

Bifrost ran three controlled benchmark rounds with Code Mode on and off, scaling tool count between rounds to see how savings change as MCP footprint grows.

	Round 1 (96 tools, 6 servers)	Round 2 (251 tools, 11 servers)	Round 3 (508 tools, 16 servers)
Input Tokens (OFF)	19.9M	35.7M	75.1M
Input Tokens (ON)	8.3M	5.5M	5.4M
Token Reduction	−58.2%	−84.5%	−92.8%
Cost (OFF)	$104.04	$180.07	$377.00
Cost (ON)	$46.06	$29.80	$29.00
Cost Reduction	−55.7%	−83.4%	−92.2%
Pass Rate	100%	100%	100%

Two things stand out here. First, the savings are not linear, they grow as you add servers. Classic MCP loads every tool definition on every request, so connecting more servers makes the problem worse at a faster rate. Code Mode's cost is controlled by what the model actually reads, not by how many tools exist.

Second, and most importantly, accuracy does not drop. Pass rate held at 100% across all three rounds. You are not trading capability for cost savings, you are getting both.

Access Control and Governance

Cost is only half of the production problem with MCP at scale. The other half is control. When a new engineer joins your team, you don't hand them unrestricted access to every system the company runs. But the moment you connect an AI agent to a fleet of MCP servers with no governance layer, that's effectively what you have done.

Bifrost MCP Gateway handles this through three mechanisms:

1. Virtual Keys: Issue scoped credentials to every consumer of your MCP gateway. Each key carries a specific set of tools it's allowed to call, scoped at the tool level, not just the server level. A key can be granted filesystem_read without being granted filesystem_write from the same MCP server. A customer-facing agent simply can't reach your internal admin tooling.

2. MCP Tool Groups: For managing access across teams, customers, and providers at scale. A tool group is a named collection of tools from one or more MCP servers. Attach it to any combination of virtual keys, teams, or users. Bifrost resolves the allowed set at request time with no database queries, everything is indexed in memory and synced across cluster nodes automatically.

3. Audit Logging: Every tool execution is a first-class log entry. For each call you get the tool name, the server it came from, arguments passed in, the result returned, latency, the virtual key that triggered it, and the parent LLM request that initiated the agent loop. Pull up any agent run and trace exactly which tools it called, in what order, and what came back.

Bifrost also tracks cost at the tool level. MCP costs are not just token costs, if your tools call paid external APIs, each call has a price. Those appear in logs alongside LLM token costs, giving you a complete picture of what each agent run actually cost.

For teams with compliance requirements, the logging pipeline is designed for SOC 2 Type II compliant, GDPR, HIPAA, and ISO 27001. Content logging can be disabled per environment while still capturing tool name, server, latency, and status.

How to Get Started in Minutes

Bifrost is designed for fast adoption. If you just want to spin up the gateway quickly, npx -y @maximhq/bifrost has you running in under 30 seconds, which you can follow in the Setting Up guide.

The steps below walk through the full governed setup via the dashboard, here is the path from zero to a fully governed MCP gateway with Code Mode enabled:

Step 1: Add an MCP client
Navigate to the MCP section in the Bifrost dashboard. Add your first MCP server, give it a name, choose the connection type (HTTP, SSE, or STDIO), and enter the endpoint or command. Bifrost connects, discovers its tools, and starts syncing them on a configured interval.

Step 2: Enable Code Mode
Open client settings and toggle Code Mode on. No schema changes, no redeployment. Token usage drops immediately.

Step 3: Set tools to auto-execute
By default, tool calls require manual approval. Open the auto-execute settings and allowlist the tools you want to run autonomously. You can allowlist at the tool level, filesystem_read can auto-execute while filesystem_write stays behind an approval gate.

Step 4: Restrict access with virtual keys
Create a key for each consumer (user, team, customer integration). Under MCP settings, select which tools the key can call. Any request made with that key only sees the tools it's been granted; the model never receives definitions for tools outside its scope.

Step 5: Connect Claude Code to Bifrost
Bifrost exposes all connected MCP servers through a single /mcp endpoint. Add Bifrost as an MCP server in your Claude Code MCP settings using that URL. Claude Code discovers every tool from every connected server through a single connection. Add new MCP servers to Bifrost, and they appear in Claude Code automatically, no client-side config changes required.

The full setup documentation, performance benchmarks, and deployment guides are available at the Bifrost resources hub.

Conclusion

Most teams start with one MCP server and one agent, and it works fine. Then they add more servers, more teams, customer-facing workflows, and the things that were ignorable early on become unavoidable. Who can call what, what this is actually costing, and what happened when something breaks.

Code Mode is a fantastic answer to the cost problem, 92% token reduction at scale, with no accuracy tradeoff. But the governance layer (virtual keys, tool groups, audit logs, per-tool cost tracking) is equally important for teams moving beyond prototyping.

Bifrost's LLM gateway handles the model side of the same equation, which is provider routing, fallbacks, load balancing, rate limiting, and unified key management across every major AI provider. When both flow through the same gateway, you get a complete picture of every agent run, model tokens and tool costs together, under a single access control model, in one audit log.

If you are building production AI agent workflows, it's worth looking at what that infrastructure layer should actually be doing.

Get started: github.com/maximhq/bifrost | getmaxim.ai/bifrost

Stay up to date with AI, tech, productivity, and personal growth

If you enjoyed these articles, connect and follow me across social media, where I share content related to all of these topics 🔥

I also have a newsletter where I share my thoughts and knowledge on AI, tech, productivity, and personal growth.

Top comments (3)

Madza • Apr 14

Awesome work as always on the article, Andrew! 👍💯

Andrew Baisden • May 11

Much appreciated 😁

Archit Mittal • Apr 20

The "tool schema bloat" problem is the real unlock here. Most agent setups ship 40+ tool definitions into every context window even though any given request only touches 1-3. A gateway that does dynamic tool selection upstream of the LLM is one of the cleanest ways to get that 80%+ token saving. The only thing I'd watch: tool selection heuristics drift as you add tools, so treat the selector itself as a tested component with regression cases. Otherwise silent accuracy drops sneak in.