Bifrost, the open-source Go-based AI gateway, can cut input token consumption by up to 92.8% in multi-server MCP configurations without degrading task completion. This guide explores five distinct optimization strategies—Code Mode, tool filtering, semantic caching, provider routing, and spending governance—that address token overhead at different architectural layers.
In production environments, agentic AI applications face a predictable cost challenge: tool calling becomes the primary token consumer. Each MCP-enabled request must inject the complete schema for every connected tool into the model's context window before any meaningful work begins. For a setup with 5 MCP servers and 20 tools per server, that's 100 tool definitions loaded on every turn, regardless of whether they're needed. Across hundreds of agent runs in production, this overhead balloons rapidly.
The five strategies below tackle this problem at distinct layers: how tools are exposed to the model, which tools each consumer can access, how cached responses bypass redundant calls, which LLM provider handles the request, and how to enforce spending boundaries before costs spiral.
Technique 1: Code Mode Dramatically Reduces Schema Overhead
The single largest source of wasted tokens in typical MCP setups is the repeated injection of full tool schemas on each request. In standard MCP configurations, all connected tools are presented directly to the model, which means every request carries the entire schema catalog regardless of actual tool needs.
Bifrost's Code Mode reimagines this workflow entirely.
Rather than exposing the full tool catalog, Code Mode presents four minimal meta-tools:
-
listToolFiles: identify available MCP servers -
readToolFile: fetch compact Python stubs for tool signatures on-demand -
getToolDocs: load full documentation for specific tools when required -
executeToolCode: run Python (Starlark) in a secure sandbox with complete tool bindings
In place of one-by-one tool calls, the model composes a brief orchestration script. Intermediate computations stay within the sandbox. Only the concise final result is returned to the model's context.
The performance gains grow with tool count:
| MCP footprint | Standard MCP tokens | Code Mode tokens | Token reduction |
|---|---|---|---|
| 96 tools / 6 servers | 19.9M | 8.3M | -55.7% |
| 251 tools / 11 servers | 35.7M | 5.5M | -83.4% |
| 508 tools / 16 servers | 75.1M | 5.4M | -92.2% |
At 508 tools distributed across 16 servers, Code Mode decreased average input tokens per query from 1.15M to 83K while maintaining a 65/65 (100%) task success rate. As the tool registry grows, standard MCP costs scale linearly, whereas Code Mode costs remain constrained by what the model actually consumes.
Enable Code Mode in config.json per client:
{
"mcp": {
"client_configs": [
{
"name": "filesystem",
"connection_type": "stdio",
"stdio_config": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-filesystem"]
},
"tools_to_execute": ["*"],
"is_code_mode_client": true
}
]
}
}
Best practice: activate Code Mode for deployments with 3 or more MCP servers, or for any "heavy" server that manages web search, document retrieval, or databases. Lightweight, single-purpose servers can remain in standard MCP mode within the same deployment.
The complete architecture and comprehensive benchmark results are available in the Bifrost MCP gateway resource page.
Technique 2: Tool Filtering Restricts Context to What's Actually Needed
Beyond Code Mode, a significant source of token waste comes from loading tools that won't be accessed. If a user-facing agent has read access to internal administrative tools, every request still pays the schema token cost, even though those tools are irrelevant.
Bifrost uses virtual key-based tool filtering to control tool visibility at a granular level. Each virtual key carries an explicit allow-list of MCP clients and individual tools. A request authenticated with that key only receives tools from its allow-list; schemas for other servers remain outside the context entirely.
This filtering delivers two cascading benefits. Fewer schemas per request means lower token costs. Beyond that, model accuracy improves: models choose more reliably from a curated set of 10 relevant tools than from a broader set of 100 tools, most of which don't apply.
Implementation approach:
- Assign a dedicated virtual key to each agent type or customer tier
- Set a tool allow-list for each key, restricted to tools that type requires
- Keep the complete tool set available only for internal admin or testing virtual keys
Virtual keys serve as the primary governance mechanism in Bifrost. The same key that restricts tool access also enforces token budgets and request rate limits, so governance configuration lives in one place rather than across multiple systems.
Technique 3: Semantic Caching Eliminates Duplicate Computation
Token optimization extends beyond reducing per-turn context size. It also means skipping redundant requests altogether. Most production systems find that a portion of incoming requests are semantically equivalent to prior requests.
Bifrost implements semantic caching by storing responses based on meaning rather than exact string matching. A question like "what are the open tickets assigned to the platform team" and a rephrased variation of the same query both trigger the same cache entry when their semantic similarity crosses the configured threshold.
Cache hits complete in roughly 5ms rather than 2,000ms+ for a full provider round-trip. For MCP-enabled systems, a cache hit bypasses the entire tool-calling pipeline: no schema loading, no tool execution, no inference tokens burned.
Semantic caching delivers the strongest returns in these scenarios:
- Support automation: many queries cluster around common topics (billing, shipping, account access), and while exact wording differs, the underlying question is often identical
- Internal knowledge tools: staff repeatedly rephrase the same policy or doc questions
- Structured Q&A applications: fixed question domains with limited topic ranges produce high hit rates
- Repetitive analytics workflows: reporting systems that answer variations of the same metric questions across multiple users or intervals
Semantic caching integrates through the Bifrost web UI or via config.json without requiring code changes.
Technique 4: Intelligent Model Routing Directs Work to Appropriate Providers
Not every step in an MCP workflow requires the same model. Agentic systems typically involve varied operations: orchestration, tool selection, summarization, and response composition. Running all steps against an advanced frontier model when smaller, less expensive models can handle lighter tasks is a common efficiency gap.
Bifrost's routing rules direct traffic to specific models, providers, and API keys based on configurable logic. Combined with virtual key configuration, routing can ensure:
- Complex reasoning steps target a capable frontier model
- Summarization and classification route to a faster, cheaper option
- Environments other than production (dev, staging, testing) route to the most economical choice
Routing can also incorporate live provider health metrics. When a primary provider times out or returns errors, automatic fallback reroutes traffic, avoiding retry chains that multiply token costs and harm latency.
For teams operating MCP-heavy agentic systems across multiple providers, cost-aware routing and failover configuration occur together at the virtual key level. The same configuration controlling tool access also determines which provider handles each request and which backup routes apply if the first fails.
Technique 5: Spending Limits Prevent Runaway Cost Escalation
The techniques above minimize token use per request. Spending controls ensure total consumption across all requests stays bounded, even as individual requests improve.
Bifrost enforces spending limits hierarchically:
- Virtual key level: hard caps per consumer, per project, or per agent
- Team level: total spending ceilings across a team or department
- Customer level: per-customer limits in multi-tenant setups
These limits apply at request time, not retroactively. Once a virtual key hits its budget cap, further requests block rather than continuing to rack up charges. Rate limiting adds a second guard that caps request volume separately from token count.
For MCP-specific workloads, the governance layer also ties to tool filtering. One virtual key configuration controls tool access, the token budget, and request rate limits. Governance works consistently across classic MCP, Code Mode, and any other execution model.
This matters most for agentic deployments where a single runaway agent iteration can trigger a cascade of tool calls consuming orders of magnitude more tokens than expected. Spending caps function as a circuit breaker that limits damage from any individual out-of-control workflow.
Layering These Techniques
These five strategies work at different levels and combine effectively in production MCP deployments:
- Code Mode cuts per-request schema token overhead (most impactful for 3+ servers)
- Tool filtering restricts tool exposure to only what each consumer needs
- Semantic caching skips token spend on repeated or near-identical queries
- Model routing assigns each workflow step to the cheapest model capable of handling it
- Spending limits enforce aggregate ceilings and prevent agent runaway scenarios
The Bifrost MCP gateway resource page covers full architecture for deploying these together. Large-scale MCP setups with 10+ connected servers and hundreds of daily agent runs typically see the biggest absolute savings from Code Mode plus tool filtering. Systems with smaller tool sets but high query repetition capture strong returns from semantic caching plus virtual key governance.
All five capabilities are in the open-source Bifrost release. Code Mode requires version v1.4.0-prerelease1 or later and is available in Bifrost Enterprise for production setups requiring clustering, RBAC, and compliance controls.
Getting Started: A Recommended Implementation Path
Teams beginning token optimization should follow this sequence for maximum impact:
- Enable semantic caching as the first step. It requires no architecture changes and immediately reduces costs on systems with query repetition.
- Activate Code Mode for MCP-heavy agentic workflows. It's the single highest-leverage token lever for workloads with 3 or more servers.
- Set up tool filtering through virtual keys to limit tool access to what each consumer truly needs.
- Implement cost-aware routing to assign lighter operations to lower-cost models.
- Add spending caps and rate limits to set upper bounds on aggregate spend and guard against runaway workflows that could negate the prior optimizations.
For assistance adapting these techniques to your MCP infrastructure, book a demo with the Bifrost team.
Top comments (0)