MCP servers have exploded since Anthropic's mcp launch. Everyone's connecting Claude to filesystems, databases, and web APIs. But production deployments quickly hit the same problems: token bloat, security gaps, and orchestration overhead.
We built an MCP gateway to solve these issues at the infrastructure level. Here's what we learned about making MCP work at scale.
maximhq
/
bifrost
Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…
The Core Problem
Direct MCP connections work for demos. Connect Claude to 1-2 servers with 10-15 tools, and everything runs fine. But scale to 5+ servers with 100+ tools, and the model spends more tokens reading tool catalogs than doing actual work.
A 6-turn conversation with 100 tools in context burns through 600+ tokens just on tool definitions. The LLM re-reads the entire catalog on every turn, even for simple queries that only need 1-2 tools.
We tested this on production traffic. An e-commerce assistant with 10 servers and 150 tools was spending $3.20-4.00 per complex query, with 18-25 second latencies. Most of that cost came from redundant tool definitions in every request.
Gateway Architecture
The solution is a control plane between LLMs and MCP servers. This gives you three things:
Connection management: Handle STDIO, HTTP, and SSE protocols. Discover tools, maintain connections, monitor health. When a server goes down, the gateway detects it (health checks every 10 seconds) and handles reconnection.
Security layer: Tools don't execute automatically. LLM responses contain suggested tool calls—your code decides whether to execute them. Full audit trails, user approval flows, permission checks.
Performance optimization: Three execution modes (Manual, Agent, Code) that trade off control for throughput based on your use case.
Three Execution Modes
Manual Mode: Explicit Control
Default behavior. LLM suggests tools, you execute them through separate API calls:
POST /v1/chat/completions → tool suggestions
POST /v1/mcp/tool/execute → execute approved tools
POST /v1/chat/completions → continue with results
This gave us the audit trail and approval workflow we needed for production. Every tool execution is a logged, traceable event.
Agent Mode: Selective Autonomy
We added auto-execution for specific tools. Configure which operations can run without approval:
{
"tools_to_execute": ["*"],
"tools_to_auto_execute": ["read_file", "list_directory", "search"]
}
Safe operations (read, search) run automatically. Dangerous operations (write, delete) still require approval.
Agent Mode runs up to 10 iterations by default (configurable), executes auto-approved tools in parallel, and stops when work is complete or max depth is reached.
This cut response times for read-heavy workloads by 40-50% while maintaining control over destructive operations.
Code Mode: Solving Token Bloat
For 3+ servers, we tested a different approach. Instead of exposing 100+ tools directly, expose three meta-tools that let the AI write TypeScript code to orchestrate everything:
-
listToolFiles- discover available servers -
readToolFile- load tool definitions on-demand -
executeToolCode- execute TypeScript in a sandboxed VM
The AI writes one script that calls multiple tools. All coordination happens in the sandbox. Only the final result returns to the LLM.
Measured impact:
- Token usage: 50%+ reduction
- Latency: 40-50% faster execution
- LLM turns: 6-10 turns → 3-4 turns
We tested this on that e-commerce assistant. Cost dropped from $3.20-4.00 to $1.20-1.80 per query. Latency went from 18-25 seconds to 8-12 seconds.
The difference comes from keeping tool definitions out of context until needed, and executing all coordination in one VM call instead of multiple LLM round-trips.
Code Mode Implementation
We generate TypeScript declarations for all connected servers:
servers/
youtube.d.ts ← all YouTube tools
filesystem.d.ts ← all filesystem tools
database.d.ts ← all database tools
The AI reads what it needs, writes code:
const results = await youtube.search({ query: "AI news", maxResults: 3 });
const titles = results.items.map(item => item.snippet.title);
await filesystem.write_file({
path: "results.json",
content: JSON.stringify(titles)
});
return { saved: titles.length };
This executes in a Goja VM with TypeScript transpilation, async/await support, and 30-second timeout protection. Sandbox restrictions prevent network access, file system access, and other dangerous operations.
Security Model
Three layers of filtering:
Connection-level: Choose which tools from each server are available. Server connects with 50 tools? You can expose 5.
Execution-level: Available tools don't auto-execute unless configured. Default requires explicit API calls.
Code-level: When Agent Mode + Code Mode are both enabled, we parse the TypeScript and validate every tool call against auto-execute lists before running.
We tested this model with internal teams. Read operations auto-execute. Write operations require approval. Delete operations need double confirmation.
This gave us the speed we wanted for 90% of operations while maintaining control over the 10% that could break things.
Connection Types
STDIO: Spawn subprocesses for local tools. Works for filesystem operations, CLI utilities, Python/Node.js servers:
{
"connection_type": "stdio",
"stdio_config": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-filesystem"]
}
}
HTTP: Standard REST APIs. Remote services, microservices, cloud functions:
{
"connection_type": "http",
"connection_string": "https://mcp-server.example.com/mcp"
}
SSE: Server-Sent Events for real-time data. Market feeds, system monitoring, event streams:
{
"connection_type": "sse",
"connection_string": "https://stream.example.com/sse"
}
Health monitoring runs every 10 seconds. Disconnected clients automatically attempt reconnection. Connection states (connected, connecting, disconnected, error) are exposed via API.
Performance Data
We ran this against production traffic. The numbers we saw:
Classic MCP (5 servers, 100 tools):
- 6 LLM calls per complex task
- 2,400 tokens in tool definitions per call
- 14,400 total tokens just for tool catalogs
- Most context budget wasted on redundant definitions
Code Mode (same 5 servers):
- 3-4 LLM calls per task
- 100-300 tokens in tool definitions per call
- 400-1,200 total tokens for tool catalogs
- Context available for actual work
For simple queries, Manual Mode works fine. For multi-step workflows, Code Mode pays off immediately. For real-time agents, Agent Mode + Code Mode gives you the best of both worlds.
Where This Matters
Three scenarios where the gateway model makes sense:
Production agents: Need audit trails, approval workflows, and security validation. Can't let Claude delete production databases without confirmation.
Multi-server setups: 3+ MCP servers where token costs matter. The more servers you connect, the bigger the Code Mode advantage.
Complex workflows: Tasks that require 5-10 tool calls. Running coordination in a sandbox is faster than 5-10 LLM round-trips.
For simple use cases (1-2 servers, basic queries), direct MCP connections work fine. But production deployments with multiple servers and complex workflows need infrastructure to handle security, performance, and reliability.
Implementation
We built this as open-source infrastructure (Bifrost). You can run it as a gateway, embed it via Go SDK, or deploy in Kubernetes. The architecture is simple: gateway sits between your app and LLM providers, routing chat requests and MCP tool execution through one interface.
Go architecture delivered what we needed:
- Sub-3ms latency overhead at 5,000 RPS
- 11µs mean overhead in sustained benchmarks
- Minimal memory footprint compared to Python alternatives (372MB)
Connection pooling, concurrent request handling, and efficient JSON parsing keep overhead low even under heavy load.
What We Learned
Token bloat is real: Past 2-3 servers, tool definitions dominate your context budget. Code Mode solves this by loading definitions on-demand instead of sending everything on every turn.
Security needs infrastructure: Expecting developers to validate tool calls in application code leads to gaps. Centralizing security at the gateway level makes it impossible to bypass.
Performance scales with mode: Manual Mode for 1-2 servers. Agent Mode for mixed workloads. Code Mode for 3+ servers or complex orchestration.
Monitoring matters: Health checks, connection state tracking, and audit logs aren't optional in production. The gateway handles this as infrastructure.
The MCP ecosystem is growing fast. Anthropic, Microsoft, and others are building servers for every possible tool. But connecting AI directly to these servers creates scaling problems. The gateway pattern solves these at the infrastructure level so you can focus on building agents that work.
Implementation: https://github.com/maximhq/bifrost
Open-source MCP gateway with connection management, security validation, and Code Mode for production deployments.

Top comments (0)