DEV Community

Pranay Batta
Pranay Batta

Posted on

Building an MCP Gateway: Lessons from Production

MCP servers have exploded since Anthropic's mcp launch. Everyone's connecting Claude to filesystems, databases, and web APIs. But production deployments quickly hit the same problems: token bloat, security gaps, and orchestration overhead.

We built an MCP gateway to solve these issues at the infrastructure level. Here's what we learned about making MCP work at scale.

GitHub logo maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

The Core Problem

Direct MCP connections work for demos. Connect Claude to 1-2 servers with 10-15 tools, and everything runs fine. But scale to 5+ servers with 100+ tools, and the model spends more tokens reading tool catalogs than doing actual work.

A 6-turn conversation with 100 tools in context burns through 600+ tokens just on tool definitions. The LLM re-reads the entire catalog on every turn, even for simple queries that only need 1-2 tools.

We tested this on production traffic. An e-commerce assistant with 10 servers and 150 tools was spending $3.20-4.00 per complex query, with 18-25 second latencies. Most of that cost came from redundant tool definitions in every request.

Gateway Architecture

The solution is a control plane between LLMs and MCP servers. This gives you three things:

Connection management: Handle STDIO, HTTP, and SSE protocols. Discover tools, maintain connections, monitor health. When a server goes down, the gateway detects it (health checks every 10 seconds) and handles reconnection.

Security layer: Tools don't execute automatically. LLM responses contain suggested tool calls—your code decides whether to execute them. Full audit trails, user approval flows, permission checks.

Performance optimization: Three execution modes (Manual, Agent, Code) that trade off control for throughput based on your use case.

Three Execution Modes

Manual Mode: Explicit Control

Default behavior. LLM suggests tools, you execute them through separate API calls:

POST /v1/chat/completions → tool suggestions
POST /v1/mcp/tool/execute → execute approved tools
POST /v1/chat/completions → continue with results
Enter fullscreen mode Exit fullscreen mode

This gave us the audit trail and approval workflow we needed for production. Every tool execution is a logged, traceable event.

Agent Mode: Selective Autonomy

We added auto-execution for specific tools. Configure which operations can run without approval:

{
  "tools_to_execute": ["*"],
  "tools_to_auto_execute": ["read_file", "list_directory", "search"]
}
Enter fullscreen mode Exit fullscreen mode

Safe operations (read, search) run automatically. Dangerous operations (write, delete) still require approval.

Agent Mode runs up to 10 iterations by default (configurable), executes auto-approved tools in parallel, and stops when work is complete or max depth is reached.

This cut response times for read-heavy workloads by 40-50% while maintaining control over destructive operations.

Code Mode: Solving Token Bloat

For 3+ servers, we tested a different approach. Instead of exposing 100+ tools directly, expose three meta-tools that let the AI write TypeScript code to orchestrate everything:

  1. listToolFiles - discover available servers
  2. readToolFile - load tool definitions on-demand
  3. executeToolCode - execute TypeScript in a sandboxed VM

The AI writes one script that calls multiple tools. All coordination happens in the sandbox. Only the final result returns to the LLM.

Measured impact:

  • Token usage: 50%+ reduction
  • Latency: 40-50% faster execution
  • LLM turns: 6-10 turns → 3-4 turns

We tested this on that e-commerce assistant. Cost dropped from $3.20-4.00 to $1.20-1.80 per query. Latency went from 18-25 seconds to 8-12 seconds.

The difference comes from keeping tool definitions out of context until needed, and executing all coordination in one VM call instead of multiple LLM round-trips.

Code Mode Implementation

We generate TypeScript declarations for all connected servers:

servers/
  youtube.d.ts       ← all YouTube tools
  filesystem.d.ts    ← all filesystem tools
  database.d.ts      ← all database tools
Enter fullscreen mode Exit fullscreen mode

The AI reads what it needs, writes code:

const results = await youtube.search({ query: "AI news", maxResults: 3 });
const titles = results.items.map(item => item.snippet.title);
await filesystem.write_file({ 
  path: "results.json", 
  content: JSON.stringify(titles) 
});
return { saved: titles.length };
Enter fullscreen mode Exit fullscreen mode

This executes in a Goja VM with TypeScript transpilation, async/await support, and 30-second timeout protection. Sandbox restrictions prevent network access, file system access, and other dangerous operations.

Security Model

Three layers of filtering:

Connection-level: Choose which tools from each server are available. Server connects with 50 tools? You can expose 5.

Execution-level: Available tools don't auto-execute unless configured. Default requires explicit API calls.

Code-level: When Agent Mode + Code Mode are both enabled, we parse the TypeScript and validate every tool call against auto-execute lists before running.

We tested this model with internal teams. Read operations auto-execute. Write operations require approval. Delete operations need double confirmation.

This gave us the speed we wanted for 90% of operations while maintaining control over the 10% that could break things.

Connection Types

STDIO: Spawn subprocesses for local tools. Works for filesystem operations, CLI utilities, Python/Node.js servers:

{
  "connection_type": "stdio",
  "stdio_config": {
    "command": "npx",
    "args": ["-y", "@anthropic/mcp-filesystem"]
  }
}
Enter fullscreen mode Exit fullscreen mode

HTTP: Standard REST APIs. Remote services, microservices, cloud functions:

{
  "connection_type": "http",
  "connection_string": "https://mcp-server.example.com/mcp"
}
Enter fullscreen mode Exit fullscreen mode

SSE: Server-Sent Events for real-time data. Market feeds, system monitoring, event streams:

{
  "connection_type": "sse",
  "connection_string": "https://stream.example.com/sse"
}
Enter fullscreen mode Exit fullscreen mode

Health monitoring runs every 10 seconds. Disconnected clients automatically attempt reconnection. Connection states (connected, connecting, disconnected, error) are exposed via API.

Performance Data

We ran this against production traffic. The numbers we saw:

Classic MCP (5 servers, 100 tools):

  • 6 LLM calls per complex task
  • 2,400 tokens in tool definitions per call
  • 14,400 total tokens just for tool catalogs
  • Most context budget wasted on redundant definitions

Code Mode (same 5 servers):

  • 3-4 LLM calls per task
  • 100-300 tokens in tool definitions per call
  • 400-1,200 total tokens for tool catalogs
  • Context available for actual work

For simple queries, Manual Mode works fine. For multi-step workflows, Code Mode pays off immediately. For real-time agents, Agent Mode + Code Mode gives you the best of both worlds.

Where This Matters

Three scenarios where the gateway model makes sense:

Production agents: Need audit trails, approval workflows, and security validation. Can't let Claude delete production databases without confirmation.

Multi-server setups: 3+ MCP servers where token costs matter. The more servers you connect, the bigger the Code Mode advantage.

Complex workflows: Tasks that require 5-10 tool calls. Running coordination in a sandbox is faster than 5-10 LLM round-trips.

For simple use cases (1-2 servers, basic queries), direct MCP connections work fine. But production deployments with multiple servers and complex workflows need infrastructure to handle security, performance, and reliability.

Implementation

We built this as open-source infrastructure (Bifrost). You can run it as a gateway, embed it via Go SDK, or deploy in Kubernetes. The architecture is simple: gateway sits between your app and LLM providers, routing chat requests and MCP tool execution through one interface.

Go architecture delivered what we needed:

  • Sub-3ms latency overhead at 5,000 RPS
  • 11µs mean overhead in sustained benchmarks
  • Minimal memory footprint compared to Python alternatives (372MB)

Connection pooling, concurrent request handling, and efficient JSON parsing keep overhead low even under heavy load.

What We Learned

Token bloat is real: Past 2-3 servers, tool definitions dominate your context budget. Code Mode solves this by loading definitions on-demand instead of sending everything on every turn.

Security needs infrastructure: Expecting developers to validate tool calls in application code leads to gaps. Centralizing security at the gateway level makes it impossible to bypass.

Performance scales with mode: Manual Mode for 1-2 servers. Agent Mode for mixed workloads. Code Mode for 3+ servers or complex orchestration.

Monitoring matters: Health checks, connection state tracking, and audit logs aren't optional in production. The gateway handles this as infrastructure.

The MCP ecosystem is growing fast. Anthropic, Microsoft, and others are building servers for every possible tool. But connecting AI directly to these servers creates scaling problems. The gateway pattern solves these at the infrastructure level so you can focus on building agents that work.


Implementation: https://github.com/maximhq/bifrost

Open-source MCP gateway with connection management, security validation, and Code Mode for production deployments.

Top comments (0)