Pranay Batta

Posted on Jan 19

How We Built MCP Support in Bifrost (And What We Learned About Agent Security)

#agents #ai #mcp #security

When we started building MCP support for Bifrost, I thought it would be straightforward. Connect to MCP servers, proxy tool calls, done. Turns out, making this production-ready meant solving problems that the MCP spec doesn't even address.

maximhq / bifrost

Fastest LLM gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration, real-time monitoring…

View on GitHub

AI agents hit a hard limit when you try to connect them to real infrastructure. You want your agent to query a database, read files, or call an API? You're writing custom integration code for every single use case.

Model Context Protocol changes this. It's a standard way for AI models to discover and use external tools at runtime. Instead of hardcoding integrations, you spin up MCP servers that expose tools - and the AI model just figures out what's available and uses it.

But when we looked at how teams were actually deploying MCP in production, we saw three big problems:

Security disasters waiting to happen. Most implementations just let the AI model execute any tool it wants. No oversight. No logging. One bad prompt and you're updating production databases or deleting files.

Zero observability. When something breaks, you have no idea which tool was called, what parameters were used, or what failed. Debugging becomes archaeology.

Operational complexity. Managing connections to dozens of MCP servers, handling failures, controlling access - none of this is solved by the protocol itself.

That's why we built MCP support directly into Bifrost. It's not just a client - it's a complete control plane for production agent infrastructure.

Four Connection Types

We support four ways to connect to MCP servers, and each one solves different problems:

InProcess Connections

These run tools directly in Bifrost's memory. If you're writing tools in Go, you register them with typed handlers:

type CalculatorArgs struct {
    Operation string  `json:"operation"`
    A         float64 `json:"a"`
    B         float64 `json:"b"`
}

func calculatorHandler(args CalculatorArgs) (string, error) {
    switch args.Operation {
    case "add":
        return fmt.Sprintf("%.2f", args.A + args.B), nil
    case "multiply":
        return fmt.Sprintf("%.2f", args.A * args.B), nil
    default:
        return "", fmt.Errorf("unsupported operation")
    }
}

client.RegisterMCPTool("calculator", "Perform arithmetic", calculatorHandler, schema)

Latency is around 0.1ms. No network overhead. Compile-time type checking catches bugs before runtime. This is perfect for internal business logic that doesn't need to run in a separate process.

STDIO Connections

These launch external processes and talk via stdin/stdout. Great for local tools, scripts, or when you're using Python/Node.js MCP servers. Latency is typically 1-10ms, which is still way faster than network calls.

We use this for filesystem operations. The agent can read files, list directories, or search code - all through a Node.js MCP server running locally.

HTTP Connections

These talk to remote MCP servers over HTTP. Latency depends on your network, usually 10-500ms. But you get scalability - run multiple instances behind a load balancer, handle failures gracefully, isolate tool execution from the gateway.

Our database tools run this way. Microservice with proper authentication, read-only access by default, connection pooling for performance.

SSE Connections

Server-Sent Events for real-time data. Maintains a persistent connection and streams updates. Perfect for monitoring tools, live dashboards, or anything that needs event-driven updates without constant polling.

The Security Model That Actually Works

Here's the thing about MCP that nobody talks about: by default, letting an AI model execute tools automatically is insane.

Bifrost's security model is simple - tool calls are suggestions, not commands. When an AI model wants to use a tool, it returns the request to your application. You decide whether to execute it. This gives you human oversight for anything dangerous.

But we also support "agent mode" where you can whitelist specific tools for automatic execution. The configuration looks like this:

ToolManagerConfig: &schemas.MCPToolManagerConfig{
    ToolsToAutoExecute: []string{
        "filesystem/read_file",
        "database/query_readonly",
    },
    ToolExecutionTimeout: 30 * time.Second,
}

Only the tools you explicitly approve run automatically. Everything else requires manual approval. This prevents disasters while keeping agents fast for safe operations.

Request-Level Filtering

This is where things get interesting for production deployments. You can filter which tools are available per request:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-bf-mcp-include-clients: secure-database,audit-logger" \
  -H "x-bf-mcp-include-tools: secure-database/*,audit-logger/log_access" \
  -d '{"model": "gpt-4o-mini", "messages": [...]}'

The agent only sees the tools you specify. Customer-facing chatbots don't get access to admin tools. Financial applications can restrict agents to read-only operations. You can even use wildcards - database/* includes all database tools.

This solves the privilege escalation problem that makes a lot of security teams nervous about AI agents.

Code Mode vs Agent Mode

We built two execution patterns because different use cases need different approaches:

Agent mode has the AI call one tool at a time. It sees the result, thinks, calls the next tool. Great for interactive agents where you want visibility at each step.

Code mode lets the AI write TypeScript that orchestrates multiple tools:

const files = await listFiles("/project");
const results = await Promise.all(
    files.map(file => analyzeFile(file))
);
return summarize(results);

The code runs atomically. Way faster for batch operations. Lower latency because you're not doing multiple LLM round-trips. But you lose per-tool approval - the code either runs or it doesn't.

We use code mode for data analysis workflows where the agent needs to read dozens of files, run queries, and generate reports. Agent mode for customer support where we want to review database queries before they execute.

What Surprised Us

Tool discovery is faster than we expected. Getting the list of 50+ tools from an MCP server takes under 100ms. The AI model sees available tools instantly.

STDIO latency is surprisingly low. We thought process communication would be slow, but 1-2ms overhead is barely noticeable. HTTP adds way more latency from network round-trips.

Type safety prevents so many bugs. InProcess tools with Go structs caught issues at compile time that would have been runtime failures with JSON validation.

Request filtering is more powerful than global config. Being able to change which tools are available per request gives you way more control than static configuration.

Performance In Production

We're running this ourselves for internal tooling. Some numbers from real usage:

STDIO filesystem tools: 1.2ms average latency
HTTP database tools: 15ms average (local network)
InProcess calculation tools: 0.08ms average
Memory per STDIO connection: ~2MB
Tool discovery for 50 tools: 85ms

These are with Go's runtime. Python and Node.js MCP servers will be slower, but the architecture scales well.

Try It Yourself

MCP support is live in Bifrost. The setup is straightforward - configure your MCP servers, set which tools can auto-execute, and start making requests.

Overview - Bifrost

Enable AI models to discover and execute external tools dynamically. Transform static chat models into action-capable agents.

docs.getbifrost.ai

We documented all four connection types, agent vs code mode, and request-level filtering in the MCP docs. There are examples for common patterns - filesystem, database, web APIs.

If you're building agents that need to interact with real infrastructure, this makes it way cleaner than custom integration code. And you get the observability and security controls you need for production.

The code is open source - check out the implementation if you're curious how the internals work. We're shipping updates based on real-world usage, so feedback is helpful.

Top comments (2)

kxbnb • Jan 20

Great breakdown of the security challenges in MCP deployments. The insight about "tool calls are suggestions, not commands" really resonates - this design pattern of separating the AI's request from actual execution is exactly right for production.

The request-level filtering with x-bf-mcp-include-clients is smart. We've been thinking about similar patterns at keypost.ai where we handle policy enforcement for MCP pipelines. The challenge we've seen is that most teams don't realize they need these controls until something goes wrong - an agent deleting a repo, spamming customers, or racking up API costs.

Your code mode vs agent mode distinction is interesting. How do you handle the observability gap in code mode? When the TypeScript runs atomically, do you still get per-tool logging, or is it more of a black box?

Would love to hear how the request-level filtering performs at scale - especially curious if there's latency overhead from the header parsing on every request.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.