Rodrigo Serra Coelho

Posted on Mar 29

MCP in Production: Routing LLM Tool Calls Through an API Gateway

#api #architecture #llm #mcp

MCP in Production: Routing LLM Tool Calls Through an API Gateway

Your LLM can now call tools. But who controls which tools it can call, how often, and with what credentials?

Model Context Protocol (MCP) gives LLMs a standard way to discover and invoke external tools — databases, APIs, file systems, anything you expose as a tool server. The protocol is clean and simple: JSON-RPC 2.0 over HTTP.

But "clean and simple" doesn't mean "production-ready." The moment you have multiple MCP servers, multiple LLM clients, and real users, you need the same infrastructure you'd need for any API: authentication, authorization, rate limiting, load balancing, failover, and observability.

That's what an API gateway does. So we built one for MCP.

The Problem

Here's a typical MCP setup:

LLM Client → MCP Server A (database queries)
LLM Client → MCP Server B (email sending)
LLM Client → MCP Server C (code execution)

This works fine on a developer's laptop. In production, you need answers to:

Who is allowed to call which tools? An intern's AI assistant shouldn't have access to the production database tool.
How do you enforce rate limits? An LLM in a retry loop can hammer a tool server with thousands of requests per minute.
What happens when a tool server goes down? Your LLM client gets a connection error and the user sees a failure.
How do you add a new tool server? Reconfigure every client? Redeploy?
Where are the logs? When the CEO asks why the AI sent 400 emails, you need an answer.

These aren't hypothetical problems. They're the same problems every API faces at scale, and they have the same solution: put a gateway in front.

Architecture

CAPI's MCP Gateway runs as a dedicated Undertow server (default port 8383) alongside CAPI's existing REST, WebSocket, and gRPC gateways. All MCP traffic flows through a single endpoint:

POST /mcp

Every request is a JSON-RPC 2.0 message. The gateway understands four methods:

Method	Purpose
`initialize`	Create a session, get capabilities
`tools/list`	Discover all available tools
`tools/call`	Invoke a specific tool
`ping`	Health check

The gateway isn't a proxy that blindly forwards bytes. It understands the MCP protocol: it parses JSON-RPC, validates sessions, resolves tool names to backend services, enforces policies, and routes the call to the right server.

                         ┌─────────────────────────────────┐
                         │         CAPI MCP Gateway        │
                         │           POST /mcp             │
  LLM Clients ─────────► │                                 │
                         │  ┌──────────┐  ┌──────────────┐ │
                         │  │ Sessions │  │ Tool Registry│ │
                         │  └──────────┘  └──────────────┘ │
                         │  ┌──────────┐  ┌──────────────┐ │
                         │  │  OAuth2  │  │  OPA Policy  │ │
                         │  └──────────┘  └──────────────┘ │
                         │  ┌──────────┐  ┌──────────────┐ │
                         │  │Throttling│  │Load Balancer │ │
                         │  └──────────┘  └──────────────┘ │
                         └──────┬──────────┬──────────┬────┘
                                │          │          │
                                ▼          ▼          ▼
                           MCP Server  MCP Server  REST API
                           (native)    (native)    (wrapped)

Zero-Config Tool Discovery

Here's where it gets interesting. CAPI uses HashiCorp Consul for service discovery. You register your MCP tools as Consul service metadata — no gateway reconfiguration needed.

A service with MCP tools registers like this:

{
  "Name": "order-service",
  "Tags": ["capi"],
  "Meta": {
    "mcp-enabled": "true",
    "mcp-toolPrefix": "orders",
    "mcp-tools": "search,create,cancel",
    "mcp-tools-search-description": "Search orders by customer ID, date range, or status",
    "mcp-tools-search-inputSchema": "{\"type\":\"object\",\"properties\":{\"customerId\":{\"type\":\"string\"},\"status\":{\"type\":\"string\",\"enum\":[\"pending\",\"shipped\",\"delivered\"]}},\"required\":[\"customerId\"]}",
    "mcp-tools-create-description": "Create a new order",
    "mcp-tools-create-inputSchema": "{\"type\":\"object\",\"properties\":{\"customerId\":{\"type\":\"string\"},\"items\":{\"type\":\"array\"}},\"required\":[\"customerId\",\"items\"]}",
    "mcp-tools-cancel-description": "Cancel an existing order by ID",
    "mcp-tools-cancel-inputSchema": "{\"type\":\"object\",\"properties\":{\"orderId\":{\"type\":\"string\"}},\"required\":[\"orderId\"]}"
  }
}

The gateway's McpToolRegistry polls the Consul service cache and builds a unified tool catalog. When an LLM client calls tools/list, it gets every tool from every registered service:

{
  "jsonrpc": "2.0",
  "result": {
    "tools": [
      {
        "name": "orders_search",
        "description": "Search orders by customer ID, date range, or status",
        "inputSchema": { ... }
      },
      {
        "name": "orders_create",
        "description": "Create a new order",
        "inputSchema": { ... }
      },
      {
        "name": "email_send",
        "description": "Send an email to a recipient",
        "inputSchema": { ... }
      }
    ]
  },
  "id": 1
}

Deploy a new service with mcp-enabled: true in Consul, and its tools appear in the catalog automatically. Remove the service, and they disappear. No gateway restart required.

Two Flavors of Backend

The gateway supports two types of tool backends:

REST APIs (default): Your existing REST services. The gateway translates tools/call into an HTTP request to the service. You expose tools via Consul metadata without touching your service code.

Native MCP servers: Backends that speak the MCP protocol natively (mcp-type: server). The gateway initializes a session with the backend, discovers its tools via tools/list, and forwards tools/call requests directly. This is useful when you already have MCP servers and want to aggregate them behind a single gateway.

# REST backend — gateway translates tool calls to HTTP
mcp-type: rest        # (default)

# Native MCP backend — gateway proxies JSON-RPC
mcp-type: server

For native MCP backends, the gateway handles session lifecycle with each backend independently — your LLM client maintains a single session with the gateway, while the gateway maintains separate sessions with each backend server.

Authentication: OAuth2 at the Gate

When OAuth2 is enabled, the initialize call requires a valid Bearer token:

POST /mcp
Authorization: Bearer eyJhbGciOiJSUzI1NiIs...

{
  "jsonrpc": "2.0",
  "method": "initialize",
  "id": 1
}

The gateway validates the token against your OIDC provider (Keycloak, Auth0, Okta — anything with a JWKS endpoint). If valid, it creates a session bound to the client's identity:

{
  "jsonrpc": "2.0",
  "result": {
    "protocolVersion": "2025-03-26",
    "capabilities": { "tools": { "listChanged": true } },
    "serverInfo": { "name": "CAPI MCP Gateway", "version": "4.3.0" }
  },
  "id": 1
}

The response includes a Mcp-Session-Id header. All subsequent requests must include this header. Sessions expire after a configurable TTL (default: 30 minutes), with a sliding window — active sessions stay alive.

Authorization: OPA for Fine-Grained Policy

Authentication tells you who the caller is. Authorization tells you what they're allowed to do. CAPI delegates authorization to Open Policy Agent (OPA).

Each service can define an OPA policy via Consul metadata. When a tools/call request arrives, the gateway sends the caller's token to OPA along with the service's policy. OPA returns allow or deny.

This means you can write policies like:

# Only users with "admin" role can call order cancellation tools
default allow = false

allow {
    input.token.realm_access.roles[_] == "admin"
    input.service.category == "orders"
}

# Data team can query but not mutate
allow {
    input.token.realm_access.roles[_] == "data-analyst"
    input.service.category == "orders"
    not endswith(input.tool.name, "_create")
    not endswith(input.tool.name, "_cancel")
}

If the policy denies the request, the LLM gets a JSON-RPC error:

{
  "jsonrpc": "2.0",
  "error": {
    "code": -32000,
    "message": "Access denied by policy"
  },
  "id": 3
}

This is critical. Without gateway-level authorization, every tool server needs its own auth logic, and you're one misconfiguration away from an LLM accessing something it shouldn't.

Load Balancing and Failover

Tool backends can run on multiple instances. The gateway's McpBackendLoadBalancer distributes calls with round-robin rotation and a circuit breaker:

Get all healthy instances of the target service from Consul
Rotate through them round-robin
If an instance fails, mark it as circuit-broken for 30 seconds (configurable)
Circuit-broken instances are deprioritized but not removed — they're tried last as a fallback
For synchronous tool calls, the gateway tries each instance in order until one succeeds

tools/call "orders_search"
  ├─ Try instance-1 (healthy) → 503 → mark circuit-broken
  ├─ Try instance-2 (healthy) → 200 ✓ return result
  └─ instance-3 (circuit-broken, not tried)

Your LLM client sees a single reliable endpoint. Backend failures are invisible.

Streaming with SSE

Some tools produce output incrementally — think code generation, log tailing, or long-running queries. The gateway supports streaming via Server-Sent Events.

Mark a tool as streaming in Consul metadata:

{
  "mcp-streaming": "generate,tail_logs"
}

When the client sends Accept: text/event-stream, the gateway streams the backend response line by line:

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache

data: {"jsonrpc":"2.0","result":{"content":[{"type":"text","text":"Processing..."}]},"id":3}

data: {"jsonrpc":"2.0","result":{"content":[{"type":"text","text":"Found 42 results"}]},"id":3}

data: {"jsonrpc":"2.0","result":{"content":[{"type":"text","text":"Done."}]},"id":3}

Distributed Sessions with Hazelcast

In a single-instance deployment, sessions live in an in-memory cache (cache2k). But if you're running multiple gateway instances — say, behind a Kubernetes service — sessions need to be shared.

When Hazelcast is enabled, HazelcastMcpSessionStore distributes sessions across all gateway instances with automatic TTL-based expiration. A client can initialize on one gateway pod and tools/call on another.

# Single instance — sessions in memory
capi.throttle.enabled: false  → LocalMcpSessionStore (cache2k)

# Multi-instance — sessions distributed
capi.throttle.enabled: true   → HazelcastMcpSessionStore (IMap)

Configuration

The MCP Gateway is opt-in. Enable it with a few properties:

capi:
  mcp:
    enabled: true
    port: 8383
    sessionTtl: 1800000              # 30 min
    toolCallTimeout: 30000           # 30 sec per tool call
    circuitBreakerCooldownMs: 30000  # 30 sec circuit breaker

That's it. Tool discovery, authentication, authorization, load balancing, and session management are all inherited from CAPI's existing infrastructure.

What It Looks Like End to End

Here's a complete flow — an LLM assistant looking up a customer's orders:

# 1. Initialize session (with OAuth2 token)
curl -X POST https://gateway.example.com:8383/mcp \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"initialize","id":1}'

# Response includes Mcp-Session-Id header
# Mcp-Session-Id: a1b2c3d4-e5f6-7890-abcd-ef1234567890

# 2. Discover available tools
curl -X POST https://gateway.example.com:8383/mcp \
  -H "Mcp-Session-Id: a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/list","id":2}'

# 3. Call a tool
curl -X POST https://gateway.example.com:8383/mcp \
  -H "Mcp-Session-Id: a1b2c3d4-e5f6-7890-abcd-ef1234567890" \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "method": "tools/call",
    "params": {
      "name": "orders_search",
      "arguments": { "customerId": "C-1234", "status": "pending" }
    },
    "id": 3
  }'

Why Not Just Use a Regular API Gateway?

You could put Kong or Envoy in front of your MCP servers. But a generic HTTP proxy doesn't understand MCP:

Tool-level routing — A regular gateway routes by URL path. MCP routes by tool name inside the JSON-RPC body. You'd need custom Lua/Wasm plugins to parse every request.
Tool-level authorization — OPA policies that understand which tool is being called, not just which endpoint.
Unified tool catalog — tools/list aggregates tools from all backends. A reverse proxy can't do this.
Session management — MCP sessions with TTL, sliding expiration, and distributed storage. Not just HTTP cookies.
Protocol translation — Exposing REST APIs as MCP tools without modifying the backend service.

A generic gateway can proxy MCP traffic. An MCP gateway understands it.

Try It

CAPI is open source: github.com/surisoft-io/capi-core

The MCP Gateway ships alongside CAPI's REST, WebSocket, gRPC, and Admin gateways in a single 39MB jar. If you're already running Consul, you can have MCP tool routing in production in under an hour.

# docker-compose.yml — minimal setup
services:
  consul:
    image: hashicorp/consul:latest
    ports: ["8500:8500"]

  capi:
    image: surisoft/capi:latest
    environment:
      CAPI_MCP_ENABLED: "true"
      CAPI_CONSUL_HOST: consul
    ports:
      - "8380:8380"   # REST gateway
      - "8383:8383"   # MCP gateway
      - "8381:8381"   # Admin API
    depends_on: [consul]

CAPI is built by Rodrigo and runs in production at government scale on EKS and VM clusters. If you're building LLM infrastructure and want to talk about MCP in production, open an issue or reach out.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.