Daino

Posted on Apr 10

Building mcp-shield: Production-grade resilience for MCP servers

#ai #opensource #mcp

The Problem

If you're building AI agents with MCP (Model Context Protocol), you've hit this wall: MCP servers have zero resilience built in.

A slow GitHub API? Your agent waits 600 seconds. A flaky database connection? The entire chain crashes. A dead server? It keeps getting hammered request after request.

MCP is great for connecting agents to tools. But it assumes every server is always fast, always available, and always correct. In production, that's never true.

The Solution

I built mcp-shield — a transparent stdio proxy that wraps any MCP server with production-grade middleware.

Agent ←→ mcp-shield ←→ MCP Server

One command, zero code changes:

npx @daino/mcp-shield wrap --timeout 30s --retries 3 -- npx @modelcontextprotocol/server-github

How It Works

MCP communicates via JSON-RPC 2.0 over stdio with Content-Length framing (like LSP). mcp-shield sits in the middle, intercepting tools/call messages and applying a middleware chain:

Incoming tools/call request
  → Logger (start)
    → Timeout (AbortController)
      → Retry (exponential backoff + jitter)
        → Circuit Breaker (fail fast if server is down)
          → Forward to real MCP server

Everything else — initialize, tools/list, notifications — passes through untouched.

The Middleware

Timeout — Wraps each tool call with an AbortController. If the server doesn't respond within the configured time, the agent gets a clear error instead of hanging forever.

Retry — Exponential backoff with jitter. Smart enough to skip deterministic errors (invalid params, method not found) — those will never succeed on retry.

Circuit Breaker — Classic state machine: closed → open → half-open. After N consecutive failures, stop calling the server entirely. Try again after a cooldown period. This prevents hammering a dead server and wasting tokens.

Rate Limiting — Sliding window per tool. Prevents runaway agent loops where the AI keeps calling the same failing tool.

Tool Filtering — Allow/deny lists so you can restrict which tools the agent can actually use. The proxy filters both tools/list responses and tools/call requests.

Response Validation — Checks that MCP responses conform to the expected schema. Two modes: "warn" (log and pass through) or "enforce" (reject invalid responses).

Metrics — Prometheus-compatible /metrics endpoint. Counters for calls/errors/retries, histograms for latency, per server and tool.

Per-Tool Configuration

Not all tools are equal. A file read should timeout in 10 seconds, but a repository search might need 60. mcp-shield supports per-tool config via YAML:

defaults:
  timeout: 30s
  retries:
    max: 3
    backoff: exponential
    jitter: true

servers:
  github:
    command: "npx @modelcontextprotocol/server-github"
    tools:
      get_file_contents:
        timeout: 60s
      search_repositories:
        retries:
          max: 5

Claude Desktop Integration

Drop it into your claude_desktop_config.json:

{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": [
        "@daino/mcp-shield", "wrap",
        "--timeout", "30s",
        "--retries", "3",
        "--",
        "npx", "@modelcontextprotocol/server-github"
      ]
    }
  }
}

Works with Claude Desktop, Cursor, or any MCP client.

What I Learned

stdout is sacred. MCP uses stdout for protocol messages. Every console.log breaks the protocol. All logging goes to stderr via pino.

Circuit breaker should be per-server, not per-tool. If one tool on a server keeps failing, the whole server is probably down. No point trying other tools.

Don't retry deterministic errors. Invalid params or method-not-found will never succeed on retry. Only retry transient failures.

Test with real stdio framing. Content-Length headers matter. Unicode strings have different byte length than character length. A mock server that speaks the real protocol catches bugs that unit tests miss.

Try It

npx @daino/mcp-shield wrap -- npx @modelcontextprotocol/server-github

GitHub: https://github.com/DainoJung/mcp-shield
npm: @daino/mcp-shield
MIT license, TypeScript, 90 tests

If you're running MCP in production, I'd love to hear what resilience features you need. Open an issue or PR!

DEV Community