Young Gao

Posted on Mar 22

Building Production MCP Servers: Architecture Patterns That Scale in 2026

#ai #typescript #architecture #tutorial

Building Production MCP Servers: Architecture Patterns That Scale in 2026

The Model Context Protocol (MCP) is rapidly becoming the standard way AI agents interact with external tools and data sources. But most MCP server examples are toy implementations — they work in demos but fall apart under real traffic.

This guide covers the architecture patterns you need to build MCP servers that survive production workloads.

What MCP Actually Is (30-Second Version)

AI Agent (Claude, GPT, etc.)
    ↓ MCP Protocol (JSON-RPC over stdio/SSE/HTTP)
MCP Server
    ↓ Your business logic
External Systems (DBs, APIs, file systems)

MCP standardizes how AI agents discover and invoke tools. Instead of each agent having custom integrations, they speak one protocol. Your server exposes tools (functions the agent can call), resources (data the agent can read), and prompts (templates the agent can use).

The Production Architecture

// server.ts — Production MCP server skeleton
import { McpServer, ResourceTemplate } from "@modelcontextprotocol/sdk/server/mcp.js";
import { SSEServerTransport } from "@modelcontextprotocol/sdk/server/sse.js";
import express from "express";

const app = express();
const server = new McpServer({
  name: "production-tools",
  version: "1.0.0",
});

// Health check — load balancers need this
app.get("/health", (_, res) => {
  res.json({ status: "ok", uptime: process.uptime() });
});

// SSE transport for web clients
app.get("/sse", async (req, res) => {
  const transport = new SSEServerTransport("/messages", res);
  await server.connect(transport);
});

app.post("/messages", async (req, res) => {
  // Handle incoming messages from SSE clients
  await transport.handlePostMessage(req, res);
});

app.listen(3000, () => console.log("MCP server on :3000"));

Pattern 1: Connection Pool Management

MCP servers often connect to databases or external APIs. Without connection pooling, each tool invocation creates a new connection — a guaranteed way to exhaust resources.

import { Pool } from "pg";
import { createClient } from "redis";

// Singleton pools initialized once
let pgPool: Pool;
let redisClient: ReturnType<typeof createClient>;

async function initPools() {
  pgPool = new Pool({
    connectionString: process.env.DATABASE_URL,
    max: 20,                    // Max connections
    idleTimeoutMillis: 30000,   // Close idle connections after 30s
    connectionTimeoutMillis: 5000,
  });

  redisClient = createClient({ url: process.env.REDIS_URL });
  await redisClient.connect();

  // Verify connections on startup
  await pgPool.query("SELECT 1");
  await redisClient.ping();
}

// Tool that uses pooled connections
server.tool(
  "query_users",
  "Search users by criteria",
  { query: z.string(), limit: z.number().max(100).default(10) },
  async ({ query, limit }) => {
    const result = await pgPool.query(
      "SELECT id, name, email FROM users WHERE name ILIKE $1 LIMIT $2",
      [`%${query}%`, limit]
    );
    return {
      content: [{ type: "text", text: JSON.stringify(result.rows, null, 2) }],
    };
  }
);

Pattern 2: Authentication and Authorization

Never deploy an MCP server without auth. AI agents will send whatever the user tells them to — including attempts to access other users' data.

import jwt from "jsonwebtoken";

interface AuthContext {
  userId: string;
  orgId: string;
  scopes: string[];
}

// Middleware that extracts auth from the transport
function extractAuth(req: express.Request): AuthContext {
  const token = req.headers.authorization?.replace("Bearer ", "");
  if (!token) throw new Error("Missing authorization header");

  const decoded = jwt.verify(token, process.env.JWT_SECRET!) as AuthContext;
  return decoded;
}

// Scope-checked tool registration
function securedTool(
  name: string,
  description: string,
  requiredScope: string,
  schema: z.ZodObject<any>,
  handler: (args: any, auth: AuthContext) => Promise<any>
) {
  server.tool(name, description, schema, async (args, extra) => {
    const auth = extra.authContext as AuthContext;
    if (!auth.scopes.includes(requiredScope)) {
      return {
        content: [{
          type: "text",
          text: `Permission denied: requires scope '${requiredScope}'`,
        }],
        isError: true,
      };
    }
    return handler(args, auth);
  });
}

// Usage: only users with "billing:read" can query invoices
securedTool(
  "get_invoices",
  "Retrieve invoices for the authenticated organization",
  "billing:read",
  { status: z.enum(["paid", "pending", "overdue"]).optional() },
  async ({ status }, auth) => {
    const invoices = await pgPool.query(
      "SELECT * FROM invoices WHERE org_id = $1 AND ($2::text IS NULL OR status = $2)",
      [auth.orgId, status ?? null]
    );
    return {
      content: [{ type: "text", text: JSON.stringify(invoices.rows) }],
    };
  }
);

Pattern 3: Rate Limiting Per Tool

Different tools have different costs. A database query is cheap; calling an external API costs money.

import { RateLimiterMemory } from "rate-limiter-flexible";

// Tiered rate limiters
const limiters = {
  fast: new RateLimiterMemory({ points: 100, duration: 60 }),   // 100/min
  standard: new RateLimiterMemory({ points: 20, duration: 60 }),  // 20/min
  expensive: new RateLimiterMemory({ points: 5, duration: 60 }),  // 5/min
};

type RateTier = keyof typeof limiters;

function rateLimitedTool(
  name: string,
  description: string,
  tier: RateTier,
  schema: z.ZodObject<any>,
  handler: (args: any) => Promise<any>
) {
  server.tool(name, description, schema, async (args, extra) => {
    const key = extra.authContext?.userId ?? "anonymous";
    try {
      await limiters[tier].consume(key);
    } catch {
      return {
        content: [{
          type: "text",
          text: `Rate limit exceeded. This tool allows ${limiters[tier].points} calls per minute.`,
        }],
        isError: true,
      };
    }
    return handler(args);
  });
}

// Cheap tool: 100/min
rateLimitedTool("search_docs", "Search documentation", "fast",
  { query: z.string() },
  async ({ query }) => { /* ... */ }
);

// Expensive tool: 5/min (calls external API)
rateLimitedTool("generate_report", "Generate analytics report", "expensive",
  { dateRange: z.string() },
  async ({ dateRange }) => { /* ... */ }
);

Pattern 4: Structured Error Handling

AI agents need clear error messages to recover gracefully. Don't let raw stack traces leak.

class ToolError extends Error {
  constructor(
    message: string,
    public readonly code: string,
    public readonly retryable: boolean = false
  ) {
    super(message);
  }
}

function withErrorHandling(
  handler: (args: any) => Promise<any>
): (args: any) => Promise<any> {
  return async (args) => {
    try {
      return await handler(args);
    } catch (error) {
      if (error instanceof ToolError) {
        return {
          content: [{
            type: "text",
            text: JSON.stringify({
              error: error.message,
              code: error.code,
              retryable: error.retryable,
            }),
          }],
          isError: true,
        };
      }

      // Log unexpected errors, return sanitized message
      console.error("Unexpected tool error:", error);
      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            error: "An internal error occurred",
            code: "INTERNAL_ERROR",
            retryable: true,
          }),
        }],
        isError: true,
      };
    }
  };
}

// Usage
server.tool("deploy_service", "Deploy a service to production",
  { service: z.string(), version: z.string() },
  withErrorHandling(async ({ service, version }) => {
    const exists = await checkServiceExists(service);
    if (!exists) {
      throw new ToolError(
        `Service '${service}' not found`,
        "NOT_FOUND",
        false
      );
    }
    // ...
  })
);

Pattern 5: Resource Caching with TTL

Resources are data the agent reads. Cache them to avoid hammering your data sources.

interface CacheEntry<T> {
  data: T;
  expires: number;
}

class TTLCache<T> {
  private cache = new Map<string, CacheEntry<T>>();

  get(key: string): T | undefined {
    const entry = this.cache.get(key);
    if (!entry) return undefined;
    if (Date.now() > entry.expires) {
      this.cache.delete(key);
      return undefined;
    }
    return entry.data;
  }

  set(key: string, data: T, ttlMs: number) {
    this.cache.set(key, { data, expires: Date.now() + ttlMs });
  }
}

const resourceCache = new TTLCache<string>();

server.resource(
  "config/{env}",
  new ResourceTemplate("config/{env}", { list: undefined }),
  async (uri, { env }) => {
    const cacheKey = `config:${env}`;
    let content = resourceCache.get(cacheKey);

    if (!content) {
      const config = await fetchConfig(env as string);
      content = JSON.stringify(config, null, 2);
      resourceCache.set(cacheKey, content, 60_000); // 1 min TTL
    }

    return {
      contents: [{ uri: uri.href, mimeType: "application/json", text: content }],
    };
  }
);

Pattern 6: Deployment with Docker

FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine
WORKDIR /app
RUN addgroup -g 1001 -S mcp && adduser -S mcp -u 1001
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER mcp
EXPOSE 3000
HEALTHCHECK --interval=30s CMD wget -qO- http://localhost:3000/health || exit 1

CMD ["node", "dist/server.js"]

# docker-compose.yml
services:
  mcp-server:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/app
      - REDIS_URL=redis://cache:6379
      - JWT_SECRET=${JWT_SECRET}
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 512M
          cpus: "0.5"
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_healthy

Pattern 7: Observability

import { trace, metrics } from "@opentelemetry/api";

const tracer = trace.getTracer("mcp-server");
const meter = metrics.getMeter("mcp-server");

const toolCallCounter = meter.createCounter("mcp.tool.calls", {
  description: "Number of tool invocations",
});
const toolLatency = meter.createHistogram("mcp.tool.duration", {
  description: "Tool execution duration in ms",
  unit: "ms",
});

function instrumentedTool(
  name: string,
  description: string,
  schema: z.ZodObject<any>,
  handler: (args: any) => Promise<any>
) {
  server.tool(name, description, schema, async (args) => {
    const start = performance.now();
    return tracer.startActiveSpan(`tool.${name}`, async (span) => {
      try {
        const result = await handler(args);
        span.setStatus({ code: 0 });
        toolCallCounter.add(1, { tool: name, status: "success" });
        return result;
      } catch (error) {
        span.setStatus({ code: 2, message: String(error) });
        toolCallCounter.add(1, { tool: name, status: "error" });
        throw error;
      } finally {
        toolLatency.record(performance.now() - start, { tool: name });
        span.end();
      }
    });
  });
}

Architecture Summary

Concern	Pattern	Why
Connections	Singleton pools	Prevent connection exhaustion
Auth	JWT + scope checking	Multi-tenant safety
Rate limiting	Per-tool tiers	Protect expensive operations
Errors	Typed error classes	Agent-friendly recovery
Caching	TTL cache on resources	Reduce backend load
Deployment	Docker + health checks	Horizontal scaling
Observability	OTel tracing + metrics	Debug production issues

Key Takeaways

MCP servers are just API servers — apply the same production patterns you'd use for any HTTP service.
Auth is not optional — AI agents will do whatever users ask, including accessing data they shouldn't.
Rate limit by tool, not globally — a search query and a report generation have very different costs.
Return structured errors — AI agents recover better from {"code": "NOT_FOUND", "retryable": false} than from stack traces.
Cache resources aggressively — agents re-read the same resources frequently during a conversation.

The MCP ecosystem is moving fast, but production patterns are timeless. Build your MCP servers like you'd build any critical API, and they'll survive real-world traffic.

Based on MCP server deployments handling 1000+ concurrent agent sessions.

DEV Community

Building Production MCP Servers: Architecture Patterns That Scale in 2026

Building Production MCP Servers: Architecture Patterns That Scale in 2026

What MCP Actually Is (30-Second Version)

The Production Architecture

Pattern 1: Connection Pool Management

Pattern 2: Authentication and Authorization

Pattern 3: Rate Limiting Per Tool

Pattern 4: Structured Error Handling

Pattern 5: Resource Caching with TTL

Pattern 6: Deployment with Docker

Pattern 7: Observability

Architecture Summary

Key Takeaways

Top comments (0)