daniel jeong

Posted on Mar 28 • Originally published at manoit.co.kr

The Real Stack Behind AI Agents in Production — MCP, Kubernetes, and What Nobody Tells You

#ai #kubernetes #mcp #devops

Every team I talk to is building AI agents. They've got LangChain running, a vector database humming, and a demo that impresses the CEO.

Then someone asks: "Cool. How do we run this for 10,000 users?"

That's where things get quiet.

I've spent the last year deploying agentic AI systems on Kubernetes — the kind that actually serve real traffic, not just notebook demos. And I've learned that the gap between "it works on my laptop" and "it runs in production" is enormous.

This post is about the real stack behind production AI agents in 2026. Not the LLM part — everyone covers that. I'm talking about the infrastructure layer that nobody writes about but everyone desperately needs.

The Problem Nobody Talks About

Here's a typical conversation I have at least once a week:

Engineer: "We built an AI agent that can query our database, search documents, and draft reports."

Me: "How does it connect to those systems?"

Engineer: "Custom Python scripts. Different ones for each tool."

Me: "How do you deploy it?"

Engineer: "Docker container on a single VM."

Me: "What happens when you need to add a new tool?"

Engineer: silence

The fundamental issue is that most AI agent architectures look like this:

LLM ──→ Hardcoded Tool A (custom REST client)
    ──→ Hardcoded Tool B (custom gRPC wrapper)
    ──→ Hardcoded Tool C (custom DB connector)
    ──→ Hardcoded Tool D (custom file reader)

Every tool integration is bespoke. Every deployment is fragile. Every new capability means rewriting half the agent. This doesn't scale.

Enter MCP: The USB-C Port for AI Agents

The Model Context Protocol (MCP) changes this equation entirely. Think of it as USB-C for AI — one standardized interface that connects any model to any tool.

Instead of custom integrations for each tool, MCP provides:

LLM ──→ MCP Client ──→ MCP Server: Database
                   ──→ MCP Server: Search
                   ──→ MCP Server: Kubernetes
                   ──→ MCP Server: Slack
                   ──→ MCP Server: (anything)

Here's a minimal MCP server definition that exposes a Kubernetes health check as a tool:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import * as k8s from "@kubernetes/client-node";

const server = new McpServer({
  name: "k8s-health",
  version: "1.0.0",
});

server.tool(
  "check_pod_status",
  "Check the health status of pods in a namespace",
  { namespace: { type: "string", description: "Kubernetes namespace" } },
  async ({ namespace }) => {
    const kc = new k8s.KubeConfig();
    kc.loadFromDefault();
    const k8sApi = kc.makeApiClient(k8s.CoreV1Api);

    const res = await k8sApi.listNamespacedPod({ namespace });
    const unhealthy = res.items.filter(
      (p) => p.status?.phase !== "Running" && p.status?.phase !== "Succeeded"
    );

    return {
      content: [
        {
          type: "text",
          text: unhealthy.length === 0
            ? `All pods in ${namespace} are healthy.`
            : `Found ${unhealthy.length} unhealthy pods:\n` +
              unhealthy
                .map((p) => `- ${p.metadata?.name}: ${p.status?.phase}`)
                .join("\n"),
        },
      ],
    };
  }
);

const transport = new StdioServerTransport();
await server.connect(transport);

The agent doesn't need to know how Kubernetes works. It just calls check_pod_status through MCP, and the server handles the rest.

Now here's where it gets interesting: the MCP ecosystem has exploded. As of March 2026, there are over 10,000 public MCP servers, 97M+ monthly SDK downloads, and adoption by Claude, ChatGPT, Gemini, VS Code, and Cursor. OpenAI, Google, and Microsoft all support it. It's not a bet anymore — it's infrastructure.

The Part Everyone Skips: Running This on Kubernetes

So you've got MCP wiring up your agent to tools. Great. But where do these MCP servers actually run? How do you manage GPU resources for your LLM? What happens when traffic spikes?

This is where Kubernetes becomes the AI infrastructure OS.

Architecture: Agent Stack on Kubernetes

Here's the production architecture I deploy:

Let me walk through each layer.

Layer 1: GPU Scheduling with Dynamic Resource Allocation

The biggest pain point in AI infrastructure is GPU management. Before DRA (Dynamic Resource Allocation), you were stuck with device plugins — crude, all-or-nothing GPU assignments.

DRA reached GA in Kubernetes 1.34 and changes the game. Here's a ResourceClaim for an inference workload:

apiVersion: resource.k8s.io/v1beta2
kind: ResourceClaim
metadata:
  name: llm-inference-gpu
  namespace: ai-agents
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      selectors:
      - cel:
          expression: >
            device.attributes["gpu.nvidia.com"].memory >= 24000 &&
            device.attributes["gpu.nvidia.com"].productName.startsWith("A")
    constraints:
    - requests: ["gpu"]
      matchAttribute: "gpu.nvidia.com/numa-node"

What this does:

CEL-based filtering: Selects GPUs with 24GB+ VRAM and A-series (A100/A10G)
Topology-aware: NUMA node matching ensures the GPU is on the same physical socket as the CPU — this alone cuts inference latency by 15-20%
Declarative: No more shell scripts checking nvidia-smi

And here's the Deployment that uses it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai-agents
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.7.3
        args:
        - "--model=meta-llama/Llama-3.3-70B-Instruct"
        - "--tensor-parallel-size=1"
        - "--max-model-len=8192"
        - "--enforce-eager"
        ports:
        - containerPort: 8000
        resources:
          claims:
          - name: llm-gpu
      resourceClaims:
      - name: llm-gpu
        resourceClaimTemplateName: llm-inference-gpu

Early adopters report 25% better GPU utilization with DRA compared to static device plugins. That's not a marginal gain — at cloud GPU prices, that's potentially tens of thousands of dollars saved per month.

Layer 2: Inference Gateway for Traffic Routing

You don't point users directly at vLLM instances. The Inference Gateway (now GA) routes inference traffic based on model names, LoRA adapters, and endpoint health:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: llm-pool
  namespace: ai-agents
spec:
  targetPortNumber: 8000
  selector:
    matchLabels:
      app: vllm-server
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: agent-model
  namespace: ai-agents
spec:
  modelName: "agent-llama-70b"
  targetModels:
  - name: "meta-llama/Llama-3.3-70B-Instruct"
    weight: 100
  poolRef:
    name: llm-pool

This gives you:

Model-aware routing: Request for agent-llama-70b automatically goes to the right pool
Health-based failover: Unhealthy endpoints are removed from rotation
Multi-tenant serving: Different teams share the same GPU pool without stepping on each other

Layer 3: MCP Servers as Kubernetes Deployments

Each MCP server runs as a standard Kubernetes deployment. Here's the pattern I use:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-postgres
  namespace: ai-agents
  labels:
    mcp-server: "true"
    mcp-tool: "database"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mcp-postgres
  template:
    metadata:
      labels:
        app: mcp-postgres
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: mcp-postgres-sa
      containers:
      - name: mcp-server
        image: registry.internal/mcp-postgres:1.2.0
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: postgres-credentials
              key: host
        - name: MCP_TRANSPORT
          value: "streamable-http"
        ports:
        - containerPort: 3000  # MCP
        - containerPort: 9090  # Metrics
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          periodSeconds: 5
      volumes:
      - name: tls-certs
        secret:
          secretName: mcp-tls

Key decisions:

Separate deployments per MCP server: Isolates failures. A broken Slack MCP server doesn't take down your database MCP server.
mTLS between agent and MCP servers: Non-negotiable for production. Agent ↔ MCP traffic carries credentials and sensitive data.
Horizontal scaling: MCP servers that hit external APIs (Slack, GitHub) might need more replicas during peak hours.

Layer 4: Observability — The AI-Specific Metrics

Standard CPU/memory metrics aren't enough. For AI agent workloads, you need:

# OpenTelemetry Collector config for AI metrics
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: ai_agent

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

The metrics that matter:

Metric	Target	Why It Matters
Time to First Token (TTFT)	< 500ms	User-perceived responsiveness
Tokens/second	> 30 tok/s	Throughput capacity
MCP tool call latency	< 200ms p99	Agent step execution speed
GPU utilization	> 70%	Cost efficiency
Queue depth	< 10	Capacity signal
Tool call success rate	> 99.5%	Reliability indicator
Agent task completion rate	> 95%	End-to-end quality

Here's a Grafana dashboard query that's saved me multiple times:

# Alert: Agent tool calls failing above threshold
(
  sum(rate(mcp_tool_call_errors_total[5m])) by (tool_name)
  /
  sum(rate(mcp_tool_call_total[5m])) by (tool_name)
) > 0.05

If any MCP tool starts failing more than 5% of calls, you want to know immediately — before users notice the agent giving wrong answers.

What I Got Wrong (So You Don't Have To)

A few hard lessons from production deployments:

1. Don't run MCP servers as sidecars

My first instinct was to run MCP servers as sidecar containers alongside the agent. Bad idea. When the MCP server for Postgres needs patching, you don't want to restart your entire agent pod. Separate deployments. Always.

2. GPU over-provisioning is worse than under-provisioning

With DRA, it's tempting to request the biggest GPU available. Don't. A Llama-3.3-8B model running inference doesn't need an A100 80GB. Start with the smallest GPU that meets your latency target, then scale up only with data.

3. Agents need circuit breakers on tool calls

An agent stuck in a retry loop on a failing MCP tool will burn through tokens (and money) shockingly fast. Implement circuit breakers:

from circuitbreaker import circuit

@circuit(failure_threshold=3, recovery_timeout=30)
async def call_mcp_tool(server: str, tool: str, params: dict):
    async with mcp_client.connect(server) as client:
        result = await client.call_tool(tool, params)
        return result

After 3 consecutive failures, the circuit opens and the agent gets an immediate "tool unavailable" response instead of waiting for timeouts.

4. Observability isn't optional — it's day one

I once spent 6 hours debugging why an agent was giving nonsensical answers. The root cause? The vector database MCP server was returning stale embeddings because a background indexing job had silently failed 2 days earlier. With proper embedding freshness metrics, I would have caught it in minutes.

The Numbers

Here's what this stack delivers in practice:

Metric	Before (Ad-hoc)	After (This Stack)
Agent response time (p95)	12s	3.2s
GPU utilization	~45%	~72%
Tool integration time	2-3 weeks	2-3 days (MCP)
Incident detection	Hours	Minutes (OTel)
Monthly GPU cost (same workload)	$18K	$11K

The GPU cost savings alone paid for the engineering effort in the first month.

Getting Started

If you're running AI agents (or about to), here's my recommended path:

Start with MCP — Wrap your existing tool integrations as MCP servers. Even before Kubernetes, this standardization pays off immediately.
Deploy on Kubernetes with DRA — If you're on K8s 1.34+, enable DRA for your GPU nodes. The topology-aware scheduling alone is worth it.
Add the Inference Gateway — Don't build custom load balancing for model serving. The Gateway API Inference Extension handles model-aware routing natively.
Instrument from day one — OpenTelemetry + Prometheus + Grafana. Track TTFT, tool call latency, and GPU utilization as your golden signals.
Run MCP servers as independent deployments — Not sidecars. Each MCP server gets its own lifecycle, scaling, and health checks.

What's Next

This is Part 1 of a series on AI Infrastructure in Production. Coming next:

Part 2: Deep dive into GPU scheduling with DRA — benchmarks, gotchas, and advanced patterns
Part 3: Observability for LLM workloads — building dashboards that actually tell you what's wrong

If you found this useful, give it a heart and follow me — I write about the infrastructure side of AI that doesn't get enough attention.

This article was researched and written with AI assistance. Read more cloud-native engineering insights on the ManoIT Tech Blog.

DEV Community