klement Gunndu

Posted on Mar 21

Containerize Your AI Agent Stack With Docker Compose: 4 Patterns That Work

#docker #ai #devops #tutorial

Your AI agent runs fine on your laptop. Then you deploy it and discover you need a model server, a vector database, a message queue, and monitoring -- all wired together correctly. You spend two days writing shell scripts.

Docker Compose defines your entire AI agent stack in a single YAML file. One command brings it all up. Here are 4 patterns that handle the common deployment scenarios.

Pattern 1: Model Runner as a Compose Service

Docker Compose now supports a top-level models element that declares AI models as first-class infrastructure. Instead of manually running a model server and wiring environment variables, you declare the model and bind it to your agent service.

Here is the compose.yaml:

services:
  agent:
    build:
      context: ./agent
    ports:
      - "8080:8080"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    models:
      llm:
        endpoint_var: MODEL_RUNNER_URL
        model_var: MODEL_RUNNER_MODEL
    depends_on:
      - vectordb

  vectordb:
    image: qdrant/qdrant:v1.17.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192

volumes:
  qdrant_data:

The models block at the top level declares the model. The models block inside the agent service binds it. Docker Compose automatically injects MODEL_RUNNER_URL and MODEL_RUNNER_MODEL as environment variables into the agent container.

What this gives you:

The model is pulled and started automatically on docker compose up
Your agent code reads MODEL_RUNNER_URL to connect -- no hardcoded endpoints
The vector database starts alongside the model with persistent storage
One docker compose down tears everything down cleanly

The context_size field controls the maximum token window for the model. You can also pass runtime_flags for inference engine tuning:

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192
    runtime_flags:
      - "--temp=0.1"
      - "--no-prefill-assistant"

This pattern replaces the common setup of running Ollama in a separate terminal, manually setting OLLAMA_HOST, and hoping your agent finds it. The model is version-controlled alongside your code -- anyone who clones the repo gets the exact same inference setup.

Pattern 2: GPU Reservations for Local Inference

Running models locally requires GPU access. Docker Compose handles this through the deploy.resources.reservations.devices block. No custom Docker run flags needed.

services:
  inference:
    image: vllm/vllm-openai:latest
    command: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - model_cache:/root/.cache/huggingface

  agent:
    build:
      context: ./agent
    environment:
      - LLM_ENDPOINT=http://inference:8000/v1
    depends_on:
      inference:
        condition: service_healthy

volumes:
  model_cache:

Key details that save you debugging time:

The capabilities field is required. Omitting it causes deployment to fail silently on some Docker versions. Always include [gpu] explicitly.

count and device_ids are mutually exclusive. Use count: 1 to grab any available GPU. Use device_ids: ['0'] to pin a specific GPU (check nvidia-smi for IDs). Never use both in the same service definition.

The depends_on with condition: service_healthy prevents your agent from starting before the model server is ready. Add a healthcheck to the inference service:

services:
  inference:
    # ... (same as above)
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

The start_period of 120 seconds matters. Large models take 60-90 seconds to load into GPU memory. Without this grace period, Docker marks the service as unhealthy before it finishes loading.

Memory limits prevent OOM kills. Add memory reservations alongside GPU reservations to prevent the inference service from consuming all host RAM during model loading:

services:
  inference:
    # ... (same as above)
    deploy:
      resources:
        limits:
          memory: 16G
        reservations:
          memory: 8G
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

This is especially important on shared machines where other services compete for memory.

Pattern 3: MCP Gateway for Tool Access

AI agents need tools -- web search, database access, file operations. Docker's MCP Gateway image brokers tool access through the Model Context Protocol, giving your agent a single endpoint for all external capabilities.

services:
  agent:
    build:
      context: ./agent
    ports:
      - "8080:8080"
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway
    models:
      gemma3:
        endpoint_var: MODEL_RUNNER_URL
        model_var: MODEL_RUNNER_MODEL

  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo,filesystem
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "8811:8811"

models:
  gemma3:
    model: ai/gemma3:4B-Q4_0
    context_size: 10000

The mcp-gateway service exposes an SSE endpoint at port 8811. Your agent connects to http://mcp-gateway:8811/sse and discovers available tools automatically through the MCP protocol.

Why this matters for DevOps:

Without the gateway, every tool integration requires custom code in your agent. Adding a new tool means changing agent code, rebuilding the image, and redeploying. With the MCP gateway, you add a tool by appending to the --servers flag and running docker compose up -d. The agent discovers the new tool without any code changes.

The Docker socket volume mount (/var/run/docker.sock) gives the gateway access to the host's Docker daemon. This is how the gateway can spawn, manage, and stop MCP server containers on demand.

Adding custom MCP servers:

If the built-in servers are not enough, mount your own configuration:

services:
  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo
      - --config=/config/custom-servers.json
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./mcp-config:/config
    ports:
      - "8811:8811"

Your custom-servers.json defines additional MCP servers the gateway should manage.

Pattern 4: Multi-Agent Orchestration

Production systems rarely run a single agent. You need specialists -- a researcher, a coder, a reviewer -- each in its own container with its own dependencies.

services:
  orchestrator:
    build:
      context: ./orchestrator
    ports:
      - "8080:8080"
    environment:
      - RESEARCHER_URL=http://researcher:8081
      - CODER_URL=http://coder:8082
      - REVIEWER_URL=http://reviewer:8083
      - REDIS_URL=redis://queue:6379
    depends_on:
      - researcher
      - coder
      - reviewer
      - queue

  researcher:
    build:
      context: ./agents/researcher
    ports:
      - "8081:8081"
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    models:
      research_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  coder:
    build:
      context: ./agents/coder
    ports:
      - "8082:8082"
    models:
      code_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  reviewer:
    build:
      context: ./agents/reviewer
    ports:
      - "8083:8083"
    models:
      review_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  queue:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo,filesystem,github
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "8811:8811"

models:
  research_model:
    model: ai/gemma3:4B-Q4_0
    context_size: 10000
  code_model:
    model: ai/qwen2.5-coder:7B-Q4_0
    context_size: 16384
  review_model:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192

volumes:
  redis_data:

Each agent gets its own model binding. The researcher uses a general-purpose model. The coder gets a code-specialized model with a larger context window. The reviewer uses the same general model as the researcher but with a smaller context for faster responses.

Three things that break multi-agent setups:

1. Port conflicts. Each agent needs its own port. Map them explicitly. If two services bind to the same host port, docker compose up fails without a clear error message.

2. Startup ordering. Use depends_on with health checks. The orchestrator should not start sending requests until all agent services report healthy.

3. Shared state without a broker. Never use shared volumes for inter-agent communication. Use Redis, RabbitMQ, or another message broker. Shared files cause race conditions that are impossible to debug in containers.

Putting It Together: The Production Stack

A production AI agent deployment combines all four patterns:

compose.yaml
  |
  +-- models:         (Pattern 1: model declarations)
  +-- services:
  |     +-- agent:       (your application)
  |     +-- inference:   (Pattern 2: GPU-accelerated model server)
  |     +-- mcp-gateway: (Pattern 3: tool access)
  |     +-- vectordb:    (RAG storage)
  |     +-- queue:       (Pattern 4: inter-agent messaging)
  |     +-- monitoring:  (Prometheus + Grafana)
  +-- volumes:         (persistent storage)

The entire stack starts with docker compose up -d and stops with docker compose down. Model weights persist in named volumes. Logs aggregate to a single stream with docker compose logs -f.

Environment-specific overrides let you swap configurations without changing the base file:

# compose.override.yaml (development)
services:
  agent:
    environment:
      - LOG_LEVEL=DEBUG
      - OPENAI_API_KEY=${OPENAI_API_KEY}

models:
  llm:
    model: ai/smollm2
    context_size: 2048

# compose.prod.yaml
services:
  agent:
    deploy:
      replicas: 3
    environment:
      - LOG_LEVEL=WARNING

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 16384

Run development with docker compose up. Run production with docker compose -f compose.yaml -f compose.prod.yaml up -d. Same stack, different configurations.

What Comes Next

These patterns give you a reproducible, portable AI agent stack. Every dependency is declared. Every service is isolated. Every configuration is version-controlled.

The next step is adding observability. Your compose.yaml should include Prometheus for metrics collection and Grafana for visualization. The inference service should export token throughput, latency percentiles, and error rates. Your agent should export task completion rates and tool call success rates.

But the foundation is the compose file. Get the infrastructure right first, and debugging becomes tractable.

Quick reference for the four patterns:

Pattern	Problem It Solves	Key Compose Feature
Model Runner	Manual model server management	Top-level `models` element
GPU Reservations	GPU access without custom flags	`deploy.resources.reservations.devices`
MCP Gateway	Hardcoded tool integrations	`docker/mcp-gateway` with Docker socket
Multi-Agent	Service isolation and orchestration	Per-service model bindings + message broker

Every pattern works standalone. Combine them when your stack demands it. Start with Pattern 1 to replace your manual model setup, then add patterns as your agent grows.

Follow @klement_gunndu for more DevOps and AI content. We're building in public.

Top comments (9)

Kai Alder • Mar 22

This is gold. Pattern 2 especially — I've wasted so many hours debugging GPU reservation issues where things just silently fail.

The start_period: 120s tip is clutch. Kept hitting "unhealthy" on my first vLLM deploys before I realized the model was still loading. Felt pretty dumb once I figured it out.

One thing I'd add for anyone doing multi-agent setups: watch out for memory pressure when multiple services try to load models at once during startup. Ended up adding restart: unless-stopped + staggered startup delays with a simple shell wrapper because docker-compose doesn't have native support for ordered parallel startup. Anyone found a cleaner solution?

klement Gunndu • Mar 22

The memory pressure issue is real — had the same problem with 3 model-serving containers racing at startup. Two approaches that worked better than shell wrappers: (1) Docker Compose depends_on with healthcheck conditions so each model service waits until the previous one reports healthy before loading, and (2) setting deploy.resources.reservations.memory per service so the OOM killer does not silently nuke containers mid-load. The depends_on approach gives you ordered startup without external scripts, though it does mean slower total boot time since services load sequentially.

freerave • Mar 21

Brilliant patterns. The MCP Gateway (Pattern 3) is definitely the future for modular AI tools. I'm currently architecting a CLI ecosystem (DotSuite), and I’m curious: how does the gateway handle tool execution security when running inside a container? Does it spin up ephemeral child containers for each tool call, or is everything sandboxed within the gateway service itself?

klement Gunndu • Mar 21

Great question. The gateway runs each MCP server in its own isolated Docker container — not ephemeral per-call, but per-server. When a tool request comes in and the target server isn't already running, the gateway starts the container on demand, injects credentials, applies security restrictions (resource limits, restricted privileges, limited network access), and forwards the request.

So it's container-level sandboxing per MCP server rather than per tool invocation. The Docker socket mount gives the gateway the ability to manage these server containers. If a server is already running from a previous call, requests are routed to the existing container.

For DotSuite, if you're considering a similar pattern, the key tradeoff is startup latency on first tool call vs. isolation. Persistent server containers give you sub-second subsequent calls but share state across invocations. Ephemeral per-call containers would give stronger isolation but add 1-3 seconds of cold-start overhead each time.

Global Chat • Apr 10

Pattern 3 is the one I keep tinkering with in my own stack. The thing that trips people up: the gateway itself still needs to know which MCP servers to route to, and hardcoding that list in the compose file kind of kills the point of containers. What has worked for me is a small discovery sidecar that reads .well-known/mcp.llmfeed.json at startup (and on SIGHUP), so new servers can join the stack without editing the gateway config. Have you tried any runtime discovery like this, or do you prefer keeping the server list explicit in compose for reproducibility?

Mehul Bhardwaj • Mar 27

Solid patterns. One gap worth flagging: all four pass credentials through environment variables, which show up in docker inspect, ps output, and /proc. For an agent stack with API keys and model credentials, Docker secrets with _FILE variants are worth the extra setup.

The other thing none of these patterns address: port binding. If you're running a reverse proxy or tunnel in front of the agent, the agent service shouldn't bind to the host network at all. Internal compose network only. Reduces the attack surface to zero from outside the compose stack.

Minor thing but matters in production: the MCP gateway pattern opens a lot of outbound paths. Each tool the agent can call is a potential data channel. Worth enumerating exactly which tools are enabled and what they can reach.

klement Gunndu • Mar 27

Really good callout on the docker inspect exposure — that's a real blind spot in most compose tutorials. The _FILE convention with mounted secrets is exactly the right fix, especially once you're juggling multiple provider keys across services.

klement Gunndu • Mar 28

Good catch on the Docker secrets point. You're right that env vars leak through docker inspect and /proc — for production agent stacks with model API keys, _FILE variants with mounted secrets are strictly better. On the sidecar logging gap: we've moved toward a dedicated vector container that tails agent logs and ships them to a central store. Keeps the agent containers clean and gives you cross-container correlation without coupling.

Desmond Wei • Mar 30

Pattern 4 resonated with me — I ran into exactly that "N agents, N compose files" problem when self-hosting multiple OpenClaw instances. Port conflicts, volume naming, keeping track of which container has which config... it gets old fast.

Ended up building a tool around it: ClawFleet handles the container lifecycle, port allocation, and per-instance config from a browser dashboard. One curl | sh to install, then everything is point-and-click.

Doesn't replace understanding the Compose patterns you laid out here though — knowing what's happening underneath matters when things break.