DEV Community

Cover image for Containerize Your AI Agent Stack With Docker Compose: 4 Patterns That Work
klement Gunndu
klement Gunndu

Posted on

Containerize Your AI Agent Stack With Docker Compose: 4 Patterns That Work

Your AI agent runs fine on your laptop. Then you deploy it and discover you need a model server, a vector database, a message queue, and monitoring -- all wired together correctly. You spend two days writing shell scripts.

Docker Compose defines your entire AI agent stack in a single YAML file. One command brings it all up. Here are 4 patterns that handle the common deployment scenarios.

Pattern 1: Model Runner as a Compose Service

Docker Compose now supports a top-level models element that declares AI models as first-class infrastructure. Instead of manually running a model server and wiring environment variables, you declare the model and bind it to your agent service.

Here is the compose.yaml:

services:
  agent:
    build:
      context: ./agent
    ports:
      - "8080:8080"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    models:
      llm:
        endpoint_var: MODEL_RUNNER_URL
        model_var: MODEL_RUNNER_MODEL
    depends_on:
      - vectordb

  vectordb:
    image: qdrant/qdrant:v1.17.0
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192

volumes:
  qdrant_data:
Enter fullscreen mode Exit fullscreen mode

The models block at the top level declares the model. The models block inside the agent service binds it. Docker Compose automatically injects MODEL_RUNNER_URL and MODEL_RUNNER_MODEL as environment variables into the agent container.

What this gives you:

  • The model is pulled and started automatically on docker compose up
  • Your agent code reads MODEL_RUNNER_URL to connect -- no hardcoded endpoints
  • The vector database starts alongside the model with persistent storage
  • One docker compose down tears everything down cleanly

The context_size field controls the maximum token window for the model. You can also pass runtime_flags for inference engine tuning:

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192
    runtime_flags:
      - "--temp=0.1"
      - "--no-prefill-assistant"
Enter fullscreen mode Exit fullscreen mode

This pattern replaces the common setup of running Ollama in a separate terminal, manually setting OLLAMA_HOST, and hoping your agent finds it. The model is version-controlled alongside your code -- anyone who clones the repo gets the exact same inference setup.

Pattern 2: GPU Reservations for Local Inference

Running models locally requires GPU access. Docker Compose handles this through the deploy.resources.reservations.devices block. No custom Docker run flags needed.

services:
  inference:
    image: vllm/vllm-openai:latest
    command: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - model_cache:/root/.cache/huggingface

  agent:
    build:
      context: ./agent
    environment:
      - LLM_ENDPOINT=http://inference:8000/v1
    depends_on:
      inference:
        condition: service_healthy

volumes:
  model_cache:
Enter fullscreen mode Exit fullscreen mode

Key details that save you debugging time:

The capabilities field is required. Omitting it causes deployment to fail silently on some Docker versions. Always include [gpu] explicitly.

count and device_ids are mutually exclusive. Use count: 1 to grab any available GPU. Use device_ids: ['0'] to pin a specific GPU (check nvidia-smi for IDs). Never use both in the same service definition.

The depends_on with condition: service_healthy prevents your agent from starting before the model server is ready. Add a healthcheck to the inference service:

services:
  inference:
    # ... (same as above)
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
Enter fullscreen mode Exit fullscreen mode

The start_period of 120 seconds matters. Large models take 60-90 seconds to load into GPU memory. Without this grace period, Docker marks the service as unhealthy before it finishes loading.

Memory limits prevent OOM kills. Add memory reservations alongside GPU reservations to prevent the inference service from consuming all host RAM during model loading:

services:
  inference:
    # ... (same as above)
    deploy:
      resources:
        limits:
          memory: 16G
        reservations:
          memory: 8G
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
Enter fullscreen mode Exit fullscreen mode

This is especially important on shared machines where other services compete for memory.

Pattern 3: MCP Gateway for Tool Access

AI agents need tools -- web search, database access, file operations. Docker's MCP Gateway image brokers tool access through the Model Context Protocol, giving your agent a single endpoint for all external capabilities.

services:
  agent:
    build:
      context: ./agent
    ports:
      - "8080:8080"
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    depends_on:
      - mcp-gateway
    models:
      gemma3:
        endpoint_var: MODEL_RUNNER_URL
        model_var: MODEL_RUNNER_MODEL

  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo,filesystem
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "8811:8811"

models:
  gemma3:
    model: ai/gemma3:4B-Q4_0
    context_size: 10000
Enter fullscreen mode Exit fullscreen mode

The mcp-gateway service exposes an SSE endpoint at port 8811. Your agent connects to http://mcp-gateway:8811/sse and discovers available tools automatically through the MCP protocol.

Why this matters for DevOps:

Without the gateway, every tool integration requires custom code in your agent. Adding a new tool means changing agent code, rebuilding the image, and redeploying. With the MCP gateway, you add a tool by appending to the --servers flag and running docker compose up -d. The agent discovers the new tool without any code changes.

The Docker socket volume mount (/var/run/docker.sock) gives the gateway access to the host's Docker daemon. This is how the gateway can spawn, manage, and stop MCP server containers on demand.

Adding custom MCP servers:

If the built-in servers are not enough, mount your own configuration:

services:
  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo
      - --config=/config/custom-servers.json
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./mcp-config:/config
    ports:
      - "8811:8811"
Enter fullscreen mode Exit fullscreen mode

Your custom-servers.json defines additional MCP servers the gateway should manage.

Pattern 4: Multi-Agent Orchestration

Production systems rarely run a single agent. You need specialists -- a researcher, a coder, a reviewer -- each in its own container with its own dependencies.

services:
  orchestrator:
    build:
      context: ./orchestrator
    ports:
      - "8080:8080"
    environment:
      - RESEARCHER_URL=http://researcher:8081
      - CODER_URL=http://coder:8082
      - REVIEWER_URL=http://reviewer:8083
      - REDIS_URL=redis://queue:6379
    depends_on:
      - researcher
      - coder
      - reviewer
      - queue

  researcher:
    build:
      context: ./agents/researcher
    ports:
      - "8081:8081"
    environment:
      - MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
    models:
      research_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  coder:
    build:
      context: ./agents/coder
    ports:
      - "8082:8082"
    models:
      code_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  reviewer:
    build:
      context: ./agents/reviewer
    ports:
      - "8083:8083"
    models:
      review_model:
        endpoint_var: MODEL_URL
        model_var: MODEL_NAME

  queue:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

  mcp-gateway:
    image: docker/mcp-gateway:latest
    command:
      - --transport=sse
      - --servers=duckduckgo,filesystem,github
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "8811:8811"

models:
  research_model:
    model: ai/gemma3:4B-Q4_0
    context_size: 10000
  code_model:
    model: ai/qwen2.5-coder:7B-Q4_0
    context_size: 16384
  review_model:
    model: ai/gemma3:4B-Q4_0
    context_size: 8192

volumes:
  redis_data:
Enter fullscreen mode Exit fullscreen mode

Each agent gets its own model binding. The researcher uses a general-purpose model. The coder gets a code-specialized model with a larger context window. The reviewer uses the same general model as the researcher but with a smaller context for faster responses.

Three things that break multi-agent setups:

1. Port conflicts. Each agent needs its own port. Map them explicitly. If two services bind to the same host port, docker compose up fails without a clear error message.

2. Startup ordering. Use depends_on with health checks. The orchestrator should not start sending requests until all agent services report healthy.

3. Shared state without a broker. Never use shared volumes for inter-agent communication. Use Redis, RabbitMQ, or another message broker. Shared files cause race conditions that are impossible to debug in containers.

Putting It Together: The Production Stack

A production AI agent deployment combines all four patterns:

compose.yaml
  |
  +-- models:         (Pattern 1: model declarations)
  +-- services:
  |     +-- agent:       (your application)
  |     +-- inference:   (Pattern 2: GPU-accelerated model server)
  |     +-- mcp-gateway: (Pattern 3: tool access)
  |     +-- vectordb:    (RAG storage)
  |     +-- queue:       (Pattern 4: inter-agent messaging)
  |     +-- monitoring:  (Prometheus + Grafana)
  +-- volumes:         (persistent storage)
Enter fullscreen mode Exit fullscreen mode

The entire stack starts with docker compose up -d and stops with docker compose down. Model weights persist in named volumes. Logs aggregate to a single stream with docker compose logs -f.

Environment-specific overrides let you swap configurations without changing the base file:

# compose.override.yaml (development)
services:
  agent:
    environment:
      - LOG_LEVEL=DEBUG
      - OPENAI_API_KEY=${OPENAI_API_KEY}

models:
  llm:
    model: ai/smollm2
    context_size: 2048
Enter fullscreen mode Exit fullscreen mode
# compose.prod.yaml
services:
  agent:
    deploy:
      replicas: 3
    environment:
      - LOG_LEVEL=WARNING

models:
  llm:
    model: ai/gemma3:4B-Q4_0
    context_size: 16384
Enter fullscreen mode Exit fullscreen mode

Run development with docker compose up. Run production with docker compose -f compose.yaml -f compose.prod.yaml up -d. Same stack, different configurations.

What Comes Next

These patterns give you a reproducible, portable AI agent stack. Every dependency is declared. Every service is isolated. Every configuration is version-controlled.

The next step is adding observability. Your compose.yaml should include Prometheus for metrics collection and Grafana for visualization. The inference service should export token throughput, latency percentiles, and error rates. Your agent should export task completion rates and tool call success rates.

But the foundation is the compose file. Get the infrastructure right first, and debugging becomes tractable.

Quick reference for the four patterns:

Pattern Problem It Solves Key Compose Feature
Model Runner Manual model server management Top-level models element
GPU Reservations GPU access without custom flags deploy.resources.reservations.devices
MCP Gateway Hardcoded tool integrations docker/mcp-gateway with Docker socket
Multi-Agent Service isolation and orchestration Per-service model bindings + message broker

Every pattern works standalone. Combine them when your stack demands it. Start with Pattern 1 to replace your manual model setup, then add patterns as your agent grows.


Follow @klement_gunndu for more DevOps and AI content. We're building in public.

Top comments (2)

Collapse
 
freerave profile image
freerave

Brilliant patterns. The MCP Gateway (Pattern 3) is definitely the future for modular AI tools. I'm currently architecting a CLI ecosystem (DotSuite), and I’m curious: how does the gateway handle tool execution security when running inside a container? Does it spin up ephemeral child containers for each tool call, or is everything sandboxed within the gateway service itself?

Collapse
 
klement_gunndu profile image
klement Gunndu

Great question. The gateway runs each MCP server in its own isolated Docker container — not ephemeral per-call, but per-server. When a tool request comes in and the target server isn't already running, the gateway starts the container on demand, injects credentials, applies security restrictions (resource limits, restricted privileges, limited network access), and forwards the request.

So it's container-level sandboxing per MCP server rather than per tool invocation. The Docker socket mount gives the gateway the ability to manage these server containers. If a server is already running from a previous call, requests are routed to the existing container.

For DotSuite, if you're considering a similar pattern, the key tradeoff is startup latency on first tool call vs. isolation. Persistent server containers give you sub-second subsequent calls but share state across invocations. Ephemeral per-call containers would give stronger isolation but add 1-3 seconds of cold-start overhead each time.