Your AI agent runs fine on your laptop. Then you deploy it and discover you need a model server, a vector database, a message queue, and monitoring -- all wired together correctly. You spend two days writing shell scripts.
Docker Compose defines your entire AI agent stack in a single YAML file. One command brings it all up. Here are 4 patterns that handle the common deployment scenarios.
Pattern 1: Model Runner as a Compose Service
Docker Compose now supports a top-level models element that declares AI models as first-class infrastructure. Instead of manually running a model server and wiring environment variables, you declare the model and bind it to your agent service.
Here is the compose.yaml:
services:
agent:
build:
context: ./agent
ports:
- "8080:8080"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
models:
llm:
endpoint_var: MODEL_RUNNER_URL
model_var: MODEL_RUNNER_MODEL
depends_on:
- vectordb
vectordb:
image: qdrant/qdrant:v1.17.0
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
models:
llm:
model: ai/gemma3:4B-Q4_0
context_size: 8192
volumes:
qdrant_data:
The models block at the top level declares the model. The models block inside the agent service binds it. Docker Compose automatically injects MODEL_RUNNER_URL and MODEL_RUNNER_MODEL as environment variables into the agent container.
What this gives you:
- The model is pulled and started automatically on
docker compose up - Your agent code reads
MODEL_RUNNER_URLto connect -- no hardcoded endpoints - The vector database starts alongside the model with persistent storage
- One
docker compose downtears everything down cleanly
The context_size field controls the maximum token window for the model. You can also pass runtime_flags for inference engine tuning:
models:
llm:
model: ai/gemma3:4B-Q4_0
context_size: 8192
runtime_flags:
- "--temp=0.1"
- "--no-prefill-assistant"
This pattern replaces the common setup of running Ollama in a separate terminal, manually setting OLLAMA_HOST, and hoping your agent finds it. The model is version-controlled alongside your code -- anyone who clones the repo gets the exact same inference setup.
Pattern 2: GPU Reservations for Local Inference
Running models locally requires GPU access. Docker Compose handles this through the deploy.resources.reservations.devices block. No custom Docker run flags needed.
services:
inference:
image: vllm/vllm-openai:latest
command: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
volumes:
- model_cache:/root/.cache/huggingface
agent:
build:
context: ./agent
environment:
- LLM_ENDPOINT=http://inference:8000/v1
depends_on:
inference:
condition: service_healthy
volumes:
model_cache:
Key details that save you debugging time:
The capabilities field is required. Omitting it causes deployment to fail silently on some Docker versions. Always include [gpu] explicitly.
count and device_ids are mutually exclusive. Use count: 1 to grab any available GPU. Use device_ids: ['0'] to pin a specific GPU (check nvidia-smi for IDs). Never use both in the same service definition.
The depends_on with condition: service_healthy prevents your agent from starting before the model server is ready. Add a healthcheck to the inference service:
services:
inference:
# ... (same as above)
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
The start_period of 120 seconds matters. Large models take 60-90 seconds to load into GPU memory. Without this grace period, Docker marks the service as unhealthy before it finishes loading.
Memory limits prevent OOM kills. Add memory reservations alongside GPU reservations to prevent the inference service from consuming all host RAM during model loading:
services:
inference:
# ... (same as above)
deploy:
resources:
limits:
memory: 16G
reservations:
memory: 8G
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
This is especially important on shared machines where other services compete for memory.
Pattern 3: MCP Gateway for Tool Access
AI agents need tools -- web search, database access, file operations. Docker's MCP Gateway image brokers tool access through the Model Context Protocol, giving your agent a single endpoint for all external capabilities.
services:
agent:
build:
context: ./agent
ports:
- "8080:8080"
environment:
- MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
depends_on:
- mcp-gateway
models:
gemma3:
endpoint_var: MODEL_RUNNER_URL
model_var: MODEL_RUNNER_MODEL
mcp-gateway:
image: docker/mcp-gateway:latest
command:
- --transport=sse
- --servers=duckduckgo,filesystem
volumes:
- /var/run/docker.sock:/var/run/docker.sock
ports:
- "8811:8811"
models:
gemma3:
model: ai/gemma3:4B-Q4_0
context_size: 10000
The mcp-gateway service exposes an SSE endpoint at port 8811. Your agent connects to http://mcp-gateway:8811/sse and discovers available tools automatically through the MCP protocol.
Why this matters for DevOps:
Without the gateway, every tool integration requires custom code in your agent. Adding a new tool means changing agent code, rebuilding the image, and redeploying. With the MCP gateway, you add a tool by appending to the --servers flag and running docker compose up -d. The agent discovers the new tool without any code changes.
The Docker socket volume mount (/var/run/docker.sock) gives the gateway access to the host's Docker daemon. This is how the gateway can spawn, manage, and stop MCP server containers on demand.
Adding custom MCP servers:
If the built-in servers are not enough, mount your own configuration:
services:
mcp-gateway:
image: docker/mcp-gateway:latest
command:
- --transport=sse
- --servers=duckduckgo
- --config=/config/custom-servers.json
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./mcp-config:/config
ports:
- "8811:8811"
Your custom-servers.json defines additional MCP servers the gateway should manage.
Pattern 4: Multi-Agent Orchestration
Production systems rarely run a single agent. You need specialists -- a researcher, a coder, a reviewer -- each in its own container with its own dependencies.
services:
orchestrator:
build:
context: ./orchestrator
ports:
- "8080:8080"
environment:
- RESEARCHER_URL=http://researcher:8081
- CODER_URL=http://coder:8082
- REVIEWER_URL=http://reviewer:8083
- REDIS_URL=redis://queue:6379
depends_on:
- researcher
- coder
- reviewer
- queue
researcher:
build:
context: ./agents/researcher
ports:
- "8081:8081"
environment:
- MCPGATEWAY_ENDPOINT=http://mcp-gateway:8811/sse
models:
research_model:
endpoint_var: MODEL_URL
model_var: MODEL_NAME
coder:
build:
context: ./agents/coder
ports:
- "8082:8082"
models:
code_model:
endpoint_var: MODEL_URL
model_var: MODEL_NAME
reviewer:
build:
context: ./agents/reviewer
ports:
- "8083:8083"
models:
review_model:
endpoint_var: MODEL_URL
model_var: MODEL_NAME
queue:
image: redis:7-alpine
volumes:
- redis_data:/data
mcp-gateway:
image: docker/mcp-gateway:latest
command:
- --transport=sse
- --servers=duckduckgo,filesystem,github
volumes:
- /var/run/docker.sock:/var/run/docker.sock
ports:
- "8811:8811"
models:
research_model:
model: ai/gemma3:4B-Q4_0
context_size: 10000
code_model:
model: ai/qwen2.5-coder:7B-Q4_0
context_size: 16384
review_model:
model: ai/gemma3:4B-Q4_0
context_size: 8192
volumes:
redis_data:
Each agent gets its own model binding. The researcher uses a general-purpose model. The coder gets a code-specialized model with a larger context window. The reviewer uses the same general model as the researcher but with a smaller context for faster responses.
Three things that break multi-agent setups:
1. Port conflicts. Each agent needs its own port. Map them explicitly. If two services bind to the same host port, docker compose up fails without a clear error message.
2. Startup ordering. Use depends_on with health checks. The orchestrator should not start sending requests until all agent services report healthy.
3. Shared state without a broker. Never use shared volumes for inter-agent communication. Use Redis, RabbitMQ, or another message broker. Shared files cause race conditions that are impossible to debug in containers.
Putting It Together: The Production Stack
A production AI agent deployment combines all four patterns:
compose.yaml
|
+-- models: (Pattern 1: model declarations)
+-- services:
| +-- agent: (your application)
| +-- inference: (Pattern 2: GPU-accelerated model server)
| +-- mcp-gateway: (Pattern 3: tool access)
| +-- vectordb: (RAG storage)
| +-- queue: (Pattern 4: inter-agent messaging)
| +-- monitoring: (Prometheus + Grafana)
+-- volumes: (persistent storage)
The entire stack starts with docker compose up -d and stops with docker compose down. Model weights persist in named volumes. Logs aggregate to a single stream with docker compose logs -f.
Environment-specific overrides let you swap configurations without changing the base file:
# compose.override.yaml (development)
services:
agent:
environment:
- LOG_LEVEL=DEBUG
- OPENAI_API_KEY=${OPENAI_API_KEY}
models:
llm:
model: ai/smollm2
context_size: 2048
# compose.prod.yaml
services:
agent:
deploy:
replicas: 3
environment:
- LOG_LEVEL=WARNING
models:
llm:
model: ai/gemma3:4B-Q4_0
context_size: 16384
Run development with docker compose up. Run production with docker compose -f compose.yaml -f compose.prod.yaml up -d. Same stack, different configurations.
What Comes Next
These patterns give you a reproducible, portable AI agent stack. Every dependency is declared. Every service is isolated. Every configuration is version-controlled.
The next step is adding observability. Your compose.yaml should include Prometheus for metrics collection and Grafana for visualization. The inference service should export token throughput, latency percentiles, and error rates. Your agent should export task completion rates and tool call success rates.
But the foundation is the compose file. Get the infrastructure right first, and debugging becomes tractable.
Quick reference for the four patterns:
| Pattern | Problem It Solves | Key Compose Feature |
|---|---|---|
| Model Runner | Manual model server management | Top-level models element |
| GPU Reservations | GPU access without custom flags | deploy.resources.reservations.devices |
| MCP Gateway | Hardcoded tool integrations |
docker/mcp-gateway with Docker socket |
| Multi-Agent | Service isolation and orchestration | Per-service model bindings + message broker |
Every pattern works standalone. Combine them when your stack demands it. Start with Pattern 1 to replace your manual model setup, then add patterns as your agent grows.
Follow @klement_gunndu for more DevOps and AI content. We're building in public.
Top comments (2)
Brilliant patterns. The MCP Gateway (Pattern 3) is definitely the future for modular AI tools. I'm currently architecting a CLI ecosystem (DotSuite), and Iām curious: how does the gateway handle tool execution security when running inside a container? Does it spin up ephemeral child containers for each tool call, or is everything sandboxed within the gateway service itself?
Great question. The gateway runs each MCP server in its own isolated Docker container ā not ephemeral per-call, but per-server. When a tool request comes in and the target server isn't already running, the gateway starts the container on demand, injects credentials, applies security restrictions (resource limits, restricted privileges, limited network access), and forwards the request.
So it's container-level sandboxing per MCP server rather than per tool invocation. The Docker socket mount gives the gateway the ability to manage these server containers. If a server is already running from a previous call, requests are routed to the existing container.
For DotSuite, if you're considering a similar pattern, the key tradeoff is startup latency on first tool call vs. isolation. Persistent server containers give you sub-second subsequent calls but share state across invocations. Ephemeral per-call containers would give stronger isolation but add 1-3 seconds of cold-start overhead each time.