Running large language models (LLMs) locally gives you full privacy, zero recurring API costs, and full control. In 2026, Docker Swarm remains one of the simplest and most reliable ways for homelabs, small teams, and even light production to run these models with high availability and easy scaling.
This guide shows you how to deploy Ollama, OpenWebUI, and vLLM on a Docker Swarm cluster in 2026. Whether you are running on high-end NVIDIA servers or a cluster of Apple Silicon Macs, you'll find the right configuration here.
Hardware Compatibility & GPU Support
Before we dive in, it’s important to understand how Docker Swarm interacts with your hardware.
| Platform | GPU Acceleration | Best Use Case |
|---|---|---|
| Linux + NVIDIA | ✅ Native (NVIDIA Container Toolkit) | Production, heavy inference, large models. |
| macOS (Intel/M-Series) | ❌ CPU Only (in Docker VM) | Testing, development, small models. |
| Windows + NVIDIA | ✅ WSL2 (Experimental in Swarm) | Power-user workstations. |
[!IMPORTANT]
A Note for Mac Users: Docker Desktop on macOS runs inside a Linux virtual machine. Currently, this VM cannot access the Mac's GPU (Metal) for Ollama or vLLM. If you are following this on a Mac, you must use the CPU-mode configurations provided in this guide.
Why Docker Swarm for Local AI in 2026?
Kubernetes is powerful but often overkill for local or small-scale AI workloads. Docker Swarm offers:
-
Simplicity — Use familiar
docker stack deploycommands. - Low overhead — Ideal for homelabs or small clusters (3–8 nodes).
- Native Docker Compose compatibility — Easy migration from single-node setups.
- Built-in load balancing and service discovery — Great for serving multiple AI models.
- Resilience — Automatic failover for your chat interface and inference services.
Many users now prefer Swarm for local AI because it strikes the perfect balance between simplicity and reliability. Check out our comparison between SwarmCLI, Portainer, and native Docker to see why it's the preferred lightweight choice for modern clusters. With tools like SwarmCLI, managing these workloads becomes even more intuitive, providing real-time visibility into your GPU nodes and service health.
Prerequisites
- A Docker Swarm cluster (3+ nodes recommended for HA; single node also works).
- NVIDIA GPUs with drivers installed and NVIDIA Container Toolkit on every node.
- At least 16GB RAM (32GB+ recommended for larger models).
- Adequate storage for models (models can be 5–150+ GB each).
- Basic familiarity with Docker Compose/stack files.
Enable Swarm (if not already done):
docker swarm init
# On worker nodes:
docker swarm join --token <token> <manager-ip>:2377
Part 1: Deploying Ollama on Docker Swarm
Ollama is the easiest way to run popular models like Llama 3.2, Mistral, Gemma, and Phi-3.
Basic Ollama Stack File (ollama-stack.yml)
version: '3.9'
services:
ollama:
image: ollama/ollama:latest
deploy:
mode: global # One instance per node
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama_data:/root/.ollama
ports:
- '11434:11434'
networks:
- ai-network
restart: unless-stopped
volumes:
ollama_data:
networks:
ai-network:
driver: overlay
attachable: true
Deploy it:
For Linux with NVIDIA GPUs, use the exact file above. For macOS or CPU-only Linux, you must remove the reservations block (lines 55-60) from the YAML file, otherwise the deployment will fail with an "invalid device" error.
# Using standard docker
docker stack deploy -c ollama-stack.yml ollama
# Or verify node health first with SwarmCLI
swarmcli status
Pull models (run on any manager):
docker exec -it <ollama-container> ollama pull llama3.2
docker exec -it <ollama-container> ollama pull mistral
Tip: For better performance, pin a specific version (e.g., ollama/ollama:0.5.x) instead of latest.
Part 2: OpenWebUI – The Beautiful Chat Interface
OpenWebUI gives you a ChatGPT-like frontend with RAG, voice input, image understanding, and multi-user support.
Recommended Stack – OpenWebUI + Ollama
version: '3.9'
services:
ollama:
image: ollama/ollama:latest
deploy:
mode: global
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama_data:/root/.ollama
networks:
- ai-network
openwebui:
image: ghcr.io/open-webui/open-webui:main
deploy:
replicas: 2 # Scale as needed
resources:
limits:
memory: 4G
ports:
- '3000:8080'
environment:
- OLLAMA_BASE_URLS=http://ollama:11434
- WEBUI_AUTH=true
volumes:
- openwebui_data:/app/backend/data
networks:
- ai-network
depends_on:
- ollama
volumes:
ollama_data:
openwebui_data:
networks:
ai-network:
driver: overlay
attachable: true
Deploy with:
docker stack deploy -c openwebui-stack.yml ai
You will see the services in SwarmCLI
swarmcli status
[!TIP]
Mac Tip: Just like with standalone Ollama, if you are running on a Mac, remove theresources: reservations:block from theollamaservice in youropenwebui-stack.yml.
Access the UI at http://your-swarm-ip:3000.
[!TIP]
SwarmCLI Pro Tip: Useswarmcli service inspect ai_openwebuito quickly verify the replica status and ensure your environment variables are correctly propagated across the cluster. For a full list of service commands, see our CLI Documentation.
Part 3: High-Performance Inference with vLLM
Challenges on Swarm: vLLM needs --ipc=host or large shared memory for multi-GPU. Use constraints to pin services to GPU-rich nodes.
[!WARNING]
vLLM on Mac: vLLM is highly optimized for NVIDIA GPUs. While it can run on CPU, it is not recommended for Mac Docker setups. Use Ollama instead for a much better experience on macOS.
vLLM Stack Example
version: '3.9'
services:
vllm:
image: vllm/vllm-openai:latest
deploy:
replicas: 1 # Start with 1, scale later
placement:
constraints:
- node.labels.gpu == high # Label your powerful nodes
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- --model meta-llama/Llama-3.1-8B-Instruct
- --gpu-memory-utilization 0.85
- --max-model-len 8192
ports:
- '8000:8000'
volumes:
- hf_cache:/root/.cache/huggingface
networks:
- ai-network
ipc: host # Best effort; alternatives exist
volumes:
hf_cache:
networks:
ai-network:
driver: overlay
Note on IPC: Swarm has limited support for --ipc=host. You can use a large tmpfs mount to /dev/shm as a workaround for tensor parallelism.
Part 4: Production Best Practices in 2026
Special: Apple Silicon (M1/M2/M3) Performance Tips
If you are running this stack on a Mac Swarm for development:
- Allocate More RAM: LLMs are memory-hungry. Go to Docker Desktop Settings > Resources and allocate at least 8GB-12GB of RAM to the VM.
-
Use Quantized Models: Stick to
4-bitor5-bitquantization (e.g.,llama3.2:3b-instruct-q4_K_M). They run significantly faster on CPUs. - Keep it Small: Models like Phi-3 Mini or Gemma 2B will feel much snappier in CPU mode than a full Llama 70B.
1. Load Balancing & High Availability
Add Traefik or Nginx as a reverse proxy in front of your services.
2. Model Management & Caching
- Use shared volumes or GlusterFS/Ceph for model storage across nodes.
- Pre-pull popular models on all nodes.
3. Monitoring & Observability
Deploy Prometheus + Grafana + Node Exporter + NVIDIA DCGM for GPU metrics. SwarmCLI provides a built-in TUI that gives you an instant overview of your cluster's resource consumption, making it easier to spot nodes that are under-utilized or over-burdened by heavy LLM inference.
4. Security
- Enable authentication in OpenWebUI.
- Use Swarm secrets for API keys.
- Expose services only via a secure proxy. For enterprise environments, we recommend using SwarmCLI Business Edition's RBAC Proxy for automated mTLS and centralized access control (see the Proxy Setup Guide).
5. Scaling Strategy
- Ollama: Global mode (one per node).
- OpenWebUI: Scale replicas based on users.
- vLLM: Add more replicas on GPU nodes with placement constraints.
Common Issues & Troubleshooting (2026)
-
"Invalid device reservation" error — This happens on Mac or Linux systems without NVIDIA drivers. Remove the
reservationsblock from your stack file. - GPU not detected — Verify NVIDIA Container Toolkit and resource reservations.
- Model download slow — Use a shared cache volume.
- High latency — Try quantization (Q4_K_M, Q5_K_M) and adjust context length.
- Swarm networking issues — Ensure services are on the same overlay network.
Advanced: Multi-Model Setup & RAG
You can run multiple specialized models (coding, general chat, vision) and route requests intelligently using a lightweight gateway or OpenWebUI’s built-in features.
Conclusion: Your Private AI Platform in 2026
With Docker Swarm and SwarmCLI, you can build a robust, private AI infrastructure that’s easy to maintain and scales with your needs. Whether you’re a homelabber running Llama 3.2 or a small team serving internal tools with vLLM, this stack delivers excellent performance without Kubernetes complexity.
Next Steps:
- Start with the Ollama + OpenWebUI stack.
- Add monitoring.
- Experiment with vLLM for higher throughput.
Why SwarmCLI?
By 2026, we noticed a gap. Docker Swarm was rock solid, but the management tooling felt stuck in 2017. SwarmCLI bridges that gap with:
Real-time Health: Stop guessing which node is throttled.
Atomic Secret Sync: One-command .env to Raft encryption.
Edge-Optimized: Built in Go for zero-overhead on ARM/RPi5 devices.



Top comments (0)