DEV Community

eldara
eldara

Posted on

How to Run Local AI Models on Docker Swarm in 2026 (Ollama, OpenWebUI, vLLM)

Running large language models (LLMs) locally gives you full privacy, zero recurring API costs, and full control. In 2026, Docker Swarm remains one of the simplest and most reliable ways for homelabs, small teams, and even light production to run these models with high availability and easy scaling.

This guide shows you how to deploy Ollama, OpenWebUI, and vLLM on a Docker Swarm cluster in 2026. Whether you are running on high-end NVIDIA servers or a cluster of Apple Silicon Macs, you'll find the right configuration here.

Hardware Compatibility & GPU Support

Before we dive in, it’s important to understand how Docker Swarm interacts with your hardware.

Platform GPU Acceleration Best Use Case
Linux + NVIDIA ✅ Native (NVIDIA Container Toolkit) Production, heavy inference, large models.
macOS (Intel/M-Series) ❌ CPU Only (in Docker VM) Testing, development, small models.
Windows + NVIDIA ✅ WSL2 (Experimental in Swarm) Power-user workstations.

[!IMPORTANT]
A Note for Mac Users: Docker Desktop on macOS runs inside a Linux virtual machine. Currently, this VM cannot access the Mac's GPU (Metal) for Ollama or vLLM. If you are following this on a Mac, you must use the CPU-mode configurations provided in this guide.

Why Docker Swarm for Local AI in 2026?

Kubernetes is powerful but often overkill for local or small-scale AI workloads. Docker Swarm offers:

  • Simplicity — Use familiar docker stack deploy commands.
  • Low overhead — Ideal for homelabs or small clusters (3–8 nodes).
  • Native Docker Compose compatibility — Easy migration from single-node setups.
  • Built-in load balancing and service discovery — Great for serving multiple AI models.
  • Resilience — Automatic failover for your chat interface and inference services.

Many users now prefer Swarm for local AI because it strikes the perfect balance between simplicity and reliability. Check out our comparison between SwarmCLI, Portainer, and native Docker to see why it's the preferred lightweight choice for modern clusters. With tools like SwarmCLI, managing these workloads becomes even more intuitive, providing real-time visibility into your GPU nodes and service health.

Prerequisites

  • A Docker Swarm cluster (3+ nodes recommended for HA; single node also works).
  • NVIDIA GPUs with drivers installed and NVIDIA Container Toolkit on every node.
  • At least 16GB RAM (32GB+ recommended for larger models).
  • Adequate storage for models (models can be 5–150+ GB each).
  • Basic familiarity with Docker Compose/stack files.

Enable Swarm (if not already done):

docker swarm init
# On worker nodes:
docker swarm join --token <token> <manager-ip>:2377
Enter fullscreen mode Exit fullscreen mode

Part 1: Deploying Ollama on Docker Swarm

Ollama is the easiest way to run popular models like Llama 3.2, Mistral, Gemma, and Phi-3.

Basic Ollama Stack File (ollama-stack.yml)

version: '3.9'

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      mode: global # One instance per node
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - '11434:11434'
    networks:
      - ai-network
    restart: unless-stopped

volumes:
  ollama_data:

networks:
  ai-network:
    driver: overlay
    attachable: true
Enter fullscreen mode Exit fullscreen mode

Deploy it:

For Linux with NVIDIA GPUs, use the exact file above. For macOS or CPU-only Linux, you must remove the reservations block (lines 55-60) from the YAML file, otherwise the deployment will fail with an "invalid device" error.

# Using standard docker
docker stack deploy -c ollama-stack.yml ollama

# Or verify node health first with SwarmCLI
swarmcli status
Enter fullscreen mode Exit fullscreen mode

Pull models (run on any manager):

docker exec -it <ollama-container> ollama pull llama3.2
docker exec -it <ollama-container> ollama pull mistral
Enter fullscreen mode Exit fullscreen mode

Tip: For better performance, pin a specific version (e.g., ollama/ollama:0.5.x) instead of latest.

Part 2: OpenWebUI – The Beautiful Chat Interface

OpenWebUI gives you a ChatGPT-like frontend with RAG, voice input, image understanding, and multi-user support.

Recommended Stack – OpenWebUI + Ollama

version: '3.9'

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      mode: global
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama_data:/root/.ollama
    networks:
      - ai-network

  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    deploy:
      replicas: 2 # Scale as needed
      resources:
        limits:
          memory: 4G
    ports:
      - '3000:8080'
    environment:
      - OLLAMA_BASE_URLS=http://ollama:11434
      - WEBUI_AUTH=true
    volumes:
      - openwebui_data:/app/backend/data
    networks:
      - ai-network
    depends_on:
      - ollama

volumes:
  ollama_data:
  openwebui_data:

networks:
  ai-network:
    driver: overlay
    attachable: true
Enter fullscreen mode Exit fullscreen mode

Deploy with:

docker stack deploy -c openwebui-stack.yml ai
Enter fullscreen mode Exit fullscreen mode

You will see the services in SwarmCLI

swarmcli status
Enter fullscreen mode Exit fullscreen mode

SwarmCLI AI Stacks Status

SwarmCLI Ollama Service Details

[!TIP]
Mac Tip: Just like with standalone Ollama, if you are running on a Mac, remove the resources: reservations: block from the ollama service in your openwebui-stack.yml.

Access the UI at http://your-swarm-ip:3000.

OpenWebUI Llama 3.2 Chat Interface

[!TIP]
SwarmCLI Pro Tip: Use swarmcli service inspect ai_openwebui to quickly verify the replica status and ensure your environment variables are correctly propagated across the cluster. For a full list of service commands, see our CLI Documentation.

Part 3: High-Performance Inference with vLLM

Challenges on Swarm: vLLM needs --ipc=host or large shared memory for multi-GPU. Use constraints to pin services to GPU-rich nodes.

[!WARNING]
vLLM on Mac: vLLM is highly optimized for NVIDIA GPUs. While it can run on CPU, it is not recommended for Mac Docker setups. Use Ollama instead for a much better experience on macOS.

vLLM Stack Example

version: '3.9'

services:
  vllm:
    image: vllm/vllm-openai:latest
    deploy:
      replicas: 1 # Start with 1, scale later
      placement:
        constraints:
          - node.labels.gpu == high # Label your powerful nodes
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - --model meta-llama/Llama-3.1-8B-Instruct
      - --gpu-memory-utilization 0.85
      - --max-model-len 8192
    ports:
      - '8000:8000'
    volumes:
      - hf_cache:/root/.cache/huggingface
    networks:
      - ai-network
    ipc: host # Best effort; alternatives exist

volumes:
  hf_cache:

networks:
  ai-network:
    driver: overlay
Enter fullscreen mode Exit fullscreen mode

Note on IPC: Swarm has limited support for --ipc=host. You can use a large tmpfs mount to /dev/shm as a workaround for tensor parallelism.

Part 4: Production Best Practices in 2026

Special: Apple Silicon (M1/M2/M3) Performance Tips

If you are running this stack on a Mac Swarm for development:

  1. Allocate More RAM: LLMs are memory-hungry. Go to Docker Desktop Settings > Resources and allocate at least 8GB-12GB of RAM to the VM.
  2. Use Quantized Models: Stick to 4-bit or 5-bit quantization (e.g., llama3.2:3b-instruct-q4_K_M). They run significantly faster on CPUs.
  3. Keep it Small: Models like Phi-3 Mini or Gemma 2B will feel much snappier in CPU mode than a full Llama 70B.

1. Load Balancing & High Availability

Add Traefik or Nginx as a reverse proxy in front of your services.

2. Model Management & Caching

  • Use shared volumes or GlusterFS/Ceph for model storage across nodes.
  • Pre-pull popular models on all nodes.

3. Monitoring & Observability

Deploy Prometheus + Grafana + Node Exporter + NVIDIA DCGM for GPU metrics. SwarmCLI provides a built-in TUI that gives you an instant overview of your cluster's resource consumption, making it easier to spot nodes that are under-utilized or over-burdened by heavy LLM inference.

4. Security

  • Enable authentication in OpenWebUI.
  • Use Swarm secrets for API keys.
  • Expose services only via a secure proxy. For enterprise environments, we recommend using SwarmCLI Business Edition's RBAC Proxy for automated mTLS and centralized access control (see the Proxy Setup Guide).

5. Scaling Strategy

  • Ollama: Global mode (one per node).
  • OpenWebUI: Scale replicas based on users.
  • vLLM: Add more replicas on GPU nodes with placement constraints.

Common Issues & Troubleshooting (2026)

  • "Invalid device reservation" error — This happens on Mac or Linux systems without NVIDIA drivers. Remove the reservations block from your stack file.
  • GPU not detected — Verify NVIDIA Container Toolkit and resource reservations.
  • Model download slow — Use a shared cache volume.
  • High latency — Try quantization (Q4_K_M, Q5_K_M) and adjust context length.
  • Swarm networking issues — Ensure services are on the same overlay network.

Advanced: Multi-Model Setup & RAG

You can run multiple specialized models (coding, general chat, vision) and route requests intelligently using a lightweight gateway or OpenWebUI’s built-in features.

Conclusion: Your Private AI Platform in 2026

With Docker Swarm and SwarmCLI, you can build a robust, private AI infrastructure that’s easy to maintain and scales with your needs. Whether you’re a homelabber running Llama 3.2 or a small team serving internal tools with vLLM, this stack delivers excellent performance without Kubernetes complexity.

Next Steps:

  1. Start with the Ollama + OpenWebUI stack.
  2. Add monitoring.
  3. Experiment with vLLM for higher throughput.

Why SwarmCLI?

By 2026, we noticed a gap. Docker Swarm was rock solid, but the management tooling felt stuck in 2017. SwarmCLI bridges that gap with:

Real-time Health: Stop guessing which node is throttled.
Atomic Secret Sync: One-command .env to Raft encryption.
Edge-Optimized: Built in Go for zero-overhead on ARM/RPi5 devices.

Stay Connected

Top comments (0)