Rost

Posted on Mar 27 • Originally published at glukhov.org

Ollama in Docker Compose with GPU and Persistent Model Storage

#selfhosting #llm #ollama #devops

Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.

This post focuses on one goal: a reproducible local or single-node Ollama "server" using Docker Compose, with GPU acceleration and persistent model storage.

It intentionally skips generic Docker and Compose basics. When you need a compact list of the commands you reach for most often (images, containers, volumes, docker compose), the Docker Cheatsheet is a good companion.

When you want HTTPS in front of Ollama, correct streaming and WebSocket proxying, and edge controls (auth, timeouts, rate limits), see Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming.

For how Ollama fits alongside vLLM, Docker Model Runner, LocalAI, and cloud hosting trade-offs, see
LLM Hosting in 2026: Local, Self-Hosted & Cloud Infrastructure Compared.

When Compose beats a bare metal install

A native install is frictionless for one developer on one machine. The moment you have any of the following, Compose starts to win on ergonomics:

A team setup benefits because the service definition is a file you can review, version, and share. A single-node server benefits because upgrades turn into an image tag bump and a restart, while your model storage stays put (as long as it is on a volume). Ollama also tends to live next to sidecars: a Web UI, a reverse proxy, an auth gateway, a vector DB, or an agent runtime. Compose is good at "one command to start the whole stack", without turning your host into a snowflake.

This approach aligns well with how the official Ollama container is designed: the image runs ollama serve by default, exposes port 11434, and is meant to keep state under a mountable directory.

A Compose skeleton that is actually useful for Ollama

Start with two decisions:

First, how you will pin versions. The Docker Hub image is ollama/ollama, so you can pin a specific tag in .env instead of relying on latest.

Second, where the model data will live. The official docs mount a volume to /root/.ollama so models are not re-downloaded each time the container is replaced.

Here is a Compose file that bakes those decisions in, and keeps the "knobs" close to the service:

services:
  ollama:
    image: ollama/ollama:${OLLAMA_IMAGE_TAG:-latest}
    container_name: ollama
    restart: unless-stopped

    # Keep it local by default, expose it later if you need to.
    ports:
      - "${OLLAMA_BIND_IP:-127.0.0.1}:11434:11434"

    # Persistent models and server state.
    volumes:
      - ollama:/root/.ollama

    environment:
      # The official image already defaults to 0.0.0.0:11434 inside the container,
      # but keeping it explicit helps when you override things later.
      - OLLAMA_HOST=0.0.0.0:11434

      # Service tuning.
      - OLLAMA_KEEP_ALIVE=${OLLAMA_KEEP_ALIVE:-5m}
      - OLLAMA_NUM_PARALLEL=${OLLAMA_NUM_PARALLEL:-1}
      - OLLAMA_MAX_LOADED_MODELS=${OLLAMA_MAX_LOADED_MODELS:-1}

      # Optional, but relevant when a browser-based UI talks to Ollama directly.
      # See the Networking section for why this exists.
      - OLLAMA_ORIGINS=${OLLAMA_ORIGINS:-}

    # GPU reservation is a separate section below.
    # Add it only on hosts that actually have NVIDIA GPUs.

volumes:
  ollama: {}

A matching .env keeps upgrades boring:

# Pin the image version you have tested.
OLLAMA_IMAGE_TAG=latest

# Local by default. Change to 0.0.0.0 when you intentionally expose it.
OLLAMA_BIND_IP=127.0.0.1

# Keep-alive tweaks cold-start latency vs memory footprint.
OLLAMA_KEEP_ALIVE=5m

# Concurrency knobs.
OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1

# Leave empty unless you are serving browser clients that hit Ollama directly.
OLLAMA_ORIGINS=

A small but important nuance: Ollama itself has a default host bind of 127.0.0.1:11434 in the general configuration, but the official container image sets OLLAMA_HOST=0.0.0.0:11434 so the service is reachable through published ports.

If you want a quick sanity check without involving any client SDKs, the Ollama API includes a "list local models" endpoint at GET /api/tags.

Persistent model storage and the least painful way to move it

If you only remember one thing, make it this: the container must have persistent storage, otherwise every rebuild is a re-download.

Ollama lets you choose the models directory using OLLAMA_MODELS. In the reference implementation, the default is $HOME/.ollama/models, and setting OLLAMA_MODELS overrides that.

Inside the official Docker image, $HOME maps naturally to the /root layout used by the documented volume mount (/root/.ollama), which is exactly why the official docker run examples mount that directory.

There are two storage patterns that tend to work well in practice:

A named Docker volume is simplest and portable. It is also easy to accidentally orphan, so it is worth naming it intentionally (for example ollama) and keeping it stable across Compose refactors.

A bind mount to a dedicated disk is better when model sizes start to dominate your root filesystem. In that case, you either mount the whole /root/.ollama to that disk, or you mount a custom directory and point OLLAMA_MODELS at it.

If you are actively reorganising storage, this is where an explicit "move models" playbook helps. See: move-ollama-models .

NVIDIA GPU support with Compose and the NVIDIA Container Toolkit

Ollama can use NVIDIA GPUs in Docker, but the image cannot magic a GPU into existence. The host needs working NVIDIA drivers and the NVIDIA Container Toolkit, and Docker must be configured to use it. The Ollama Docker docs explicitly call out installing nvidia-container-toolkit, configuring the runtime via nvidia-ctk runtime configure --runtime=docker, and restarting Docker.

On the Compose side, the clean, modern way is device reservations. Docker documents GPU access in Compose using deploy.resources.reservations.devices, with capabilities: [gpu], driver: nvidia, and either count (including all) or device_ids.

Add this to the ollama service when you are on an NVIDIA host:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

If you have multiple GPUs and want to keep Ollama on specific devices, switch from count to device_ids as documented by Docker (they are mutually exclusive).

You will sometimes see legacy Compose examples that use runtime: nvidia. That can fail on newer setups with errors like "unknown or invalid runtime name: nvidia", which is a strong hint that you should move to the supported device reservation pattern and make sure the toolkit is configured on the host.

A useful detail hiding in plain sight: the official ollama/ollama image sets NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=compute,utility. These are standard knobs recognised by the NVIDIA container runtime, and they are already present unless you overwrite them.

To confirm whether you are actually getting GPU inference (not just a container that starts), Ollama recommends using ollama ps and checking the "Processor" column, which shows whether the model is on GPU memory.

Platform reality check: Ollama notes that GPU acceleration in Docker is available on Linux (and Windows with WSL2), and not available on Docker Desktop for macOS due to the lack of GPU passthrough.

Networking choices: host vs bridge, ports, and CORS

Networking is where most "it runs but my app cannot connect" bugs come from.

Bridge networking with published ports

The default Compose network is a bridge network. In this setup, publishing 11434:11434 makes Ollama reachable from the host on port 11434, while other containers should talk to it using the service name ollama (not localhost). A lot of people trip on this because localhost inside a container means "this container", not "the Ollama container".

Ollama itself runs an HTTP server on port 11434 (the image exposes it), and the common convention is that clients use http://localhost:11434 on the host when ports are published.

Host networking

network_mode: host can be tempting on a single-node server because it removes port publishing and makes localhost semantics simpler. The trade-off is you lose the isolation and namespacing benefits of a bridge network, and you are more likely to hit port conflicts.

Exposing Ollama intentionally

Ollama on a normal install binds to 127.0.0.1 by default, and the documented way to change the bind address is OLLAMA_HOST.

In Docker, you have two layers:

Ollama bind address, controlled by OLLAMA_HOST (the container image defaults to binding on all interfaces inside the container).

Reachability from outside the container, controlled by Compose ports and the host firewall.

A pattern I like is "bind locally by default" via 127.0.0.1:11434:11434, then switch to 0.0.0.0:11434:11434 only when I have a reason to expose it.

Browser clients and OLLAMA_ORIGINS

If a browser-based UI or extension calls Ollama directly, you are in CORS territory. Ollama allows cross-origin requests from 127.0.0.1 and 0.0.0.0 by default, and you can configure additional origins using OLLAMA_ORIGINS.

This matters even on a single node, because "it works with curl" does not mean "it works from a browser app".

Upgrade and rollback patterns that fit a single-node server

Ollama evolves quickly. Your Compose file can make that a calm process instead of a late-night surprise.

Upgrade by bumping a tag, not by hoping "latest" behaves

The most practical upgrade strategy is to pin the image to a known-good tag in .env, and bump it intentionally. The image is published as ollama/ollama on Docker Hub.

Because model data and server state are stored under a mounted directory (/root/.ollama in the official docs), replacing the container does not imply re-downloading models.

Rollback is just switching the tag back

Rollback is the same mechanism in reverse: set the previous tag, recreate the container, keep the same volume. This is where pinning pays for itself.

Data migration is mostly about storage paths

Most "migrations" in a single-node setup are not about database schemas. They are about disk layout. If you change the models directory (via OLLAMA_MODELS) or move the mounted volume to a new disk, you are doing a data migration whether you call it that or not.

If you want a practical guide for reorganising the model directory on real machines, see: move-ollama-models .

A final note that is easy to miss: Ollama's API documentation explicitly says the API is expected to be stable and backwards compatible, with rare deprecations announced in release notes. That makes "upgrade the server, keep clients working" a reasonable default expectation for a single-node service endpoint.

Common failures: GPU permissions, driver mismatch, and OOM

This section is deliberately symptom-driven. The goal is not "every possible Docker error", only the failures that show up specifically in Ollama + GPU + persistent storage setups.

GPU visible on the host, missing in the container

If the host has a working NVIDIA driver but the container does not see a GPU, the common causes are:

The NVIDIA Container Toolkit is not installed or the Docker runtime is not configured via nvidia-ctk. Ollama's Docker docs call this out directly.

Compose is not reserving a GPU device. The supported way is deploy.resources.reservations.devices with the gpu capability as documented by Docker.

A legacy runtime: nvidia configuration is being used on a daemon that does not recognise it, producing "unknown or invalid runtime name: nvidia".

For validation, ollama ps gives you a pragmatic check: it shows whether a model is loaded in GPU memory.

Permission denied on GPU devices

The "permission denied" flavour of GPU failures typically points to environment constraints rather than Ollama itself. Examples include running rootless Docker, security policies, or device nodes not being exposed as expected. The Docker Compose GPU support docs are explicit that the host must have GPU devices and that the Docker daemon must be set accordingly.

When in doubt, reduce the variables: confirm the toolkit configuration (host), then confirm GPU reservation (Compose), then confirm GPU usage (ollama ps).

Wrong driver, wrong expectation

Ollama in Docker relies on the host driver stack. If the host driver is missing, too old, or misconfigured, you will see failures that look like "Ollama is broken" but are really "CUDA stack is not usable". The official docs place the container toolkit and Docker daemon configuration as prerequisites for NVIDIA GPU usage.

Out of memory: VRAM or RAM disappears fast

OOM is the most predictable failure mode for local inference, and it is usually self-inflicted by configuration.

Ollama supports concurrent processing through multiple loaded models and parallel request handling, but it is constrained by available memory (system RAM on CPU inference, VRAM on GPU inference). When GPU inference is used, new models must fit in VRAM to allow concurrent model loads.

Two configuration details are worth treating as first-class "server settings":

OLLAMA_NUM_PARALLEL increases parallel request processing per model, but required memory scales with OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH.

OLLAMA_KEEP_ALIVE controls how long models remain loaded (default is 5 minutes). Keeping models loaded reduces cold-start latency, but it also pins memory.

If you are stabilising a single-node service under load, the non-dramatic fixes usually look like:

Lower parallelism and context defaults before you change anything else.

Limit how many models are allowed to remain loaded concurrently.

Consider memory-reduction features like Flash Attention (OLLAMA_FLASH_ATTENTION=1) and lower precision K/V cache types (OLLAMA_KV_CACHE_TYPE) when your bottleneck is memory, not raw compute.

When it is not Ollama: choosing Docker Model Runner instead

Sometimes the "failure" is really a tooling mismatch. If your organisation already standardises on Docker-native artifacts and workflows, Docker Model Runner (DMR) can be a better fit than running Ollama as a long-lived service container.

Docker positions DMR as a way to manage, run, and serve models directly via Docker, pulling from Docker Hub or other OCI registries, and serving OpenAI-compatible and Ollama-compatible APIs.

It also supports multiple inference engines (including llama.cpp, and vLLM on Linux with NVIDIA GPUs), which can matter if you care about throughput characteristics, not just "run one model locally".

If you want a practical command reference and a deeper comparison angle, see: Docker Model Runner Cheatsheet: Commands & Examples.

DEV Community