Mike

Posted on Jun 28

Self-Hosted LLMs with Docker, Ollama, and Open WebUI

#ai #selfhosted #docker #tutorial

Self-Hosted LLMs with Docker, Ollama, and Open WebUI

Cloud AI is convenient. Open a browser, type a prompt, get an answer. But it comes with three things that bug me: API bills that surprise you at the end of the month, rate limits that kick in when you're in the middle of something, and the knowledge that every conversation you have is sitting on someone else's server.
I've been running local LLMs on my homelab for months — on everything from a 10-year-old Xeon server to a $150 mini PC. The setup that's stuck with me? Docker plus Ollama plus Open WebUI. Three pieces. Five minutes. Your own private ChatGPT.
Here's how to set it up — properly, with security hardening, API access, GPU passthrough, and integrations for the tools you probably already run.

What You're Building

By the end of this post, you'll have:

Ollama running in Docker — pulling and serving LLMs (Gemma, Llama, Phi, Mistral, anything in the Ollama library)
Open WebUI running in Docker — a clean ChatGPT-style interface that talks to Ollama
A single docker-compose.yml file that starts everything with one command
Models that run entirely on your hardware, with local model inference and no prompts sent to a cloud LLM provider
API access that any OpenAI-compatible tool can use You need Docker installed and at least 8 GB of RAM free. That's it. If you're new to Docker, I wrote a deep dive on how containers actually work under the hood. You don't need to read it for this guide — but knowing what docker run actually does makes troubleshooting a lot less mysterious. ## Step 1: Ollama in Docker Ollama is the engine that downloads and runs models. It exposes a REST API on port 11434 that any client (including Open WebUI) can talk to. Under the hood, it wraps llama.cpp — the C++ inference engine that makes CPU inference practical. The official Ollama Docker image lives at ollama/ollama on Docker Hub. Here's the simplest command to get it running:

docker run -d \
  --name ollama \
  -p 127.0.0.1:11434:11434 \
  -v ollama_data:/root/.ollama \
  --restart unless-stopped \
  ollama/ollama

Let me break that down:

-d runs it in the background (detached mode)
--name ollama gives the container a fixed name so you can reference it later
-p 127.0.0.1:11434:11434 binds Ollama's API to localhost only — this is how Open WebUI and API clients on the same machine reach Ollama
-v ollama_data:/root/.ollama creates a named Docker volume for your downloaded models. Without this, every time you recreate the container, you'd have to re-download every model. With the volume, models persist across container restarts.
--restart unless-stopped ensures Ollama comes back after a reboot Wait a few seconds for the container to start, then verify:

docker ps | grep ollama

You should see the ollama/ollama container with status Up.

Step 2: Pull Your First Model

Ollama's container is running, but it's empty — no models yet. You pull models by running ollama pull inside the container:

docker exec -it ollama ollama pull llama3.2:3b

This downloads Meta's Llama 3.2 3B model — about 2 GB when quantized. It's small enough to run on almost anything (I've run similar models on a $150 BMAX Pro 8 with 24 GB of RAM) but smart enough for real work.
While it downloads, here's what's happening: Ollama is fetching the GGUF quantized weights from its model registry and storing them in /root/.ollama/models inside the container (which maps to the ollama_data volume on your host). Once downloaded, the model is ready to serve instantly.
After the pull finishes, test it:

docker exec -it ollama ollama run llama3.2:3b "Explain what a container is in one sentence"

You should see a streaming response appear in your terminal. Ollama is working.
Some models worth pulling depending on your hardware:
| Model | RAM Needed | Good For |
|-------|-----------|----------|
| llama3.2:3b | 4–6 GB | General chat, summaries |
| gemma3:4b | 6–8 GB | Quick answers, teaching material |
| mistral:7b | 8–12 GB | Longer reasoning |
| qwen2.5-coder:7b | 8–12 GB | Coding help |
| phi4:14b | 16–24 GB | Technical writing, coding |
| qwen3:30b | 24–32 GB | Complex reasoning, vision |
On machines with limited RAM, stick to the 3B–4B range. They're fast and surprisingly capable for everyday tasks. I've benchmarked models on everything from a 10-year-old Xeon to a $150 mini PC — the token-per-second numbers might surprise you.
To see which models are loaded and where they're running (GPU vs CPU):

docker exec -it ollama ollama ps

Step 3: Open WebUI — The ChatGPT-Style Interface

Ollama's API is great for scripts. For humans? You want a chat interface. Open WebUI is the gold standard — it's a self-hosted, feature-rich web app that looks and feels like ChatGPT but talks to your local Ollama instance.
Open WebUI ships as a Docker image on GitHub Container Registry: ghcr.io/open-webui/open-webui:main.
Here's the command:

docker run -d \
  --name open-webui \
  -p 127.0.0.1:3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui_data:/app/backend/data \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

The key flag here is --add-host=host.docker.internal:host-gateway. This lets the Open WebUI container reach the host machine's network — and therefore Ollama on port 11434. Without this, the container can't see Ollama running in a separate container.
Once it's up, open your browser and go to http://localhost:3000. Create an admin account (the first account you create becomes admin) and you'll land on the chat screen. Open WebUI should auto-detect your local Ollama instance — if it doesn't, go to Settings → Connections and check the Ollama Base URL (http://host.docker.internal:11434).
You should see llama3.2:3b in the model dropdown. Select it and start chatting.

Step 4: Docker Compose — One File, One Command

Running two separate docker run commands works, but it's messy. Docker Compose lets you define both services in a single file and manage them together. More importantly, it lets containers talk to each other by service name — no --add-host hacks needed.
Create a docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    # Uncomment for Nvidia GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "127.0.0.1:3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open-webui_data:/app/backend/data
    restart: unless-stopped
    depends_on:
      - ollama
volumes:
  ollama_data:
  open-webui_data:

Notice OLLAMA_BASE_URL=http://ollama:11434 — inside the Compose network, containers can reach each other by service name. No host-gateway trickery needed.
Save that file, then run:

docker compose up -d

That's it. Both containers start. Your models are persisted in Docker volumes. Reboot your machine and everything comes back. Pull a new model at any time:

docker exec -it ollama ollama pull llama3.2:3b

Step 5: GPU Acceleration

CPU inference works — I do it daily on my Xeon. But if you have a GPU, you want it working for you. Token generation jumps from ~10 tok/s (CPU) to ~50+ tok/s (GPU) on modest hardware.

Nvidia GPU

Install the Nvidia Container Toolkit:

sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then uncomment the deploy: block in your compose file and restart:

docker compose down && docker compose up -d

If you're using the standalone docker run approach instead of Compose, add --gpus=all:

docker run -d \
  --name ollama \
  -p 127.0.0.1:11434:11434 \
  -v ollama_data:/root/.ollama \
  --gpus=all \
  --restart unless-stopped \
  ollama/ollama

Ollama detects CUDA automatically. Run docker exec -it ollama ollama ps after loading a model — it'll show 100% GPU if passthrough is working.

AMD GPU (ROCm)

For AMD cards, use the ROCm-tagged Ollama image:

ollama:
  image: ollama/ollama:rocm
  devices:
    - /dev/kfd
    - /dev/dri

Check GPU support with rocminfo on the host first. ROCm requires v7 drivers and specific GFX targets — not all AMD GPUs are supported.

Intel / Vulkan

The standard ollama/ollama image bundles Vulkan support for Intel GPUs. Install Mesa Vulkan drivers on the host (sudo apt install mesa-vulkan-drivers) and Ollama picks them up. Disable with OLLAMA_VULKAN=0 if needed.
Key insight for LLM inference: It's memory-bandwidth-bound, not compute-bound. CPU inference works well with enough RAM bandwidth (DDR4/DDR5 multi-channel). A used server with 128 GB of DDR3 can run 26B models that rival cloud offerings — all on CPU.

API Usage — Native and OpenAI-Compatible

This is where the stack gets powerful. Ollama exposes two API surfaces — a native REST API for full control, and OpenAI-compatible endpoints so any tool that speaks OpenAI just works.

Native REST API

The native API lives at http://localhost:11434/api. Here's a chat completion with system prompt and temperature control:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {"role": "system", "content": "You are a helpful teaching assistant for IGCSE Computer Science."},
    {"role": "user", "content": "Explain the fetch-execute cycle in three bullet points."}
  ],
  "stream": false,
  "options": {"temperature": 0.3}
}'

Structured output (JSON schema mode) — force the model to return data in a predictable shape:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [{"role": "user", "content": "List 3 IGCSE CS topics with difficulty ratings."}],
  "stream": false,
  "format": {
    "type": "object",
    "properties": {
      "topics": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "difficulty": {"type": "string"}
          },
          "required": ["name", "difficulty"]
        }
      }
    },
    "required": ["topics"]
  }
}'

This is invaluable for automation — you get guaranteed JSON back, not a rambling paragraph.
Other useful native endpoints:
| Endpoint | Purpose |
|----------|---------|
| POST /api/generate | Single-prompt text generation |
| POST /api/embed | Generate embeddings for RAG |
| GET /api/tags | List downloaded models |
| POST /api/pull | Download a model programmatically |
| DELETE /api/delete | Remove a model |

OpenAI-Compatible Endpoints

Point any tool that supports OpenAI's API at http://localhost:11434/v1 with no API key:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "What is Docker?"}]
  }'

Supported OpenAI-compatible endpoints include Chat Completions, Completions, Models list, Embeddings, and (as of Ollama v0.13.3+) the Responses API. This means VS Code extensions, LangChain, n8n, Hermes Agent, and CLI tools all work by changing one base_url.

Performance Tuning

A few environment variables make a real difference on constrained hardware:
| Variable | Default | What to Set |
|----------|---------|------------|
| OLLAMA_KEEP_ALIVE | 5m | How long models stay loaded in RAM after the last request. Set to 0 to unload immediately (save RAM), or 24h to keep models hot. |
| OLLAMA_CONTEXT_LENGTH | 4096 | Default context window. Reduce to 2048 on low-RAM machines for smaller models. |
| OLLAMA_NUM_PARALLEL | 1 | Concurrent requests per model. Keep at 1 on CPU — parallel requests multiply RAM usage. |
| OLLAMA_MAX_LOADED_MODELS | 3 × GPU count | Max models kept in memory simultaneously. On CPU-only setups, set this to 1 or 2. |
| OLLAMA_FLASH_ATTENTION | (off) | Set to 1 to enable Flash Attention — cuts memory usage on supported models. |
In Docker Compose, set these under the Ollama service's environment: block:

ollama:
  environment:
    - OLLAMA_KEEP_ALIVE=24h
    - OLLAMA_CONTEXT_LENGTH=4096
    - OLLAMA_NUM_PARALLEL=1
    - OLLAMA_FLASH_ATTENTION=1

The first prompt after pulling a model loads the weights into RAM — expect a few seconds of delay. Subsequent prompts are fast because the model stays resident. If you switch models, the old one gets evicted and the new one loads — another few seconds. OLLAMA_KEEP_ALIVE=24h keeps your most-used model hot all day.

Security: Don't Expose Ollama Directly

By default, Ollama listens on 127.0.0.1:11434 — localhost only. This is the safe default and the one we used in every docker run and Compose example above. Do not set OLLAMA_HOST=0.0.0.0:11434 unless you understand the risk: Ollama's local API has no built-in authentication. Anyone on your network can pull models, run prompts, and delete your data.
In Docker, bind ports to 127.0.0.1 if you only want local access. Do not publish Ollama on all interfaces unless protected by firewall, VPN, or reverse proxy.
If you need remote access, put it behind a reverse proxy:
Nginx with basic auth:

server {
    listen 443 ssl;
    server_name ollama.yourdomain.com;
    auth_basic "Ollama";
    auth_basic_user_file /etc/nginx/.htpasswd;
    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
    }
}

Cloudflare Tunnel (no open ports):

cloudflared tunnel --url http://localhost:11434 --http-host-header="localhost:11434"

Then layer Cloudflare Access policies on top for zero-trust auth. I covered this approach in detail in my post on Cloudflare Self-Managed OAuth.
Tailscale is another clean option — put your server and your laptop on the same tailnet and access Ollama at http://:11434 without opening any ports.
If you're running this stack on Proxmox, this Docker Compose setup drops right into an Ubuntu VM — I've covered the full Proxmox-on-mini-PC workflow in a separate guide.
On newer Ollama versions, set OLLAMA_NO_CLOUD=1 if you want to avoid Ollama cloud features:

ollama:
  environment:
    - OLLAMA_NO_CLOUD=1

CORS for Browser Extensions

If you use browser extensions that talk to Ollama (like the Page Assist extension), set allowed origins:

ollama:
  environment:
    - OLLAMA_ORIGINS=chrome-extension://*,moz-extension://*,safari-web-extension://*

Troubleshooting Common Issues

"Connection refused" when Open WebUI tries to reach Ollama. In the standalone docker run setup, you need --add-host=host.docker.internal:host-gateway. In Docker Compose, use OLLAMA_BASE_URL=http://ollama:11434 — the service name resolves inside the Compose network. Double-check which mode you're using.
Models are slow on first prompt. The first prompt loads the model weights into RAM. Subsequent prompts are fast. OLLAMA_KEEP_ALIVE=24h avoids frequent reloads.
Out of memory errors. LLMs are memory-hungry. A 7B parameter model at 4-bit quantization needs about 4–5 GB of RAM. On a machine with 8 GB total, stick to 2B–3B models. On 16 GB, 7B models are comfortable. If you push too far, Docker kills the Ollama container with an OOM (out of memory) error — check docker logs ollama.
Port conflicts. If port 3000 or 11434 is already in use, change the host-side port in your compose file (the left side of the colon). For example, "3001:8080" maps host port 3001 to the container's 8080.

Integrating with Your Tools

Hermes Agent

Ollama documents a Hermes Agent integration. If you're running Hermes Agent for task orchestration (I do — here's my mini PC setup guide), pointing it at your local Ollama instance means zero-cost inference for all your automated workflows:

ollama launch hermes

This auto-installs Hermes, configures the Ollama provider at http://127.0.0.1:11434/v1, and sets up messaging gateways. For manual setup, point Hermes's provider config to http://127.0.0.1:11434/v1 with no API key.
Models that work well for Hermes agent tasks:

gemma3:12b — reasoning and code generation (~12 GB VRAM)
qwen2.5-coder:7b — coding tasks (~8 GB)
llama3.2:3b — lightweight tasks, quick lookups (~4 GB) ### n8n If you run n8n for automation (I do for teaching workflows), Ollama plugs in through two paths: HTTP Request node — POST to Ollama's native API:

{
  "method": "POST",
  "url": "http://ollama:11434/api/chat",
  "body": {
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Grade this student answer..."}],
    "stream": false
  }
}

OpenAI node — point base_url at http://ollama:11434/v1 and it works as a drop-in replacement. Same node, same workflow shape, zero API cost.

Teaching Use Cases

As a CS teacher, here's what I use this stack for in the classroom:

Mark scheme generation: Feed a past paper question through Ollama with structured output — get back a formatted mark scheme with point allocations.
Worksheet creation: "Generate 10 differentiated questions on binary addition for IGCSE CS, with answers." Structured JSON output makes it trivial to import into a worksheet template.
Code review feedback: Student submits a Python program → n8n passes it to Ollama → model returns line-level feedback. Not perfect, but catches obvious issues before I review.
Vocabulary extraction: Feed a Cambridge 0478/9618 syllabus section → Ollama returns a structured list of key terms with definitions → pipe into Anki or a CSV for Quizlet.
Lesson plan drafting: "Outline a 45-minute lesson on logic gates for A-Level CS with starter activity, main task, and plenary." Saves 30 minutes of staring at a blank doc. The structured output mode is the killer feature here — you're not asking the model to write freeform text and hoping it's useful. You're getting guaranteed JSON that feeds directly into your existing templates and workflows. ## What's Next? Once you've got the basic stack running, three things worth exploring:
RAG with your own documents. Open WebUI's built-in RAG lets you upload PDFs, markdown files, and text documents. Ask questions against your own knowledge base — syllabi, notes, documentation — and the model responds with answers grounded in your files, not whatever it memorized during training.
Expose it safely. Put Open WebUI behind Nginx with HTTPS, or use a Cloudflare Tunnel with Access policies so you can chat from your phone without exposing anything to the open internet.
Multi-user setup. Open WebUI supports user accounts with role-based access. Create accounts for students, colleagues, or family — each gets their own chat history and model access. ## Why This Matters A year ago, running a useful LLM at home meant owning a GPU. That's no longer true. A $150 mini PC with integrated graphics can serve a 3B model at reading speed. A used server with 128 GB of DDR3 RAM can run 26B models that rival cloud offerings. The Docker + Ollama + Open WebUI stack is the simplest path I've found. Three pieces, one compose file. No API keys. No rate limits. Local model inference, with no prompts sent to a cloud LLM provider. You pull a model, you start a chat, and you own the whole stack — from the inference engine down to the web interface. If you've been waiting for the "right time" to try self-hosted AI — this is it. The software is ready. The hardware is cheap. Go spin up a container.

DEV Community

Self-Hosted LLMs with Docker, Ollama, and Open WebUI

Self-Hosted LLMs with Docker, Ollama, and Open WebUI

What You're Building

Step 2: Pull Your First Model

Step 3: Open WebUI — The ChatGPT-Style Interface

Step 4: Docker Compose — One File, One Command

Step 5: GPU Acceleration

Nvidia GPU

AMD GPU (ROCm)

Intel / Vulkan

API Usage — Native and OpenAI-Compatible

Native REST API

OpenAI-Compatible Endpoints

Performance Tuning

Security: Don't Expose Ollama Directly

CORS for Browser Extensions

Troubleshooting Common Issues

Integrating with Your Tools

Hermes Agent

Teaching Use Cases

Top comments (0)