This is a submission for the Gemma 4 Challenge: Write About Gemma 4
Here's a situation every ML team eventually hits.
Your intern gets Gemma 4 running locally. Beautiful. Fast. Impresses the standup. Then they leave for the semester. Three weeks later, nobody can reproduce the setup. The model was in a path that no longer exists. The Python version was wrong. The CUDA libraries conflicted with PyTorch. Someone pip installed something globally and broke everything.
This is why Docker exists. And yet, every Gemma 4 setup guide I've found uses either the Ollama GUI, LM Studio's drag-and-drop interface, or bare ollama run commands that vanish when the terminal closes. None of them answer the actual question a dev team has:
How do we version-control this, ship it to staging, and have it work the same way every time?
This guide answers that. By the end, you'll have:
- A
Dockerfileyour team can pin, review, and modify in a PR - A
docker-compose.ymlthat brings up the full stack with one command - A running Gemma 4 endpoint that speaks OpenAI's API format — meaning zero changes to any code that already calls GPT-4o
Let's build it.
Why Docker + GPU Is Annoying (and How It Actually Works)
Before the files, a quick mental model — because this is where most people get confused.
CUDA is not inside your container. The NVIDIA driver lives on the host machine. The container runtime (Docker) needs a way to reach through the container wall and talk to those drivers. That bridge is the NVIDIA Container Toolkit.
Here's the flow:
Your Container
└── CUDA libraries (inside the image)
└── NVIDIA Container Runtime ← the bridge
└── Host NVIDIA Driver ← never inside the container
└── Physical GPU
This means:
- You do NOT need to install CUDA on your host. Only the NVIDIA driver.
- You do NOT bundle the driver inside your Docker image. Only CUDA libraries.
- The container toolkit version on your host must be compatible with your driver version.
A common gotcha: if your host driver is old (e.g., 470.x), it cannot run CUDA 12.x containers. The CUDA version inside your image must be less than or equal to what your driver supports. Check first:
nvidia-smi
# Look for: "CUDA Version: XX.X" in the top-right corner
# That's the maximum CUDA your driver supports
Prerequisites Checklist
Before touching any Docker file, verify these four things on your host machine:
# 1. Confirm NVIDIA GPU is visible
lspci | grep -i nvidia
# 2. Check driver is installed and working
nvidia-smi
# 3. Check Docker version (need 19.03+ for native --gpus support)
docker --version
# 4. Check NVIDIA Container Toolkit is installed
nvidia-ctk --version
If step 4 fails, install the toolkit:
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Wire it into Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test — you should see nvidia-smi output from inside a container
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
If that last command shows your GPU, you're ready. If it errors, stop here and fix the driver — no amount of clever Dockerfiles will help.
The Setup We're Building
We're going to run three containers in a single Compose stack:
| Container | What it does | Port |
|---|---|---|
gemma4 |
Ollama serving Gemma 4 (with GPU) |
11434 (internal) |
litellm |
OpenAI-compatible proxy in front of Ollama |
8000 (external) |
model-puller |
One-shot container that pulls the model on first boot | — |
The LiteLLM proxy is the key piece most guides skip. It translates OpenAI API calls (/v1/chat/completions) into Ollama's native format, adds API key auth, rate limiting, and request logging — all without touching your application code.
Project Structure
Create this directory layout before writing any files:
gemma4-stack/
├── .env # Secrets and config — never commit this
├── .env.example # Template for new team members
├── docker-compose.yml # The full stack
├── Dockerfile.ollama # Custom Ollama image with health checks
├── litellm/
│ └── config.yaml # LiteLLM model routing config
├── scripts/
│ └── pull-model.sh # Idempotent model pull script
└── tests/
└── smoke_test.py # Verify the stack works end-to-end
Step 1: The .env File
# .env — copy from .env.example and fill in your values
# !! Never commit this file — add .env to .gitignore
# Which Gemma 4 variant to run
# Options: gemma4:e2b | gemma4:e4b | gemma4:26b | gemma4:31b
GEMMA_MODEL=gemma4:26b
# API key for LiteLLM (clients use this to authenticate)
# Generate a random one: openssl rand -hex 32
LITELLM_MASTER_KEY=sk-your-secret-key-here
# How long Ollama keeps the model in VRAM after the last request
# Increase this if you have spare VRAM and want faster cold starts
OLLAMA_KEEP_ALIVE=10m
# Max parallel inference requests (tune to your VRAM)
OLLAMA_NUM_PARALLEL=2
# HuggingFace token — needed only if you pull gated models
HF_TOKEN=
# .env.example — commit this
GEMMA_MODEL=gemma4:26b
LITELLM_MASTER_KEY=sk-CHANGE-ME
OLLAMA_KEEP_ALIVE=10m
OLLAMA_NUM_PARALLEL=2
HF_TOKEN=
Step 2: The Dockerfile
This is where we diverge from every "just use the ollama image" guide. Our custom image adds:
- A proper health check endpoint (so Compose knows when Ollama is actually ready)
- A non-root user (security baseline for any production-adjacent setup)
- Pinned Ollama version (reproducibility)
# Dockerfile.ollama
# Pin to a specific Ollama release — update deliberately, not accidentally
FROM ollama/ollama:0.9.3
# Metadata
LABEL maintainer="your-team@yourcompany.com"
LABEL description="Ollama serving Gemma 4 with GPU passthrough"
LABEL gemma.version="gemma4"
# GPU access configuration
# NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES tell the
# container runtime which GPU capabilities to expose.
# "compute" = CUDA, "utility" = nvidia-smi
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
# Ollama configuration
ENV OLLAMA_HOST=0.0.0.0
ENV OLLAMA_ORIGINS=*
# Keep the model warm in VRAM (overridable via docker-compose env)
ENV OLLAMA_KEEP_ALIVE=10m
# Number of parallel inference slots
# Each slot uses roughly model_size / 2 additional VRAM
ENV OLLAMA_NUM_PARALLEL=2
# Flash attention — reduces memory for long contexts
ENV OLLAMA_FLASH_ATTENTION=1
# Model storage location inside the container
# We'll mount a volume here to persist across container restarts
ENV OLLAMA_MODELS=/models
# Create the models directory and set ownership
RUN mkdir -p /models && chmod 755 /models
# Expose Ollama's API port
EXPOSE 11434
# Health check: ask Ollama if it's ready
# --start-period gives it 60s to load before health checks count
HEALTHCHECK --interval=15s --timeout=10s --retries=5 --start-period=60s \
CMD curl -sf http://localhost:11434/api/tags || exit 1
# Default command — starts Ollama server
CMD ["ollama", "serve"]
Why pin ollama:0.9.3 and not ollama:latest?
Because latest on a production service means "surprise breaking changes on every docker pull." Pin it. Update it intentionally via PR. When a new Ollama release drops and the team wants to update, the diff is one line and the change is reviewable.
Step 3: The Model Pull Script
This script runs once on startup and is idempotent — running it twice does nothing harmful:
#!/usr/bin/env bash
# scripts/pull-model.sh
# Pulls the Gemma 4 model if not already present
# Safe to run multiple times — checks before pulling
set -euo pipefail
OLLAMA_HOST="${OLLAMA_HOST:-http://gemma4:11434}"
MODEL="${GEMMA_MODEL:-gemma4:26b}"
echo "==> Waiting for Ollama to be ready at ${OLLAMA_HOST}..."
# Wait up to 5 minutes for Ollama to respond
for i in $(seq 1 60); do
if curl -sf "${OLLAMA_HOST}/api/tags" > /dev/null 2>&1; then
echo "==> Ollama is up."
break
fi
echo " Attempt ${i}/60 — retrying in 5s..."
sleep 5
done
# Check if model already exists
if curl -sf "${OLLAMA_HOST}/api/tags" | grep -q "\"${MODEL}\""; then
echo "==> Model '${MODEL}' already present. Skipping pull."
exit 0
fi
echo "==> Pulling model: ${MODEL}"
echo " This can take 5–20 minutes depending on your connection."
echo " The 26B model is ~15GB. Get a coffee."
curl -sf -X POST "${OLLAMA_HOST}/api/pull" \
-H "Content-Type: application/json" \
-d "{\"name\": \"${MODEL}\", \"stream\": false}"
echo "==> Pull complete. Model '${MODEL}' is ready."
chmod +x scripts/pull-model.sh
Step 4: LiteLLM Config
LiteLLM is the OpenAI-compatible gateway. This config tells it that "gemma-4" (what your app calls) maps to "ollama_chat/gemma4:26b" (what Ollama actually serves):
# litellm/config.yaml
model_list:
# Gemma 4 26B MoE — best balance of speed and quality
- model_name: gemma-4
litellm_params:
model: ollama_chat/gemma4:26b
api_base: http://gemma4:11434
# Gemma 4 E4B — fast, low VRAM, good for quick tasks
- model_name: gemma-4-fast
litellm_params:
model: ollama_chat/gemma4:e4b
api_base: http://gemma4:11434
# Gemma 4 31B Dense — maximum quality, needs 24GB+ VRAM
- model_name: gemma-4-max
litellm_params:
model: ollama_chat/gemma4:31b
api_base: http://gemma4:11434
litellm_settings:
# Log all requests (set to false in prod if you handle this elsewhere)
set_verbose: false
# Drop unsupported params instead of erroring
# Gemma doesn't support every OpenAI parameter (e.g., logprobs)
drop_params: true
# Request timeout in seconds
request_timeout: 120
general_settings:
# Master key for authentication
# Clients send: Authorization: Bearer sk-your-secret-key-here
master_key: "os.environ/LITELLM_MASTER_KEY"
Step 5: The docker-compose.yml
This is the centrepiece. Read it carefully — every decision has a comment explaining why:
# docker-compose.yml
# Gemma 4 production-grade local inference stack
# Usage: docker compose up -d
# First run: docker compose up -d && docker compose logs -f model-puller
name: gemma4-stack
services:
# ─────────────────────────────────────────────
# OLLAMA: The inference engine with GPU access
# ─────────────────────────────────────────────
gemma4:
build:
context: .
dockerfile: Dockerfile.ollama
container_name: gemma4-ollama
# GPU passthrough — this is the magic line
# "count: all" exposes every GPU on the host
# Change to "count: 1" or "device_ids: ['0']" to pin a specific GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Model storage: named volume persists across container recreations
# Without this, you re-download 15GB every time you update the image
volumes:
- gemma4-models:/models
environment:
OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE:-10m}
OLLAMA_NUM_PARALLEL: ${OLLAMA_NUM_PARALLEL:-2}
OLLAMA_FLASH_ATTENTION: "1"
OLLAMA_MODELS: /models
# Do NOT expose port 11434 to the host
# All external traffic goes through LiteLLM on port 8000
# This prevents clients from bypassing auth
expose:
- "11434"
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:11434/api/tags"]
interval: 15s
timeout: 10s
retries: 5
start_period: 60s
# ─────────────────────────────────────────────
# MODEL PULLER: Downloads Gemma 4 on first boot
# Exits cleanly after pulling — not a long-running service
# ─────────────────────────────────────────────
model-puller:
image: curlimages/curl:latest
container_name: gemma4-model-puller
volumes:
- ./scripts/pull-model.sh:/pull-model.sh:ro
environment:
OLLAMA_HOST: http://gemma4:11434
GEMMA_MODEL: ${GEMMA_MODEL:-gemma4:26b}
entrypoint: ["/bin/sh", "/pull-model.sh"]
depends_on:
gemma4:
condition: service_healthy
# This container should exit 0 after pulling the model
restart: "no"
# ─────────────────────────────────────────────
# LITELLM: OpenAI-compatible API gateway
# Your apps connect to THIS, not to Ollama directly
# ─────────────────────────────────────────────
litellm:
image: ghcr.io/berriai/litellm:main-stable
container_name: gemma4-litellm
volumes:
- ./litellm/config.yaml:/app/config.yaml:ro
environment:
LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
command: ["--config", "/app/config.yaml", "--port", "8000"]
ports:
# THIS is the only port exposed to your host machine
# http://localhost:8000/v1 — drop-in replacement for OpenAI
- "8000:8000"
depends_on:
gemma4:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/health"]
interval: 20s
timeout: 10s
retries: 3
# ─────────────────────────────────────────────
# VOLUMES
# ─────────────────────────────────────────────
volumes:
gemma4-models:
driver: local
# Uncomment to bind to a specific host path instead:
# driver_opts:
# type: none
# o: bind
# device: /data/gemma4-models
networks:
default:
name: gemma4-net
Step 6: Boot It Up
# Clone or create the project directory
cd gemma4-stack
# Copy the env template and fill in your master key
cp .env.example .env
# Edit .env — at minimum, set LITELLM_MASTER_KEY
# Build the Ollama image and start the stack
docker compose up -d
# Watch the model download (first run only — this takes a while)
docker compose logs -f model-puller
# Once model-puller exits, check everything is healthy
docker compose ps
Expected output from docker compose ps once everything is running:
NAME IMAGE STATUS
gemma4-ollama gemma4-stack-gemma4 Up (healthy)
gemma4-litellm ghcr.io/berriai/litellm:... Up (healthy)
gemma4-model-puller curlimages/curl:latest Exited (0)
The model-puller showing Exited (0) is correct — it finished its job and stopped cleanly.
Step 7: Test the OpenAI-Compatible Endpoint
Your stack is now running a drop-in OpenAI replacement at http://localhost:8000/v1. Test it:
# Check available models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}"
# Send a chat completion — same format as OpenAI
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
-d '{
"model": "gemma-4",
"messages": [
{
"role": "system",
"content": "You are a senior backend engineer doing code review."
},
{
"role": "user",
"content": "Review this Python function and identify any issues:\n\ndef get_user(id):\n return db.query(f\"SELECT * FROM users WHERE id={id}\")"
}
],
"temperature": 0.3,
"max_tokens": 512
}'
Expected response — Gemma 4 should immediately flag the SQL injection:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"model": "gemma-4",
"choices": [{
"message": {
"role": "assistant",
"content": "Critical issue: SQL injection vulnerability on line 2. The `id` parameter is interpolated directly into the query string with an f-string. An attacker could pass `id = '1 OR 1=1 --'` and retrieve all users. Fix: use parameterized queries:\n\n```
python\ndef get_user(user_id: int):\n return db.query(\n \"SELECT * FROM users WHERE id = %s\",\n (user_id,)\n )\n
```\n\nAlso: rename the parameter from `id` to `user_id` — `id` shadows a Python built-in."
},
"finish_reason": "stop"
}]
}
Step 8: Drop It Into Your Existing Code
Here's the part that makes this setup valuable to a team. If you already have code calling OpenAI, you change two lines:
Python (openai SDK)
# Before — calling OpenAI
from openai import OpenAI
client = OpenAI(api_key="sk-your-openai-key")
# After — calling your local Gemma 4 stack
from openai import OpenAI
import os
client = OpenAI(
base_url="http://localhost:8000/v1", # ← changed
api_key=os.environ["LITELLM_MASTER_KEY"], # ← changed
)
# Everything below is identical — no other changes needed
response = client.chat.completions.create(
model="gemma-4", # or "gemma-4-fast" for E4B, "gemma-4-max" for 31B
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what a Docker volume is in one sentence."},
],
temperature=0.7,
max_tokens=200,
)
print(response.choices[0].message.content)
Node.js / TypeScript
import OpenAI from "openai";
// Same two-line swap
const client = new OpenAI({
baseURL: "http://localhost:8000/v1",
apiKey: process.env.LITELLM_MASTER_KEY,
});
const response = await client.chat.completions.create({
model: "gemma-4",
messages: [
{ role: "user", content: "Write a TypeScript type for a paginated API response." }
],
});
console.log(response.choices[0].message.content);
LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="your-litellm-master-key",
model="gemma-4",
temperature=0,
)
response = llm.invoke("What are the SOLID principles? Give me one sentence per principle.")
print(response.content)
Step 9: The Smoke Test
Add this to your CI pipeline or run it after deployment to verify the stack is working:
# tests/smoke_test.py
"""
Smoke test for the Gemma 4 Docker stack.
Run with: python tests/smoke_test.py
Exit 0 = healthy. Exit 1 = something's broken.
"""
import os
import sys
import time
from openai import OpenAI
BASE_URL = os.environ.get("GEMMA_BASE_URL", "http://localhost:8000/v1")
API_KEY = os.environ.get("LITELLM_MASTER_KEY", "sk-test")
MODEL = os.environ.get("GEMMA_MODEL_ALIAS", "gemma-4")
def check(name: str, condition: bool, detail: str = ""):
if condition:
print(f" ✓ {name}")
else:
print(f" ✗ {name}{': ' + detail if detail else ''}")
sys.exit(1)
print(f"\nSmoke test — {BASE_URL}\n")
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
# Test 1: Models endpoint
models = client.models.list()
model_ids = [m.id for m in models.data]
check("Models endpoint", len(model_ids) > 0, f"got: {model_ids}")
check(f"Model '{MODEL}' available", MODEL in model_ids)
# Test 2: Basic chat completion
start = time.time()
resp = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Reply with exactly the word PONG and nothing else."}],
max_tokens=10,
temperature=0,
)
elapsed = time.time() - start
reply = resp.choices[0].message.content.strip().upper()
check("Chat completion returns", "PONG" in reply, f"got: {reply!r}")
check("Response time < 30s", elapsed < 30, f"took {elapsed:.1f}s")
# Test 3: Streaming
chunks = []
for chunk in client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "Count to 3, one number per line."}],
max_tokens=30,
stream=True,
):
if chunk.choices[0].delta.content:
chunks.append(chunk.choices[0].delta.content)
check("Streaming works", len(chunks) > 0, f"got {len(chunks)} chunks")
print(f"\nAll checks passed. Stack is healthy.\n")
Useful Day-to-Day Commands
# Start the stack
docker compose up -d
# Tail logs from all containers
docker compose logs -f
# Tail logs from one container
docker compose logs -f gemma4
# Check GPU utilization inside the container
docker exec gemma4-ollama nvidia-smi
# List loaded models and their VRAM usage
docker exec gemma4-ollama ollama ps
# Pull a different model (e.g., try 31B if you have the VRAM)
docker exec gemma4-ollama ollama pull gemma4:31b
# Hard restart (keeps model volume intact)
docker compose restart gemma4
# Full teardown — preserves the model volume
docker compose down
# NUCLEAR: full teardown including the 15GB model volume
# Only do this if you want to re-download everything
docker compose down -v
# Update to a newer Ollama version (edit Dockerfile.ollama first)
docker compose build --no-cache gemma4
docker compose up -d gemma4
# Check container resource usage
docker stats gemma4-ollama
Hardware Decision Guide
Not sure which Gemma 4 variant to put in your .env? Here's the honest breakdown:
| Your hardware | Set GEMMA_MODEL to |
Why |
|---|---|---|
| RTX 3090/4090 (24 GB VRAM) | gemma4:31b |
Full quality, fits comfortably |
| RTX 3080/4080 (12–16 GB VRAM) | gemma4:26b |
MoE activates only 3.8B — fits and runs fast |
| RTX 3060/4060 (8 GB VRAM) | gemma4:e4b |
Edge model, good quality, 3.5 GB VRAM |
| Any GPU with 6 GB VRAM | gemma4:e2b |
Lightweight, still genuinely useful |
| CPU only (no GPU) | gemma4:e2b |
Slow but functional; ~3–5 tok/s on modern CPU |
| 2× GPUs (any combo) |
gemma4:31b + tensor parallel |
See multi-GPU note below |
For multi-GPU setups, add to your Ollama environment:
environment:
OLLAMA_GPU_LAYERS: 999 # Offload all layers to GPU
OLLAMA_SCHED_SPREAD: "1" # Spread across multiple GPUs
Common Errors and Fixes
nvidia-smi: command not found inside container
Your NVIDIA Container Toolkit isn't wired into Docker. Re-run sudo nvidia-ctk runtime configure --runtime=docker and restart Docker.
CUDA error: no kernel image is available for execution on the device
Your GPU is too old for the CUDA version in the Ollama image. Check nvidia-smi for your driver version and use an older Ollama base image with a compatible CUDA tag.
LiteLLM returns APIConnectionError
The litellm container can't reach gemma4. Check they're on the same network: docker network inspect gemma4-net. The service name gemma4 in the LiteLLM config must match the Compose service name exactly.
docker compose up says GPU not found on Linux
The Docker daemon didn't get restarted after toolkit installation. Run sudo systemctl restart docker.
Model-puller keeps restarting instead of exiting
Ollama isn't healthy yet. Increase start_period in the Ollama healthcheck from 60s to 120s — large models take longer to initialize on slow disks.
What This Unlocks for Your Team
Once this stack is running, here's what your team gets for free:
Reproducibility. git clone + cp .env.example .env + docker compose up -d = working Gemma 4 in under 20 minutes for any new team member. No "it works on my machine."
Version control. Every change to the model, Ollama version, or config goes through a PR. Your ML setup is as auditable as your application code.
OpenAI drop-in. Any code that calls OpenAI works immediately. Switch between gemma-4 (local) and gpt-4o (cloud) by changing one environment variable. Useful for cost comparisons and offline development.
Privacy by default. No token leaves the machine. Customer data, proprietary code, internal documents — all processed locally with no third-party API calls.
Zero cloud dependency. The stack works on a laptop, a workstation, an on-prem server, or an air-gapped environment. The only network requirement is the initial model download.
References
- NVIDIA Container Toolkit documentation — Installation and Docker configuration guide. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- Ollama Docker Hub — Official Ollama images and tags. https://hub.docker.com/r/ollama/ollama
- LiteLLM documentation — OpenAI proxy setup and config reference. https://docs.litellm.ai
- Gemma 4 on Ollama — Model tags and size reference. https://ollama.com/library/gemma4
-
NVIDIA Specialized Docker Configurations —
NVIDIA_VISIBLE_DEVICESand capability flags. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html - vLLM production deployment with Gemma 4 — For teams needing multi-user concurrency beyond what Ollama provides. https://gemma4-ai.com/blog/gemma4-vllm-deploy
Top comments (0)