Shreya Nalawade

Posted on May 18

Running Gemma 4 Inside a Docker Container with GPU Passthrough

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Here's a situation every ML team eventually hits.

Your intern gets Gemma 4 running locally. Beautiful. Fast. Impresses the standup. Then they leave for the semester. Three weeks later, nobody can reproduce the setup. The model was in a path that no longer exists. The Python version was wrong. The CUDA libraries conflicted with PyTorch. Someone pip installed something globally and broke everything.

This is why Docker exists. And yet, every Gemma 4 setup guide I've found uses either the Ollama GUI, LM Studio's drag-and-drop interface, or bare ollama run commands that vanish when the terminal closes. None of them answer the actual question a dev team has:

How do we version-control this, ship it to staging, and have it work the same way every time?

This guide answers that. By the end, you'll have:

A Dockerfile your team can pin, review, and modify in a PR
A docker-compose.yml that brings up the full stack with one command
A running Gemma 4 endpoint that speaks OpenAI's API format — meaning zero changes to any code that already calls GPT-4o

Let's build it.

Why Docker + GPU Is Annoying (and How It Actually Works)

Before the files, a quick mental model — because this is where most people get confused.

CUDA is not inside your container. The NVIDIA driver lives on the host machine. The container runtime (Docker) needs a way to reach through the container wall and talk to those drivers. That bridge is the NVIDIA Container Toolkit.

Here's the flow:

Your Container
    └── CUDA libraries (inside the image)
          └── NVIDIA Container Runtime  ← the bridge
                └── Host NVIDIA Driver  ← never inside the container
                      └── Physical GPU

This means:

You do NOT need to install CUDA on your host. Only the NVIDIA driver.
You do NOT bundle the driver inside your Docker image. Only CUDA libraries.
The container toolkit version on your host must be compatible with your driver version.

A common gotcha: if your host driver is old (e.g., 470.x), it cannot run CUDA 12.x containers. The CUDA version inside your image must be less than or equal to what your driver supports. Check first:

nvidia-smi
# Look for: "CUDA Version: XX.X" in the top-right corner
# That's the maximum CUDA your driver supports

Prerequisites Checklist

Before touching any Docker file, verify these four things on your host machine:

# 1. Confirm NVIDIA GPU is visible
lspci | grep -i nvidia

# 2. Check driver is installed and working
nvidia-smi

# 3. Check Docker version (need 19.03+ for native --gpus support)
docker --version

# 4. Check NVIDIA Container Toolkit is installed
nvidia-ctk --version

If step 4 fails, install the toolkit:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Wire it into Docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test — you should see nvidia-smi output from inside a container
docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

If that last command shows your GPU, you're ready. If it errors, stop here and fix the driver — no amount of clever Dockerfiles will help.

The Setup We're Building

We're going to run three containers in a single Compose stack:

Container	What it does	Port
`gemma4`	Ollama serving Gemma 4 (with GPU)	`11434` (internal)
`litellm`	OpenAI-compatible proxy in front of Ollama	`8000` (external)
`model-puller`	One-shot container that pulls the model on first boot	—

The LiteLLM proxy is the key piece most guides skip. It translates OpenAI API calls (/v1/chat/completions) into Ollama's native format, adds API key auth, rate limiting, and request logging — all without touching your application code.

Project Structure

Create this directory layout before writing any files:

gemma4-stack/
├── .env                    # Secrets and config — never commit this
├── .env.example            # Template for new team members
├── docker-compose.yml      # The full stack
├── Dockerfile.ollama       # Custom Ollama image with health checks
├── litellm/
│   └── config.yaml         # LiteLLM model routing config
├── scripts/
│   └── pull-model.sh       # Idempotent model pull script
└── tests/
    └── smoke_test.py       # Verify the stack works end-to-end

Step 1: The `.env` File

# .env — copy from .env.example and fill in your values
# !! Never commit this file — add .env to .gitignore

# Which Gemma 4 variant to run
# Options: gemma4:e2b | gemma4:e4b | gemma4:26b | gemma4:31b
GEMMA_MODEL=gemma4:26b

# API key for LiteLLM (clients use this to authenticate)
# Generate a random one: openssl rand -hex 32
LITELLM_MASTER_KEY=sk-your-secret-key-here

# How long Ollama keeps the model in VRAM after the last request
# Increase this if you have spare VRAM and want faster cold starts
OLLAMA_KEEP_ALIVE=10m

# Max parallel inference requests (tune to your VRAM)
OLLAMA_NUM_PARALLEL=2

# HuggingFace token — needed only if you pull gated models
HF_TOKEN=

# .env.example — commit this
GEMMA_MODEL=gemma4:26b
LITELLM_MASTER_KEY=sk-CHANGE-ME
OLLAMA_KEEP_ALIVE=10m
OLLAMA_NUM_PARALLEL=2
HF_TOKEN=

Step 2: The Dockerfile

This is where we diverge from every "just use the ollama image" guide. Our custom image adds:

A proper health check endpoint (so Compose knows when Ollama is actually ready)
A non-root user (security baseline for any production-adjacent setup)
Pinned Ollama version (reproducibility)

# Dockerfile.ollama
# Pin to a specific Ollama release — update deliberately, not accidentally
FROM ollama/ollama:0.9.3

# Metadata
LABEL maintainer="your-team@yourcompany.com"
LABEL description="Ollama serving Gemma 4 with GPU passthrough"
LABEL gemma.version="gemma4"

# GPU access configuration
# NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES tell the
# container runtime which GPU capabilities to expose.
# "compute" = CUDA, "utility" = nvidia-smi
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Ollama configuration
ENV OLLAMA_HOST=0.0.0.0
ENV OLLAMA_ORIGINS=*

# Keep the model warm in VRAM (overridable via docker-compose env)
ENV OLLAMA_KEEP_ALIVE=10m

# Number of parallel inference slots
# Each slot uses roughly model_size / 2 additional VRAM
ENV OLLAMA_NUM_PARALLEL=2

# Flash attention — reduces memory for long contexts
ENV OLLAMA_FLASH_ATTENTION=1

# Model storage location inside the container
# We'll mount a volume here to persist across container restarts
ENV OLLAMA_MODELS=/models

# Create the models directory and set ownership
RUN mkdir -p /models && chmod 755 /models

# Expose Ollama's API port
EXPOSE 11434

# Health check: ask Ollama if it's ready
# --start-period gives it 60s to load before health checks count
HEALTHCHECK --interval=15s --timeout=10s --retries=5 --start-period=60s \
  CMD curl -sf http://localhost:11434/api/tags || exit 1

# Default command — starts Ollama server
CMD ["ollama", "serve"]

Why pin ollama:0.9.3 and not ollama:latest?

Because latest on a production service means "surprise breaking changes on every docker pull." Pin it. Update it intentionally via PR. When a new Ollama release drops and the team wants to update, the diff is one line and the change is reviewable.

Step 3: The Model Pull Script

This script runs once on startup and is idempotent — running it twice does nothing harmful:

#!/usr/bin/env bash
# scripts/pull-model.sh
# Pulls the Gemma 4 model if not already present
# Safe to run multiple times — checks before pulling

set -euo pipefail

OLLAMA_HOST="${OLLAMA_HOST:-http://gemma4:11434}"
MODEL="${GEMMA_MODEL:-gemma4:26b}"

echo "==> Waiting for Ollama to be ready at ${OLLAMA_HOST}..."

# Wait up to 5 minutes for Ollama to respond
for i in $(seq 1 60); do
  if curl -sf "${OLLAMA_HOST}/api/tags" > /dev/null 2>&1; then
    echo "==> Ollama is up."
    break
  fi
  echo "    Attempt ${i}/60 — retrying in 5s..."
  sleep 5
done

# Check if model already exists
if curl -sf "${OLLAMA_HOST}/api/tags" | grep -q "\"${MODEL}\""; then
  echo "==> Model '${MODEL}' already present. Skipping pull."
  exit 0
fi

echo "==> Pulling model: ${MODEL}"
echo "    This can take 5–20 minutes depending on your connection."
echo "    The 26B model is ~15GB. Get a coffee."

curl -sf -X POST "${OLLAMA_HOST}/api/pull" \
  -H "Content-Type: application/json" \
  -d "{\"name\": \"${MODEL}\", \"stream\": false}"

echo "==> Pull complete. Model '${MODEL}' is ready."

chmod +x scripts/pull-model.sh

Step 4: LiteLLM Config

LiteLLM is the OpenAI-compatible gateway. This config tells it that "gemma-4" (what your app calls) maps to "ollama_chat/gemma4:26b" (what Ollama actually serves):

# litellm/config.yaml
model_list:
  # Gemma 4 26B MoE — best balance of speed and quality
  - model_name: gemma-4
    litellm_params:
      model: ollama_chat/gemma4:26b
      api_base: http://gemma4:11434

  # Gemma 4 E4B — fast, low VRAM, good for quick tasks
  - model_name: gemma-4-fast
    litellm_params:
      model: ollama_chat/gemma4:e4b
      api_base: http://gemma4:11434

  # Gemma 4 31B Dense — maximum quality, needs 24GB+ VRAM
  - model_name: gemma-4-max
    litellm_params:
      model: ollama_chat/gemma4:31b
      api_base: http://gemma4:11434

litellm_settings:
  # Log all requests (set to false in prod if you handle this elsewhere)
  set_verbose: false

  # Drop unsupported params instead of erroring
  # Gemma doesn't support every OpenAI parameter (e.g., logprobs)
  drop_params: true

  # Request timeout in seconds
  request_timeout: 120

general_settings:
  # Master key for authentication
  # Clients send: Authorization: Bearer sk-your-secret-key-here
  master_key: "os.environ/LITELLM_MASTER_KEY"

Step 5: The docker-compose.yml

This is the centrepiece. Read it carefully — every decision has a comment explaining why:

# docker-compose.yml
# Gemma 4 production-grade local inference stack
# Usage: docker compose up -d
# First run: docker compose up -d && docker compose logs -f model-puller

name: gemma4-stack

services:

  # ─────────────────────────────────────────────
  # OLLAMA: The inference engine with GPU access
  # ─────────────────────────────────────────────
  gemma4:
    build:
      context: .
      dockerfile: Dockerfile.ollama
    container_name: gemma4-ollama

    # GPU passthrough — this is the magic line
    # "count: all" exposes every GPU on the host
    # Change to "count: 1" or "device_ids: ['0']" to pin a specific GPU
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    # Model storage: named volume persists across container recreations
    # Without this, you re-download 15GB every time you update the image
    volumes:
      - gemma4-models:/models

    environment:
      OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE:-10m}
      OLLAMA_NUM_PARALLEL: ${OLLAMA_NUM_PARALLEL:-2}
      OLLAMA_FLASH_ATTENTION: "1"
      OLLAMA_MODELS: /models

    # Do NOT expose port 11434 to the host
    # All external traffic goes through LiteLLM on port 8000
    # This prevents clients from bypassing auth
    expose:
      - "11434"

    restart: unless-stopped

    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:11434/api/tags"]
      interval: 15s
      timeout: 10s
      retries: 5
      start_period: 60s

  # ─────────────────────────────────────────────
  # MODEL PULLER: Downloads Gemma 4 on first boot
  # Exits cleanly after pulling — not a long-running service
  # ─────────────────────────────────────────────
  model-puller:
    image: curlimages/curl:latest
    container_name: gemma4-model-puller

    volumes:
      - ./scripts/pull-model.sh:/pull-model.sh:ro

    environment:
      OLLAMA_HOST: http://gemma4:11434
      GEMMA_MODEL: ${GEMMA_MODEL:-gemma4:26b}

    entrypoint: ["/bin/sh", "/pull-model.sh"]

    depends_on:
      gemma4:
        condition: service_healthy

    # This container should exit 0 after pulling the model
    restart: "no"

  # ─────────────────────────────────────────────
  # LITELLM: OpenAI-compatible API gateway
  # Your apps connect to THIS, not to Ollama directly
  # ─────────────────────────────────────────────
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    container_name: gemma4-litellm

    volumes:
      - ./litellm/config.yaml:/app/config.yaml:ro

    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}

    command: ["--config", "/app/config.yaml", "--port", "8000"]

    ports:
      # THIS is the only port exposed to your host machine
      # http://localhost:8000/v1 — drop-in replacement for OpenAI
      - "8000:8000"

    depends_on:
      gemma4:
        condition: service_healthy

    restart: unless-stopped

    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8000/health"]
      interval: 20s
      timeout: 10s
      retries: 3

# ─────────────────────────────────────────────
# VOLUMES
# ─────────────────────────────────────────────
volumes:
  gemma4-models:
    driver: local
    # Uncomment to bind to a specific host path instead:
    # driver_opts:
    #   type: none
    #   o: bind
    #   device: /data/gemma4-models

networks:
  default:
    name: gemma4-net

Step 6: Boot It Up

# Clone or create the project directory
cd gemma4-stack

# Copy the env template and fill in your master key
cp .env.example .env
# Edit .env — at minimum, set LITELLM_MASTER_KEY

# Build the Ollama image and start the stack
docker compose up -d

# Watch the model download (first run only — this takes a while)
docker compose logs -f model-puller

# Once model-puller exits, check everything is healthy
docker compose ps

Expected output from docker compose ps once everything is running:

NAME                   IMAGE                          STATUS
gemma4-ollama          gemma4-stack-gemma4            Up (healthy)
gemma4-litellm         ghcr.io/berriai/litellm:...   Up (healthy)
gemma4-model-puller    curlimages/curl:latest         Exited (0)

The model-puller showing Exited (0) is correct — it finished its job and stopped cleanly.

Step 7: Test the OpenAI-Compatible Endpoint

Your stack is now running a drop-in OpenAI replacement at http://localhost:8000/v1. Test it:

# Check available models
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer ${LITELLM_MASTER_KEY}"

# Send a chat completion — same format as OpenAI
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${LITELLM_MASTER_KEY}" \
  -d '{
    "model": "gemma-4",
    "messages": [
      {
        "role": "system",
        "content": "You are a senior backend engineer doing code review."
      },
      {
        "role": "user",
        "content": "Review this Python function and identify any issues:\n\ndef get_user(id):\n    return db.query(f\"SELECT * FROM users WHERE id={id}\")"
      }
    ],
    "temperature": 0.3,
    "max_tokens": 512
  }'

Expected response — Gemma 4 should immediately flag the SQL injection:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "model": "gemma-4",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Critical issue: SQL injection vulnerability on line 2. The `id` parameter is interpolated directly into the query string with an f-string. An attacker could pass `id = '1 OR 1=1 --'` and retrieve all users. Fix: use parameterized queries:\n\n```

python\ndef get_user(user_id: int):\n    return db.query(\n        \"SELECT * FROM users WHERE id = %s\",\n        (user_id,)\n    )\n

```\n\nAlso: rename the parameter from `id` to `user_id` — `id` shadows a Python built-in."
    },
    "finish_reason": "stop"
  }]
}

Step 8: Drop It Into Your Existing Code

Here's the part that makes this setup valuable to a team. If you already have code calling OpenAI, you change two lines:

Python (openai SDK)

# Before — calling OpenAI
from openai import OpenAI

client = OpenAI(api_key="sk-your-openai-key")

# After — calling your local Gemma 4 stack
from openai import OpenAI
import os

client = OpenAI(
    base_url="http://localhost:8000/v1",  # ← changed
    api_key=os.environ["LITELLM_MASTER_KEY"],  # ← changed
)

# Everything below is identical — no other changes needed
response = client.chat.completions.create(
    model="gemma-4",  # or "gemma-4-fast" for E4B, "gemma-4-max" for 31B
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what a Docker volume is in one sentence."},
    ],
    temperature=0.7,
    max_tokens=200,
)

print(response.choices[0].message.content)

Node.js / TypeScript

import OpenAI from "openai";

// Same two-line swap
const client = new OpenAI({
  baseURL: "http://localhost:8000/v1",
  apiKey: process.env.LITELLM_MASTER_KEY,
});

const response = await client.chat.completions.create({
  model: "gemma-4",
  messages: [
    { role: "user", content: "Write a TypeScript type for a paginated API response." }
  ],
});

console.log(response.choices[0].message.content);

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-litellm-master-key",
    model="gemma-4",
    temperature=0,
)

response = llm.invoke("What are the SOLID principles? Give me one sentence per principle.")
print(response.content)

Step 9: The Smoke Test

Add this to your CI pipeline or run it after deployment to verify the stack is working:

# tests/smoke_test.py
"""
Smoke test for the Gemma 4 Docker stack.
Run with: python tests/smoke_test.py
Exit 0 = healthy. Exit 1 = something's broken.
"""

import os
import sys
import time
from openai import OpenAI

BASE_URL = os.environ.get("GEMMA_BASE_URL", "http://localhost:8000/v1")
API_KEY = os.environ.get("LITELLM_MASTER_KEY", "sk-test")
MODEL = os.environ.get("GEMMA_MODEL_ALIAS", "gemma-4")

def check(name: str, condition: bool, detail: str = ""):
    if condition:
        print(f"  ✓ {name}")
    else:
        print(f"  ✗ {name}{': ' + detail if detail else ''}")
        sys.exit(1)

print(f"\nSmoke test — {BASE_URL}\n")

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

# Test 1: Models endpoint
models = client.models.list()
model_ids = [m.id for m in models.data]
check("Models endpoint", len(model_ids) > 0, f"got: {model_ids}")
check(f"Model '{MODEL}' available", MODEL in model_ids)

# Test 2: Basic chat completion
start = time.time()
resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Reply with exactly the word PONG and nothing else."}],
    max_tokens=10,
    temperature=0,
)
elapsed = time.time() - start

reply = resp.choices[0].message.content.strip().upper()
check("Chat completion returns", "PONG" in reply, f"got: {reply!r}")
check("Response time < 30s", elapsed < 30, f"took {elapsed:.1f}s")

# Test 3: Streaming
chunks = []
for chunk in client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Count to 3, one number per line."}],
    max_tokens=30,
    stream=True,
):
    if chunk.choices[0].delta.content:
        chunks.append(chunk.choices[0].delta.content)

check("Streaming works", len(chunks) > 0, f"got {len(chunks)} chunks")

print(f"\nAll checks passed. Stack is healthy.\n")

Useful Day-to-Day Commands

# Start the stack
docker compose up -d

# Tail logs from all containers
docker compose logs -f

# Tail logs from one container
docker compose logs -f gemma4

# Check GPU utilization inside the container
docker exec gemma4-ollama nvidia-smi

# List loaded models and their VRAM usage
docker exec gemma4-ollama ollama ps

# Pull a different model (e.g., try 31B if you have the VRAM)
docker exec gemma4-ollama ollama pull gemma4:31b

# Hard restart (keeps model volume intact)
docker compose restart gemma4

# Full teardown — preserves the model volume
docker compose down

# NUCLEAR: full teardown including the 15GB model volume
# Only do this if you want to re-download everything
docker compose down -v

# Update to a newer Ollama version (edit Dockerfile.ollama first)
docker compose build --no-cache gemma4
docker compose up -d gemma4

# Check container resource usage
docker stats gemma4-ollama

Hardware Decision Guide

Not sure which Gemma 4 variant to put in your .env? Here's the honest breakdown:

Your hardware	Set `GEMMA_MODEL` to	Why
RTX 3090/4090 (24 GB VRAM)	`gemma4:31b`	Full quality, fits comfortably
RTX 3080/4080 (12–16 GB VRAM)	`gemma4:26b`	MoE activates only 3.8B — fits and runs fast
RTX 3060/4060 (8 GB VRAM)	`gemma4:e4b`	Edge model, good quality, 3.5 GB VRAM
Any GPU with 6 GB VRAM	`gemma4:e2b`	Lightweight, still genuinely useful
CPU only (no GPU)	`gemma4:e2b`	Slow but functional; ~3–5 tok/s on modern CPU
2× GPUs (any combo)	`gemma4:31b` + tensor parallel	See multi-GPU note below

For multi-GPU setups, add to your Ollama environment:

environment:
  OLLAMA_GPU_LAYERS: 999      # Offload all layers to GPU
  OLLAMA_SCHED_SPREAD: "1"    # Spread across multiple GPUs

Common Errors and Fixes

nvidia-smi: command not found inside container
Your NVIDIA Container Toolkit isn't wired into Docker. Re-run sudo nvidia-ctk runtime configure --runtime=docker and restart Docker.

CUDA error: no kernel image is available for execution on the device
Your GPU is too old for the CUDA version in the Ollama image. Check nvidia-smi for your driver version and use an older Ollama base image with a compatible CUDA tag.

LiteLLM returns APIConnectionError
The litellm container can't reach gemma4. Check they're on the same network: docker network inspect gemma4-net. The service name gemma4 in the LiteLLM config must match the Compose service name exactly.

docker compose up says GPU not found on Linux
The Docker daemon didn't get restarted after toolkit installation. Run sudo systemctl restart docker.

Model-puller keeps restarting instead of exiting
Ollama isn't healthy yet. Increase start_period in the Ollama healthcheck from 60s to 120s — large models take longer to initialize on slow disks.

What This Unlocks for Your Team

Once this stack is running, here's what your team gets for free:

Reproducibility. git clone + cp .env.example .env + docker compose up -d = working Gemma 4 in under 20 minutes for any new team member. No "it works on my machine."

Version control. Every change to the model, Ollama version, or config goes through a PR. Your ML setup is as auditable as your application code.

OpenAI drop-in. Any code that calls OpenAI works immediately. Switch between gemma-4 (local) and gpt-4o (cloud) by changing one environment variable. Useful for cost comparisons and offline development.

Privacy by default. No token leaves the machine. Customer data, proprietary code, internal documents — all processed locally with no third-party API calls.

Zero cloud dependency. The stack works on a laptop, a workstation, an on-prem server, or an air-gapped environment. The only network requirement is the initial model download.

References

NVIDIA Container Toolkit documentation — Installation and Docker configuration guide. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Ollama Docker Hub — Official Ollama images and tags. https://hub.docker.com/r/ollama/ollama
LiteLLM documentation — OpenAI proxy setup and config reference. https://docs.litellm.ai
Gemma 4 on Ollama — Model tags and size reference. https://ollama.com/library/gemma4
NVIDIA Specialized Docker Configurations — NVIDIA_VISIBLE_DEVICES and capability flags. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html
vLLM production deployment with Gemma 4 — For teams needing multi-user concurrency beyond what Ollama provides. https://gemma4-ai.com/blog/gemma4-vllm-deploy

DEV Community

Running Gemma 4 Inside a Docker Container with GPU Passthrough

Why Docker + GPU Is Annoying (and How It Actually Works)

Prerequisites Checklist

The Setup We're Building

Project Structure

Step 1: The `.env` File

Step 2: The Dockerfile

Step 3: The Model Pull Script

Step 4: LiteLLM Config

Step 5: The docker-compose.yml

Step 6: Boot It Up

Step 7: Test the OpenAI-Compatible Endpoint

Step 8: Drop It Into Your Existing Code

Python (openai SDK)

Node.js / TypeScript

LangChain

Step 9: The Smoke Test

Useful Day-to-Day Commands

Hardware Decision Guide

Common Errors and Fixes

What This Unlocks for Your Team

References

Top comments (0)

Why Docker + GPU Is Annoying (and How It Actually Works)

Prerequisites Checklist

The Setup We're Building

Project Structure

Step 1: The .env File

Step 2: The Dockerfile

Step 3: The Model Pull Script

Step 4: LiteLLM Config

Step 5: The docker-compose.yml

Step 6: Boot It Up

Step 7: Test the OpenAI-Compatible Endpoint

Step 8: Drop It Into Your Existing Code

Python (openai SDK)

Node.js / TypeScript

LangChain

Step 9: The Smoke Test

Useful Day-to-Day Commands

Hardware Decision Guide

Common Errors and Fixes

What This Unlocks for Your Team

References

Step 1: The `.env` File