DEV Community

Pavan Madduri
Pavan Madduri

Posted on

I Replaced a $3/hr GPU Dev Workflow with Docker Model Runner. Here's How

Last month I was debugging a prompt template for a vLLM inference service. The change was two lines — swap the system prompt and adjust the temperature. To test it, I had to:

  1. Rebuild a 15GB Docker image (the CUDA base alone is 3.5GB)
  2. Push it to our registry (8 minutes on a good day)
  3. Wait for Kubernetes to pull it on a GPU node
  4. Realize the prompt still wasn't right
  5. Repeat

Total cycle time: 22 minutes per iteration. For a two-line text change.

Then I tried Docker Model Runner. Pull the model once. Run inference locally. Iterate on the prompt in seconds. Push only when it's right. The same change took 14 seconds.

Docker shipped two features this year that I think every GPU/AI engineer needs to know about: Model Runner and Sandboxes. This post is the walkthrough I wish I had when I started using them.

My background: I build GPU infrastructure tools — keda-gpu-scaler for GPU autoscaling on Kubernetes, otel-gpu-receiver for GPU observability, and I contributed GPU NUMA topology scheduling to CNCF Volcano. Everything I build runs in Docker containers.


Part 1: Docker Model Runner — Run LLMs Like You Run Containers

If you've used docker pull and docker run, you already know how Model Runner works. Same mental model, same CLI patterns, but for AI models instead of containers.

Setup (one time)

Update Docker Desktop to 4.40+ and enable Model Runner:

Settings → Features in Development → Enable Docker Model Runner

Pull your first model

$ docker model pull ai/llama3.2:1B-Q8_0
Enter fullscreen mode Exit fullscreen mode

This downloads quantized model weights from Docker Hub — same registry, same content-addressable storage, same layer deduplication. If two models share base weights, you only download the diff.

Run inference from the CLI

$ docker model run ai/llama3.2:1B-Q8_0 "What is NUMA topology in GPU scheduling?"
Enter fullscreen mode Exit fullscreen mode
NUMA (Non-Uniform Memory Access) topology in GPU scheduling refers to the 
arrangement of CPUs and GPUs on a server where memory access speed depends 
on physical proximity. GPUs on the same NUMA node as the requesting CPU 
have faster memory access. NUMA-aware schedulers like Volcano place GPU 
workloads on nodes where the GPUs share a NUMA domain with the allocated 
CPUs, reducing cross-node memory latency by 10-20% for multi-GPU training...
Enter fullscreen mode Exit fullscreen mode

That ran locally on my MacBook Pro. No cloud GPU. No 15GB container image. No Kubernetes cluster. Apple Silicon handles the inference via Metal/MLX. On Linux with NVIDIA GPUs, it uses CUDA automatically.

The killer feature: OpenAI-compatible API

Model Runner exposes a local endpoint that speaks the exact same protocol as OpenAI's API:

from openai import OpenAI

# Local — hits Docker Model Runner
client = OpenAI(
    base_url="http://localhost:12434/engines/llama3.2/v1/",
    api_key="not-needed"  # no key required locally
)

# This SAME code works in production with one env var change:
# client = OpenAI(base_url=os.environ["VLLM_ENDPOINT"])

response = client.chat.completions.create(
    model="ai/llama3.2:1B-Q8_0",
    messages=[
        {"role": "system", "content": "You are a Kubernetes GPU infrastructure expert."},
        {"role": "user", "content": "Explain GPU memory fragmentation in 3 sentences."}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Why this matters: Your application code is identical between local dev and production. Locally you hit Model Runner. In production you hit vLLM, Triton, or OpenAI. Change one environment variable. That's it.

List and manage models

$ docker model ls
NAME                       SIZE      CREATED
ai/llama3.2:1B-Q8_0       1.3 GB    10 minutes ago
ai/mistral:7B-Q4_K_M      4.1 GB    2 hours ago

$ docker model rm ai/mistral:7B-Q4_K_M
Enter fullscreen mode Exit fullscreen mode

Same workflow as docker image ls and docker image rm. If you know Docker, you know Model Runner. Zero learning curve.


Part 2: The Real Problem Model Runner Solves

I run GPU inference in production on Kubernetes. Here's what the inner development loop looked like before Model Runner:

Edit prompt → docker build (8 min) → docker push (8 min) → kubectl rollout → test → repeat
                    ↑                                                              │
                    └──────────── 22 minutes per iteration ────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

And after:

Edit prompt → docker model run (14 sec) → test → iterate → ship when ready
                                                              │
                                            ← seconds per iteration →
Enter fullscreen mode Exit fullscreen mode

The three pain points Model Runner eliminates:

Problem Before After
Dependency hell CUDA 12 vs 11, cuDNN mismatches, PyTorch pinning Model Runner handles the inference runtime
Image bloat 15GB vLLM image with full CUDA toolkit 1.3GB quantized model, no container build needed
Dev-prod gap Can't run A100 inference on a MacBook Model Runner uses Apple Silicon or local NVIDIA GPU

The architecture end-to-end:

┌─── Your Laptop ──────────────────┐     ┌─── Production K8s Cluster ────────────┐
│                                   │     │                                        │
│  Docker Model Runner              │     │  vLLM containers (A100/H100)           │
│  ├─ Llama 3.2 (local inference)  │     │  ├─ keda-gpu-scaler (auto-scaling)    │
│  └─ OpenAI-compatible API        │     │  ├─ otel-gpu-receiver (GPU metrics)   │
│          ↕                        │     │  ├─ Volcano (NUMA-aware scheduling)   │
│  Your Application Code            │ ──→ │  └─ OpenAI-compatible API             │
│  (same code, same SDK)           │     │          ↕                              │
│                                   │     │  Same Application Code                 │
└───────────────────────────────────┘     └────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Same application code. Same API. Same Docker. Different scale.


Part 3: Docker Sandboxes — Because AI Agents Will Try to Delete Your Files

Here's a scenario that keeps me up at night: an AI coding agent decides the best way to fix a test failure is to rm -rf the test directory. Or it installs a malicious pip package. Or it curls your AWS credentials to an external server.

If you're building agentic workflows — LLMs that execute code, call APIs, or modify files — running that code on your host is reckless. Docker Sandboxes fix this.

What Sandboxes give you

  • Filesystem isolation — the agent gets its own filesystem. Your SSH keys, browser cookies, and credentials are invisible.
  • Network whitelisting — you specify exactly which hosts the agent can reach. Everything else is blocked.
  • Resource caps — CPU, memory, GPU limits per sandbox. No runaway processes.
  • Ephemeral by default — sandbox is destroyed when the task completes. No persistent state leakage.

A real example: sandboxed coding agent

# docker-compose.sandbox.yml
services:
  coding-agent:
    image: my-coding-agent:latest
    sandbox:
      enabled: true
      network:
        egress:
          - "api.openai.com:443"      # LLM API calls
          - "pypi.org:443"            # pip install
          - "github.com:443"          # git clone
      resources:
        memory: 4g
        cpus: 2
    volumes:
      - ./workspace:/workspace        # Only this directory is accessible
    environment:
      - MODEL_ENDPOINT=http://host.docker.internal:12434  # Model Runner
Enter fullscreen mode Exit fullscreen mode

What the agent can do:

  • Read/write files in /workspace
  • Call OpenAI, install Python packages, clone repos
  • Use up to 4GB RAM, 2 CPUs

What the agent cannot do:

  • Touch ~/.ssh, ~/.aws, or any file outside /workspace
  • Reach arbitrary servers (no data exfiltration)
  • Consume unlimited resources (no fork bombs)
  • Persist state after completion (clean slate every run)

The production mirror

This is the same isolation model I enforce in production Kubernetes with PodSecurityStandards:

# Production: Kubernetes pod security
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
---
# Local: Docker Sandbox
sandbox:
  enabled: true
  network:
    egress: ["api.openai.com:443"]
Enter fullscreen mode Exit fullscreen mode

Same security boundary. Same isolation model. Docker Sandboxes for local dev, Kubernetes PodSecurity for production. The gap between "works on my machine" and "works in production" shrinks to almost nothing.


Part 4: Putting It All Together — The Full GPU/AI Stack on Docker

Here's the picture I've been building toward. Docker isn't just a container runtime anymore — it's the full development platform for AI:

On Your Laptop (Docker Desktop)

Layer Tool What It Does
Inference Docker Model Runner Pull and run LLMs locally, OpenAI-compatible API
Security Docker Sandboxes Isolate AI agents, whitelist network access
Supply Chain Docker Scout Scan GPU images for CVEs in CUDA/Python dependency trees
Observability Docker Extensions I built a GPU Dashboard showing real-time NVML metrics in Docker Desktop
Multi-service Docker Compose Run inference + app + monitoring together locally

In Production (Kubernetes)

Layer Tool What It Does
Runtime Docker containers (containerd) vLLM, Triton inference servers
Autoscaling keda-gpu-scaler Scale on real GPU utilization, not CPU proxy metrics. Scale to zero when idle.
Observability otel-gpu-receiver GPU metrics → OpenTelemetry → Prometheus/Grafana
Scheduling Volcano GPU NUMA Place multi-GPU training on NUMA-aligned GPUs (10-20% throughput improvement)
Security NetworkPolicy + PodSecurity Same isolation as Docker Sandboxes, enforced by the cluster

The container is the unit of deployment from your laptop to the GPU cluster. Docker owns the inner loop. Kubernetes owns the outer loop. Both speak the same language.


Part 5: Try It Right Now (5-Minute Walkthrough)

Everything below runs on Docker Desktop. No GPU required (it'll use CPU). Takes 5 minutes.

Step 1: Pull a model

docker model pull ai/llama3.2:1B-Q8_0
Enter fullscreen mode Exit fullscreen mode

Step 2: Chat with it

docker model run ai/llama3.2:1B-Q8_0 "Write a Dockerfile for a Python FastAPI app"
Enter fullscreen mode Exit fullscreen mode

Step 3: Hit the API

curl http://localhost:12434/engines/llama3.2/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2:1B-Q8_0",
    "messages": [
      {"role": "system", "content": "You are a Docker and Kubernetes expert."},
      {"role": "user", "content": "How do I expose GPUs to a Docker container?"}
    ],
    "temperature": 0.3
  }'
Enter fullscreen mode Exit fullscreen mode

Step 4: Use it from Python

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:12434/engines/llama3.2/v1/",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="ai/llama3.2:1B-Q8_0",
    messages=[{"role": "user", "content": "Explain Docker Scout in one paragraph"}]
)
print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

That's it. You're running local LLM inference with the same API your production code uses. No CUDA installation. No PyTorch dependency conflicts. No 15GB images.


What I Want to See Next from Docker

Model Runner is already good. Here's what would make it great for production GPU engineers:

  1. docker model stats — VRAM usage per model, like docker stats for containers. Right now I have to run nvidia-smi separately.
  2. Multi-model serving — run Llama + Mistral + CodeLlama concurrently with per-model resource limits. Production inference servers (Triton) do this; local dev should too.
  3. OpenTelemetry integration — emit inference latency, tokens/second, and queue depth to an OTel collector. My otel-gpu-receiver handles hardware GPU metrics — application-level model metrics would complete the picture.
  4. Sandbox GPU passthrough — let sandboxed AI agents access the GPU for local inference. Currently sandboxes are CPU-only.

I've shared this feedback in the Docker community forums. If you're building GPU/AI infrastructure on Docker, I'd love to hear what you're missing — drop a comment or reach out.


The Bottom Line

Docker used to be where you built containers and pushed them somewhere else. In 2026, Docker is where you develop AI applications end-to-end:

  • Model Runner replaces your 22-minute build-push-deploy cycle with a 14-second local inference call
  • Sandboxes give your AI agents the same security boundary they'll have in production
  • Scout catches CVEs in your CUDA dependency tree before they reach a GPU cluster
  • Compose runs your entire AI stack locally — inference, app, monitoring, all together

And when it's time to ship to production, the same containers deploy to Kubernetes with GPU autoscaling, GPU observability, and NUMA-aware scheduling.

The gap between "works on my machine" and "works on 8 A100s" just got a lot smaller.


Pavan Madduri is a Senior Cloud Platform Engineer at W.W. Grainger, Inc., CNCF Golden Kubestronaut, and Oracle ACE Associate. He maintains keda-gpu-scaler and otel-gpu-receiver, contributed GPU NUMA topology scheduling to Volcano, and is a Dragonfly Community Member. Published: PlatformEngineering.com. Follow on Facebook: Docker AI & Cloud-Native DevOps.

Top comments (0)