Every GPU inference container has the same problem: Kubernetes HPA can't see the GPU. You scale on CPU and memory while your GPU sits at 95% utilization, completely invisible to the autoscaler. Or worse — your GPU is idle and you're paying $3/hour for an instance doing nothing.
I built keda-gpu-scaler to fix this. It's a KEDA external scaler that reads real GPU metrics via NVIDIA NVML and drives Kubernetes autoscaling decisions — including scale-to-zero. This post covers the Docker-specific parts: how GPU metrics flow from the NVIDIA Container Toolkit through Docker to KEDA, and how to build GPU-aware containers that actually scale.
How Docker Exposes GPUs to Containers
When you run a GPU container with Docker, three layers work together:
docker run --gpus all nvidia/cuda:12.4-base nvidia-smi
-
Docker Engine detects the
--gpusflag and calls the NVIDIA Container Toolkit - The toolkit configures nvidia-container-runtime as the OCI runtime for this container
- The runtime injects GPU device files (
/dev/nvidia0,/dev/nvidiactl) and NVIDIA driver libraries into the container's filesystem
The container now has full access to NVML (NVIDIA Management Library), which exposes GPU utilization, memory usage, temperature, power draw, and more. This is the same mechanism my GPU scaler uses — each scaler pod runs on a GPU node and reads NVML metrics from the GPUs Docker has exposed to it.
┌─────────────┐ gRPC ┌──────────────────────────┐
│ KEDA Operator│─────────────→│ keda-gpu-scaler (Docker) │
│ (central pod)│ │ DaemonSet on each GPU node│
└─────────────┘ │ │
│ NVML ──→ /dev/nvidia0 │
│ NVML ──→ /dev/nvidia1 │
│ (Docker-exposed) │
└──────────────────────────┘
Building GPU Containers: The Dockerfile
GPU containers need CGO for NVML access. Here's the multi-stage Dockerfile I use for keda-gpu-scaler:
# Stage 1: Build
FROM golang:1.22-bookworm AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# CGO_ENABLED=1 is required — NVML needs CGO
# This is why GPU scaling can't be a native KEDA scaler
# (KEDA builds with CGO_ENABLED=0)
RUN CGO_ENABLED=1 go build -ldflags="-s -w" -o keda-gpu-scaler ./cmd/scaler
# Stage 2: Minimal runtime
FROM nvidia/cuda:12.4-base-ubuntu22.04
# Security: non-root user
RUN useradd -r -u 65534 -s /bin/false scaler
USER 65534:65534
COPY --from=builder /app/keda-gpu-scaler /usr/local/bin/
EXPOSE 6000
ENTRYPOINT ["keda-gpu-scaler"]
Key decisions:
-
CGO_ENABLED=1— NVML requires C bindings. This is the fundamental architectural reason keda-gpu-scaler exists as an external scaler instead of being built into KEDA core. -
nvidia/cudabase image — provides the NVML shared libraries (libnvidia-ml.so) at runtime. - Non-root execution — NVML reads GPU data from sysfs, doesn't require root. Standard Docker security practice.
- Multi-stage build — final image is ~150MB instead of 1.5GB (no Go toolchain, no build deps).
Docker Compose for Local GPU Development
Before deploying to Kubernetes, test the full stack locally with Docker Compose:
version: "3.8"
services:
# The GPU scaler — reads NVML metrics, serves gRPC
gpu-scaler:
build: .
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "6000:6000" # gRPC for KEDA
- "9090:9090" # Prometheus metrics
environment:
- LOG_LEVEL=debug
# A real GPU workload to scale
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
command: >
--model meta-llama/Llama-3.2-1B
--port 8000
--max-model-len 2048
# Prometheus to scrape GPU metrics
prometheus:
image: prom/prometheus:latest
ports:
- "9091:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# Start the stack
docker compose up -d
# Check GPU metrics via gRPC
grpcurl -plaintext localhost:6000 externalscaler.ExternalScaler/GetMetrics
# Check GPU metrics via Prometheus
curl localhost:9090/metrics | grep gpu
# Send requests to vLLM and watch GPU utilization climb
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-1B", "messages": [{"role": "user", "content": "Hello"}]}'
This gives you the full autoscaling feedback loop locally: vLLM serving → GPU utilization rises → scaler reports metrics → you can verify KEDA would trigger a scale-up.
Docker Scout: Scanning GPU Container Images
GPU images are large and have deep dependency trees. A typical vLLM image pulls in CUDA, cuDNN, NCCL, Python, PyTorch, and dozens of transitive dependencies. More packages = more CVE surface.
# Scan the GPU scaler image
docker scout cves pmady/keda-gpu-scaler:latest
# Get base image recommendations
docker scout recommendations pmady/keda-gpu-scaler:latest
Common findings I've hit with GPU images:
| Issue | Cause | Fix |
|---|---|---|
| High-severity OpenSSL CVE | CUDA base image uses older Ubuntu | Multi-stage build with patched base |
| Python package CVEs | Transitive deps in ML frameworks | Pin versions, use pip audit
|
| Outdated CUDA libs | NVIDIA base image release lag | Use Docker Hardened Images as base |
For production GPU containers, I run Scout in CI and block merges on critical/high CVEs:
# GitHub Actions
- name: Docker Scout CVE scan
uses: docker/scout-action@v1
with:
command: cves
image: pmady/keda-gpu-scaler:${{ github.sha }}
only-severities: critical,high
exit-code: true # Fail the build on findings
Pre-Built Scaling Profiles
Different GPU workloads need different scaling strategies. keda-gpu-scaler ships profiles so you don't have to figure this out yourself:
# ScaledObject for vLLM inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-gpu-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 8
triggers:
- type: external
metadata:
scalerAddress: "keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000"
profile: "vllm-inference"
| Profile | Metric | Target | Scale-to-Zero | Why |
|---|---|---|---|---|
vllm-inference |
GPU memory (%) | 80% | Yes (5% activation) | vLLM fills KV cache proportional to request load |
triton-inference |
GPU utilization (%) | 75% | Yes (10% activation) | Triton batches requests, SM utilization is the bottleneck |
training |
GPU utilization (%) | 90% | No | Training jobs should saturate GPUs |
batch |
GPU memory (%) | 70% | Yes (1% activation) | Batch inference, aggressive scale-down |
Scale-to-zero is the killer feature for inference. A single A100 instance costs ~$3/hour. If your inference service is idle overnight, that's $36/night wasted. keda-gpu-scaler detects GPU idle state and scales the deployment to zero pods. KEDA spins it back up on the first incoming request.
From Docker Desktop to GPU Cluster
The workflow end-to-end:
-
Build your GPU container with
docker build— multi-stage, non-root, minimal runtime -
Test locally with
docker compose— verify GPU metrics, NVML access, gRPC endpoint -
Scan with
docker scout— catch CVEs before pushing - Push to your registry
- Deploy to Kubernetes with keda-gpu-scaler for autoscaling
Docker is the consistent runtime from development to production. Same Dockerfile, same NVML metrics, same security model — just different scale.
The project is open source and being discussed for adoption under the KEDA organization. If you're running GPU workloads on Kubernetes and want autoscaling that actually looks at the GPU, give it a try: github.com/pmady/keda-gpu-scaler.
Pavan Madduri is a Senior Cloud Platform Engineer at W.W. Grainger, Inc., CNCF Golden Kubestronaut, and Oracle ACE Associate. He maintains keda-gpu-scaler and otel-gpu-receiver, and contributed GPU NUMA topology scheduling to Volcano.
Top comments (0)