GPU-Aware Autoscaling for Docker Containers: From NVML to Production

#gpu #nvml #nvidia #docker

Every GPU inference container has the same problem: Kubernetes HPA can't see the GPU. You scale on CPU and memory while your GPU sits at 95% utilization, completely invisible to the autoscaler. Or worse — your GPU is idle and you're paying $3/hour for an instance doing nothing.

I built keda-gpu-scaler to fix this. It's a KEDA external scaler that reads real GPU metrics via NVIDIA NVML and drives Kubernetes autoscaling decisions — including scale-to-zero. This post covers the Docker-specific parts: how GPU metrics flow from the NVIDIA Container Toolkit through Docker to KEDA, and how to build GPU-aware containers that actually scale.

How Docker Exposes GPUs to Containers

When you run a GPU container with Docker, three layers work together:

docker run --gpus all nvidia/cuda:12.4-base nvidia-smi

Docker Engine detects the --gpus flag and calls the NVIDIA Container Toolkit
The toolkit configures nvidia-container-runtime as the OCI runtime for this container
The runtime injects GPU device files (/dev/nvidia0, /dev/nvidiactl) and NVIDIA driver libraries into the container's filesystem

The container now has full access to NVML (NVIDIA Management Library), which exposes GPU utilization, memory usage, temperature, power draw, and more. This is the same mechanism my GPU scaler uses — each scaler pod runs on a GPU node and reads NVML metrics from the GPUs Docker has exposed to it.

┌─────────────┐     gRPC      ┌──────────────────────────┐
│ KEDA Operator│─────────────→│ keda-gpu-scaler (Docker)  │
│ (central pod)│              │ DaemonSet on each GPU node│
└─────────────┘              │                            │
                              │  NVML ──→ /dev/nvidia0    │
                              │  NVML ──→ /dev/nvidia1    │
                              │       (Docker-exposed)     │
                              └──────────────────────────┘

Building GPU Containers: The Dockerfile

GPU containers need CGO for NVML access. Here's the multi-stage Dockerfile I use for keda-gpu-scaler:

# Stage 1: Build
FROM golang:1.22-bookworm AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
# CGO_ENABLED=1 is required — NVML needs CGO
# This is why GPU scaling can't be a native KEDA scaler
# (KEDA builds with CGO_ENABLED=0)
RUN CGO_ENABLED=1 go build -ldflags="-s -w" -o keda-gpu-scaler ./cmd/scaler

# Stage 2: Minimal runtime
FROM nvidia/cuda:12.4-base-ubuntu22.04

# Security: non-root user
RUN useradd -r -u 65534 -s /bin/false scaler
USER 65534:65534

COPY --from=builder /app/keda-gpu-scaler /usr/local/bin/

EXPOSE 6000
ENTRYPOINT ["keda-gpu-scaler"]

Key decisions:

CGO_ENABLED=1 — NVML requires C bindings. This is the fundamental architectural reason keda-gpu-scaler exists as an external scaler instead of being built into KEDA core.
nvidia/cuda base image — provides the NVML shared libraries (libnvidia-ml.so) at runtime.
Non-root execution — NVML reads GPU data from sysfs, doesn't require root. Standard Docker security practice.
Multi-stage build — final image is ~150MB instead of 1.5GB (no Go toolchain, no build deps).

Docker Compose for Local GPU Development

Before deploying to Kubernetes, test the full stack locally with Docker Compose:

version: "3.8"
services:
  # The GPU scaler — reads NVML metrics, serves gRPC
  gpu-scaler:
    build: .
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "6000:6000"    # gRPC for KEDA
      - "9090:9090"    # Prometheus metrics
    environment:
      - LOG_LEVEL=debug

  # A real GPU workload to scale
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.2-1B
      --port 8000
      --max-model-len 2048

  # Prometheus to scrape GPU metrics
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

# Start the stack
docker compose up -d

# Check GPU metrics via gRPC
grpcurl -plaintext localhost:6000 externalscaler.ExternalScaler/GetMetrics

# Check GPU metrics via Prometheus
curl localhost:9090/metrics | grep gpu

# Send requests to vLLM and watch GPU utilization climb
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-1B", "messages": [{"role": "user", "content": "Hello"}]}'

This gives you the full autoscaling feedback loop locally: vLLM serving → GPU utilization rises → scaler reports metrics → you can verify KEDA would trigger a scale-up.

Docker Scout: Scanning GPU Container Images

GPU images are large and have deep dependency trees. A typical vLLM image pulls in CUDA, cuDNN, NCCL, Python, PyTorch, and dozens of transitive dependencies. More packages = more CVE surface.

# Scan the GPU scaler image
docker scout cves pmady/keda-gpu-scaler:latest

# Get base image recommendations
docker scout recommendations pmady/keda-gpu-scaler:latest

Common findings I've hit with GPU images:

Issue	Cause	Fix
High-severity OpenSSL CVE	CUDA base image uses older Ubuntu	Multi-stage build with patched base
Python package CVEs	Transitive deps in ML frameworks	Pin versions, use `pip audit`
Outdated CUDA libs	NVIDIA base image release lag	Use Docker Hardened Images as base

For production GPU containers, I run Scout in CI and block merges on critical/high CVEs:

# GitHub Actions
- name: Docker Scout CVE scan
  uses: docker/scout-action@v1
  with:
    command: cves
    image: pmady/keda-gpu-scaler:${{ github.sha }}
    only-severities: critical,high
    exit-code: true  # Fail the build on findings

Pre-Built Scaling Profiles

Different GPU workloads need different scaling strategies. keda-gpu-scaler ships profiles so you don't have to figure this out yourself:

# ScaledObject for vLLM inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-gpu-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 0    # Scale to zero when idle
  maxReplicaCount: 8
  triggers:
    - type: external
      metadata:
        scalerAddress: "keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000"
        profile: "vllm-inference"

Profile	Metric	Target	Scale-to-Zero	Why
`vllm-inference`	GPU memory (%)	80%	Yes (5% activation)	vLLM fills KV cache proportional to request load
`triton-inference`	GPU utilization (%)	75%	Yes (10% activation)	Triton batches requests, SM utilization is the bottleneck
`training`	GPU utilization (%)	90%	No	Training jobs should saturate GPUs
`batch`	GPU memory (%)	70%	Yes (1% activation)	Batch inference, aggressive scale-down

Scale-to-zero is the killer feature for inference. A single A100 instance costs ~$3/hour. If your inference service is idle overnight, that's $36/night wasted. keda-gpu-scaler detects GPU idle state and scales the deployment to zero pods. KEDA spins it back up on the first incoming request.

From Docker Desktop to GPU Cluster

The workflow end-to-end:

Build your GPU container with docker build — multi-stage, non-root, minimal runtime
Test locally with docker compose — verify GPU metrics, NVML access, gRPC endpoint
Scan with docker scout — catch CVEs before pushing
Push to your registry
Deploy to Kubernetes with keda-gpu-scaler for autoscaling

Docker is the consistent runtime from development to production. Same Dockerfile, same NVML metrics, same security model — just different scale.

The project is open source and being discussed for adoption under the KEDA organization. If you're running GPU workloads on Kubernetes and want autoscaling that actually looks at the GPU, give it a try: github.com/pmady/keda-gpu-scaler.

Pavan Madduri is a Senior Cloud Platform Engineer at W.W. Grainger, Inc., CNCF Golden Kubestronaut, and Oracle ACE Associate. He maintains keda-gpu-scaler and otel-gpu-receiver, and contributed GPU NUMA topology scheduling to Volcano.