DEV Community

Pavan Madduri
Pavan Madduri

Posted on

GPU-Aware Autoscaling for Docker Containers: From NVML to Production

Every GPU inference container has the same problem: Kubernetes HPA can't see the GPU. You scale on CPU and memory while your GPU sits at 95% utilization, completely invisible to the autoscaler. Or worse — your GPU is idle and you're paying $3/hour for an instance doing nothing.

I built keda-gpu-scaler to fix this. It's a KEDA external scaler that reads real GPU metrics via NVIDIA NVML and drives Kubernetes autoscaling decisions — including scale-to-zero. This post covers the Docker-specific parts: how GPU metrics flow from the NVIDIA Container Toolkit through Docker to KEDA, and how to build GPU-aware containers that actually scale.

How Docker Exposes GPUs to Containers

When you run a GPU container with Docker, three layers work together:

docker run --gpus all nvidia/cuda:12.4-base nvidia-smi
Enter fullscreen mode Exit fullscreen mode
  1. Docker Engine detects the --gpus flag and calls the NVIDIA Container Toolkit
  2. The toolkit configures nvidia-container-runtime as the OCI runtime for this container
  3. The runtime injects GPU device files (/dev/nvidia0, /dev/nvidiactl) and NVIDIA driver libraries into the container's filesystem

The container now has full access to NVML (NVIDIA Management Library), which exposes GPU utilization, memory usage, temperature, power draw, and more. This is the same mechanism my GPU scaler uses — each scaler pod runs on a GPU node and reads NVML metrics from the GPUs Docker has exposed to it.

┌─────────────┐     gRPC      ┌──────────────────────────┐
│ KEDA Operator│─────────────→│ keda-gpu-scaler (Docker)  │
│ (central pod)│              │ DaemonSet on each GPU node│
└─────────────┘              │                            │
                              │  NVML ──→ /dev/nvidia0    │
                              │  NVML ──→ /dev/nvidia1    │
                              │       (Docker-exposed)     │
                              └──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Building GPU Containers: The Dockerfile

GPU containers need CGO for NVML access. Here's the multi-stage Dockerfile I use for keda-gpu-scaler:

# Stage 1: Build
FROM golang:1.22-bookworm AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
# CGO_ENABLED=1 is required — NVML needs CGO
# This is why GPU scaling can't be a native KEDA scaler
# (KEDA builds with CGO_ENABLED=0)
RUN CGO_ENABLED=1 go build -ldflags="-s -w" -o keda-gpu-scaler ./cmd/scaler

# Stage 2: Minimal runtime
FROM nvidia/cuda:12.4-base-ubuntu22.04

# Security: non-root user
RUN useradd -r -u 65534 -s /bin/false scaler
USER 65534:65534

COPY --from=builder /app/keda-gpu-scaler /usr/local/bin/

EXPOSE 6000
ENTRYPOINT ["keda-gpu-scaler"]
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • CGO_ENABLED=1 — NVML requires C bindings. This is the fundamental architectural reason keda-gpu-scaler exists as an external scaler instead of being built into KEDA core.
  • nvidia/cuda base image — provides the NVML shared libraries (libnvidia-ml.so) at runtime.
  • Non-root execution — NVML reads GPU data from sysfs, doesn't require root. Standard Docker security practice.
  • Multi-stage build — final image is ~150MB instead of 1.5GB (no Go toolchain, no build deps).

Docker Compose for Local GPU Development

Before deploying to Kubernetes, test the full stack locally with Docker Compose:

version: "3.8"
services:
  # The GPU scaler — reads NVML metrics, serves gRPC
  gpu-scaler:
    build: .
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "6000:6000"    # gRPC for KEDA
      - "9090:9090"    # Prometheus metrics
    environment:
      - LOG_LEVEL=debug

  # A real GPU workload to scale
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.2-1B
      --port 8000
      --max-model-len 2048

  # Prometheus to scrape GPU metrics
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
Enter fullscreen mode Exit fullscreen mode
# Start the stack
docker compose up -d

# Check GPU metrics via gRPC
grpcurl -plaintext localhost:6000 externalscaler.ExternalScaler/GetMetrics

# Check GPU metrics via Prometheus
curl localhost:9090/metrics | grep gpu

# Send requests to vLLM and watch GPU utilization climb
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.2-1B", "messages": [{"role": "user", "content": "Hello"}]}'
Enter fullscreen mode Exit fullscreen mode

This gives you the full autoscaling feedback loop locally: vLLM serving → GPU utilization rises → scaler reports metrics → you can verify KEDA would trigger a scale-up.

Docker Scout: Scanning GPU Container Images

GPU images are large and have deep dependency trees. A typical vLLM image pulls in CUDA, cuDNN, NCCL, Python, PyTorch, and dozens of transitive dependencies. More packages = more CVE surface.

# Scan the GPU scaler image
docker scout cves pmady/keda-gpu-scaler:latest

# Get base image recommendations
docker scout recommendations pmady/keda-gpu-scaler:latest
Enter fullscreen mode Exit fullscreen mode

Common findings I've hit with GPU images:

Issue Cause Fix
High-severity OpenSSL CVE CUDA base image uses older Ubuntu Multi-stage build with patched base
Python package CVEs Transitive deps in ML frameworks Pin versions, use pip audit
Outdated CUDA libs NVIDIA base image release lag Use Docker Hardened Images as base

For production GPU containers, I run Scout in CI and block merges on critical/high CVEs:

# GitHub Actions
- name: Docker Scout CVE scan
  uses: docker/scout-action@v1
  with:
    command: cves
    image: pmady/keda-gpu-scaler:${{ github.sha }}
    only-severities: critical,high
    exit-code: true  # Fail the build on findings
Enter fullscreen mode Exit fullscreen mode

Pre-Built Scaling Profiles

Different GPU workloads need different scaling strategies. keda-gpu-scaler ships profiles so you don't have to figure this out yourself:

# ScaledObject for vLLM inference
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-gpu-scaler
spec:
  scaleTargetRef:
    name: vllm-deployment
  minReplicaCount: 0    # Scale to zero when idle
  maxReplicaCount: 8
  triggers:
    - type: external
      metadata:
        scalerAddress: "keda-gpu-scaler.gpu-scaler.svc.cluster.local:6000"
        profile: "vllm-inference"
Enter fullscreen mode Exit fullscreen mode
Profile Metric Target Scale-to-Zero Why
vllm-inference GPU memory (%) 80% Yes (5% activation) vLLM fills KV cache proportional to request load
triton-inference GPU utilization (%) 75% Yes (10% activation) Triton batches requests, SM utilization is the bottleneck
training GPU utilization (%) 90% No Training jobs should saturate GPUs
batch GPU memory (%) 70% Yes (1% activation) Batch inference, aggressive scale-down

Scale-to-zero is the killer feature for inference. A single A100 instance costs ~$3/hour. If your inference service is idle overnight, that's $36/night wasted. keda-gpu-scaler detects GPU idle state and scales the deployment to zero pods. KEDA spins it back up on the first incoming request.

From Docker Desktop to GPU Cluster

The workflow end-to-end:

  1. Build your GPU container with docker build — multi-stage, non-root, minimal runtime
  2. Test locally with docker compose — verify GPU metrics, NVML access, gRPC endpoint
  3. Scan with docker scout — catch CVEs before pushing
  4. Push to your registry
  5. Deploy to Kubernetes with keda-gpu-scaler for autoscaling

Docker is the consistent runtime from development to production. Same Dockerfile, same NVML metrics, same security model — just different scale.

The project is open source and being discussed for adoption under the KEDA organization. If you're running GPU workloads on Kubernetes and want autoscaling that actually looks at the GPU, give it a try: github.com/pmady/keda-gpu-scaler.


Pavan Madduri is a Senior Cloud Platform Engineer at W.W. Grainger, Inc., CNCF Golden Kubestronaut, and Oracle ACE Associate. He maintains keda-gpu-scaler and otel-gpu-receiver, and contributed GPU NUMA topology scheduling to Volcano.

Top comments (0)