GPU autoscaling on Kubernetes with KEDA: building an external scaler with NVML

#kubernetes #llm #devops #development

If you run vLLM, Triton, or any other inference server on Kubernetes, you have probably noticed that the HPA cannot see the GPU. Autoscaling decisions are driven by CPU and memory, while the resource that actually determines inference capacity remains invisible. A CNCF blog post published in May 2026 describes how to fix this by building a KEDA external scaler.

The problem with default autoscaling

The Kubernetes Horizontal Pod Autoscaler (HPA) was designed to scale on CPU and memory metrics. For traditional web workloads, that is enough. For LLM inference, it is not. A GPU can be running at 95% utilization while the HPA sees low CPU and decides not to scale.

KEDA (Kubernetes Event-driven Autoscaling) addresses part of this by enabling scaling on external events and custom metrics. But someone still has to read the GPU hardware metrics and expose them in a form KEDA can consume. That is the role of the external scaler.

How the external scaler works

NVML (NVIDIA Management Library) is the NVIDIA C library for reading metrics like SM utilization (Streaming Multiprocessor utilization) and Frame Buffer Memory (VRAM). It requires local device access via libnvidia-ml.so, which creates an important architectural constraint: the central KEDA operator runs as a pod without that local access. The scaler must therefore run on the GPU nodes themselves.

The approach described in the article deploys a DaemonSet on nodes labeled nvidia.com/gpu.present: "true". Each pod in that DaemonSet uses go-nvml (Go bindings over NVML) to read metrics directly from the hardware, then serves them to the KEDA operator over gRPC by implementing the ExternalScalerServer interface with four methods:

IsActive: determines whether the workload should be active
StreamIsActive: continuous active-state stream
GetMetricSpec: describes the metric and its target value
GetMetrics: returns the current metric value to the operator

With this in place, a ScaledObject can target a vLLM or Triton deployment and scale based on real GPU utilization or VRAM usage. Setting minReplicaCount: 0 enables scale-to-zero when the workload is idle.

Comparison with the full metrics stack

The alternative for surfacing GPU metrics in Kubernetes involves five components: a DCGM Exporter DaemonSet, Prometheus collecting the data, a Metrics Adapter exposing them through the Kubernetes API, KEDA consuming via a Prometheus scaler, and the inference pods. The external scaler reduces this to two components: the NVML DaemonSet and the KEDA operator. Metric latency drops to milliseconds, without going through Prometheus scrape intervals.

The CNCF post points to the keda-gpu-scaler repository (github.com/pmady/keda-gpu-scaler) as a reference implementation.

What to evaluate before adopting

This solution is practical for teams already running KEDA who are expanding into inference workloads. A few concrete points to consider:

Scale-to-zero has a direct latency implication. Bringing a vLLM pod up from zero, including model loading, can take tens of seconds. If the use case requires low first-request latency, minReplicaCount should reflect that.

The DaemonSet needs GPU device access. On clusters with NVIDIA device plugins or strict RBAC controls on device resources, permissions need to be reviewed before deploying.

Finally, scope: if the cluster has heterogeneous GPU nodes (different GPU models with different capabilities), the node selection logic and metric interpretation need to account for that.

As described in the post, this approach removes the dependency on a separate metrics stack and makes autoscaling directly responsive to real hardware state. For AI platforms growing in scale, that simplification is worth the evaluation.

Source: GPU autoscaling on Kubernetes with KEDA: Building an external scaler