Why vLLM autoscaling on Kubernetes breaks (and what to use instead)

#devops #kubernetes #ai #machinelearning

If you deploy vLLM on Kubernetes and reach for the standard HPA-on-CPU autoscaling, you will ship something that looks fine in testing and falls apart under real traffic.
Here is why, and what to do instead.

The problem: Kubernetes can't see your inference load

HPA scales on CPU and memory by default. Both are useless signals for LLM inference.
CPU stays low because the GPU does the work. A vLLM pod serving zero requests and one serving 100 show nearly identical CPU.
GPU memory stays constant because vLLM pre-allocates it for the KV cache at startup. It never moves, so it never triggers a scaling decision.
So under heavy load, your vLLM deployment looks idle to Kubernetes while requests pile up inside the engine and latency climbs. The scheduler has no idea anything is wrong.

The fix: autoscale on the metrics that reflect real load

The signals that matter live inside vLLM and are exported on its Prometheus endpoint:

vllm:num_requests_waiting : how many requests are queued.
vllm:gpu_cache_usage_perc : how full the KV cache is.

Wire these into KEDA, not HPA:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      threshold: "10"
      query: |
        sum(vllm:num_requests_waiting) /
        count(kube_deployment_status_replicas_ready{deployment="vllm-inference"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_kv_cache
      threshold: "0.8"
      query: avg(vllm:gpu_cache_usage_perc)

The cooldownPeriod: 300 is not optional. vLLM cold start takes minutes. A short cooldown thrashes, scaling down then immediately back up, paying the cold-start cost on repeat.

Three more things that bite in production

Cold start. Pre-cache model weights on a ReadOnlyMany PVC. Downloading a 7B model from HuggingFace on every scale-up adds 2-5 minutes. Mounting from a pre-populated PVC cuts that to 30-60 seconds. Set a startupProbe with a high failureThresholdso Kubernetes doesn't kill the pod mid-load.
KV cache OOM. The most common vLLM crash. Size it with params_B × 2 for FP16 weights plus 25% for the cache, and set --gpu-memory-utilization 0.85 for headroom. 0.95 will OOM under concurrent load.
Preemption. When the KV cache fills, vLLM silently preempts older requests to make room. P99 latency can spike 8x. Alert on rate(vllm:num_preemptions_total[1m]) > 0.05 for 30s. It's the earliest warning you'll get.

The metric that matters most

Watch preemption rate. Standard Kubernetes monitoring (CPU, memory, restarts) tells you nothing about inference health. Preemption rate precedes the latency spikes your users actually feel. If you alert on one vLLM metric, alert on that one.

I wrote the full version with the GPU memory sizing rules, Spot vs On-Demand node strategy, cold-start mitigation including NVIDIA Dynamo Snapshot, the four metrics to monitor, and a production checklist over on our blog: [https://thegoodshell.com/vllm-kubernetes/]
What signals are you autoscaling LLM inference on? Curious if anyone has found something better than queue depth + KV cache.

Top comments (1)

Max Quimby • Jun 21

This matches what we learned the expensive way: CPU/memory HPA on a vLLM pod is basically autoscaling on a constant. num_requests_waiting is the right primary trigger, and the cooldownPeriod warning deserves the bold you gave it — cold-start thrash is the single most common KEDA misconfiguration I've seen on GPU workloads. One thing I'd add from production: queue depth alone can lag the user-visible problem, because the KV cache can read "healthy" while p95 latency is already falling off a cliff on long-context requests. Pairing num_requests_waiting with a TTFT (time-to-first-token) p95 trigger catches that earlier, since TTFT degrades before the queue visibly backs up. The PVC pre-caching tip is gold — the other lever is keeping a low-priority "pause" pod parked so scale-up can preempt it and skip part of the cold start. Curious whether you've tried scaling on a blended signal vs two independent triggers, and whether KEDA's OR-semantics across triggers caused any flapping for you.