DEV Community

Cover image for Why vLLM autoscaling on Kubernetes breaks (and what to use instead)
Sonia
Sonia

Posted on • Originally published at thegoodshell.com

Why vLLM autoscaling on Kubernetes breaks (and what to use instead)

If you deploy vLLM on Kubernetes and reach for the standard HPA-on-CPU autoscaling, you will ship something that looks fine in testing and falls apart under real traffic.
Here is why, and what to do instead.

The problem: Kubernetes can't see your inference load

HPA scales on CPU and memory by default. Both are useless signals for LLM inference.
CPU stays low because the GPU does the work. A vLLM pod serving zero requests and one serving 100 show nearly identical CPU.
GPU memory stays constant because vLLM pre-allocates it for the KV cache at startup. It never moves, so it never triggers a scaling decision.
So under heavy load, your vLLM deployment looks idle to Kubernetes while requests pile up inside the engine and latency climbs. The scheduler has no idea anything is wrong.

The fix: autoscale on the metrics that reflect real load

The signals that matter live inside vLLM and are exported on its Prometheus endpoint:

vllm:num_requests_waiting : how many requests are queued.
vllm:gpu_cache_usage_perc : how full the KV cache is.

Wire these into KEDA, not HPA:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: llm-inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_queue_depth
      threshold: "10"
      query: |
        sum(vllm:num_requests_waiting) /
        count(kube_deployment_status_replicas_ready{deployment="vllm-inference"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_kv_cache
      threshold: "0.8"
      query: avg(vllm:gpu_cache_usage_perc)
Enter fullscreen mode Exit fullscreen mode

The cooldownPeriod: 300 is not optional. vLLM cold start takes minutes. A short cooldown thrashes, scaling down then immediately back up, paying the cold-start cost on repeat.

Three more things that bite in production

Cold start. Pre-cache model weights on a ReadOnlyMany PVC. Downloading a 7B model from HuggingFace on every scale-up adds 2-5 minutes. Mounting from a pre-populated PVC cuts that to 30-60 seconds. Set a startupProbe with a high failureThresholdso Kubernetes doesn't kill the pod mid-load.
KV cache OOM. The most common vLLM crash. Size it with params_B × 2 for FP16 weights plus 25% for the cache, and set --gpu-memory-utilization 0.85 for headroom. 0.95 will OOM under concurrent load.
Preemption. When the KV cache fills, vLLM silently preempts older requests to make room. P99 latency can spike 8x. Alert on rate(vllm:num_preemptions_total[1m]) > 0.05 for 30s. It's the earliest warning you'll get.

The metric that matters most

Watch preemption rate. Standard Kubernetes monitoring (CPU, memory, restarts) tells you nothing about inference health. Preemption rate precedes the latency spikes your users actually feel. If you alert on one vLLM metric, alert on that one.

I wrote the full version with the GPU memory sizing rules, Spot vs On-Demand node strategy, cold-start mitigation including NVIDIA Dynamo Snapshot, the four metrics to monitor, and a production checklist over on our blog: [https://thegoodshell.com/vllm-kubernetes/]
What signals are you autoscaling LLM inference on? Curious if anyone has found something better than queue depth + KV cache.

Top comments (0)