The Default CPU Metric Doesn't Scale Inference Pods Right
Kubernetes Horizontal Pod Autoscaler (HPA) ships with CPU and memory metrics out of the box. Sounds great until you realize inference workloads don't behave like web servers. I've seen Triton pods sit at 30% CPU utilization while requests queue for 2+ seconds because the GPU is maxed out. The cluster thinks everything's fine. It's not.
Triton Inference Server can batch requests and pipeline stages across CPU/GPU, which means CPU usage becomes a terrible proxy for "is this pod overloaded?" You need to scale on what actually matters: GPU utilization, queue depth, or batch occupancy. This post walks through wiring HPA to Triton's Prometheus metrics so your cluster scales on signal that reflects reality.
I'll show the full stack: Prometheus → Prometheus Adapter → HPA custom metrics → autoscaling Triton deployments. The key insight is that HPA only knows about metrics the API server exposes, so you're building a pipeline from Triton metrics to custom.metrics.k8s.io API.
Continue reading the full article on TildAlice

Top comments (0)