Kubernetes HPA + Triton: Custom Metrics Autoscaling Setup

#kubernetes #triton #mlops #autoscaling

The Default CPU Metric Doesn't Scale Inference Pods Right

Kubernetes Horizontal Pod Autoscaler (HPA) ships with CPU and memory metrics out of the box. Sounds great until you realize inference workloads don't behave like web servers. I've seen Triton pods sit at 30% CPU utilization while requests queue for 2+ seconds because the GPU is maxed out. The cluster thinks everything's fine. It's not.

Triton Inference Server can batch requests and pipeline stages across CPU/GPU, which means CPU usage becomes a terrible proxy for "is this pod overloaded?" You need to scale on what actually matters: GPU utilization, queue depth, or batch occupancy. This post walks through wiring HPA to Triton's Prometheus metrics so your cluster scales on signal that reflects reality.

I'll show the full stack: Prometheus → Prometheus Adapter → HPA custom metrics → autoscaling Triton deployments. The key insight is that HPA only knows about metrics the API server exposes, so you're building a pipeline from Triton metrics to custom.metrics.k8s.io API.

Close-up of the Triton statue spouting water at the iconic Piazza Navona fountain in Rome. — Photo by Paolo Motti on Pexels

Continue reading the full article on TildAlice

DEV Community

Kubernetes HPA + Triton: Custom Metrics Autoscaling Setup

The Default CPU Metric Doesn't Scale Inference Pods Right

Top comments (0)