DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

Kubernetes Autoscaler for ML: Build HPA from Scratch

The Default HPA Doesn't Know When Your Model Is Struggling

Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory. That's fine for web servers. But your ML inference service might be drowning at 40% CPU because the batch queue is backing up, or sitting idle at 80% because it's loading a huge model into memory.

I needed an autoscaler that understood model-specific metrics: request queue depth, P95 inference latency, GPU utilization. The built-in metrics-server doesn't expose those. You could use Prometheus Adapter, but configuring it felt like writing a second thesis after my master's degree.

So I built a custom autoscaler in 200 lines of Python. It watches a FastAPI inference service, reads custom metrics from /metrics, and scales the deployment up or down via the Kubernetes API. Docker vs Kubernetes for First ML Model: When to Use Each covered when you'd even need this — if you're still on a single Docker container, you don't.

This isn't production-grade (no leader election, no failover), but it's a solid portfolio project that shows you understand Kubernetes controllers and ML service bottlenecks. If you're interviewing for MLOps roles, being able to walk through "I wrote a custom autoscaler because the default one couldn't see model queue depth" is a strong signal.


Continue reading the full article on TildAlice

Top comments (0)