This is the multi-part blog series in the first part I covered up an operator’s view into the Kubernetes resource paradox. Learn why most clusters waste 40–60% of their capacity, how resource requests really work, and why overprovisioning is a rational response to fear — not incompetence. And in the second part I explained why Kubernetes resource overprovisioning happens, how it quietly inflates cloud costs, and what real-world strategies DevOps teams use to regain control over CPU, memory, and GPU usage.
Horizontal Pod Autoscaler is often treated as Kubernetes’ automatic scaling solution, but in reality it only works when requests, metrics, and workload behavior are understood. This deep dive explains why autoscaling frequently fails in production and how to design scaling strategies that actually work at scale.
By the time most teams adopt autoscaling in Kubernetes, they’ve already run into the limitations of static resource allocation. Traffic fluctuates, workloads behave unpredictably, and the idea of manually adjusting replica counts quickly becomes unrealistic. Autoscaling promises a cleaner solution: let the platform react dynamically to demand.
The Horizontal Pod Autoscaler (HPA) is often introduced as the answer to this problem. Configure a target CPU utilization, set minimum and maximum replicas, and Kubernetes will automatically adjust the number of pods as load changes.
On paper, it sounds like the perfect system.
In reality, autoscaling is one of the most misunderstood parts of Kubernetes. Many teams assume that once HPA is enabled, resource efficiency and scaling problems will take care of themselves. Instead, what often happens is the opposite: autoscaling amplifies bad assumptions about requests, workload behavior, and metrics. Clusters become harder to reason about, scaling events become unpredictable, and the root problems that caused overprovisioning in the first place remain untouched.
Autoscaling is powerful, but only when the underlying signals are trustworthy.
How Horizontal Pod Autoscaling Actually Works
The Horizontal Pod Autoscaler doesn’t measure “load” in the abstract. It calculates scaling decisions based on utilization relative to the container’s requested resources.
For CPU-based scaling, the formula is essentially:
Current Utilization = Actual CPU Usage / CPU Request
If the current utilization exceeds the target threshold, Kubernetes increases the number of replicas. If it falls below the threshold, replicas are reduced.
At first glance, this seems logical. But notice the dependency hidden in that equation: CPU requests are part of the calculation. If requests are inaccurate, the utilization signal becomes distorted.
Imagine a container that consistently uses around 500 millicores of CPU but has a request of 2000 millicores. The autoscaler will see utilization of only 25 percent, even if the application is under significant real-world load. Because the utilization appears low, scaling will not occur when it should.
In effect, the autoscaler becomes blind to demand.
This is why autoscaling often fails quietly in clusters where requests have been inflated as a safety buffer. The autoscaler is working correctly; it’s simply responding to incorrect inputs.
Why Autoscaling Often Makes Overprovisioning Worse
Once teams realize that autoscaling is not reacting quickly enough, they tend to compensate in ways that make the situation worse.
A common response is to increase baseline replica counts. Instead of running two or three pods and letting the autoscaler expand as needed, teams start with ten or fifteen replicas just to avoid scaling delays. While this improves perceived reliability, it eliminates much of the cost benefit autoscaling was meant to provide.
Another reaction is to inflate resource requests further. If scaling triggers depend on utilization percentages, increasing requests might seem like a way to create more headroom. In practice, this makes scaling signals even less accurate and pushes the cluster toward earlier node scale-outs.
Over time, the autoscaler becomes more of a safety mechanism than an efficiency tool. It prevents catastrophic overload but does little to improve resource usage.
Scaling Latency Is the Hidden Constraint
Even when requests are accurate and autoscaling signals are correct, scaling is not instantaneous.
Adding replicas involves several steps: the autoscaler must observe the metric change, compute a new replica count, update the deployment, schedule new pods, and wait for those pods to become ready. In clusters where nodes must also be provisioned by the cluster autoscaler, the delay can be even longer.
These delays are not bugs. They are fundamental properties of distributed systems.
The implication is that autoscaling works best when it responds to gradual changes in demand, not sudden traffic spikes. Workloads that experience abrupt surges often require a different strategy, such as maintaining a slightly higher baseline replica count or scaling based on predictive signals rather than purely reactive metrics.
Teams that assume autoscaling can instantly absorb any spike often discover the limits of that assumption during incidents.
Vertical Scaling: The Quiet Companion to Horizontal Autoscaling
While horizontal scaling adjusts replica counts, vertical scaling focuses on correcting resource requests themselves. This is where the Vertical Pod Autoscaler (VPA) enters the picture.
VPA analyzes historical resource usage and suggests more appropriate requests for CPU and memory. Instead of adding more pods, it attempts to right-size the pods that already exist.
In practice, VPA is most effective when used cautiously. Fully automated vertical scaling can lead to disruptive restarts, which is why many organizations run VPA in “recommendation mode.” In this configuration, the system provides insights about resource usage without automatically applying changes.
This mode turns VPA into something more valuable than automation: it becomes a feedback mechanism. Platform teams can see which workloads are dramatically over-requested and begin the process of gradual correction.
Horizontal scaling handles demand variability, while vertical scaling corrects historical misallocation. The two approaches are complementary, not interchangeable.
Autoscaling Works Only When Metrics Tell the Truth
The quality of autoscaling decisions ultimately depends on the metrics that feed the system.
CPU utilization is easy to measure, but it doesn’t always correlate with user-facing performance. Some applications are bottlenecked by I/O, external APIs, or internal queue depth rather than raw CPU consumption. In those cases, scaling based solely on CPU metrics may miss the signals that actually matter.
Advanced platforms often introduce application-level metrics into scaling decisions. Queue length, request latency, and throughput are frequently better indicators of load than CPU utilization alone. These signals allow scaling behavior to align more closely with real-world demand rather than infrastructure metrics.
However, this approach introduces complexity. Application metrics must be reliable, well-defined, and resistant to noise. Otherwise, autoscaling becomes unstable and oscillates between states.
The challenge is not gathering more metrics, but identifying the ones that genuinely reflect pressure on the system.
The Interaction Between Pod Autoscaling and Cluster Autoscaling
Another dimension of scaling complexity emerges when the Horizontal Pod Autoscaler interacts with the Cluster Autoscaler.
The cluster autoscaler is responsible for adding or removing nodes when pods cannot be scheduled due to insufficient capacity. This interaction creates a chain reaction. When HPA increases replica counts, the scheduler attempts to place those pods on existing nodes. If capacity is unavailable, the cluster autoscaler provisions new nodes.
This sequence introduces additional delay and sometimes surprising behavior. If resource requests are inflated, pods may appear unschedulable even when the node still has unused CPU and memory in reality. The cluster autoscaler then adds nodes unnecessarily, increasing infrastructure costs.
In this sense, inaccurate requests don’t just affect pod scheduling; they propagate all the way up to cluster-level infrastructure decisions.
Autoscaling Is a Feedback System, Not a Magic Switch
Autoscaling systems behave more like control loops than simple triggers. They observe signals, make adjustments, and then observe the effects of those adjustments over time.
Like any feedback system, stability depends on signal quality, response timing, and predictable behavior from the workloads involved. When any of those elements are unreliable, scaling becomes erratic.
Understanding autoscaling in this way helps explain why tuning parameters such as scaling thresholds, cooldown periods, and replica limits can have dramatic effects. These settings control how aggressively the system reacts to perceived changes in demand.
Organizations that operate large Kubernetes environments eventually learn that autoscaling is not something you “enable and forget.” It is an ongoing operational discipline that requires observation, adjustment, and occasionally restraint.
When Autoscaling Actually Works Well
Autoscaling tends to perform best when a few key conditions are met. Resource requests closely match typical usage, ensuring utilization metrics reflect real pressure. Workloads scale horizontally without complex state dependencies. Traffic patterns change gradually enough for scaling decisions to keep up.
When those conditions hold, the system begins to behave predictably. Scaling events become routine rather than surprising, infrastructure usage becomes more efficient, and operational stress decreases.
Ironically, autoscaling becomes almost invisible at that point. It simply does its job in the background.
Closing Thoughts
Autoscaling is often portrayed as Kubernetes’ built-in solution for dynamic workloads. In practice, it is only as effective as the signals and assumptions that feed into it. Inflated resource requests, poorly chosen metrics, and unrealistic expectations about scaling speed can all undermine the system.
The Horizontal Pod Autoscaler is not a replacement for thoughtful resource configuration. Instead, it builds on top of it. When requests reflect reality and metrics reflect meaningful pressure on the system, autoscaling becomes an incredibly powerful tool.
But without those foundations, it simply amplifies existing problems.
In the next part of this series, we’ll explore a domain where these problems become dramatically more expensive: GPU workloads in Kubernetes, where idle capacity can burn thousands of dollars per day.
Key Takeaways
Horizontal Pod Autoscaling depends on resource requests, so inflated requests distort scaling signals and prevent correct scaling behavior.
Vertical scaling complements horizontal scaling by correcting long-term resource misallocation and improving autoscaling accuracy.
Autoscaling is a feedback system, not a one-click feature, and its effectiveness depends on accurate metrics, realistic expectations, and careful tuning.
So, what coming next?
GPU workloads magnify every resource management mistake. This deep dive shows how idle accelerators quietly burn budgets and why traditional Kubernetes patterns don’t work for AI workloads.


Top comments (0)