When Autoscaling Makes Your Bill Worse, Not Better
Autoscaling is sold as the solution to cloud waste. Scale down when traffic drops, scale up when it rises, pay only for what you use. That logic holds when the configuration is correct. When it is not, autoscaling becomes the most expensive mistake in your cluster.
We have seen production clusters where HPA and VPA were both active, Cluster Autoscaler was provisioning nodes on every spike, and the monthly bill was 40% higher than the equivalent fixed-size deployment would have been. The scaling was working as configured. The configuration was wrong.
This is not a rare edge case. The four failure modes below appear consistently across teams that have just enabled autoscaling and assumed the defaults are safe.
How HPA Actually Computes Scale Decisions
Most engineers understand HPA conceptually: when CPU goes up, add replicas. The detail that causes the most misconfiguration is what "CPU goes up" actually means to the controller.
HPA computes utilization as a percentage of the pod's CPU request, not its limit, and not the raw CPU usage in millicores.
If a pod has a CPU request of 100m and a CPU limit of 2000m, and HPA's targetCPUUtilizationPercentage is set to 50, HPA will try to keep actual CPU usage at 50m per pod. A pod that idles at 60m looks perpetually overloaded. HPA adds a replica. The new replica also idles at 60m. HPA adds another. This continues until maxReplicas is hit.
The pod is not overloaded. The cluster is.
The formula HPA uses is: desiredReplicas = ceil(currentReplicas * (currentUtilization / targetUtilization)).
If currentUtilization is already above targetUtilization before any real load arrives, the multiplier is greater than 1 on every reconciliation loop. The cluster scales continuously.
The fix: Set CPU requests to match the pod's actual idle CPU consumption, measured over 7 days. HPA's target should reflect load above idle, not total usage. A pod that idles at 60m and peaks at 300m should have a request near 60m and an HPA target around 70%, triggering scale-out when load reaches 42m above idle.
The Four Failure Modes That Inflate Your Bill
1. Target Set Below Idle CPU
Already described above. The signal is an HPA that shows TARGETS at or above targetCPUUtilizationPercentage even when no traffic is hitting the service. Check with kubectl get hpa -n <namespace> and look for utilization values that match your target even during off-hours.
2. HPA and VPA Running Simultaneously in Auto Mode
VPA in Auto mode evicts pods to apply new resource recommendations. Each eviction causes a brief CPU spike as the replacement pod starts up. HPA reads that spike and adds a replica. VPA then recalculates recommendations against a fleet that now has more replicas than before. The new recommendation is different. VPA evicts again. HPA scales again.
The result is replica count drift upward over time, with no corresponding increase in actual workload.
The fix: Run VPA in Off or Recommendation mode only. Read VPA's output and apply it manually to your deployment's resources.requests. Let HPA handle replica scaling from a stable request baseline. Never run VPA Auto and HPA on the same deployment.
3. Node Scale-Down Lag
Cluster Autoscaler's default scale-down-delay-after-add is 10 minutes. Its default scale-down-unneeded-time is also 10 minutes. A spike that lasts 3 minutes provisions new nodes that remain billable for at least 20 minutes after the spike ends.
For workloads with frequent short spikes, this means your cluster is almost always running with the node count from the last peak, not the current load. At $0.10/node-hour on t3.large, 5 extra nodes during 8 hours of evening quiet costs $4 per night, $120 per month, per service.
The fix: Tune --scale-down-delay-after-add to 3-5 minutes for development and staging clusters. For production, balance cost against the re-provisioning latency your workload can tolerate.
4. Metric Staleness in Custom Autoscalers
KEDA and any custom metrics-based HPA configuration depend on Prometheus (or another metrics source) for their scaling signals. Prometheus default scrape interval is 15 seconds. With metric collection lag, rule evaluation delay, and HPA reconciliation period, the data HPA acts on can be 30-60 seconds old.
For a bursty workload that spikes and drops in under 60 seconds, the autoscaler always reacts after the fact. It adds replicas as the spike is already ending. Those replicas sit idle through the stabilization window (default 5 minutes), then Cluster Autoscaler keeps the nodes alive for another 10 minutes. The spike cost 3 minutes of real load and 15 minutes of real spend.
The fix: For bursty workloads, reduce Prometheus scrape interval to 5 seconds for the relevant metrics. Or use predictive scaling (pre-scale before known peak times) instead of reactive scaling for workloads with predictable traffic patterns.
Detecting Runaway Scaling Before It Hits Your Invoice
The failure modes above are detectable before they become expensive. These are the signals worth watching.
| Signal | Warning Condition | How to Check |
|---|---|---|
| HPA utilization at target during off-hours | HPA shows utilization at or above target with no active traffic |
kubectl get hpa -A during off-peak window |
| Replica count trend | Replicas increasing over days without traffic growth | Prometheus kube_deployment_spec_replicas 7-day graph |
| VPA eviction rate | More than 2 evictions/hour per deployment | kubectl get events --field-selector reason=Evicted |
| Node count vs request count ratio | Node count stable while request rate drops | Prometheus kube_node_info vs ingress RPS |
| HPA scale-up frequency | More than 4 scale-up events per hour during normal load |
kubectl describe hpa <name> events section |
| Cluster Autoscaler churn | Nodes provisioned and deleted more than twice per day | Cluster Autoscaler logs: grep "Scale-up" cluster-autoscaler.log
|
Set alerts on the first three. The rest are diagnostic when you suspect a problem.
Configuration Patterns That Eliminate the Failure Modes
VPA as Input to HPA, Not a Parallel Controller
This is the pattern that produces stable scaling. VPA's recommendations are reviewed and applied on a cadence. HPA operates against a request value that reflects actual idle consumption. Cluster Autoscaler provisions nodes against predictable replica counts.
Stabilization Windows Tuned to Your Traffic Shape
HPA's scaleDown.stabilizationWindowSeconds defaults to 300 seconds (5 minutes). For a service with 10-minute traffic cycles, that is reasonable. For a service with 2-hour traffic cycles, replicas will churn constantly. Set the window to match the natural period of your load pattern.
Similarly, scaleUp.stabilizationWindowSeconds defaults to 0 (immediate scale-up). For services where a 30-second spike does not justify a new replica, set this to 60-120 seconds to absorb transient spikes without triggering scale-out.
Cluster Autoscaler Tuning by Environment
For production clusters, the default 10-minute delays are appropriate. For non-production clusters (dev, staging, preview), reduce scale-down-delay-after-add and scale-down-unneeded-time to 2-3 minutes each. Non-prod clusters are typically not latency-sensitive. Aggressive scale-down on non-prod is pure cost reduction with no operational downside.
Going further: for non-prod environments, scheduled scaling (down to zero at end of day, back up at start of day) eliminates the problem entirely. A cluster that is off for 14 hours per day costs 58% less than one running 24/7. Autoscaling on a non-prod cluster is often the wrong tool. Scheduling is simpler and cheaper.
KEDA Scrape Interval Alignment
If you use KEDA with a Prometheus trigger, set pollingInterval in your ScaledObject to match your Prometheus scrape interval. The default pollingInterval is 30 seconds. If Prometheus scrapes every 15 seconds, KEDA sees data that is up to 45 seconds old (scrape age plus polling delay). Reducing both to 5-10 seconds closes the detection gap for bursty workloads.
What Good Autoscaling Actually Looks Like
A well-tuned autoscaling setup has two visible characteristics. First, replica counts are stable during steady-state traffic, moving only when load genuinely changes over a meaningful time window. Second, node count follows replica count with a predictable lag, and drops back to baseline within 15-20 minutes after load normalizes.
If your replica count graph looks like a heartbeat at rest, your autoscaling is calibrated. If it looks like a seismograph, the configuration is fighting your workload rather than tracking it.
The deeper issue is that autoscaling is a tool for production variability. Non-production environments do not have the same variability profile. Dev and staging clusters run at low load most of the day, spike briefly during CI runs or manual testing, then sit idle for hours. Autoscaling on these environments responds to those spikes by provisioning nodes that stay alive long after the spike ends. For non-prod, scheduled environment management eliminates this entirely. zopnight handles this automatically: environments shut down after inactivity and wake on access, without relying on autoscaler heuristics that were designed for production traffic patterns.
The goal is not to autoscale everything. The goal is to pay for what you actually use. Sometimes that means better-tuned HPA. Sometimes it means no autoscaling at all.



Top comments (0)