Scaling Cooldown Tuning: Stop Your Autoscaler From Thrashing

#devops #infrastructure #kubernetes #sre

Your HPA is flapping.

Pods spin up. Traffic dips. Pods spin down. Traffic returns. Pods spin up again. All within 90 seconds.

This costs money and stability. Every scale event creates pod churn. New pods need to warm up. Connections restart. Metrics refresh.

The fix isn't complicated. It's tuning cooldown periods.

What Flapping Looks Like

Before tuning:

9:15 AM: CPU hits 75%. HPA scales 3→5 pods.
9:16 AM: Traffic normalizes. CPU drops to 60%.
9:17 AM: HPA scales 5→3 pods (scaleDown default is 300s, but we weren't respecting it).
9:18 AM: Batch request comes in. CPU jumps to 80%.
9:19 AM: HPA scales 3→5 pods again.

Every 60-90 seconds. Constantly. Pod logs show connection resets every minute.

Billing spike? $200/day in unnecessary compute because pods kept restarting.

The Tuning

We changed from defaults:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300    # wait 5 min before scaling down
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0      # scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 60

Why these values?

scaleDown stabilizationWindow: 300s (5 min). If CPU drops below threshold, wait 5 minutes before actually scaling down. Most traffic spikes last longer than 90 seconds. This prevents reacting to temporary dips. One team tried 60s, still flapping. 300s worked.

scaleDown percent: 50. Remove half the pods at a time, not all of them. If you're at 5 pods and scale down to 3, you're making a big bet that you don't need those 2. Removing 50% (5→3) is safer than removing 100%.

scaleUp stabilizationWindow: 0. When CPU hits 75%, scale immediately. You have customers waiting. Slow scale-up means slow response time.

scaleUp percent: 100. Double the pod count if needed. If you're at 3 pods and hitting limits, jump to 6. Better to overprovision briefly than make customers wait.

After Tuning

Same 9:15 AM scenario:

9:15 AM: CPU hits 75%. HPA scales 3→5 pods immediately.
9:20 AM: Traffic stabilizes. System waits (stabilization window).
9:25 AM: CPU still below 60%. HPA scales 5→3 pods.
9:26 AM: No more thrashing.

Pod restart rate dropped 95%. Load balancer connection resets went from 60/min to 2/min.

Monthly compute cost dropped $1,400 (was $8,500/month due to churn, now $7,100).

The Principle

Scale up fast, scale down slow.

Customers need capacity now. They don't care if you have extra pods for 5 minutes. They do care if you're constantly churning them.

Stabilization windows let temporary spikes and dips pass without action. Percent-based scaling lets you adjust gradually instead of binary yes/no decisions.

One team still uses defaults. They have pod churn every 90 seconds. Another adjusted to these values and saw pod churn once per day, only when actual traffic patterns genuinely changed.

Your HPA is probably thrashing. Check your stabilization windows. If you see pod restart spikes that correlate with CPU threshold crossings, you've found it.

Set scaleDown to 300s. Set scaleUp to 0. Adjust percents based on your app. Test. Most teams see 70-80% reduction in unnecessary scaling events.