Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does
How HPA Actually Decides to Scale
The HPA controller uses a formula to determine desired replicas: desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)). A critical detail is that "the metric value is expressed relative to the resource request, not the resource limit." This distinction explains many HPA failures.
HPA polls metrics every 15 seconds by default, scaling up within one to three polling cycles when thresholds are exceeded. Scale-down is deliberately slow, waiting 5 minutes by default to prevent oscillation.
CPU-Based HPA: When It Works and When It Doesn't
Where CPU HPA Works Well
CPU-based HPA succeeds with stateless request-processing workloads where CPU consumption correlates with request volume. Prerequisites include:
- Accurate CPU requests set to actual sustained consumption, not placeholders
- Reasonable request-to-limit ratios (1:4 or less)
- CPU consumption that tracks user load linearly
Where CPU HPA Fails
CPU HPA struggles with:
- Latency-sensitive services with sharp spikes — by the time HPA detects and reacts to peaks, the burst may be over
- I/O-bound workloads — showing low CPU even under heavy load
- Workloads with cold-start costs — requiring earlier scaling decisions than CPU metrics can trigger
Memory-Based HPA: Why It Almost Always Breaks
The Core Problem
Memory is incompressible; exhausting it causes OOM termination. Unlike CPU, "memory consumption is relatively stable" for well-architected services. A Go service or JVM application maintains a consistent memory footprint regardless of traffic volume from 10 to 10,000 requests per second.
This creates two outcomes: memory HPA either never triggers (useless) or always triggers (permanently scaled out).
The Request Misconfiguration Trap
A Java service needing 512Mi heap but configured with a 256Mi request will immediately consume 200% of its request. An HPA with 70% memory threshold will scale such workloads to maximum replicas permanently. The solution is right-sizing requests, not adjusting thresholds.
JVM and Go Runtime Memory Behavior
The JVM allocates heap up to its maximum and doesn't release it aggressively, even after garbage collection. Go's garbage collector prioritizes low latency over minimal memory use, potentially holding memory above strict necessity.
When Memory HPA Is Actually Appropriate
Memory-based HPA is defensible only in narrow cases:
- Workloads where memory consumption tracks load linearly
- As a secondary safety valve (not primary) at 85-90% threshold for protecting against memory leaks
- Caching services where avoiding eviction before scaling out is critical
Right-Sizing Requests Before Adding HPA
No HPA strategy works without accurate resource requests. Run workloads under representative load and measure actual consumption. VPA in recommendation mode provides data-driven baselines:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
updatePolicy:
updateMode: "Off" # Recommendation only
Critical note: VPA and HPA cannot both auto-manage the same resource metric simultaneously.
Better Signals: What to Scale On Instead
Shift from resource consumption metrics (describing the past) to demand metrics (describing current needs).
Requests Per Second (RPS)
For HTTP services, "requests per second per replica is usually the most accurate proxy for load." RPS measures demand directly, working for CPU-bound, memory-bound, or I/O-bound services.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500"
Queue Depth and Lag
For consumer workloads reading from message queues, "consumer lag: how many messages are waiting to be processed" is the right scaling signal. KEDA was built for this use case, reading consumer group lag directly.
Latency
P99 latency per replica is an excellent signal for latency-sensitive services, requiring custom metrics from service meshes or APM tools.
Scheduled and Predictive Scaling
For predictable traffic patterns, proactive scaling outperforms reactive scaling. KEDA's Cron scaler enables time-based scaling rules.
HPA Configuration Best Practices
Always Set minReplicas ≥ 2 for Production
A single-replica HPA creates a single point of failure during scale-in events.
Tune Stabilization Windows
The default 5-minute scale-down stabilization is too aggressive for workloads with cyclical patterns. Increase it to match your workload's natural cycle:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
The behavior block (available in HPA v2) enables independent control over scale-up and scale-down.
Use a Lower CPU Threshold Than You Think
If scale-up takes 45 seconds, a 70% threshold leaves existing pods throttled during that window. Set CPU targets at 50-60% for services where scaling latency matters.
Combine HPA with PodDisruptionBudgets
HPA scale-down terminates pods. Without a PodDisruptionBudget, multiple replicas can be terminated simultaneously during maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-service-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: my-service
Don't Mix VPA Auto-Update with HPA on the Same Metric
VPA auto-updating requests while HPA scales on those metrics creates conflicting control loops.
Decision Framework: Which Autoscaler for Which Workload
| Workload type | Recommended signal | Tool |
|---|---|---|
| Stateless HTTP API, CPU-bound | CPU utilization at 50-60% | HPA |
| Stateless HTTP API, I/O-bound | RPS per replica or P99 latency | HPA + custom metrics |
| Message queue consumer | Consumer lag / queue depth | KEDA |
| Event-driven / Kafka / SQS | Event rate or lag | KEDA |
| Predictable traffic pattern | Schedule (time-based) | KEDA Cron scaler |
| Workload with memory leak risk | CPU primary + memory at 85% secondary | HPA (v2 multi-metric) |
| Right-sizing before HPA | Historical CPU/memory recommendations | VPA recommendation mode |
Going Beyond HPA: KEDA and Custom Metrics
KEDA provides a Kubernetes-native autoscaling framework supporting over 60 built-in scalers. The key architectural point: "KEDA does not replace HPA — it feeds it." KEDA creates and manages HPA resources while consuming signals HPA cannot access natively.
FAQ
Can I use both CPU and memory in the same HPA?
Yes. HPA v2 supports multiple metrics simultaneously, scaling to satisfy the most demanding metric. Use CPU at 60% threshold and memory at 85% threshold so memory only triggers in genuine overconsumption.
Why does my workload scale up immediately after deployment?
Resource request misconfiguration. Check actual consumption against requests using kubectl top pods. If consuming 200% of request by simply running, adjust requests to match actual usage before enabling HPA.
Why does HPA scale down too aggressively and cause latency spikes?
Increase scaleDown.stabilizationWindowSeconds in the HPA behavior block. Also add a Percent policy limiting scale-down to 25% of replicas per minute.
Should I set HPA on every deployment?
No. HPA fits stateless services, consumers, and request handlers. It's inappropriate for stateful workloads requiring more than replica addition, singleton controllers, or batch jobs that should run to completion.
What is the minimum CPU request for reliable HPA?
No absolute minimum, but requests below 100m make percentage thresholds coarse-grained. At 50m and 70% threshold, scaling triggers at 35m consumption. For lower needs, use RPS or custom metrics instead.
How do I debug HPA scaling decisions?
Use kubectl describe hpa to see current metrics and last scaling events. Check HPA events with kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler. For custom metrics, verify the metrics server returns expected values.
Originally published at alexandre-vazquez.com/kubernetes-hpa-best-practices
Top comments (0)