DEV Community

Alexandre Vazquez
Alexandre Vazquez

Posted on

Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does

Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does

How HPA Actually Decides to Scale

The HPA controller uses a formula to determine desired replicas: desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)). A critical detail is that "the metric value is expressed relative to the resource request, not the resource limit." This distinction explains many HPA failures.

HPA polls metrics every 15 seconds by default, scaling up within one to three polling cycles when thresholds are exceeded. Scale-down is deliberately slow, waiting 5 minutes by default to prevent oscillation.

CPU-Based HPA: When It Works and When It Doesn't

Where CPU HPA Works Well

CPU-based HPA succeeds with stateless request-processing workloads where CPU consumption correlates with request volume. Prerequisites include:

  • Accurate CPU requests set to actual sustained consumption, not placeholders
  • Reasonable request-to-limit ratios (1:4 or less)
  • CPU consumption that tracks user load linearly

Where CPU HPA Fails

CPU HPA struggles with:

  • Latency-sensitive services with sharp spikes — by the time HPA detects and reacts to peaks, the burst may be over
  • I/O-bound workloads — showing low CPU even under heavy load
  • Workloads with cold-start costs — requiring earlier scaling decisions than CPU metrics can trigger

Memory-Based HPA: Why It Almost Always Breaks

The Core Problem

Memory is incompressible; exhausting it causes OOM termination. Unlike CPU, "memory consumption is relatively stable" for well-architected services. A Go service or JVM application maintains a consistent memory footprint regardless of traffic volume from 10 to 10,000 requests per second.

This creates two outcomes: memory HPA either never triggers (useless) or always triggers (permanently scaled out).

The Request Misconfiguration Trap

A Java service needing 512Mi heap but configured with a 256Mi request will immediately consume 200% of its request. An HPA with 70% memory threshold will scale such workloads to maximum replicas permanently. The solution is right-sizing requests, not adjusting thresholds.

JVM and Go Runtime Memory Behavior

The JVM allocates heap up to its maximum and doesn't release it aggressively, even after garbage collection. Go's garbage collector prioritizes low latency over minimal memory use, potentially holding memory above strict necessity.

When Memory HPA Is Actually Appropriate

Memory-based HPA is defensible only in narrow cases:

  • Workloads where memory consumption tracks load linearly
  • As a secondary safety valve (not primary) at 85-90% threshold for protecting against memory leaks
  • Caching services where avoiding eviction before scaling out is critical

Right-Sizing Requests Before Adding HPA

No HPA strategy works without accurate resource requests. Run workloads under representative load and measure actual consumption. VPA in recommendation mode provides data-driven baselines:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only
Enter fullscreen mode Exit fullscreen mode

Critical note: VPA and HPA cannot both auto-manage the same resource metric simultaneously.

Better Signals: What to Scale On Instead

Shift from resource consumption metrics (describing the past) to demand metrics (describing current needs).

Requests Per Second (RPS)

For HTTP services, "requests per second per replica is usually the most accurate proxy for load." RPS measures demand directly, working for CPU-bound, memory-bound, or I/O-bound services.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"
Enter fullscreen mode Exit fullscreen mode

Queue Depth and Lag

For consumer workloads reading from message queues, "consumer lag: how many messages are waiting to be processed" is the right scaling signal. KEDA was built for this use case, reading consumer group lag directly.

Latency

P99 latency per replica is an excellent signal for latency-sensitive services, requiring custom metrics from service meshes or APM tools.

Scheduled and Predictive Scaling

For predictable traffic patterns, proactive scaling outperforms reactive scaling. KEDA's Cron scaler enables time-based scaling rules.

HPA Configuration Best Practices

Always Set minReplicas ≥ 2 for Production

A single-replica HPA creates a single point of failure during scale-in events.

Tune Stabilization Windows

The default 5-minute scale-down stabilization is too aggressive for workloads with cyclical patterns. Increase it to match your workload's natural cycle:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
Enter fullscreen mode Exit fullscreen mode

The behavior block (available in HPA v2) enables independent control over scale-up and scale-down.

Use a Lower CPU Threshold Than You Think

If scale-up takes 45 seconds, a 70% threshold leaves existing pods throttled during that window. Set CPU targets at 50-60% for services where scaling latency matters.

Combine HPA with PodDisruptionBudgets

HPA scale-down terminates pods. Without a PodDisruptionBudget, multiple replicas can be terminated simultaneously during maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: my-service
Enter fullscreen mode Exit fullscreen mode

Don't Mix VPA Auto-Update with HPA on the Same Metric

VPA auto-updating requests while HPA scales on those metrics creates conflicting control loops.

Decision Framework: Which Autoscaler for Which Workload

Workload type Recommended signal Tool
Stateless HTTP API, CPU-bound CPU utilization at 50-60% HPA
Stateless HTTP API, I/O-bound RPS per replica or P99 latency HPA + custom metrics
Message queue consumer Consumer lag / queue depth KEDA
Event-driven / Kafka / SQS Event rate or lag KEDA
Predictable traffic pattern Schedule (time-based) KEDA Cron scaler
Workload with memory leak risk CPU primary + memory at 85% secondary HPA (v2 multi-metric)
Right-sizing before HPA Historical CPU/memory recommendations VPA recommendation mode

Going Beyond HPA: KEDA and Custom Metrics

KEDA provides a Kubernetes-native autoscaling framework supporting over 60 built-in scalers. The key architectural point: "KEDA does not replace HPA — it feeds it." KEDA creates and manages HPA resources while consuming signals HPA cannot access natively.

FAQ

Can I use both CPU and memory in the same HPA?

Yes. HPA v2 supports multiple metrics simultaneously, scaling to satisfy the most demanding metric. Use CPU at 60% threshold and memory at 85% threshold so memory only triggers in genuine overconsumption.

Why does my workload scale up immediately after deployment?

Resource request misconfiguration. Check actual consumption against requests using kubectl top pods. If consuming 200% of request by simply running, adjust requests to match actual usage before enabling HPA.

Why does HPA scale down too aggressively and cause latency spikes?

Increase scaleDown.stabilizationWindowSeconds in the HPA behavior block. Also add a Percent policy limiting scale-down to 25% of replicas per minute.

Should I set HPA on every deployment?

No. HPA fits stateless services, consumers, and request handlers. It's inappropriate for stateful workloads requiring more than replica addition, singleton controllers, or batch jobs that should run to completion.

What is the minimum CPU request for reliable HPA?

No absolute minimum, but requests below 100m make percentage thresholds coarse-grained. At 50m and 70% threshold, scaling triggers at 35m consumption. For lower needs, use RPS or custom metrics instead.

How do I debug HPA scaling decisions?

Use kubectl describe hpa to see current metrics and last scaling events. Check HPA events with kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler. For custom metrics, verify the metrics server returns expected values.


Originally published at alexandre-vazquez.com/kubernetes-hpa-best-practices

Top comments (0)