Stop treating your autoscaling as a black box! This deep dive into Kubernetes HPA, originally published on devopsstart.com, explores the algorithms and stabilization windows that drive scaling decisions.
Introduction to Kubernetes Horizontal Pod Autoscaling (HPA)
Kubernetes Horizontal Pod Autoscaling (HPA) is a critical component for managing dynamic workloads, ensuring your applications remain responsive and cost-efficient. You've likely configured basic HPA rules before, tying them to CPU or memory utilization. But how well do you really understand what's happening under the hood when your HPA decides to scale up or down? This isn't just about applying a YAML manifest; it's about understanding the intricate dance between the HPA controller, the Metrics Server, and your application's resource demands.
This article goes beyond the typical "how-to" guide. We'll peel back the layers, exploring the HPA's core mechanics, its precise scaling algorithm, and how it handles metrics beyond just CPU. We'll dive into stabilization windows, custom and external metrics, and the policies that prevent erratic scaling. By the end, you'll have a much clearer picture of how HPA makes its decisions, empowering you to troubleshoot common issues, fine-tune your autoscaling strategies, and build more resilient and performant Kubernetes applications. Ditching the black box approach will save you headaches and infrastructure costs.
How HPA Works: Components and Data Flow
At its heart, HPA is about reacting to observed resource utilization to adjust the number of pod replicas. The HPA controller, running as part of the Kubernetes control plane, is the brain. It continuously monitors defined metrics and compares them against target values you've specified in the HPA configuration.
Here’s a breakdown of the interaction:
- HPA Controller: Periodically (by default, every 15 seconds) queries the Kubernetes API for metrics and the current state of the scaled target.
-
Metrics Server: This cluster add-on aggregates CPU and memory usage data from Kubelets on each node. It exposes these metrics via the
metrics.k8s.ioAPI, which is a standard resource metrics API. -
Kubernetes API Server: The central control plane component. The HPA controller interfaces with the API server to retrieve object definitions, submit scaling requests (by updating the
replicasfield of the target workload), and fetch metrics data (which the API server obtains from the Metrics Server or Custom/External Metrics APIs). -
Target Workload (Deployment, ReplicaSet, StatefulSet): This is the Kubernetes object that HPA scales. When HPA decides to scale, it updates the
replicasfield of this object. The corresponding controller for that object (e.g., the Deployment controller) then creates or deletes pods accordingly to match the desired replica count.
The crucial distinction here is between horizontal and vertical scaling. HPA performs horizontal scaling, meaning it changes the number of pods. Vertical Pod Autoscaler (VPA), on the other hand, adjusts the resource requests and limits (CPU/memory) of individual pods. They solve different problems and can even be used together cautiously.
Let's look at a basic HPA definition for a Deployment named nginx-app:
# hpa-example.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Target 50% average CPU utilization
To apply this configuration, save it as hpa-example.yaml and run:
kubectl apply -f hpa-example.yaml
You can then check the HPA status to see its current state and targets:
kubectl get hpa nginx-hpa -o wide
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
nginx-hpa Deployment/nginx-app 0%/50% 1 10 1 45s
In this output, 0%/50% indicates that the current average CPU utilization across the pods managed by nginx-hpa is 0%, and the target utilization is 50%. The HPA is currently managing 1 replica, which is its minReplicas setting.
HPA Scaling Algorithm and Stabilization Windows
Understanding how HPA calculates the desired number of replicas is key to predicting its behavior. The core of the HPA algorithm revolves around comparing observed metrics against target values.
For Resource metrics (CPU, memory) with target.type: Utilization, the HPA controller fetches the raw usage data for all pods managed by the scaleTargetRef. It explicitly excludes pods that are marked for deletion, or those with a "Ready" condition of false for too long (e.g., more than minReadySeconds defined in the Deployment). It then divides this raw usage by the resource.requests value defined in the pod's container specification to get a utilization percentage for each pod. Finally, it averages these percentages across all healthy, ready pods.
The desired number of replicas is calculated using this formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
Let's illustrate with an example:
currentReplicas = 3-
currentMetricValue(average CPU utilization)= 75% -
desiredMetricValue(target CPU utilization)= 50%
The calculation proceeds as follows: desiredReplicas = ceil[3 * (75 / 50)] = ceil[3 * 1.5] = ceil[4.5] = 5
Based on this, HPA would propose scaling up to 5 replicas.
Crucially, this desiredReplicas value is then subjected to stabilization windows to prevent rapid "resource flapping" (thrashing between scaling up and down). Kubernetes autoscaling/v2 introduced configurable behavior policies, which offer fine-grained control over scaling actions.
# hpa-behavior.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
# Wait 5 minutes (300 seconds) after the last scale-up or scale-down event before scaling up again.
# This smooths out reactions to transient spikes.
stabilizationWindowSeconds: 300
policies:
# Scale up by at most 100% of current replicas in any 15-second period
- type: Percent
value: 100
periodSeconds: 15
# Or, scale up by at most 4 pods in any 15-second period
- type: Pods
value: 4
periodSeconds: 15
scaleDown:
# Wait 5 minutes before allowing a scale-down.
# This prevents rapid "flapping" down after a brief dip in load.
stabilizationWindowSeconds: 300
policies:
# Scale down by at most 100% of current replicas in any 15-second period
- type: Percent
value: 100
periodSeconds: 15
The stabilizationWindowSeconds defines a period during which the HPA controller considers past scaling proposals.
For scaleUp, HPA evaluates the desired state over the last stabilizationWindowSeconds. It will only scale up if the highest proposed replica count within that window is greater than the current number of replicas. This mechanism helps to prevent the HPA from overreacting to very short-lived spikes that might disappear quickly.
For scaleDown, HPA uses the lowest proposed replica count within the stabilizationWindowSeconds. This is critical: if HPA determines it needs fewer replicas, it will wait for the entire stabilizationWindowSeconds to pass from the last time a scaling event occurred, and consistently see a low load before executing a scale-down. This prevents "flapping" down too quickly after a transient dip in load, which could lead to subsequent quick scale-ups if the load returns.
The policies within scaleUp and scaleDown further refine the scaling rate. For example, the scaleUp policies above specify that within any 15-second period, the HPA can scale up by at most 100% of the current replicas, OR by a maximum of 4 pods, whichever results in a smaller replica count increase. This prevents explosive scaling that could overwhelm the cluster or external services.
Conversely, scaleDown policies limit how quickly HPA removes pods. The default stabilizationWindowSeconds is 300 seconds (5 minutes) for scale-down and 0 seconds for scaleUp. It's often beneficial to increase the scaleUp stabilization window for more stable behavior, especially if application startup times are significant.
Custom and External Metrics for Advanced Autoscaling
While CPU and memory are useful for general resource management, many applications have more nuanced scaling requirements. Consider web services that need to scale based on Requests Per Second (RPS), or message processing queues that should scale based on queue depth. This is where Custom and External Metrics become invaluable.
- Custom Metrics: These are application-specific metrics collected and exposed by your services. They are usually derived from an application's internal state (e.g., number of active users, database connections, unique business metric). HPA can query these custom metrics via the Kubernetes Custom Metrics API.
- External Metrics: These are metrics originating from outside the Kubernetes cluster, often from cloud provider services (e.g., AWS SQS queue length, Google Cloud Pub/Sub publish rate, Azure Service Bus message count, Datadog metric). HPA accesses these through the Kubernetes External Metrics API.
Both Custom and External Metrics APIs are backed by adapters. These adapters, like k8s-prometheus-adapter for custom metrics or cloud-specific adapters (e.g., GCP's Stackdriver adapter) for external metrics, translate metric requests from the HPA into queries against your monitoring system (Prometheus, CloudWatch, Stackdriver, etc.). The adapter then returns the aggregated metric data to the Kubernetes API server, which HPA consumes.
Let's illustrate with a Custom Metric example for requests_per_second, assuming you have configured k8s-prometheus-adapter in your cluster.
Define your application metric: Your application needs to expose a metric, for example,
http_requests_totalusing a Prometheus client library. Prometheus then scrapes this metric. Thek8s-prometheus-adapteris configured to query Prometheus and expose this as a custom metric likehttp_requests_per_secondvia the Custom Metrics API.-
HPA configuration using Custom Metrics:
# hpa-custom-metrics.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: webapp-hpa namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: webapp minReplicas: 1 maxReplicas: 10 metrics: - type: Pods # "Pods" type targets an average value across all pods pods: metric: name: http_requests_per_second # This is the custom metric name exposed by the adapter target: type: AverageValue averageValue: "100" # Target 100 requests/second per pod
yaml
In this example, the HPA targets an average of 100 HTTP requests per second *per pod*. The `k8s-prometheus-adapter` would be configured to aggregate the `http_requests_total` metric, convert it to a rate, and expose it as `http_requests_per_second` to the Custom Metrics API.
3. **HPA configuration using External Metrics (conceptual example for an AWS SQS queue):**
```yaml
# hpa-external-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sqs-worker-processor
minReplicas: 1
maxReplicas: 5
metrics:
- type: External
external:
metric:
name: sqs_queue_messages_visible
selector: # Used to specify which specific instance of the external metric to target
matchLabels:
queue_name: my-app-queue # Label to distinguish this queue in the external monitoring system
target:
type: AverageValue
averageValue: "20" # Target no more than 20 messages visible in the queue per worker pod
Here, an external metrics adapter (e.g., one integrated with AWS CloudWatch or a dedicated SQS metrics adapter) would query the `ApproximateNumberOfMessagesVisible` metric for `my-app-queue` in SQS. The HPA then scales the `sqs-worker-processor` pods to maintain an average of 20 messages per available worker. This implies that if there are 100 messages in the queue, 5 workers will be desired (100 messages / 20 messages/worker = 5 workers).
The key advantage of custom and external metrics is their ability to scale based on business-relevant load indicators, leading to more accurate and efficient autoscaling than relying solely on generic resource metrics. However, they introduce additional complexity in setup and require robust monitoring of the metric sources themselves. Always ensure your custom/external metric sources are highly available, accurate, and have low latency.
Best Practices for HPA in Production Environments
Implementing HPA effectively in production goes beyond just writing a YAML file. Here are some best practices to ensure stability, efficiency, and predictability.
1. Define Accurate Resource Requests and Limits
HPA's CPU and memory Utilization targets rely directly on appropriately set resource.requests. If requests are too low, pods will appear highly utilized prematurely, leading to over-scaling. If requests are too high, HPA will underscale or never scale, as actual utilization will consistently be below the target. Limits, on the other hand, prevent a runaway pod from consuming all node resources, helping to maintain node stability. Use tools like Vertical Pod Autoscaler (VPA) in recommendation mode to get data-driven suggestions for requests and limits, but always validate them with testing.
2. Choose the Right Metrics for Autoscaling
While CPU and memory are defaults, they aren't always the best indicators of application load or user experience. For stateless web applications, Requests Per Second (RPS) or error rates might be more relevant. For backend workers, queue length or processing latency could be better. Use custom or external metrics when generic resource metrics don't align with your application's actual load. This leads to more intelligent and workload-aware scaling.
3. Effectively Configure HPA Stabilization Windows
The default scaleDown.stabilizationWindowSeconds of 300 seconds (5 minutes) is often a good starting point, preventing scale-down flapping. However, the scaleUp stabilization window is 0 seconds by default, meaning HPA will react instantly to a sustained spike. Sudden traffic spikes can cause rapid, jerky scale-ups, potentially leading to resource contention or increased costs. Consider adding a scaleUp.stabilizationWindowSeconds (e.g., 60-120 seconds) to prevent overreacting to short-lived spikes. Adjust these values based on your application's startup time, shutdown time, and how quickly its load typically changes.
4. Set minReplicas and maxReplicas Realistically
minReplicas should be high enough to handle your baseline load, provide redundancy for high availability, and absorb initial traffic bursts. This avoids cold starts when scaling from zero. maxReplicas acts as a crucial safety valve, preventing uncontrolled scaling due to misconfigurations or unforeseen load spikes, which could exhaust cluster resources, hit API rate limits, or lead to excessive cloud costs.
5. Validate HPA Configurations with Load Testing
Never deploy HPA to production without thorough load testing. Simulate various traffic patterns (gradual ramp-up, sudden spikes, sustained high load, rapid fall-off) in a staging environment. Observe not just pod counts but also application performance metrics (latency, throughput), error rates, and resource utilization on nodes. Tools like K6, Locust, or JMeter are excellent for this. This proactive testing helps identify and fix issues before they impact production.
6. Monitor HPA Events and Metrics Thoroughly
Kubernetes emits events for HPA scaling actions (e.g., ScaledUp, ScaledDown). Monitor these events (kubectl describe hpa <name> or kubectl get events --watch) to understand why and when scaling occurs. Additionally, track the HPA's own metrics if exposed (e.g., via Prometheus), such as the calculated desired replica count, the last scale time, and the target metrics. Comprehensive monitoring helps debug misbehavior, validate efficiency, and understand your application's dynamic resource needs over time.
FAQ about Kubernetes HPA
Q1: My HPA isn't scaling, even though metrics (like CPU) are consistently high. What could be wrong?
A1: First, check the HPA's current state and conditions with kubectl get hpa <name> -o yaml and kubectl describe hpa <name>. Look for status.conditions for any warnings or reasons for inaction. Common reasons for HPA not scaling include:
-
Missing
resource.requests: HPA requires CPU and memoryrequeststo be defined in your pod specification to calculate utilization percentages accurately. If they're missing, HPA cannot scale based on resource utilization. -
Metrics Server issues: Ensure the Kubernetes Metrics Server is running and healthy (
kubectl get pods -n kube-system | grep metrics-server). HPA relies on it for CPU and memory metrics. -
Incorrect
scaleTargetRef: Verify that theapiVersion,kind, andnameinscaleTargetRefaccurately point to the targetDeployment,StatefulSet, orReplicaSet. -
minReplicas/maxReplicasreached: Check if the HPA is already at itsmaxReplicaslimit. If so, it cannot scale up further. -
Stabilization windows: If a
scaleUp.stabilizationWindowSecondsis configured, the HPA might be waiting for the load to be consistently high for the specified duration before acting. - Custom/External metrics misconfiguration: For non-resource metrics, check the logs of your custom/external metrics adapter for any errors in fetching or processing metrics from your monitoring system.
Q2: Why does my application scale down so slowly?
A2: This behavior is typically due to the scaleDown.stabilizationWindowSeconds configuration in your HPA. HPA, by default for autoscaling/v2, uses a stabilizationWindowSeconds of 300 seconds (5 minutes) for scale-down if no explicit behavior is defined. This means it will wait for 5 minutes of consistently low load before allowing a scale-down action. This delay is intentionally designed to prevent "flapping" down too quickly during temporary lulls in traffic that might immediately be followed by another spike. If your application can tolerate more aggressive scale-downs without impacting performance or availability, you can reduce this value in the behavior section of your HPA specification.
Q3: Can HPA scale based on multiple metrics simultaneously?
A3: Yes, Kubernetes HPA version autoscaling/v2 and higher fully support scaling based on multiple metrics. When multiple metrics are specified in the HPA configuration (e.g., CPU utilization, custom requests per second, and external queue depth), HPA calculates a desiredReplicas value for each metric independently. It then selects the highest desiredReplicas value among all calculations, ensuring that the application scales up to satisfy the most demanding metric. This combined desired replica count is then subject to minReplicas and maxReplicas, as well as stabilization windows, before the scaling action is performed. This allows for very flexible and intelligent autoscaling.
Q4: What is the difference between Pods and Object metric types for custom metrics in HPA?
A4: The Pods and Object metric types are used for different aggregation strategies when dealing with custom metrics:
-
Podsmetric type: This targets an average value across all pods that the HPA manages. HPA expects the custom metric to be reported on a per-pod basis (i.e., each pod exposes its own value for that metric). The HPA controller will fetch the metric value for each individual pod, sum them up, and divide by the number of pods to get an average value. It then uses this average to calculate the desired number of replicas (e.g., "target an average of 100 requests per second per pod"). -
Objectmetric type: This targets a single metric value associated with thescaleTargetRefobject itself (e.g., a Deployment, or even an arbitrary other Kubernetes object like a Service or Ingress). This is typically used when the metric doesn't naturally aggregate on a per-pod basis but rather represents a global value for the workload (e.g., "total number of active user sessions across all pods of this Deployment," or "latency of the service endpoint"). HPA queries this single object-level metric value directly and uses it to calculate the desired replicas.
Conclusion
Understanding Kubernetes HPA beyond basic configuration is essential for building robust, scalable applications. We've explored the interplay of the HPA controller, Metrics Server, and API server, dissected the scaling algorithm with its critical stabilization windows, and demystified the integration of custom and external metrics. This deeper knowledge isn't just academic; it directly empowers you to troubleshoot elusive scaling issues, optimize resource utilization, and ultimately control your infrastructure costs.
Your next steps should involve:
-
Review your existing HPA configurations: Verify that your
minReplicas,maxReplicas, andstabilizationWindowSecondssettings are appropriate for your application's behavior and performance requirements. - Evaluate your metrics: Determine if CPU/memory are truly the best indicators of load for all your critical workloads, or if custom/external metrics could provide more precise and business-relevant scaling.
- Implement careful monitoring: Set up alerts and dashboards to track HPA events, desired replica counts, and application performance metrics during scaling events. This feedback loop is crucial for iterative improvement.
- Test under load: Always validate your HPA behavior in a dedicated staging or testing environment. Use load testing tools to simulate various traffic patterns and observe how HPA responds, ensuring it scales efficiently without over- or under-provisioning.
By applying these insights, you'll move from merely setting and forgetting HPA to actively mastering your application's autoscaling behavior.
Top comments (0)