As a Site Reliability Engineer who has managed countless production Kubernetes clusters, I've learned that one of the biggest challenges teams face is properly sizing their applications. Too little resources? Your app crashes under load. Too much? You're burning money unnecessarily.
That's where Kubernetes autoscaling comes to the rescue.
What is Autoscaling and Why Should You Care?
Imagine you're running an online store. During normal hours, you might need 3 servers to handle traffic. But during Black Friday sales, you suddenly need 20 servers. Then afterward, you scale back down to save costs.
Kubernetes autoscaling does exactly this - but automatically. No more 3 AM wake-up calls to manually add servers during traffic spikes.
There are two main types of autoscaling in Kubernetes:
- Horizontal Pod Autoscaler (HPA): Adds more copies of your application
- Vertical Pod Autoscaler (VPA): Gives more power (CPU/memory) to existing copies
Think of HPA as hiring more cashiers during busy hours, while VPA is like giving your existing cashiers faster computers.
Horizontal Pod Autoscaler (HPA): Adding More Workers
How HPA Works
HPA continuously monitors your application's resource usage. When CPU or memory usage gets too high, it automatically creates more pod copies to handle the load. When traffic decreases, it scales back down.
Here's the process:
- Monitor: HPA checks your app's metrics every 15 seconds
- Calculate: It determines if more or fewer pods are needed
- Scale: It adds or removes pods accordingly
- Wait: It waits for the system to stabilize before making another decision
Setting Up HPA
Let's say you have a web application that should scale up when CPU usage exceeds 70%. Here's how you set it up:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2 # Never go below 2 pods
maxReplicas: 10 # Never exceed 10 pods
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when CPU > 70%
Real-World Example
I once worked with an e-commerce company that saw traffic spikes during lunch hours (12-2 PM). Without HPA, their app would slow down terribly during these periods, causing customers to abandon their shopping carts.
After implementing HPA:
- Before lunch rush: 3 pods running (normal load)
- During lunch rush: Automatically scaled to 8 pods
- After lunch rush: Gradually scaled back to 3 pods
Result? Page load times stayed consistent, and they processed 40% more orders during peak hours.
HPA Best Practices
1. Set Resource Requests
Your pods MUST have CPU and memory requests defined. HPA can't work without them.
resources:
requests:
cpu: 100m # Required for HPA
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
2. Don't Set Min Replicas Too Low
Always keep at least 2 replicas running. If your single pod crashes, your entire service goes down.
3. Monitor Scaling Events
Use these commands to see what HPA is doing:
kubectl get hpa
kubectl describe hpa web-app-hpa
4. Avoid Flapping
If your app scales up and down too frequently, increase the stabilization window:
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
Vertical Pod Autoscaler (VPA): Giving More Power
How VPA Works
While HPA adds more workers, VPA makes existing workers more powerful. It monitors your application's actual resource usage over time and automatically adjusts CPU and memory allocations.
VPA has three modes:
- Off: Only provides recommendations
- Initial: Sets resources only when pods are created
- Auto: Automatically updates running pods (requires pod restart)
Setting Up VPA
Here's a basic VPA configuration:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: web-app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
When to Use VPA
VPA is perfect for:
- Database workloads: Databases often need more memory rather than more instances
- Data processing applications: These might need varying amounts of CPU and memory
- Legacy applications: Apps that can't easily scale horizontally
VPA Limitations to Know
1. Pod Restarts Required
VPA needs to restart pods to apply new resource settings. Plan for this.
2. Not Suitable for Stateful Apps
Avoid VPA for databases or other stateful applications where pod restarts cause issues.
3. Still in Beta
VPA is less mature than HPA. Test thoroughly before production use.
Can You Use HPA and VPA Together?
Short answer: Generally no, not on the same metrics.
If both HPA and VPA try to scale based on CPU usage, they'll fight each other:
- HPA adds more pods because CPU is high
- VPA increases CPU allocation because usage is high
- This creates confusion and unpredictable behavior
Safe combination: Use HPA for CPU scaling and VPA only for memory optimization:
# HPA handles CPU scaling
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
# VPA handles only memory
resourcePolicy:
containerPolicies:
- containerName: web-app
controlledResources: ["memory"] # Only memory, not CPU
Choosing the Right Autoscaling Strategy
Here's my decision framework after years of production experience:
Use HPA When:
- Your app is stateless (web servers, APIs)
- You have variable traffic patterns (daily/weekly spikes)
- Your app can handle multiple instances
- You need fast scaling response (seconds to minutes)
Use VPA When:
- Your app has unpredictable resource needs
- You're running batch jobs or data processing
- You have stateful applications that can't scale horizontally
- You want to optimize resource costs over time
Use Neither When:
- Your app has steady, predictable load
- Resource requirements are well-known and stable
- You prefer manual control over scaling decisions
Getting Started: Your First Autoscaling Setup
Step 1: Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Step 2: Create a Simple HPA
kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10
Step 3: Generate Some Load
kubectl run load-test --image=busybox --rm -it --restart=Never -- /bin/sh
# Inside the pod, run:
while true; do wget -q -O- http://web-app-service; done
Step 4: Watch It Scale
kubectl get hpa --watch
kubectl get pods --watch
Conclusion
Kubernetes autoscaling isn't just a nice-to-have feature - it's essential for running resilient, cost-effective applications at scale. HPA helps you handle traffic spikes automatically, while VPA ensures you're not wasting resources.
Start simple:
- Implement HPA for your stateless web applications
- Set conservative scaling thresholds initially
- Monitor and adjust based on real usage patterns
- Consider VPA for resource optimization once you're comfortable
Remember, autoscaling is as much about saving money as it is about handling load. Done right, it keeps your applications responsive while optimizing costs - letting you sleep better at night knowing your systems can handle whatever comes their way.
Top comments (2)
Most of you will face issues when installing metrics-server because kubelet runs insecure by default.
Instead of the usual installation, do this:
Then open the
components.yaml
and update the args in the deployment by appending:Finally, apply the file and your metrics-server will be up and running 🚀
Some comments may only be visible to logged-in visitors. Sign in to view all comments.