DEV Community

Cover image for Kubernetes Autoscaling: Never Manually Scale Again
Yoshik Karnawat
Yoshik Karnawat

Posted on • Originally published at Medium

Kubernetes Autoscaling: Never Manually Scale Again

As a Site Reliability Engineer who has managed countless production Kubernetes clusters, I've learned that one of the biggest challenges teams face is properly sizing their applications. Too little resources? Your app crashes under load. Too much? You're burning money unnecessarily.

That's where Kubernetes autoscaling comes to the rescue.

What is Autoscaling and Why Should You Care?

Imagine you're running an online store. During normal hours, you might need 3 servers to handle traffic. But during Black Friday sales, you suddenly need 20 servers. Then afterward, you scale back down to save costs.

Kubernetes autoscaling does exactly this - but automatically. No more 3 AM wake-up calls to manually add servers during traffic spikes.

There are two main types of autoscaling in Kubernetes:

  • Horizontal Pod Autoscaler (HPA): Adds more copies of your application
  • Vertical Pod Autoscaler (VPA): Gives more power (CPU/memory) to existing copies

Think of HPA as hiring more cashiers during busy hours, while VPA is like giving your existing cashiers faster computers.

Horizontal Pod Autoscaler (HPA): Adding More Workers

How HPA Works

HPA continuously monitors your application's resource usage. When CPU or memory usage gets too high, it automatically creates more pod copies to handle the load. When traffic decreases, it scales back down.

Here's the process:

  1. Monitor: HPA checks your app's metrics every 15 seconds
  2. Calculate: It determines if more or fewer pods are needed
  3. Scale: It adds or removes pods accordingly
  4. Wait: It waits for the system to stabilize before making another decision

Setting Up HPA

Let's say you have a web application that should scale up when CPU usage exceeds 70%. Here's how you set it up:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2      # Never go below 2 pods
  maxReplicas: 10     # Never exceed 10 pods
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%
Enter fullscreen mode Exit fullscreen mode

Real-World Example

I once worked with an e-commerce company that saw traffic spikes during lunch hours (12-2 PM). Without HPA, their app would slow down terribly during these periods, causing customers to abandon their shopping carts.

After implementing HPA:

  • Before lunch rush: 3 pods running (normal load)
  • During lunch rush: Automatically scaled to 8 pods
  • After lunch rush: Gradually scaled back to 3 pods

Result? Page load times stayed consistent, and they processed 40% more orders during peak hours.

HPA Best Practices

1. Set Resource Requests
Your pods MUST have CPU and memory requests defined. HPA can't work without them.

resources:
  requests:
    cpu: 100m        # Required for HPA
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi
Enter fullscreen mode Exit fullscreen mode

2. Don't Set Min Replicas Too Low
Always keep at least 2 replicas running. If your single pod crashes, your entire service goes down.

3. Monitor Scaling Events
Use these commands to see what HPA is doing:

kubectl get hpa
kubectl describe hpa web-app-hpa
Enter fullscreen mode Exit fullscreen mode

4. Avoid Flapping
If your app scales up and down too frequently, increase the stabilization window:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
Enter fullscreen mode Exit fullscreen mode

Vertical Pod Autoscaler (VPA): Giving More Power

How VPA Works

While HPA adds more workers, VPA makes existing workers more powerful. It monitors your application's actual resource usage over time and automatically adjusts CPU and memory allocations.

VPA has three modes:

  • Off: Only provides recommendations
  • Initial: Sets resources only when pods are created
  • Auto: Automatically updates running pods (requires pod restart)

Setting Up VPA

Here's a basic VPA configuration:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 2Gi
Enter fullscreen mode Exit fullscreen mode

When to Use VPA

VPA is perfect for:

  • Database workloads: Databases often need more memory rather than more instances
  • Data processing applications: These might need varying amounts of CPU and memory
  • Legacy applications: Apps that can't easily scale horizontally

VPA Limitations to Know

1. Pod Restarts Required
VPA needs to restart pods to apply new resource settings. Plan for this.

2. Not Suitable for Stateful Apps
Avoid VPA for databases or other stateful applications where pod restarts cause issues.

3. Still in Beta
VPA is less mature than HPA. Test thoroughly before production use.

Can You Use HPA and VPA Together?

Short answer: Generally no, not on the same metrics.

If both HPA and VPA try to scale based on CPU usage, they'll fight each other:

  • HPA adds more pods because CPU is high
  • VPA increases CPU allocation because usage is high
  • This creates confusion and unpredictable behavior

Safe combination: Use HPA for CPU scaling and VPA only for memory optimization:

# HPA handles CPU scaling
metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70

---
# VPA handles only memory
resourcePolicy:
  containerPolicies:
  - containerName: web-app
    controlledResources: ["memory"]  # Only memory, not CPU
Enter fullscreen mode Exit fullscreen mode

Choosing the Right Autoscaling Strategy

Here's my decision framework after years of production experience:

Use HPA When:

  • Your app is stateless (web servers, APIs)
  • You have variable traffic patterns (daily/weekly spikes)
  • Your app can handle multiple instances
  • You need fast scaling response (seconds to minutes)

Use VPA When:

  • Your app has unpredictable resource needs
  • You're running batch jobs or data processing
  • You have stateful applications that can't scale horizontally
  • You want to optimize resource costs over time

Use Neither When:

  • Your app has steady, predictable load
  • Resource requirements are well-known and stable
  • You prefer manual control over scaling decisions

Getting Started: Your First Autoscaling Setup

Step 1: Install Metrics Server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a Simple HPA

kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10
Enter fullscreen mode Exit fullscreen mode

Step 3: Generate Some Load

kubectl run load-test --image=busybox --rm -it --restart=Never -- /bin/sh
# Inside the pod, run:
while true; do wget -q -O- http://web-app-service; done
Enter fullscreen mode Exit fullscreen mode

Step 4: Watch It Scale

kubectl get hpa --watch
kubectl get pods --watch
Enter fullscreen mode Exit fullscreen mode

Conclusion

Kubernetes autoscaling isn't just a nice-to-have feature - it's essential for running resilient, cost-effective applications at scale. HPA helps you handle traffic spikes automatically, while VPA ensures you're not wasting resources.

Start simple:

  1. Implement HPA for your stateless web applications
  2. Set conservative scaling thresholds initially
  3. Monitor and adjust based on real usage patterns
  4. Consider VPA for resource optimization once you're comfortable

Remember, autoscaling is as much about saving money as it is about handling load. Done right, it keeps your applications responsive while optimizing costs - letting you sleep better at night knowing your systems can handle whatever comes their way.

Top comments (2)

Collapse
 
yoshik_karnawat profile image
Yoshik Karnawat

Most of you will face issues when installing metrics-server because kubelet runs insecure by default.

Instead of the usual installation, do this:

wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Enter fullscreen mode Exit fullscreen mode

Then open the components.yaml and update the args in the deployment by appending:

--kubelet-insecure-tls
Enter fullscreen mode Exit fullscreen mode

Finally, apply the file and your metrics-server will be up and running 🚀

Some comments may only be visible to logged-in visitors. Sign in to view all comments.