Yoshik Karnawat

Posted on Aug 17 • Originally published at Medium

Kubernetes Autoscaling: Never Manually Scale Again

#devops #kubernetes #learning

As a Site Reliability Engineer who has managed countless production Kubernetes clusters, I've learned that one of the biggest challenges teams face is properly sizing their applications. Too little resources? Your app crashes under load. Too much? You're burning money unnecessarily.

That's where Kubernetes autoscaling comes to the rescue.

What is Autoscaling and Why Should You Care?

Imagine you're running an online store. During normal hours, you might need 3 servers to handle traffic. But during Black Friday sales, you suddenly need 20 servers. Then afterward, you scale back down to save costs.

Kubernetes autoscaling does exactly this - but automatically. No more 3 AM wake-up calls to manually add servers during traffic spikes.

There are two main types of autoscaling in Kubernetes:

Horizontal Pod Autoscaler (HPA): Adds more copies of your application
Vertical Pod Autoscaler (VPA): Gives more power (CPU/memory) to existing copies

Think of HPA as hiring more cashiers during busy hours, while VPA is like giving your existing cashiers faster computers.

Horizontal Pod Autoscaler (HPA): Adding More Workers

How HPA Works

HPA continuously monitors your application's resource usage. When CPU or memory usage gets too high, it automatically creates more pod copies to handle the load. When traffic decreases, it scales back down.

Here's the process:

Monitor: HPA checks your app's metrics every 15 seconds
Calculate: It determines if more or fewer pods are needed
Scale: It adds or removes pods accordingly
Wait: It waits for the system to stabilize before making another decision

Setting Up HPA

Let's say you have a web application that should scale up when CPU usage exceeds 70%. Here's how you set it up:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2      # Never go below 2 pods
  maxReplicas: 10     # Never exceed 10 pods
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when CPU > 70%

Real-World Example

I once worked with an e-commerce company that saw traffic spikes during lunch hours (12-2 PM). Without HPA, their app would slow down terribly during these periods, causing customers to abandon their shopping carts.

After implementing HPA:

Before lunch rush: 3 pods running (normal load)
During lunch rush: Automatically scaled to 8 pods
After lunch rush: Gradually scaled back to 3 pods

Result? Page load times stayed consistent, and they processed 40% more orders during peak hours.

HPA Best Practices

1. Set Resource Requests
Your pods MUST have CPU and memory requests defined. HPA can't work without them.

resources:
  requests:
    cpu: 100m        # Required for HPA
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

2. Don't Set Min Replicas Too Low
Always keep at least 2 replicas running. If your single pod crashes, your entire service goes down.

3. Monitor Scaling Events
Use these commands to see what HPA is doing:

kubectl get hpa
kubectl describe hpa web-app-hpa

4. Avoid Flapping
If your app scales up and down too frequently, increase the stabilization window:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down

Vertical Pod Autoscaler (VPA): Giving More Power

How VPA Works

While HPA adds more workers, VPA makes existing workers more powerful. It monitors your application's actual resource usage over time and automatically adjusts CPU and memory allocations.

VPA has three modes:

Off: Only provides recommendations
Initial: Sets resources only when pods are created
Auto: Automatically updates running pods (requires pod restart)

Setting Up VPA

Here's a basic VPA configuration:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 2Gi

When to Use VPA

VPA is perfect for:

Database workloads: Databases often need more memory rather than more instances
Data processing applications: These might need varying amounts of CPU and memory
Legacy applications: Apps that can't easily scale horizontally

VPA Limitations to Know

1. Pod Restarts Required
VPA needs to restart pods to apply new resource settings. Plan for this.

2. Not Suitable for Stateful Apps
Avoid VPA for databases or other stateful applications where pod restarts cause issues.

3. Still in Beta
VPA is less mature than HPA. Test thoroughly before production use.

Can You Use HPA and VPA Together?

Short answer: Generally no, not on the same metrics.

If both HPA and VPA try to scale based on CPU usage, they'll fight each other:

HPA adds more pods because CPU is high
VPA increases CPU allocation because usage is high
This creates confusion and unpredictable behavior

Safe combination: Use HPA for CPU scaling and VPA only for memory optimization:

# HPA handles CPU scaling
metrics:
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70

---
# VPA handles only memory
resourcePolicy:
  containerPolicies:
  - containerName: web-app
    controlledResources: ["memory"]  # Only memory, not CPU

Choosing the Right Autoscaling Strategy

Here's my decision framework after years of production experience:

Use HPA When:

Your app is stateless (web servers, APIs)
You have variable traffic patterns (daily/weekly spikes)
Your app can handle multiple instances
You need fast scaling response (seconds to minutes)

Use VPA When:

Your app has unpredictable resource needs
You're running batch jobs or data processing
You have stateful applications that can't scale horizontally
You want to optimize resource costs over time

Use Neither When:

Your app has steady, predictable load
Resource requirements are well-known and stable
You prefer manual control over scaling decisions

Getting Started: Your First Autoscaling Setup

Step 1: Install Metrics Server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Step 2: Create a Simple HPA

kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

Step 3: Generate Some Load

kubectl run load-test --image=busybox --rm -it --restart=Never -- /bin/sh
# Inside the pod, run:
while true; do wget -q -O- http://web-app-service; done

Step 4: Watch It Scale

kubectl get hpa --watch
kubectl get pods --watch

Conclusion

Kubernetes autoscaling isn't just a nice-to-have feature - it's essential for running resilient, cost-effective applications at scale. HPA helps you handle traffic spikes automatically, while VPA ensures you're not wasting resources.

Start simple:

Implement HPA for your stateless web applications
Set conservative scaling thresholds initially
Monitor and adjust based on real usage patterns
Consider VPA for resource optimization once you're comfortable

Remember, autoscaling is as much about saving money as it is about handling load. Done right, it keeps your applications responsive while optimizing costs - letting you sleep better at night knowing your systems can handle whatever comes their way.

Top comments (2)

Yoshik Karnawat • Aug 17

Most of you will face issues when installing metrics-server because kubelet runs insecure by default.

Instead of the usual installation, do this:

wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Then open the components.yaml and update the args in the deployment by appending:

--kubelet-insecure-tls

Finally, apply the file and your metrics-server will be up and running 🚀

Some comments may only be visible to logged-in visitors. Sign in to view all comments.