Abhishek Gautam

Posted on Jan 29

Mastering Kubernetes Scaling: A Comprehensive Guide for High Traffic Applications - Part 1

#kubernetes #hpa #systemdesign #devops

🌟Conquering the Traffic Tsunami with Kubernetes Superpowers 🌟

Picture this: Your app just went viral. Elon tweeted about it, and now your user base is exploding – 50,000 requests per second flood your server. Your database starts to lag, APIs stumble, and frontend feels like it is on the verge of collapsing. Stress levels rise “Will the system hold up? Will users start dropping off? Am I about to be flooded with app crash alerts?”

Welcome to the unpredictable world of hyper scaling.

Now, what if you could stay ahead of the chaos? What if your app could auto-magically scale to handle the pressure? That’s where Kubernetes Scaling capabilities steps in.

Why Kubernetes Scaling?

Kubernetes can act as a foundation for apps that can handle anything you throw at them. It is your force multiplier. Your apps can seamlessly expand to handle surging traffic. Your cluster can dynamically adjusts resources, spinning up new servers exactly when needed.

But scaling Kubernetes apps isn’t just about throwing resources around - it’s about strategy. It’s about using right tools at the right time, avoiding missteps, and achieving optimal performance without breaking the bank 💰💰.

Who is This Guide For?

Are you…

A developer tired of manual scaling guesswork? 🎲
A DevOps engineer tasked with making sure the next big launch goes off without a hitch? 🚀
A team lead who wants to sleep soundly, knowing your apps can survive Black Friday or Big Billion Days - without midnight wake-up calls?

This guide is a practical playbook for mastering Kubernetes scaling, no mastery of distributed systems required - don’t worry we will start from basics and then dive deep into pro-tier patterns.

What You’ll Learn 📈

By the end of this guide, you’ll have the tools to scale Kubernetes with confidence:
✅Horizontal Pod Autoscaling (HPA): Automatically spawn pods when traffic surges – no all nighters needed.
✅Vertical Pod Autoscaling (VPA): Transform your pods to pods on steroids 💪(more CPU & memory Safely).
✅Cluster Autoscaling: Scale your cluster effortlessly by adding or removing nodes as needed.
✅Scaling Stateful Workloads: Learn how to scale databases and other stateful systems without compromising data integrity.
✅Pro Tips & Pitfalls: Avoid the deadly sins of scaling (yes, over scaling is a sin 😇).

Why This Guide?

Most tutorials treat scaling like a checkbox: Set HPA, Done!💤

Not here, we’ll unpack:

How to predict scaling needs (math needed, intuition encouraged)
When to choose horizontal vs vertical scaling (spoiler: it’s not either/or).
Why monitoring is your Spidey sense for avoiding disasters 🕷️🚨

In this guide, we will just have actionable wisdom, no robotic jargon.

Kubernetes Scaling Basics: Core Tools & Terminologies

Let’s start with the fundamentals. Kubernetes can feel like an endless list of jargons, but once you grasp a few core concepts, scaling becomes far less intimidating. Think of this section as your map to the Kubernetes universe.

Key Terminologies Explained

Pods: Your App’s Smallest Unit
1. What is it: A pod is a smallest deployable unit in Kubernetes. It’s like a single shipping container that holds one or more application processes (containers).
2. Why it matters: Scaling starts here. When traffic spikes, you don’t scale individual containers — you scale pods
3. Example: A Web App running in a pod. Need more capacity? Add a pod.
Nodes: The machine behind the curtain
1. What is it: A node is a physical or virtual machine that runs pods. It’s the “worker bee” of your cluster.
2. Why it matters: Nodes provide the CPU, memory, and storage which your pods need. If pods can’t fit on existing nodes, you need more nodes (or bigger ones).
Deployments: The puppeteers of Pods
1. What is it: A Deployment manages pods. It ensures that desired numbers of pods are running, replaces failures, and rollouts updates.
2. Why it matters: Deployments are your go-to tools for scaling stateless applications. E.g. Frontend/Backend.
ReplicaSets:
1. What is it: ReplicaSets are a group of identical pods that ensures a stable number of pods are running for a specific workload.
2. Why it matters: ReplicaSets ensure that the desired number of pods are always running, even if some pods fail. They are critical for increasing the availability and provide redundancy.

Auto Scaling Tools

Horizontal Pod Autoscaler (HPA):
1. What it does: Automatically increases or decreases number of pods in a deployment based on metrics like CPU, memory or custom values(e.g. Requests per second).
2. Why it matters: Handle traffic spikes by distributing loads across pods.
3. How it works:
  1. HPA checks metrics every 15-30 seconds (default).
  2. If CPU usage exceeded targets (e.g. 70%) or the utilization exceeded your defined thresholds, it adds pods.
  3. If usage drops then it removes pods.
4. Example: If CPU usage exceeded 70%, HPA adds pods until it stabilizes.
5. Caveats: Only work for stateless apps that can run in parallel.
Vertical Pod Autoscaler (VPA)
1. What it does: Adjusts CPU/memory resource requests for pods. E.g. increases POD’s CPU from 0.5 core to 1 core.
2. Why it matters: Optimizes resource usage for apps that cannot scale horizontally. E.g. legacy monoliths. The issue with kubernetes scheduler is it does not re-evaluate pod’s resource needs after a pod is scheduled with a given set of requests.
3. How it works:
  1. The VPA controller observes the resource usage of an application.
  2. Then using this information as a baseline, VPA recommends a lower bound, upper bound and target values for resource requests for those pods.
  3. Depending on how you configure VPA it can either:
    1. Apply the recommendation directly.
    2. Store the recommended values for reference.
    3. Apply the recommended values to newly created pods only.
4. Caveats: Requires restarting pods which can cause a brief downtime. Stateful scaling is not yet a supported VPA use case.
Cluster Autoscaler
1. What it does: Adds or removes nodes (worker machines) in the cluster when pods can’t be scheduled due to resource shortages.
2. Why it matters: Ensures your cluster has enough capacity to run pods.
3. Example: If HPA creates 10 new pods but there’s no node space, Cluster Autoscaler adds a new node.

Resource Management

Resource Requests and Limits
1. Requests: The minimum CPU/memory a pod needs to run. Kubernetes uses this to decide where to schedule the pod.
2. Limits: The maximum CPU/memory a pod can use. Prevents a single pod from hogging resources.

Reliability Features

Pod Disruption Budget (PDB)
1. What it does: Ensures a minimum number of pods remain available during voluntary disruptions(e.g. Node maintenance upgrades)
2. Why it matters: Prevents downtime during cluster operations.
3. Example: A PDB can block node drains until replacement pods are ready.
Liveness and Readiness Probes
1. Liveness Probe: Checks if a pod is healthy, If it fails, Kubernetes restarts the pod.
2. Readiness Probe: Checks if the pod is ready to serve traffic. If it fails, the pods is removed from the load balancer.
3. Why it matters: Ensures traffic only goes to healthy pods and faulty ones are replaced automatically.

Specialized Workloads

StatefulSets
1. What it does: Manages stateful applications(e.g. Databases, message queues) with stable network identities and persistent storage.
2. Why it matters: Unlike Deployments, StatefulSets scale pods in a predictable order(e.g. MySQL Replicas)
3. Example: Scaling a Redis cluster from 3 to 5 pods while preserving data order.
CronJobs
1. What it does: Runs batch jobs on a schedule (e.g. nightly reports, data cleanup)
2. Why it matters: Ensures resource-heavy jobs don’t collide with peak traffic.
3. Key Settings:
  1. Parallelism: Number of pods running concurrently
  2. completions : Total pods needed to finish the Job.
Service Meshes (e.g. Istio)
1. What it does: Manages traffic between services with features like load balancing, retries and canary deployment.
2. Why it matters: Enables safe scaling in complex environments(e.g. Gradually rolling out new versions to 10% of users)
3. Use Cases:
  1. Canary Rollouts: Test new versions with a subset of traffic.
  2. A/B Testing: Route users to different app versions.

Why Autoscaling Isn’t Magic (But Close Enough)

Autoscaling works only if:

You set accurate resource requests/limits: Guess wrong, and pods might starve or hog resources
Your app is designed to scale horizontally: Not all apps can handle multiple instances. E.g. stateful apps.
You monitor metrics: Autoscaling without data is like driving blindfolded. 🚗👀

Key Takeaways

HPA and Cluster Autoscaler handle most stateless scaling needs.
VPA is niche – use it cautiously for stateful apps or legacy systems.
PDBs and Probes keep your app resilient during scaling and updates.
StatefulSets & CronJobs requires special scaling strategies (orderly scaling, resource limits)

(Pro Tip: Experiment with these tools in a sandbox cluster. Breaking things is the fastest way to learn!)

Scaling Stateless Apps

Since now we have understanding of Kubernetes terminologies and tools for scaling. Let's cut through the theory and dive into real-world tactics, complete with YAML snippets, load-testing strategies.

Level 1 of Scaling your App

Imagine you're running an e-commerce app. Your API gets loaded during weekends. Here's how you can use HPA:

Define Resource Boundaries Every pod needs resource requests(the minimum it needs to run) and the limits(the maximum it can consume). Setting the boundaries are important because if it's too little then your app isn't performant and if it's too much your app hogs resources.

   # deployment.yaml
   containers:
    - name: shop
      resource: 
        requests:
          cpu: "0.5" # Half a CPU core
          memory: "512Mi" # 512 MB RAM
        limits: 
          cpu: "1" # Don't let it use more than 1 core
          memory: "1Gi" # Max 1 GB RAM

Configure Horizontal Pod Autoscaling (HPA) HPA is responsible for adding pods when demand spikes and scaling down when things calm down.

    # HPA.yaml
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
    name: shop-api-hpa
    spec:
    scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: shop-api
    minReplicas: 3 # Always keep 3 pods warm for sudden traffic
    maxReplicas: 30 # Don't go bankrupt on cloud costs! 💸💸
    metrics:
    - type: Resource
        resource:
        name: cpu
        target:
            type: Utilization
            averageUtilization: 60  # Scale up if CPU > 60%
    behavior:
        scaleUp:
        stabilizationWindowSeconds: 0  # React immediately when CPU > 60%
        policies:
        - type: Percent
            value: 100  # Can double the pods in a single step
            periodSeconds: 15  # Evaluate every 15 seconds
        scaleDown:
        stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
        policies:
        - type: Percent
            value: 50  # Reduce up to 50% of pods at a time
            periodSeconds: 60  # Evaluate every 60 seconds
        selectPolicy: Min # Min is the slowest possible scale-down (safest approach)

What happens Behind the Scenes::
- At 60% CPU, HPA adds pods.
- At 30% CPU, it removes pods (slowly, to avoid trashing)
- The scaling down happens after a cool down period (default 5 minutes)

But WAIT! Scaling based on just CPU alone is not enough -- it doesn't cover the whole story.

Level 100 of Scaling your App with Precision (Custom Metrics)

Enterprises don't guess - they measure. Let's use Requests per second(RPS) and latency to autoscale like a pro.
Step 1 Set Up Prometheus for Custom Metrics

    # Install Prometheus & Adapter
    helm install prometheus prometheus-community/prometheus
    helm install prometheus-adapter prometheus-community/prometheus-adapter

NOTE: Ensure that the apps expose these metrics, and then prometheus adapter can pick it and transform to the Kubernetes supported metrics.

Step 2 Define an HPA based on RPS

    # hpa-custom.yaml  
    metrics:  
    - type: Pods  
    pods:  
        metric:  
        name: http_requests_per_second  # Track requests/sec  
        target:  
        type: AverageValue  
        averageValue: 1000   # Scale up if pods average >1000 RPS

Why this is better?
- If 1 pod handles 1000 RPS, mathematically 10 pods handle 10,000 RPS
- No more mystery scaling based on CPU alone Step 3 Add Latency Metrics

    metrics:  
    - type: Object  
    object:  
        metric:  
        name: http_request_duration_seconds  
        describedObject:  
        apiVersion: networking.k8s.io/v1  
        kind: Ingress  
        name: shop-api-ingress  
        target:  
        type: Value  
        value: "0.5"  # Scale up if latency exceeds 500ms

Advanced Patterns for Scaling your App

1️⃣ Multi-Metric Autoscaling ⚖️

Why? Relying on a single metric like CPU utilization can lead to inefficient scaling. For example

High traffic might not always mean high CPU usage.
Memory-hungry applications could be under-provisioned.
API latency could degrade before CPU spikes. So we can combine CPU, Memory, RPS and latency to cover all bases:

metrics:  
- type: Resource  
  resource: { name: cpu, target: { averageUtilization: 60 } }  
- type: Pods  
  pods: { metric: { name: http_requests_per_second }, target: { averageValue: 200 } }  
- type: Object  
  object:  
    metric: { name: http_request_duration_seconds }  
    target: { value: "0.5" }  # 500ms latency

Caveats: Delayed scaling if metrics update too slowly (e.g. request-based metrics might lag behind real traffic surges)

2️⃣ Spread Pods with Anti-Affinity

Why? Kubernetes by default may place all your pods on the same node, which creates a single point of failure and resource contention
How it works?
Pod Anti-Affinity ensures your pods are distributed across multiple nodes to avoid congestion and improve reliability

affinity:  
  podAntiAffinity:  
    requiredDuringSchedulingIgnoredDuringExecution:  
    - labelSelector:  
        matchExpressions:  
        - key: app  
          operator: In  
          values: [shop-api]  
      topologyKey: kubernetes.io/hostname  # Spread across nodes

Caveats

Scheduler may fail to place pods if there aren't enough nodes.
Can increase cross-node communication latency if pods rely heavily on each other.

The Art of Calculating Pods:

Since we have established that intuition based guesswork is not best fit - enterprise teams take a data driven approach. You can use some below strategies to determine the right amount of pods without surprises when traffic surges.

1️⃣ Load Test Relentlessly 🔥

Before setting up auto-scaling, you must understand how much load your application can handle. This ensures your scaling strategy is data-driven, rather than based on guesswork.

Simulate Real-world Traffic Use load testing tools like:
- k6 – Ideal for API load testing
- Locust – Python-based and scalable
- Gatling – Great for simulating high traffic
What to Observe?
- Latency: Measure response times under different loads
- Error rates: Track failed requests due to overload
- CPU & Memory usage: Identify resource bottlenecks
Find your App's Breaking Point You need to determine when your app starts struggling under load.
- Gradually increase RPS during load testing.
- Monitor latency metrics (e.g., P95 response time).
- Identify the "breaking point"—the RPS where latency exceeds your SLA.

🚨 Example SLA-based Breaking Point:

SLA: 95th percentile latency must be <50ms

Observations:

✅ At 200 RPS → P95 latency = 30ms (Good)
✅ At 400 RPS → P95 latency = 45ms (Still OK)
❌ At 600 RPS → P95 latency = 120ms (Too slow!)
🔥 Breaking Point = 500 RPS (just before latency spikes past 50ms)

2️⃣ Calculate your Safety Net 🛡️

Once you know how much traffic a single pod can handle, you can calculate the right number of replicas for peak demand.
📌 Example Calculation:

Max RPS per pod: Suppose 1 pod handles 250 RPS before latency becomes unacceptable.
Expected peak traffic: Say your app needs to handle 10,000 RPS
Pods Required (10,000 RPS / 250 RPS per Pod) = 40 Pods
Add a Buffer for unpredictable traffic:
- A 30% safety margin -> 52 Pods.
- Set maxReplicas: 52 in your HPA configuration.

✅ Pro Tip: Use Cluster over-provisioning with buffer nodes to ensure Kubernetes can scale quickly during sudden demand spikes.

3️⃣ Monitor, Analyze and Tweak 📊

Your scaling strategy should evolve with real-world data. Use Grafana dashboards to continuously monitor:

RPS per pod 📈
CPU & memory utilization 📊
95th percentile latency ⏳
Error rates under load ❌

Should Yor Run Multiple Containers in a Pod?

Pods can run multiple containers(e.g. app + logging sidecar), but keep it simple

Scenario	Containers per Pod	Example
Primary App + Sidecar	2	App + Proxy
Legacy Monolith	1	No Sidecar , Reduces Complexity
Data Processing	1	One task per pod (easy to scale)

Example

# Sidecar for logging (Fluentd)  
containers:  
- name: shop  
  image: shop-api:v3  
  resources: { requests: { cpu: "0.5" }, limits: { cpu: "1" } }  
- name: fluentd  
  image: fluent/fluentd  
  resources: { requests: { cpu: "0.1" }, limits: { cpu: "0.25" } }

Golden Rule: More containers = more complexity. Only use sidecars for critical needs (logging, security)

Key Takeaways

🔃 Adjust HPA thresholds regularly based on actual patterns.
🛑 Avoid Over Scaling - Set --horizontal-pod-autoscaler-upscale-delay=2m to prevent knee-jerk reactions that add too many pods too quickly.
⚡ Fine-tune autoscaling policies for better responsiveness during peak hours. Keep minReplicas at 10% of peak traffic to avoid latency spikes when scaling up.
💰 Use Kubecost to detect idle pods draining your budget.

Whats Next?

Ready for the ultimate challenge? In the next part of this article, we’ll crack the code on scaling stateful apps—databases, caches, and queues. Spoiler: It’s like defusing a bomb 💣… but with more YAML.