Stop scaling by CPU. Start scaling by what actually matters ā your users.
Letās be honest: Scaling because your pod is at 80% CPU is like refueling your car after the gas light has been flashing for 20 miles. It works, but itās reactive, clumsy, and your users already felt the lag.
But what if your Kubernetes cluster could scale before the traffic spike hits? What if your backend could read the room ā or rather, the frontend ā and prepare itself?
Welcome to trafficābased HPA. Let's build it.
š¤ Why frontend traffic?
Your users don't care about CPU throttling or memory limits. They care about:
Page load speed
API response times
Smooth checkout flows
CPUābased scaling reacts to resource pressure. Trafficābased scaling reacts to user demand.
Metric When it scales Problem
CPU After load increases Users already suffer
Memory After allocation spikes Too late for batch jobs
RPS (requests/sec) As traffic rises Proactive, userāfirst
š§ The mental model
Think of your frontend (or API gateway) as the canary. It sees every incoming request before your backend pods do.
By exporting request rate from your frontend layer ā whether thatās an ingress controller, a Node.js middleware, or a service mesh sidecar ā you can feed that signal into the Kubernetes HPA.
The result?
š Pods scale up as soon as the traffic graph tilts upward, not seconds later when CPU catches up.
š ļø How to actually do this (stepābyāstep)
- Expose frontend traffic metrics The easiest path? Use your ingress controller. Most popular ones expose Prometheus metrics out of the box:
yaml
Prometheus scrape config for Nginx Ingress
scrape_configs:
- job_name: 'nginx-ingress'
static_configs:
- targets: ['nginx-ingress-controller.monitoring:10254'] Look for metrics like:
nginx_ingress_controller_requests
nginx_ingress_controller_request_duration_seconds_count
Pro tip: Filter by host or path to get perāservice traffic.
- Set up Prometheus Adapter The Kubernetes HPA canāt talk to Prometheus directly. Enter the Prometheus Adapter:
bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter
Configure it to expose a custom metric called frontend_rps:
yaml
adapter config
rules:
- seriesQuery: 'nginx_ingress_controller_requests{host="myapp.example.com"}'
resources:
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m]))'
- Define your HPA Now the magic ā an HPA that scales on requests per second:
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: frontend-traffic-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-backend-api
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods pods: metric: name: frontend_rps target: type: AverageValue averageValue: "500" # 500 requests/sec per pod When traffic exceeds 500 RPS per pod, Kubernetes scales up. Dropped below? Scales down.
š Realāworld example: Black Friday ready
Letās say your eācommerce frontend normally serves 1,500 RPS with 3 pods (500 RPS each).
Suddenly, a flash sale starts. Frontend RPS jumps to 4,500.
CPUābased HPA: Waits 60ā120s for CPU to max out ā users see timeouts.
Trafficābased HPA: Scales to 9 pods within 30s (prometheus scrape + HPA sync) ā users never notice.
Weāve seen this cut P99 latency by 40% during rampāup spikes in production.
ā ļø But watch out forā¦
Noisy neighbors
If your frontend sees bot traffic or web scrapers, youāll scale unnecessarily. Solution: filter metrics by HTTP status (e.g., exclude 4xx/5xx) or use a sliding window.
Cold starts
Trafficābased scaling works after the first request of a spike lands. For truly bursty workloads, combine with:
Minimum replicas (always keep a baseline)
Predictive scaling (e.g., KEDA with cron)
Single source of truth
If you have multiple ingresses or CDNs, aggregate metrics. Prometheusā sum() across all sources is your friend.
š® Beyond simple RPS
Once youāve got trafficābased HPA working, you can get creative:
Metric What it detects
RPS per endpoint /search spikes vs /status traffic
Active WebSocket connections Realātime apps
Queue length (frontend ā backend) Request backlog
P99 latency of frontend "Users are waiting too long"
š§© Putting it all together
Hereās the architecture you just built:
text
User Request ā Ingress Controller ā Prometheus ā Prometheus Adapter ā Kubernetes HPA ā Scale Backend Pods
ā ā ā
(export RPS) (scrape every 15s) (expose custom metric)
No new tools. No black magic. Just metrics you already have, used intelligently.
ā
TL;DR ā Do this today
Check if your ingress controller exposes request rate metrics.
Deploy Prometheus + Prometheus Adapter.
Write an HPA using pods metric with averageValue in RPS.
Test with kubectl run load-generator while watching kubectl get hpa -w.
Your users will never know you scaled. And thatās exactly the point.
Have you tried scaling on business metrics instead of infrastructure ones? Drop your war stories below ā Iād love to hear how others are moving beyond CPU. š
š Follow for more Kubernetes scaling deep dives. Next up: āScaling on RabbitMQ queue depth (the right way).ā
Top comments (0)