The incident
Last quarter we hit a production incident that looked “healthy” at first — until it wasn’t.
Traffic spiked from 100 to 1000 req/sec.
Kubernetes HPA did exactly what it was designed to do.
Our database did not.
What actually happened
- EKS HPA scaled API pods from 3 → 15
- Each pod had DB_POOL_SIZE=50
- PostgreSQL max_connections=200
The math no one notices under pressure:
15 pods × 50 connections = 750 required
Database capacity = 200
Result:
“FATAL: too many clients already”
CrashLoopBackOff across new pods.
Why this failure is so common
HPA scales compute.
It does not understand downstream limits.
Databases don’t autoscale like pods.
Connection pools multiply silently.
By the time alerts fire, you’re already down.
The debugging flow that actually matters
When this happens, logs alone are not enough.
You need to:
- Confirm exit reasons
- Validate connection pressure
- Calculate real concurrency
- Avoid dangerous “just increase max_connections” fixes
This is where most time is lost during incidents.
How I now approach this class of incident
After seeing this failure pattern repeatedly, I stopped relying on memory and ad-hoc runbooks.
I ended up building CodeWeave, a DevOps copilot that forces a structured incident-response flow for production infrastructure.
For this class of incident, it explicitly:
- Calculates real connection math (pods × pool size)
- Flags unsafe HPA scaling relative to database limits
- Evaluates PgBouncer or AWS RDS Proxy where appropriate
- Applies HPA caps aligned to downstream capacity
- Focuses on zero-downtime remediation paths
The important part isn’t automation — it’s making unsafe decisions harder under pressure.
The goal isn’t speed.
It’s reducing risk when systems are already unstable.
Key lessons
- Autoscaling without dependency limits is unsafe
- Databases are the first choke point
- Incident response should be structured, not tribal
- Tools should reduce blast radius, not just generate YAML
I’m curious how others handle this failure mode.
Do you cap HPA replicas based on database limits — or rely entirely on pooling layers?
If you’re curious, I shared a short demo of this exact flow using **[CodeWeave]
https://www.linkedin.com/posts/activity-7415388599883292672-Q05O?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAqyZpkBTUNyc9y0g8Qnow5IZiIzJ9MbUGc
I’d mainly love feedback from other DevOps and SREs who’ve dealt with similar scaling failures.
Top comments (1)
For anyone curious, I built CodeWeave after hitting this failure pattern multiple times in production.
If you’ve handled similar HPA + database scaling incidents, I’d genuinely love to hear how you’ve approached it (pooling, HPA caps, proxies, etc.).