We had an API that was consistently fast.
~80ms average response time. No complaints.
Then one day, everything slowed down.
Not crashed. Not broken. Just… slow.
And the weird part?
It was caused by one user.
The Symptom: Latency Gradually Creeping Up
At first:
- p95 latency increased slightly
- CPU was fine
- Memory was stable
- No obvious errors
Then:
- Requests started taking 500ms+
- Some hit 2–3 seconds
- But only during certain times
It wasn’t global traffic.
It was something more subtle.
The Clue: One Endpoint, One Pattern
After digging into logs, we noticed:
- Almost all slow requests hit the same endpoint
- Same query pattern
- Same user ID appearing frequently
That user had… a lot of data.
Way more than anyone else.
The Root Cause: “Works Fine” Query That Didn’t Scale
Here’s the query (simplified):
SQL:
SELECT *
FROM orders
WHERE user_id = $1
ORDER BY created_at DESC;
Looks harmless, right?
Except:
- No pagination
- No limit
- That user had 120,000+ rows
Every request:
- Pulled all rows
- Sorted them
- Serialized them into JSON
- Sent them over the network
For one user.
Now imagine multiple requests hitting that at once.
Why It Slowed Down Everyone
Node.js is non-blocking… but not magic.
What actually happened:
- Huge DB query time increased
- Large JSON serialization blocked the event loop
- Response size increased network time
- Other requests waited behind it
One “heavy” request created backpressure for everything else.
The Fix (That Took 10 Minutes)
1️⃣ Add Pagination (Always)
SQL:
SELECT *
FROM orders
WHERE user_id = $1
ORDER BY created_at DESC
LIMIT 50 OFFSET $2;
Or even better: cursor-based pagination.
2️⃣ Add Proper Index
CREATE INDEX idx_orders_user_created
ON orders(user_id, created_at DESC);
This alone drastically reduced query time.
3️⃣ Reduce Payload Size
Instead of SELECT *:
SELECT id, total, status, created_at
FROM orders
WHERE user_id = $1
ORDER BY created_at DESC
LIMIT 50;
Less data → faster everything.
4️⃣ Optional: Protect the API
We added a soft guard:
if (limit > 100) {
throw new Error("Limit too high");
}
No more accidental “give me everything.”
The Lesson That Stuck With Me
The API wasn’t fast.
It was fast for the average case.
The moment one user had “edge-case data,” the system showed its real behavior.
The Dangerous Assumption
“It works fine with my test data.”
Test data is small. Clean. Predictable.
Production data is:
- messy
- uneven
- sometimes extreme
Systems fail at the edges — not the average.
Conclusion / Key takeaway
Your backend performance isn’t defined by your average user. It’s defined by your heaviest one.
If one user can slow everyone else down, it’s not a user problem — it’s a system design problem.
What’s the most unexpected “edge case” in your data that caused real performance issues?
Top comments (0)