We had a rollout where requests in one client environment started timing out under load.
Not locally. Not in staging. Only in their infra.
At first, everything looked normal. No crashes. No clear errors. Just slow requests piling up until the system started struggling.
The obvious move would have been to add more logs and redeploy.
We didn’t do that.
When your system runs continuously, every redeploy is a risk. You don’t push changes blindly. You need to understand the problem before touching anything.
So instead of changing code, we changed how we looked at the system.
Start With One Request
Instead of scanning logs randomly, we focused on a single request and followed it through the system.
That changed everything.
What we saw was simple:
- The API received the request instantly
- An internal service call was taking several seconds
- The downstream AI layer was fast
So the problem wasn’t external. It was inside our own system.
Break Down the Time
Looking at request-level logs wasn’t enough.
We needed to know where time was actually being spent.
Once we broke things down step by step, the issue became clear.
A database operation that normally takes milliseconds was taking multiple seconds under load.
No errors. Just delay.
The Real Issue
This wasn’t a slow query problem.
It was a resource problem.
The database connection pool was getting exhausted.
In this client environment, the limits were lower than what we had assumed. Under load, requests weren’t failing. They were waiting.
That’s why nothing looked broken. Everything was just slow.
Fixing It the Right Way
We could have increased the connection pool size and moved on.
That would have created new problems later.
Instead, we changed how the system handled load.
We limited how many requests could run at the same time.
We added control so requests didn’t pile up endlessly.
We made the system aware of environment limits instead of assuming defaults.
What Changed
After that:
- Requests stopped piling up
- Latency stayed stable under load
- No infrastructure changes were needed from the client
What Actually Worked
The fix wasn’t adding more logs.
It wasn’t redeploying faster.
It was understanding the system properly.
- Track one request completely
- Measure time at each step
- Find the real bottleneck before changing anything
When your system runs 24/7, debugging is different.
You don’t guess.
You prove where the problem is, and then you fix only that.
Top comments (0)