It started, as these things always do, with a mysterious error message popping up in our production logs during peak hours:
KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
This is the kind of error that makes your stomach drop. It’s sporadic—sometimes during a traffic burst, sometimes when a background job kicks off. It's the worst kind of bug.
Our first instinct? "The database is overloaded. We're hitting our limits."
But as we dug in, we realized this wasn’t a database saturation problem at all. It was a subtle, tricky concurrency problem we had created in our own code.
🔍 The Setup (and Our False Sense of Security)
Here's a quick look at our stack for this service:
- Knex.js as our query builder for PostgreSQL.
- Mongoose for our MongoDB data.
- A set of APIs that, for a single request, often need to fetch data from both databases.
To handle this, we had a pretty beefy Knex connection pool: 100 connections. On paper, this should have been more than enough for our workload.
And yet, here we were, getting timeouts as if the pool was completely exhausted.
đź§© Step 1: Finding the Pattern (The "Huh, That's Weird" Moment)
The timeouts weren't happening everywhere. They were clustered around a few specific endpoints. And the one thing they all had in common was this pattern:
// A simplified version of the "fast" code
await Promise.all([
fetchFromPostgres(), // Knex query
fetchFromMongo() // Mongoose query
]);
Looks smart, right? We're running our database calls in parallel to be efficient and reduce response time. That’s just good async programming!
...or so we thought. This assumption, it turns out, was the root of all our problems.
đź§ Step 2: How Knex Pools Actually Work
To understand why this broke, we had to go back to basics. How does Knex (which uses tarn.js under the hood) manage its pool?
Let's say our config is:
pool: {
min: 2,
max: 100,
acquireTimeoutMillis: 60000 // 60 seconds
}
This means:
- Knex will keep at most 100 connections open to Postgres.
- When a query needs to run, it "checks out" a connection.
- When the query is done, it "releases" the connection back to the pool.
- If all 100 connections are checked out, the next query has to wait.
- If it waits for more than 60 seconds... boom.
KnexTimeoutError.
This is straightforward. But the chaos starts when you mix this simple checkout/release system with Node.js's event loop and other async tasks.
⚠️ Step 3: The "Aha!" Moment — The Promise.all() Trap
Here’s where it all fell apart.
When you await Promise.all(), you're telling Node.js: "Start both of these things right now."
So, for a single incoming API request, this happens:
-
fetchFromPostgres()kicks off. It goes to the Knex pool and acquires Connection #1. -
fetchFromMongo()kicks off at the same time. It's a totally separate I/O task, hitting a different database.
Now, multiply this by 100 concurrent API requests all hitting this endpoint at once.
- Request 1: Acquires Postgres Connection #1.
- Request 2: Acquires Postgres Connection #2.
- ...
- Request 100: Acquires Postgres Connection #100.
- Request 101: Tries to get a connection. The pool is full. It starts waiting.
But here’s the real kicker: The fetchFromMongo() calls.
Those Mongoose queries are also taking up time and resources on the Node.js event loop. While Node is busy handling Mongo I/O, parsing large JSON responses, or doing anything else, it's delaying the very moment that the fetchFromPostgres() query finishes and releases its connection.
We weren't just running out of connections. We were holding onto them for way too long because the event loop was busy with other "parallel" work. Our "fast" code was causing a traffic jam.
đź§® Step 4: Why "Just Add More Connections" Would Have Failed
You might be thinking, "Why not just increase the pool to 500?"
We thought about it. But that would have just masked the symptom and likely caused a new, worse problem: saturating the PostgreSQL server itself.
Every connection in the Knex pool is a real backend process on your Postgres server, consuming memory. Hundreds of concurrent, active connections is not something Postgres handles efficiently. We would have just moved the bottleneck from our Node app to the database, and probably taken down the whole thing.
This was not a pool-size problem. It was a connection lifecycle problem.
⚙️ Step 5: The Fix (That Felt Too Simple)
Once we truly understood the problem, the fix was almost embarrassingly simple.
We stopped trying to be clever. We made the code sequential.
// âś… The "boring" but stable version
const postgresData = await fetchFromPostgres(); // 1. Acquire, 2. Use, 3. RELEASE
const mongoData = await fetchFromMongo(); // 4. Now do other stuff
And just like that, the timeouts vanished.
This code looks slower, but it performs infinitely better under load. Why?
- The Postgres connection is acquired and released in a tiny, predictable window.
- The connection pool never gets overwhelmed. Most of the time, it's sitting at 20-30 active connections, not slammed at 100.
- The Mongo query (the "other async work") now runs after the scarce resource (the Postgres connection) has been returned, so it can't cause a delay.
We traded theoretical, single-request parallelism for actual, system-wide throughput.
đź§ľ Step 6: The Proof
After deploying the fix, we watched the dashboards.
-
Result: Zero
KnexTimeoutErrorsin the two weeks since. - Connection Pool Metrics: Stable. Average active connections dropped from spiking at 100 to a healthy 20-30.
- Bonus: Average response time during high-load actually dropped by ~15%. Why? Because requests weren't sitting around waiting 60 seconds for a connection to free up.
đź’ˇ My Hard-Won Lessons
This bug was a humbling reminder that in systems programming, "fast" is almost always more complicated than it looks.
- "Parallel" Isn't Always Faster. Especially when you're all fighting for the same limited resource. It's like opening all the security lines at the airport but only having one TSA agent. The coordination is the bottleneck.
- Node.js Has a Busy To-Do List. Node.js is single-threaded. Any long-running I/O or a heavy CPU task (like parsing a huge JSON from Mongo) can and will delay other work, like the callbacks that free your database connections.
- You Can't Fix What You Can't See. Monitor your connection pool! We now have alerts on "active connections" and "pending acquires." If we didn't have those metrics, we'd still be guessing.
- Treat Your Connection Pool Like a Scarce Resource. Keep your transactions and queries as short as humanly possible. Get in, get your data, and get out. Don't hold a connection open while you go off and do something else.
This whole ordeal started as a scary production fire, but it ended up being a fantastic lesson in how our async code really behaves under pressure. The way we coordinate our work is often far more important than the speed of any single part.
Top comments (0)