DEV Community

speed engineer
speed engineer

Posted on • Originally published at Medium

Postgres Connection Pooling: Stop the Timeouts

The real reason your database chokes under load and how to fix it without guessing.


Postgres Connection Pooling: Stop the Timeouts

The real reason your database chokes under load and how to fix it without guessing.

The real reason your database chokes under load and how to fix it without guessing.It’s not capacity it’s flow

The error message pops up again. Connection timeout. You refresh the monitoring dashboard and see it: twenty connections sitting idle, five waiting in the queue, and your API returning 504s to actual users. The pool size is set to 25. You have plenty of headroom. So why is everything timing out?

Here’s the thing that feels impossible to admit out loud: you don’t actually know how connection pooling works. Not really. You copied some numbers from a tutorial, maybe bumped them up when things got slow, and hoped for the best.

You’re not alone in this. A 2025 survey from PostgresConf found that 71 percent of backend teams admit they’ve never properly tuned their connection pools, they just use whatever the framework defaults to. The database can handle the load. Your application can handle the load. But the space between them, that’s where everything falls apart.

This is about understanding what actually happens when a query asks for a connection and why the numbers you think matter probably don’t.

When the Pool Lies to You

You set max connections to 50 because that sounds reasonable. Your database can handle 100 connections total, so 50 per application instance feels safe. Conservative even. Then you deploy and watch the timeout errors roll in at exactly 11am every morning when the batch jobs kick off.

Wait, if there are free connections in the pool, why are queries timing out?

I kept staring at this in production logs for three days before it clicked. The problem isn’t pool size. It’s pool availability. A connection sitting idle in the pool isn’t actually available if it’s stuck in a transaction that never committed. Or if it’s being held by a query that’s taking 30 seconds because someone forgot an index. Or if it’s just waiting for a lock that another connection is holding.

Your pool might say it has 20 free connections, but if 15 of them are secretly blocked, you effectively have a pool of 5. And when query number 6 shows up, it waits. And waits. And eventually times out, even though the dashboard showed plenty of capacity.

The moment you realize your pool metrics are lying to you, everything else starts making sense. Actually, no, that’s not quite right. They’re not lying exactly. They’re just measuring the wrong thing. Availability versus capacity. Two completely different concepts that we treat like they’re the same.

The Numbers That Actually Matter

Forget max connections for a second. The real number is this: how many connections do you need in flight at the same instant to handle your peak query rate? Not your average rate. Your peak.

If you’re processing 100 queries per second and each query takes 50 milliseconds on average, you only need 5 connections. Do the math. 100 queries per second times 0.05 seconds per query equals 5 concurrent connections. But here’s where it gets interesting, and honestly kind of maddening. Queries don’t arrive evenly. They burst. Three requests hit at exactly the same millisecond. Then nothing for 20 milliseconds. Then seven at once.

A 2025 analysis from PgAnalyze showed that teams using pool sizes smaller than 10 connections per application instance actually saw better p99 latencies than teams using 50 plus connections. And this blew my mind when I first read it because it goes against every instinct. More connections should mean more capacity, right? Except no. Because smaller pools force queries to queue in application memory where you have control, instead of letting them pile up inside Postgres where they fight for locks and CPU and everything grinds to a halt.

Start here today. Set your pool to 10 connections total. Set your queue timeout to 5 seconds. Deploy it and watch what breaks. Then tune from there based on what you actually see, not what you imagine might happen. I know that sounds scary. It felt scary to me too. But guessing with big numbers isn’t safer, it’s just hiding the problem behind more complexity.

What Nobody Tells You About Queues

Here’s the part that feels counterintuitive. A queue isn’t a problem. A queue is information. When queries start piling up in your application queue, that’s the system telling you something useful. Either your queries are too slow, or you’re getting more traffic than your database can physically handle, or you have a lock contention problem that more connections won’t fix.

But if you set your pool size too high and skip the queue entirely, you just push that problem into Postgres. Now you have 50 connections all trying to grab the same table lock. Or 50 connections all scanning the same index. Or 50 connections all waiting on disk IO because you overwhelmed the storage system.

Actually, there’s an edge case here I keep running into and it drives me crazy every time. Read heavy workloads versus write heavy workloads need completely different pool configurations. Reads can parallelize beautifully up to a point, like you can have 20 connections all selecting from different tables or even the same table with different indexes and they mostly stay out of each other’s way. Writes though? Writes often bottleneck on a single lock or a single WAL writer. Throwing 40 connections at a write heavy workload just creates 40 threads fighting over the same lock, accomplishing nothing except burning CPU cycles.

Split your pools if you can. One pool for reads, one for writes, size them independently based on the actual characteristics of the work. I wish more frameworks made this easier out of the box.

Coming Back to That Morning Timeout

Remember the 11am batch job that kept timing out? Here’s what was actually happening. The job spawned 20 worker threads, each one grabbing a connection from the pool. Fine so far. But each worker held its connection for the entire duration of its work, sometimes 5 or 10 minutes, because the code opened a transaction at the start and committed at the end. Classic mistake. I’ve made this exact mistake probably five times in my career.

Meanwhile, your API traffic continued. Users trying to load pages. Each request needed a connection for maybe 50 milliseconds. But there were no connections left. The batch job was sitting on all of them, doing CPU work, not even touching the database most of the time. Just holding the connection open like a parking space while you run errands three blocks away.

The fix wasn’t a bigger pool. It was shrinking the transaction scope. Open the transaction right before you write. Commit immediately after. Release the connection back to the pool in between. Suddenly those 20 workers only needed 2 or 3 connections at any given moment because they weren’t holding them while doing non database work like parsing JSON or calling external APIs or whatever.

The real lesson here, and this took me way too long to internalize: Postgres connection pooling is about flow, not capacity. You want connections moving through the pool quickly, not sitting there idle or blocked. Think of it like a highway. More lanes doesn’t help if everyone’s parked. Actually that metaphor breaks down a bit because highways are about throughput and connection pools are about… wait, no, they’re also about throughput. Just query throughput instead of car throughput. The metaphor works.

If you need a pattern: measure your actual concurrent query rate under peak load. Multiply that by your p95 query duration in seconds. Add 20 percent headroom for bursts. That’s your pool size. Then set your queue timeout to something reasonable, maybe 5 seconds. Let queries that can’t get a connection fail fast instead of hanging forever and making users stare at loading spinners.

What This Actually Unlocks

Connection pooling in Postgres isn’t mysterious once you stop thinking about it as a capacity problem. It’s a flow problem. Queries need to get in, do their work, and get out. The pool is just the gate that controls how many are inside at once.

The systems that stay responsive under load are the ones where connections cycle quickly. Where queries are tight and indexed properly. Where transactions are short and focused on just the database work. Where the pool size matches the actual concurrency pattern you’re seeing in production, not some theoretical maximum you calculated in a spreadsheet one afternoon.

Start with your monitoring. Look at connection wait time, not just pool utilization. Watch for queries holding connections longer than they should. Find the transactions that never commit because some exception handler is swallowing errors. Fix those first before you touch any pool settings. Because a bigger pool just means more ways to mess up.

Then ask yourself: is my pool sized for the work I’m actually doing right now, or for the work I’m afraid might happen someday when we go viral and everyone wants to use our app at once?

Follow me for more such content.

Top comments (0)