When the Default Postgres Pool Died at 3 AM

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our treasure-hunt engine at Veltrix was simple on paper: read JSON blobs from S3, parse them, and return the top 50 results by relevance score. By month three we had 2.3 million daily active users, but every Tuesday at 02:47 the API latency spiked to 1.4 seconds and the Postgres pool collapsed with too many connections. The error message wasnt a surprise—we were still using the default max_connections = 100—but what stumped us was that the spike happened even though 45 % of the connections were idle. Profiling with pg_stat_activity showed 89 blocked queries each time the relevance worker tried to UPDATE a cache table. The JSONB column had grown from 2 MB to 180 MB, and every UPDATE rewrote the whole row. Vacuum couldnt keep up because the autovacuum workers were also blocked. The constraint wasnt CPU or memory; it was the concurrency model baked into the default Postgres config.

What We Tried First (And Why It Failed)

We bolted on a Redis cache in front of Postgres. First week it cut median latency from 45 ms to 8 ms. Then the cache stampede hit: at 02:47 the Redis TTLs expired for 30 k keys, and 18 k simultaneous GETs raced to recompute the relevance scores. We tried SET key value NX PX to protect the recomputation, but the Lua script we pushed to Redis kept timing out after 5 ms because it called JSON.get on 500-line blobs. The Redis node saturated its network interface at 110 Mbps while the Postgres pool still ran against the same wall of row-level locks. The JSONB scans were now off the hot path, but every spike left the Postgres shared_buffers full of dirty blocks that had to be fsynced under memory pressure. We measured 110 k block reads per second during the spike—way above the 30 k our SSD could sustain without latency ballooning.

The Architecture Decision

We rewrote the relevance scorer in Rust and moved the scoring entirely off Postgres. The critical decision wasnt the language; it was the data layout. Instead of one huge JSONB column we created a columnar Parquet file sharded by day and hunt ID. We replaced the UPDATE cache with an immutable log: each hunt writes a Parquet file, and the API reads the latest N files with arrow2::io::parquet::read. We chose arrow2 because it zero-copies from Parquet to IPC buffers, so the scoring worker never allocates during the hot loop. We ran the Rust worker on the same Kubernetes node as the Postgres primary to avoid cross-AZ network cost, but we put it in its own deployment with a 400 Mi memory request and a 100 ms soft limit. The workers main loop is a single tokio::select! with three branches: new Parquet file, SIGTERM, or 250 ms timer. We turned Postgres max_connections down to 50 and set idle_in_transaction_session_timeout = 30s to kill lingering idle sessions. The change didnt feel controversial until we hit the first production incident with the Rust worker: it ran out of memory during a 4 GB Parquet merge. That forced us to implement a streaming merge in chunks of 250 MB, using rayon with 4 threads and a custom memory budget. The tunable chunk size became our safety valve.

What The Numbers Said After

After two weeks, APM showed p99 latency drop from 1.4 s to 42 ms. The Postgres pool stabilized at 28 active connections during peak, with idle_in_transaction_session_timeout killing 16 rogue sessions every spike. We measured memory: the Rust process RSS stayed at 220 MiB during steady state and spiked to 380 MiB only during the 4 GB merge, which lasted 2.3 s. Allocations per hunt averaged 1.8 k objects, 58 % of them in the scoring hot loop. We ran heaptrack on a staging copy of the largest hunt and saw 92 % of allocations came from serde_json::Value creation; switching to simd-json cut allocations by 44 %. Postgres buffer hits rose from 67 % to 94 %, and autovacuum finished in under 4 minutes instead of timing out after 15. The network interface on the Rust worker never exceeded 22 Mbps, freeing Redis for the actual cache workload. Cost per 100 k hunts fell from 0.14 USD to 0.07 USD because we dropped two extra cache layers wed added in desperation.

What I Would Do Differently

I would not have rewritten the scorer in Rust if the data layout stayed the same. The bottleneck was the row rewrite, not the language. Rust gave us safety and control, but the Parquet sharding and streaming merge were the real wins. Today Id start with the storage layer change first, measure, and only then decide whether the hot loop needs Rust. Id also budget more time for the memory budget in the streaming merge; the 4 GB spike caught us because our staging workloads capped at 1 GB. Finally, Id resist the urge to expose every knob to Prometheus. We shipped 14 custom metrics for the Rust worker, but only latency, memory_use, and files_processed are truly actionable. The rest were noise during the first production incident when we had to triage a memory leak introduced by a bad rayon scheduler hint.