How Avoiding Locking in Postgres Helped Us Get More Throughput

#performance #postgres #devchallenge #bugsmash

Summer Bug Smash: Smash Stories 🐛🛹

This is a submission for DEV's Summer Bug Smash: Smash Stories powered by Sentry.

The observation

One of our services is a distributed worker fleet that builds a large tree
in Postgres (Aurora, db.r6g.xlarge, 4 vCPU, single writer). Each work item a
worker processes yields child records, and every child is upserted as a node
under its parent in the tree.

Under real load — about 90 jobs running concurrently across 20 pods — the
Aurora writer pegged at ~96% CPU and stayed there. Throughput collapsed: work
backed up, the queue grew, and pods sat idle waiting on the database instead of
doing work. The dashboard showed the usual trio of alarms:

CPU ~96%, sustained
Elevated WriteIOPS
~565 database connections
Rising lock-wait time

Every one of those is a symptom. The interesting question was which one was the
cause — and, crucially, which were just downstream of something else.

What we deliberately did not do

When a database is on fire, the reflexes are well-worn:

"565 connections — add a connection pooler, or a bigger instance, or cap the pool."
"High WriteIOPS and table bloat — crank up autovacuum, maybe a VACUUM FULL."
"CPU's at 96% — scale the instance up a tier."

Each of these is a plausible-sounding fix, and each one treats a symptom:

Bigger hardware buys headroom that linear growth eats straight back. We're targeting 3× the load; a one-tier bump is a few weeks of runway, not a fix.
Aggressive vacuum fights dead tuples we hadn't yet proven were the problem — and vacuum itself burns IO and CPU, so "just vacuum harder" can make a CPU-bound box worse.
Connection caps / poolers move the queue from inside Postgres to in front of it. If the work per query is the problem, throttling connections just makes clients wait in a different place.

None of these answers the only question that matters: why is the CPU pinned?
So we didn't touch any of them until we knew.

Step 1 — Read the waits, not the vitals (Performance Insights)

Instead of guessing from CPU/IOPS/connections, we opened Performance Insights
and looked at Top Waits. The dominant wait was not query execution, not disk
I/O, not client wait. It was:

LWLock:MultiXactOffsetBuffer   (with MultiXactGen close behind)

That is an extremely specific fingerprint, and it changed everything.

MultiXacts are how Postgres represents more than one transaction holding a
lock on the same row at the same time. The most common way to create them is a
flood of FOR KEY SHARE locks on a shared row. And who takes FOR KEY SHARE
locks, implicitly, on every write? Foreign keys. When you insert a child row,
Postgres locks the referenced parent row FOR KEY SHARE for the duration of the
transaction, to guarantee the parent can't vanish underneath you.

Our hot table tree_nodes had three foreign keys:

FK column	references	who it locks
`parent_id`	`tree_nodes`	the parent node — contended by all its siblings
`source_node_id`	`tree_nodes`	the node a child was derived from
`job_id`	`jobs`	*one row per job — locked by every* insert for that job**

Under concurrency this is a worst case. Many inserts share a parent (and all
inserts for a job share the single jobs row), so multiple transactions took a
FOR KEY SHARE lock on the same tuple simultaneously. Postgres had to allocate
a MultiXact to track the set of lockers, rewrite the locked tuple's xmax on
every new locker, and push that through the MultiXact SLRU buffers — allocate,
look up, rewrite, WAL, repeat. That SLRU churn was the 96% CPU.

The connections and the WriteIOPS were downstream of this: transactions piled
up because they were serializing on the same rows, and the tuple rewrites drove
the writes. Fix the lock and both recede on their own.

The fix: drop the three FK constraints. Integrity is guaranteed by the
workers' own ordering — a parent node is always written before its child, and
job_id always points at a live row we just created. The indexes stay
(it's the constraint that takes the lock, not the index), so reads are
unaffected.

Result: MultiXact waits went to zero, and CPU came off the ceiling — even at
the same 90-job, full-fan-out concurrency that had pinned it before.

Step 2 — A suspicious calls-to-rows ratio (a tiny cache)

With the biggest fire out, we went back to Performance Insights and looked at the
top statements. The busiest INSERT told on itself:

INSERT INTO tree_nodes (...) ON CONFLICT DO NOTHING
    calls ≈ 1439   rows ≈ 2.78   (per sample)

We were issuing ~1439 insert attempts to actually write ~3 rows. The workers
re-touch the same shared ancestor nodes constantly — the interior nodes near the
root are referenced by every item beneath them — so the overwhelming majority of
inserts were conflict-and-discard: take the unique-index leaf lock, find the
existing row, do nothing. Correct, but not free — every attempt still touches the
index and costs a round-trip.

The fix: a per-run in-memory seen-set — a sync.Map keyed by the node's
full key, populated after each commit. Before attempting an insert we check the
set; if we've already persisted that node this run, we skip the database entirely.
It costs a couple of megabytes for a 20k-node run and evaporates when the process
exits — no schema, no eviction policy, no new failure mode.

Result: the INSERT ratio went from ~1439 calls : 2.78 rows to
~131 calls : 129 rows — we now almost only touch the database for rows that
are genuinely new.

Step 3 — Batch the transactions (the classic one)

Next wait to surface once the churn was gone:

IO:XactSync   ≈ 85% of active samples

IO:XactSync is commit / WAL-flush wait — the fsync at the end of a transaction.
We had been committing once per child record. Thousands of tiny transactions
meant thousands of fsyncs, and the commit itself became the bottleneck.

The fix: the oldest trick in the book — batch. Collect all of a work
item's child records and upsert them in a single transaction: one
BEGIN/COMMIT per item instead of one per child.

Result: IO:XactSync dropped from ~85% to ~32% of active samples.

The payoff

Same workload (cold re-run of 100 jobs, ramping to an 88-job peak):

Metric	Before	After
Writer CPU (peak)	96%	~68% during cold-start ramp, ~20–30% steady at peak
Top wait	`MultiXactOffsetBuffer`	`IO:XactSync` (~32%)
MultiXact waiters	dominant	0
INSERT calls : rows	1439 : 2.78	131 : 129
`IO:XactSync` share	~85%	~32%
DB connections	~565	~246

We got more throughput without a bigger instance, a connection pooler, or
touching autovacuum — the three things we'd have reached for first if we'd trusted
the vitals instead of the waits.

What's next (being honest about the tail)

Optimizations unmask each other. With MultiXact and most commit-sync gone, the new
top wait is a bursty Lock:transactionid on hot parent-counter rows
(children_total, children_processed, …) that every child bump serializes on
during the initial fan-out of high-fan-out parents. That's the next thing to
attack — likely by deferring and coalescing those counter bumps to the end of each
item's batch — and we only know it's next because, again, we read the waits.

Takeaways

Symptoms lie about causes. Connections, IOPS, and CPU were all genuinely high — and all three were downstream of one lock. Fixing any of them directly would have burned time and money without moving the real number.
Read the wait events. MultiXactOffsetBuffer pointed us straight at FK-induced FOR KEY SHARE locking — a diagnosis we would never have guessed from a CPU graph.
Fix in dependency order, and re-measure between each. Kill the dominant wait, look again, kill the next. The seen-set and the batching were cheap and almost obvious — but only measurement told us they mattered, and in what order.
The cheapest fixes are invisible until the expensive one is gone. You can't see the commit-sync problem while a lock is burning 96% of the CPU. Peel the layers.