Parth Mishra

Posted on Feb 19

Racing Against 2 Billion: How We Survived a PostgreSQL Transaction ID Wraparound on a 2TB Production Table

#devops #database #postgressql #postgres

A detailed account of diagnosing and recovering from a near-fatal transaction ID wraparound on a high-traffic production PostgreSQL database — and the unexpected twist that turned a 60-second operation into 50 minutes of downtime.

TL;DR: Our production PostgreSQL hit 1.6B transaction IDs (limit is ~2.1B) on a 2TB table. Normal vacuum failed for weeks due to excessive bloat. Vacuum freeze ran for 6 days and got stuck. We dropped 800GB of unused indexes and ran pg_repack as a last resort. pg_repack worked perfectly until the final step, where PostgreSQL's own anti-wraparound autovacuum blocked the ACCESS EXCLUSIVE lock needed to complete the swap — causing 50 minutes of downtime instead of 60 seconds. We killed autovacuum, pg_repack finished, XIDs dropped from 1.6B to 440M, and vacuum that used to take days now runs in 2.5 hours. Full timeline and lessons below.

The Alert Nobody Worries About (Until They Should)

On a quiet Tuesday morning, a monitoring alert fired: MaximumUsedTransactionIDs had crossed the 500 million threshold on our production PostgreSQL database.

The database in question was the backbone of a high-traffic SaaS platform — an 800GB table called events with several large indexes, bringing the total footprint to over 2TB. It handled hundreds of queries per second, 24/7.

If you work with PostgreSQL, you know the rule: the XID limit is approximately 2.147 billion. Cross it, and PostgreSQL forces the database into read-only mode to prevent data corruption. Recovery requires an offline, single-user vacuum that can take hours to days depending on database size. For a 2TB table serving live traffic, "days of downtime" wasn't a hypothetical — it was an existential threat.

At 500 million, we weren't worried. Weekly autovacuum maintenance had always kept things in check. We expected the next scheduled vacuum run to clean this up automatically.

That assumption would prove dangerously wrong.

Day 4: The First Sign Something Was Off

Day 4 — 700 million XIDs

Autovacuum had kicked off but hadn't completed the events table. This wasn't unusual — on a table this size with live traffic, vacuum could take more than a day. We decided to wait. In rare cases, high write throughput can cause vacuum to take longer as it competes with production queries for resources.

The XIDs kept climbing.

Day 9: Crossing the Billion Mark

Day 9 — 1.1 billion XIDs

Vacuum still hadn't completed. We'd blown past our internal acceptable threshold of 1 billion. At this point, we initiated VACUUM FREEZE — an aggressive form of vacuum that forcefully marks all tuples as frozen, resetting their transaction ID age. This was the recommended escalation from our database consulting partner.

We expected vacuum freeze to resolve the situation within a few days. It had never failed us before.

Day 14: AWS Sends the Warning

Day 14, early morning — 1.5 billion XIDs

AWS RDS sent an automated alert:

The instance is approaching transaction-wraparound. This requires urgent attention to avoid a future outage. When the age reaches 2,146,483,647, the database will stop accepting new transactions.

We immediately got on calls with AWS Support and our database consulting partner. The situation:

Vacuum freeze had been running for 5 days
Of the 10 indexes on the events table, 7 had completed, but 3 large indexes were still in progress with no clear completion estimate
Based on the XID growth rate, we estimated 5 days until we hit 2 billion

Five days. That was our runway.

We decided to give vacuum freeze one more day. If the remaining 3 indexes showed progress, we'd let it finish. If not, we'd escalate.

Day 15: The Decision Point

Day 15, late morning — vacuum freeze showed no progress.

The three remaining indexes were stuck. After 6 days of running, vacuum freeze was not going to save us. We had to move to our contingency plan: pg_repack.

Why pg_repack?

pg_repack is a PostgreSQL extension that rebuilds a table and its indexes completely, removing all bloat in the process. Unlike VACUUM FULL (which locks the table for the entire duration), pg_repack works online — it creates a shadow copy of the table, replays changes via triggers, and then performs a brief swap at the end. The expected downtime is two short lock windows, typically under 60 seconds each.

For a 2TB table approaching catastrophic XID limits, pg_repack was our best option: minimal downtime, complete bloat removal, and a full XID reset.

But first, we needed to give pg_repack the best chance of success.

Preparing the Ground

Step 1: Provision storage. pg_repack needs to create a full copy of the table before swapping. We increased storage from 8.5TB to 11TB to ensure we wouldn't run out of disk mid-operation. Running out of storage during pg_repack would be catastrophic.

Step 2: Remove dead weight. We analyzed the indexes on the events table and identified 4 unused indexes totaling approximately 800GB:

idx_events_on_source_id
idx_events_agent_id
idx_events_status
idx_events_closed_at

After confirming with the application on-call team that none of these were actively used, we dropped all four. This served two purposes: it reduced the work pg_repack needed to do and freed up storage for the table rebuild.

Dropping 800GB of unused indexes on a production database under time pressure is not a decision you make lightly. We verified each index against active query patterns and got explicit sign-off before proceeding.

The Repack

Day 15, 12:17 PM — pg_repack started.

We notified all stakeholders via Slack and coordinated with the application on-call team. pg_repack began its work: reading the entire events table, writing a clean copy, and rebuilding indexes.

1:57 PM — The first lock acquisition completed in under 60 seconds, as expected. Table copy was done, and index creation was in progress. Everything was proceeding normally.

7:05 PM — pg_repack reached its final step: acquire an ACCESS EXCLUSIVE lock on the events table, swap the old table with the new one, and drop the old copy. This was supposed to take under 60 seconds.

It didn't.

The Twist: When Your Safety Net Becomes Your Obstacle

The ACCESS EXCLUSIVE lock couldn't be acquired. Something was blocking it.

We investigated and found the culprit: autovacuum had automatically started on the events table.

Here's the irony. PostgreSQL's autovacuum daemon monitors MaximumUsedTransactionIDs. When it sees a table approaching dangerous XID levels, it aggressively kicks off anti-wraparound vacuum to protect the database. At 1.6 billion XIDs, PostgreSQL decided it was time to intervene.

The problem: autovacuum held a lock on the events table that conflicted with pg_repack's ACCESS EXCLUSIVE lock requirement. pg_repack was waiting. Autovacuum was running. And every SELECT, INSERT, UPDATE, and DELETE against the events table was queued behind pg_repack's pending lock request.

The application was down. Users couldn't access the platform. The application was returning 500 errors because the database couldn't process any queries on its largest table.

The clock was ticking. We were at 1.6 billion XIDs, and we were in an unplanned outage caused by the very mechanism PostgreSQL uses to prevent outages.

The Fix

We made the call: kill autovacuum manually.

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query LIKE '%autovacuum%'
AND datname = 'your_database';

The moment autovacuum terminated, pg_repack acquired the ACCESS EXCLUSIVE lock, performed the table swap, and completed.

Total unplanned downtime: 50 minutes.

The Recovery

Day 15, ~8 PM — pg_repack completed.

The database was back. The application recovered automatically once queries could reach the events table again.

With the table fully rebuilt and free of bloat, autovacuum restarted automatically and completed its normal maintenance cycle.

~10:30 PM — MaximumUsedTransactionIDs dropped to 440 million. The crisis was over.

But the numbers told a remarkable story about how bad the bloat had been:

Metric	Before pg_repack	After pg_repack
MaximumUsedTransactionIDs	1.6 billion	440 million
Table + index size	~2TB+	~1TB (50% reduction)
Autovacuum completion time	>1 day (often failing)	~2.5 hours
Unused index overhead	~800GB	0

The reason vacuum had been failing for weeks was now obvious. The table had accumulated so much bloat that vacuum couldn't process it within its resource constraints. The bloated unused indexes were consuming vacuum time without providing any query benefit. It was a vicious cycle: bloat prevented vacuum from completing, which caused more bloat, which pushed XIDs higher.

pg_repack broke the cycle by rebuilding everything clean.

Lessons Learned

1. Monitor XID age trends, not just thresholds

A 500 million XID alert is meaningless if you don't also look at the rate of increase and whether vacuum is actually making progress. We had the alert but initially treated it as routine. A dashboard showing "days until 2B at current rate" would have escalated this earlier.

2. Bloat is the silent killer of vacuum

Vacuum on a heavily bloated table with large indexes can appear to be running but make no meaningful progress. If your table has grown significantly beyond its expected size, vacuum may not be enough — you may need pg_repack to break the cycle before vacuum can work effectively again.

3. Unused indexes are a liability, not just wasted space

Those 4 unused indexes (800GB) weren't just wasting storage. They were actively preventing vacuum from completing in a reasonable time. Every vacuum operation has to process every index on the table. Regularly auditing and removing unused indexes isn't just a performance optimization — it's an operational safety measure.

4. pg_repack's final lock can be blocked by autovacuum

This is the detail I've seen least discussed publicly. pg_repack needs an ACCESS EXCLUSIVE lock for its final swap operation. If PostgreSQL's autovacuum daemon is running on the same table (which it will be if XID levels are high enough to trigger anti-wraparound vacuum), the lock will be blocked. You need to be prepared to manually kill autovacuum during pg_repack's final step.

This creates a paradox: pg_repack is most needed precisely when autovacuum is most aggressively trying to run on the same table. Plan for this.

5. Pre-flight before pg_repack

Before starting pg_repack on a large table:

Provision extra storage — pg_repack creates a full copy before swapping. You need at least 1x the table size in free space, plus overhead for indexes.
Drop unused indexes first — every index pg_repack has to rebuild extends the operation time and the risk window.
Coordinate with application teams — the two lock windows are brief but real. Ensure on-call teams are aware and monitoring.
Have a plan for autovacuum interference — especially if your XID levels are already high enough to trigger anti-wraparound vacuum.

6. Estimate your runway before it becomes urgent

At 1.5 billion XIDs, we had an estimated 5 days. That sounds like enough time, but each preparatory step (storage provisioning, index analysis, team coordination, pg_repack execution) took hours. If we'd reached 1.8 or 1.9 billion before acting, we might not have had time for pg_repack to finish before catastrophic read-only mode kicked in.

The Timeline at a Glance

Day	XIDs	Action
Day 1	500M	Alert fires. Expected autovacuum to resolve.
Day 4	700M	Vacuum still running, decided to wait.
Day 9	1.1B	Initiated vacuum freeze.
Day 14	1.5B	AWS warning. Vacuum freeze running 5 days, 3 indexes stuck. 5-day runway.
Day 15 AM	1.5B+	Vacuum freeze killed. Dropped 800GB of unused indexes. Provisioned storage.
Day 15 12:17 PM	~1.6B	pg_repack started.
Day 15 1:57 PM	~1.6B	First lock <60 seconds. Table copy complete.
Day 15 7:05 PM	~1.6B	Final lock blocked by autovacuum. Application down.
Day 15 ~8 PM	~1.6B	Killed autovacuum. pg_repack completed. Application recovered.
Day 15 ~10:30 PM	440M	Autovacuum completed in 2.5 hours. Crisis resolved.

Final Thought

PostgreSQL's transaction ID system is one of its few genuinely dangerous operational characteristics. The safety mechanisms (autovacuum, anti-wraparound vacuum) work well in normal conditions. But when a table is bloated enough to prevent vacuum from completing, those safety mechanisms become insufficient — and in our case, one of them actively interfered with the recovery operation.

The combination of excessive bloat + large unused indexes + high write throughput created a situation where normal vacuum couldn't keep up, vacuum freeze couldn't finish, and the only viable path (pg_repack) was momentarily blocked by the very system trying to prevent the disaster we were trying to fix.

Know your tables. Audit your indexes. Monitor your XIDs. And have a pg_repack runbook ready before you need it.

DEV Community