Sasi Kumar T

Posted on Mar 1, 2025 • Edited on Mar 6, 2025

Postgres “almost” Outage Postmortem: The Hidden Dangers of Replication Slots and Autovacuum

#postgres #backend #database

One day, I noticed something odd with our Postgres instance. It was experiencing intermittent performance spikes, load averages were hitting unusually high levels, and I/O utilization occasionally maxed out at 100%. Most workloads were running smoothly, but these spikes felt off. So, I put on my debugging hat and got to work.

Investigating the Spikes

While analyzing the running queries, I discovered that Autovacuum was actively running on the tasks database. This database is used for scheduling pre-scheduled tasks from one of our services, which meant it had a high volume of write activity. Seeing Autovacuum run wasn’t surprising, so I suspected it was the only contributor to the performance issue.

To mitigate the problem, we decided to move the tasks database to a new VM, so that the other systems are not impacted.

Once the migration was complete, the results were immediate and significant
Below are the screenshots for the same

The graphs showed a significant improvement, I/O utilization no longer spiked beyond 40%, and the overall average had dropped below 15%. A big win, right? Well, yes and no. While the system was in much better shape, I still wasn’t fully satisfied.

I also noticed occasional Autovacuum processes running during business hours. While they weren’t as aggressive as they had been for the tasks database, they still contributed to some system load. This got me thinking, could I optimize things even further?

With the immediate issue resolved, I swapped my debugging hat for my analyzing glasses and dug deeper.

Optimizing Storage with Manual Vacuum

Since the tasks database had previously triggered frequent Autovacuum processes, I decided to run a manual Autovacuum on its tables. With the actual database now running on a different machine and the current one no longer in use, this seemed like a safe and logical step

Before running the vacuum, I took note of the storage usage across all databases on the VM:

database_name	size
tasks	42 GB
wickets	12 GB
leash	6833 MB
toolbox	5857 MB
bases	1126 MB

Which totals upto, 67.816 GB.

Vacuuming tasks

Here’s a before-and-after comparison of the vacuum operation on the tasks database:

Before

relname	n_live_tup	n_dead_tup	total_size	table_size	indexes_size
tasks_p1	1,026,187	176,574	13 GB	4956 MB	8211 MB
tasks_p2	1,023,375	57,710	13 GB	4666 MB	7748 MB
tasks_p3	1,020,129	12369	12 GB	4562 MB	7782 MB

After

relname	n_live_tup	total_size	table_size	indexes_size
tasks_p1	1,026,187	1298 MB	1138 MB	140 MB
tasks_p2	1,023,375	1296 MB	1137 MB	140 MB
tasks_p3	1,020,129	1297 MB	1136 MB	140 MB

The Shocking Discovery

Manual vacuuming cut our storage usage by 32 GB, reducing tasks from 36 GB to just 4 GB, an 89% reduction. Seeing such a drastic change made me wonder: What if I applied the same process to the other databases on this Postgres instance?

So, I did exactly that. The result is that the storage usage dropped from ~86% to 76%

That’s when I stumbled upon something strange. While all the databases combined used around 70 GB, but the metrics shows that the cluster was consuming nearly 1.06 TB of storage.

Something wasn’t adding up. Where was all that extra space being used?

So, I started digging into PostgreSQL’s data directory and quickly discovered a directory filled with WAL (Write-Ahead Log) files consuming the majority of the disk space. This didn’t make sense at all, our cluster wasn’t configured to retain WAL files indefinitely.

Write-Ahead Logging (WAL) is a critical feature in PostgreSQL that ensures data integrity and durability. It records changes to the database before they are applied, allowing PostgreSQL to recover data and restore the database to its most recent state in case of a crash or failure.

Curious about why so many WAL files were piling up, I turned to Google for possible explanations. One potential culprit stood out: inactive replication slots.

A replication slot is a feature that ensures WAL (Write-Ahead Log) files are retained until a replica (standby server) or a logical consumer has processed them. It prevents WAL files from being removed before they are received by the subscriber, ensuring smooth replication.

To confirm this, I ran the following query to check for active replication slots:

SELECT * FROM pg_replication_slots;

Ideally, there should have been zero active slots, but instead, I found two logical replication slots, remnants of an old experiment that were never removed. Since these slots were inactive, the cluster was holding onto WAL files indefinitely, waiting for the corresponding consumers to process them. But those consumers were never coming back online.

This was the root cause of the massive storage bloat.

Next, I took the logical step of removing the inactive replication slots. As soon as I did, the disk usage dropped to just 40 GB, freeing up a massive amount of space instantly.

I had achieved something great, a massive reduction in storage. But that wasn’t my primary goal. My real objective was to reduce system load.

Strangely, after removing the replication slots, Autovacuum didn’t run as frequently as before. While the outcome was exactly what I wanted, the "why" behind it kept bothering me. I needed to understand how this change had affected Autovacuum’s behavior.

The following are the CPU and the load average, for the week

After removing the replication slots, CPU usage dropped from ~80% to <10% during business hours and from ~35% to < 1.5% during off-hours.

How and Why?

It was only later that I realized why Autovacuum had been running so aggressively before, it wasn't actually cleaning up dead tuples.

To break it down, let’s go through the 5 Whys:

1. Why wasn’t Autovacuum removing dead tuples, even though it was running frequently?

The problem was caused by inactive replication slots. Replication had been initially set up on the toolbox database, which prevented Autovacuum from cleaning up dead tuples across the entire cluster.

2. How do replication slots impact Autovacuum?

Replication slots track a value called catalog_xmin, which represents the oldest transaction ID affecting system catalogs that must be retained.

When a replication slot is inactive, PostgreSQL holds onto WAL (Write-Ahead Log) entries for all transactions after the catalog_xmin. Because of this, Autovacuum cannot remove dead tuples, and they keep accumulating until the replication slot catches up or is removed.

3. How does replication slot present in one database block Autovacuum on another?

catalog_xmin exists at both the cluster level and individual database level. Since it applies to the entire cluster, an inactive replication slot on one database can prevent Autovacuum from removing dead tuples in all databases.

This meant that Autovacuum was running but skipping cleanup, as the replication slot still "needed" those old tuples.

4. If Autovacuum couldn't remove dead tuples, how did a manual vacuum reclaim space?

A FULL VACUUM (manual vacuum) works differently:

It creates a new physical table file and copies only live tuples into it.
The old table (containing dead tuples) is completely discarded, freeing up space.
Unlike Autovacuum, FULL VACUUM ignores catalog_xmin, so it isn’t blocked by replication slots.

This explains why manual vacuuming instantly freed up disk space, while Autovacuum had been ineffective.

5. If FULL VACUUM is so effective, why isn't it recommended?

FULL VACUUM is a blocking operation, it locks the table and can cause performance issues on production systems. This is why it's not a standard solution for routine maintenance.

The Key Takeaway

Inactive replication slots had been blocking Autovacuum from doing its job, causing a buildup of dead tuples and high system load. Once the replication slots were removed, Autovacuum could work properly again, leading to drastically lower CPU usage and a much healthier database.

About Me

I’m a backend developer with extensive experience in designing and optimizing scalable backend systems. My expertise includes tackling complex performance challenges. I’ve led numerous database performance initiatives and have also been deeply involved in system design and revamping existing systems. My focus is on enhancing efficiency, ensuring reliability, and delivering robust solutions that scale effectively.

Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.

DEV Community