One day, I noticed something odd with our Postgres instance. It was experiencing intermittent performance spikes, load averages were hitting unusually high levels, and I/O utilization occasionally maxed out at 100%. Most workloads were running smoothly, but these spikes felt off. So, I put on my debugging hat and got to work.
Investigating the Spikes
While analyzing the running queries, I discovered that Autovacuum was actively running on the tasks database. This database is used for scheduling pre-scheduled tasks from one of our services, which meant it had a high volume of write activity. Seeing Autovacuum run wasn’t surprising, so I suspected it was the only contributor to the performance issue.
To mitigate the problem, we decided to move the tasks database to a new VM, so that the other systems are not impacted.
Once the migration was complete, the results were immediate and significant
Below are the screenshots for the same
The graphs showed a significant improvement, I/O utilization no longer spiked beyond 40%, and the overall average had dropped below 15%. A big win, right? Well, yes and no. While the system was in much better shape, I still wasn’t fully satisfied.
I also noticed occasional Autovacuum processes running during business hours. While they weren’t as aggressive as they had been for the tasks database, they still contributed to some system load. This got me thinking, could I optimize things even further?
With the immediate issue resolved, I swapped my debugging hat for my analyzing glasses and dug deeper.
Optimizing Storage with Manual Vacuum
Since the tasks database had previously triggered frequent Autovacuum processes, I decided to run a manual Autovacuum on its tables. With the actual database now running on a different machine and the current one no longer in use, this seemed like a safe and logical step
Before running the vacuum, I took note of the storage usage across all databases on the VM:
database_name | size |
---|---|
tasks | 42 GB |
wickets | 12 GB |
leash | 6833 MB |
toolbox | 5857 MB |
bases | 1126 MB |
Which totals upto, 67.816 GB.
Vacuuming tasks
Here’s a before-and-after comparison of the vacuum operation on the tasks database:
Before
relname | n_live_tup | n_dead_tup | total_size | table_size | indexes_size |
---|---|---|---|---|---|
tasks_p1 | 1,026,187 | 176,574 | 13 GB | 4956 MB | 8211 MB |
tasks_p2 | 1,023,375 | 57,710 | 13 GB | 4666 MB | 7748 MB |
tasks_p3 | 1,020,129 | 12369 | 12 GB | 4562 MB | 7782 MB |
After
relname | n_live_tup | n_dead_tup | total_size | table_size | indexes_size |
---|---|---|---|---|---|
tasks_p1 | 1,026,187 | 0 | 1298 MB | 1138 MB | 140 MB |
tasks_p2 | 1,023,375 | 0 | 1296 MB | 1137 MB | 140 MB |
tasks_p3 | 1,020,129 | 0 | 1297 MB | 1136 MB | 140 MB |
The Shocking Discovery
Manual vacuuming cut our storage usage by 32 GB, reducing tasks from 36 GB to just 4 GB, an 89% reduction. Seeing such a drastic change made me wonder: What if I applied the same process to the other databases on this Postgres instance?
So, I did exactly that. The result is that the storage usage dropped from ~86% to 76%
That’s when I stumbled upon something strange. While all the databases combined used around 70 GB, but the metrics shows that the cluster was consuming nearly 1.06 TB of storage.
Something wasn’t adding up. Where was all that extra space being used?
So, I started digging into PostgreSQL’s data directory and quickly discovered a directory filled with WAL (Write-Ahead Log) files consuming the majority of the disk space. This didn’t make sense at all, our cluster wasn’t configured to retain WAL files indefinitely.
Write-Ahead Logging (WAL) is a critical feature in PostgreSQL that ensures data integrity and durability. It records changes to the database before they are applied, allowing PostgreSQL to recover data and restore the database to its most recent state in case of a crash or failure.
Curious about why so many WAL files were piling up, I turned to Google for possible explanations. One potential culprit stood out: inactive replication slots.
A replication slot is a feature that ensures WAL (Write-Ahead Log) files are retained until a replica (standby server) or a logical consumer has processed them. It prevents WAL files from being removed before they are received by the subscriber, ensuring smooth replication.
To confirm this, I ran the following query to check for active replication slots:
SELECT * FROM pg_replication_slots;
Ideally, there should have been zero active slots, but instead, I found two logical replication slots, remnants of an old experiment that were never removed. Since these slots were inactive, the cluster was holding onto WAL files indefinitely, waiting for the corresponding consumers to process them. But those consumers were never coming back online.
This was the root cause of the massive storage bloat.
Next, I took the logical step of removing the inactive replication slots. As soon as I did, the disk usage dropped to just 40 GB, freeing up a massive amount of space instantly.
I had achieved something great, a massive reduction in storage. But that wasn’t my primary goal. My real objective was to reduce system load.
Strangely, after removing the replication slots, Autovacuum didn’t run as frequently as before. While the outcome was exactly what I wanted, the "why" behind it kept bothering me. I needed to understand how this change had affected Autovacuum’s behavior.
The following are the CPU and the load average, for the week
After removing the replication slots, CPU usage dropped from ~80%
to <10%
during business hours and from ~35%
to < 1.5%
during off-hours.
How and Why?
It was only later that I realized why Autovacuum had been running so aggressively before, it wasn't actually cleaning up dead tuples.
To break it down, let’s go through the 5 Whys:
1. Why wasn’t Autovacuum removing dead tuples, even though it was running frequently?
The problem was caused by inactive replication slots. Replication had been initially set up on the toolbox
database, which prevented Autovacuum from cleaning up dead tuples across the entire cluster.
2. How do replication slots impact Autovacuum?
Replication slots track a value called catalog_xmin
, which represents the oldest transaction ID affecting system catalogs that must be retained.
When a replication slot is inactive, PostgreSQL holds onto WAL (Write-Ahead Log) entries for all transactions after the catalog_xmin
. Because of this, Autovacuum cannot remove dead tuples, and they keep accumulating until the replication slot catches up or is removed.
3. How does replication slot present in one database block Autovacuum on another?
catalog_xmin
exists at both the cluster level and individual database level. Since it applies to the entire cluster, an inactive replication slot on one database can prevent Autovacuum from removing dead tuples in all databases.
This meant that Autovacuum was running but skipping cleanup, as the replication slot still "needed" those old tuples.
4. If Autovacuum couldn't remove dead tuples, how did a manual vacuum reclaim space?
A FULL VACUUM (manual vacuum) works differently:
- It creates a new physical table file and copies only live tuples into it.
- The old table (containing dead tuples) is completely discarded, freeing up space.
- Unlike Autovacuum, FULL VACUUM ignores
catalog_xmin
, so it isn’t blocked by replication slots.
This explains why manual vacuuming instantly freed up disk space, while Autovacuum had been ineffective.
5. If FULL VACUUM is so effective, why isn't it recommended?
FULL VACUUM is a blocking operation, it locks the table and can cause performance issues on production systems. This is why it's not a standard solution for routine maintenance.
The Key Takeaway
Inactive replication slots had been blocking Autovacuum from doing its job, causing a buildup of dead tuples and high system load. Once the replication slots were removed, Autovacuum could work properly again, leading to drastically lower CPU usage and a much healthier database.
About Me
I’m a backend developer with extensive experience in designing and optimizing scalable backend systems. My expertise includes tackling complex performance challenges. I’ve led numerous database performance initiatives and have also been deeply involved in system design and revamping existing systems. My focus is on enhancing efficiency, ensuring reliability, and delivering robust solutions that scale effectively.
Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.
Top comments (0)