Mustafa ERBAY

Posted on Jun 2 • Originally published at mustafaerbay.com.tr

Database WAL Bloat: The Hidden Culprit Behind Storage Nightmares

#career #postgres #database #sistemyonetimi

Database WAL Bloat: Behind the Scenes of Unexpected Storage Problems

While optimizing a production ERP, I encountered an sudden exhaustion of disk space on the database server. My initial thoughts were an excessive accumulation of logs or forgotten backup files. However, upon deeper investigation, I discovered the root cause was an unexpected "bloat" in PostgreSQL's Write-Ahead Logging (WAL) mechanism. This situation not only creates disk space issues but can also lead to performance degradation and even make the system inaccessible.

WAL is the cornerstone of every database system. It ensures data integrity and recoverability by writing data changes to disk before they are made permanent. However, if WAL files are not cleaned up in a timely manner or are mismanaged, it leads to what we call "WAL bloat." In this post, I will explain why WAL bloat causes unexpected storage problems, how to detect this issue, and what strategies can be used to manage it, drawing on concrete examples from my own field experience.

What is WAL and Why is it Important?

In PostgreSQL, WAL is a critical component for ensuring ACID (Atomicity, Consistency, Isolation, Durability) properties. When a data change (INSERT, UPDATE, DELETE) occurs, this change is not written directly to the data files. Instead, it is first recorded in the WAL buffer and then "write-ahead" to WAL files. Subsequently, these changes are applied to the data files by background processes (e.g., bgwriter and checkpointer).

The main advantages of this approach are:

Durability: In the event of a system crash, WAL files contain the most recently committed transactions. When the system restarts, these WAL records can be replayed to restore the data files. Thus, all transactions made up to the point of the crash are preserved.
Performance: Sequential WAL writes are generally faster than random writes to data files. This improves the overall performance of database operations.
Replication: WAL records form the basis for physical replication. The WAL stream from the primary server is sent to secondary servers to keep the database synchronized.

WAL files are typically located in the pg_wal directory. These files are divided into "segments" when they reach a certain size or after a certain period. Normally, these segments are cleaned up or archived after being written to data files and used for replication. However, if this cleanup process is disrupted, the WAL bloat problem arises.

ℹ️ WAL Archiving and Cleanup

The management of WAL files is provided by two main mechanisms: WAL archiving and WAL cleanup. Archiving involves saving copies of WAL segments to a secure location (e.g., backup server or cloud storage). This is necessary for Point-in-Time Recovery (PITR). Cleanup, on the other hand, is the deletion of WAL segments from disk that are no longer needed. The correct functioning of these two processes is key to preventing WAL bloat.

Root Causes of WAL Bloat

WAL bloat usually arises not from a single cause, but from multiple interacting factors. Understanding these reasons is the first step to solving the problem at its source. Based on my field experience, the most common root causes I've encountered are:

1. Insufficient WAL Archiving or Cleanup Configuration

This is the most common cause of WAL bloat. If archive_mode and archive_command settings are not configured correctly, or if writing to the archive destination fails, PostgreSQL continues to create new WAL segments but does not archive or clean up old ones.

For example, in one project, we had set archive_command to a simple cp %p /mnt/wal_archive/%f. However, when the /mnt/wal_archive directory ran out of disk space, the cp command failed, and WAL segments began to accumulate. PostgreSQL did not delete the WAL segments because it could not archive them, and the pg_wal directory quickly filled up.

2. High Write Transaction Volume and Rapid WAL Segment Generation

When there are very intensive database write operations (INSERT, UPDATE, DELETE), WAL segments are generated very quickly. If replication or archiving cannot keep up with this speed, WAL files can start to accumulate. This situation can be observed especially during large data loads, bulk updates, or applications with high transaction rates.

During the holiday season for an e-commerce site, the average transaction volume of 100 transactions/second surged to 500 transactions/second. This intensity caused WAL segments to be generated much faster. Replication lag increased somewhat, and the WAL cleanup mechanism also struggled to keep up with this pace. As a result, the pg_wal directory grew from 50 GB to 200 GB within a few hours.

3. Replication Lag or Interruptions

When physical replication is used, secondary servers receive and apply WAL records from the primary server. If the replication connection is broken or the secondary server falls too far behind the primary, the primary server is forced to wait before deleting old WAL segments. This is because these segments are necessary for the secondary server to resynchronize.

In one project, due to a sudden network outage, the connection to the secondary PostgreSQL server was lost. The primary server continued to accumulate WAL segments instead of deleting them. After a 3-hour outage, the pg_wal directory on the primary server reached 150 GB instead of its usual 10 GB. When the connection was re-established, the secondary server had to process this large amount of WAL data, which imposed an additional performance burden.

4. Long-Running Transactions and `max_slot_wal_keep_size`

Logical replication or tools like pg_basebackup use "replication slots" to consume WAL segments. If a replication slot is active, PostgreSQL will not delete WAL files that need to be read by that slot. The max_slot_wal_keep_size parameter determines the maximum size that WAL files not consumed by a slot can occupy. If this size is exceeded, the primary server, which is not using the slot, may stop generating new WAL segments, which can lead to data loss. However, if this parameter is not set correctly, it can cause WAL files to accumulate even when the slot is active.

In a customer project, a logical replication slot was accidentally deleted. However, PostgreSQL did not immediately notice this and continued to accumulate WAL files belonging to the deleted slot. After a few hours, the pg_wal directory grew much larger than expected. This situation was an example of how critical slot management is.

⚠️ Considerations for max_slot_wal_keep_size

Setting max_slot_wal_keep_size too high increases the risk of WAL bloat. Setting it too low can prevent slots from "catching up" even during normal replication operations, leading to data loss. The setting of this parameter should be done carefully according to the system's overall replication strategy and the rate of WAL segment generation. Generally, it is recommended to set it to several times the amount of WAL produced during the system's busiest write hour.

Methods for Detecting WAL Bloat

Detecting WAL bloat is the first step to finding the root cause of the problem. Fortunately, PostgreSQL provides tools to help us with this. Here are the main methods I use:

1. Checking the `pg_wal` Directory Size

The simplest and most direct method is to check the size of the pg_wal directory. This directory is usually located within the PostgreSQL data directory.

# To find the PostgreSQL data directory (usually something like /var/lib/postgresql/<version>/main)
SHOW data_directory;

# Then check the size of the pg_wal directory
sudo du -sh /var/lib/postgresql/14/main/pg_wal

The output of this command will show you how much space the pg_wal directory is occupying. If this value is much higher than it should normally be, you might be facing a WAL bloat issue. Normally, the pg_wal directory should not exceed a few GBs (depending on system load).

2. Using the `pg_walfilecount` Function

In PostgreSQL 13 and later versions, the pg_walfilecount function returns the number of WAL files in the pg_wal directory. This is useful not only for disk space but also for understanding if the number of files in the file system has excessively increased.

SELECT count(*) FROM pg_walfilecount();

If this number is higher than expected, it could also be a sign of a problem.

3. Examining Replication Slot Status

Since replication slots can be a significant cause of WAL bloat, it's important to regularly check their status.

SELECT slot_name, plugin, slot_type, database, active, restart_lsn FROM pg_replication_slots;

This query provides information about active and inactive slots, the plugins they use, and the last read LSN (Log Sequence Number). If a slot does not appear active but is still consuming WAL files, this could be an issue.

4. Checking WAL Archiving Status

Is WAL archiving working correctly? Is the archiving command archive_command throwing errors? To find answers to these questions, you need to examine the PostgreSQL log files.

# The location of PostgreSQL log files is usually:
# /var/log/postgresql/postgresql-<version>-main.log

# Look for "archive command failed" or similar errors in the log files
sudo grep "archive command failed" /var/log/postgresql/postgresql-14-main.log

If the archiving command is constantly failing, this will cause WAL segments to accumulate.

💡 Lifespan of WAL Segments

In PostgreSQL, how long WAL segments are retained depends on various settings. wal_keep_size (or wal_keep_segments in older versions) limits the size of WAL files kept for replication. If physical replication is used, the primary server retains all WAL files that the secondary server might need to read. If logical replication and pg_basebackup are used, replication slots manage this process. WAL archiving ensures that WAL files are permanently stored for PITR.

Strategies for Resolving and Preventing WAL Bloat

When WAL bloat is detected, several strategies can be followed to resolve the problem and prevent its recurrence in the future. These strategies may vary depending on the root cause of the problem.

1. Correcting WAL Archiving and Cleanup Processes

If the problem stems from WAL archiving or cleanup processes not functioning correctly, the first step is to fix these mechanisms.

Check Archiving Destination: Ensure that the target directory of the archiving command has sufficient space and that write permissions are correct.
Test Archiving Command: Manually run the archive_command to check for errors.
Check archive_mode Setting: Ensure that your database's archive_mode setting is correct (it should be on or always, not off).

In one instance, we noticed that our WAL archive directory on the NAS device was full. The archive_command was failing. After adding sufficient space to the NAS device and updating the settings, the problem was resolved.

2. Addressing Replication Lags

If replication lag is causing WAL bloat, the focus should be on improving replication performance.

Optimize Network Connection: Check network latency and bandwidth between the primary and secondary servers.
Improve Secondary Server Performance: Enhance the secondary server's I/O performance, ensuring it has sufficient RAM.
Adjust wal_receiver_timeout Setting: This parameter determines how long the WAL receiver will wait for data before disconnecting. A very low value can cause unnecessary disconnections during temporary network issues.

In a production environment, we identified that the secondary server's disk I/O was insufficient. By switching to faster SSDs, we significantly reduced replication lag and stopped WAL accumulation.

3. Replication Slot Management

Proper management of replication slots is critical, especially when using logical replication.

Delete Unused Slots: Always delete replication slots that are no longer in use.
Carefully Configure max_slot_wal_keep_size: Adjust this parameter according to the system's WAL generation rate and replication requirements.

In an enterprise project, a development team created a logical slot for testing purposes and then forgot about it. This slot held WAL segments for 1 month, causing a severe disk space issue. The problem was resolved after finding and deleting the slot.

-- List active replication slots
SELECT slot_name, active FROM pg_replication_slots;

-- Delete an unused slot
SELECT pg_drop_replication_slot('slot_to_be_deleted');

4. Adjusting `wal_keep_size` (or `wal_keep_segments`)

When using physical replication, the wal_keep_size (or wal_keep_segments in older versions) parameter determines how much WAL data the primary server will retain. This value should be set based on how long the secondary server might be disconnected from the primary. However, remember that this setting alone will not solve WAL bloat; it primarily defines the replication's "recovery window."

5. Manually Cleaning WAL Files (Use with Caution!)

If WAL bloat has created an emergency and the above methods cannot be immediately applied, it may be necessary to manually clean WAL files. However, this is an extremely risky operation and can lead to data loss. This operation should only be performed when you are certain that the WAL files are no longer needed for replication or recovery and you fully understand the current state of the system.

Generally, to manually delete WAL segments, we find old files in the pg_wal directory and delete them with the rm command. However, if PostgreSQL still thinks these files are in use and we delete them, inconsistencies or data loss may occur in the system. Therefore, the safest way is to stop the database and ensure that the WAL segments are truly unnecessary before taking this step.

🔥 Risk of Manual WAL Cleanup

Manually deleting WAL files can severely jeopardize the stability of the database system. If a replication slot is active or a checkpoint has not yet completed, deleted WAL files can lead to data loss or corruption of replication. Perform this operation only when absolutely necessary and with in-depth knowledge. Generally, safer and more sustainable solutions exist for WAL bloat issues.

Impact of WAL Bloat on Performance

WAL bloat not only creates disk space problems but also negatively affects database performance. Here are its main impacts on performance:

Disk I/O Intensity: As the pg_wal directory grows, data read and write operations on the disk slow down. While the disk is busy managing WAL files, resources allocated for normal data operations decrease.
Extended Checkpoint Durations: The checkpointer process writes data pages to disk and cleans up the WAL log. When WAL files are very large, checkpoint operations take longer. This can cause brief system freezes.
Archiving Command Performance: As WAL segments grow, the time it takes for the archiving command to copy or move these segments also increases. This further slows down the archiving process, completing the cycle.
Increased Recovery Times: In the event of a system crash or restart, PostgreSQL needs to apply WAL files to bring the database to a consistent state. If there are too many WAL files, this recovery process can take hours.

In a customer project, due to WAL bloat, the primary server's response time increased by over 300%. Disk I/O was constantly above 95%, and users were complaining about slowness. After resolving WAL bloat, response times returned to normal, and disk I/O dropped to around 20%.

Conclusion: WAL Management Requires a Proactive Approach

WAL bloat in PostgreSQL is often overlooked but can have serious consequences. It can lead to unexpected storage problems, performance degradation, and even the risk of data loss. Therefore, understanding the WAL mechanism, detecting potential problems early, and taking proactive measures are vital.

My field experience shows that the root causes of WAL bloat often stem from simple configuration errors, insufficient monitoring, or improper management of heavy system loads. By implementing the detection and resolution strategies discussed in this post, you can ensure the stability and performance of your database systems. Remember, database management is not just about writing queries; understanding the underlying infrastructure and mechanisms is a sign of true professionalism.

As I mentioned in my previous post [related: PostgreSQL index strategies], database optimization is a continuous process. WAL management is an integral part of this process.

DEV Community

Database WAL Bloat: The Hidden Culprit Behind Storage Nightmares

Database WAL Bloat: Behind the Scenes of Unexpected Storage Problems

What is WAL and Why is it Important?

Root Causes of WAL Bloat

1. Insufficient WAL Archiving or Cleanup Configuration

2. High Write Transaction Volume and Rapid WAL Segment Generation

3. Replication Lag or Interruptions

4. Long-Running Transactions and `max_slot_wal_keep_size`

Methods for Detecting WAL Bloat

1. Checking the `pg_wal` Directory Size

2. Using the `pg_walfilecount` Function

3. Examining Replication Slot Status

4. Checking WAL Archiving Status

Strategies for Resolving and Preventing WAL Bloat

1. Correcting WAL Archiving and Cleanup Processes

2. Addressing Replication Lags

3. Replication Slot Management

4. Adjusting `wal_keep_size` (or `wal_keep_segments`)

5. Manually Cleaning WAL Files (Use with Caution!)

Impact of WAL Bloat on Performance

Conclusion: WAL Management Requires a Proactive Approach

Top comments (0)

Database WAL Bloat: Behind the Scenes of Unexpected Storage Problems

What is WAL and Why is it Important?

Root Causes of WAL Bloat

1. Insufficient WAL Archiving or Cleanup Configuration

2. High Write Transaction Volume and Rapid WAL Segment Generation

3. Replication Lag or Interruptions

4. Long-Running Transactions and max_slot_wal_keep_size

Methods for Detecting WAL Bloat

1. Checking the pg_wal Directory Size

2. Using the pg_walfilecount Function

3. Examining Replication Slot Status

4. Checking WAL Archiving Status

Strategies for Resolving and Preventing WAL Bloat

1. Correcting WAL Archiving and Cleanup Processes

2. Addressing Replication Lags

3. Replication Slot Management

4. Adjusting wal_keep_size (or wal_keep_segments)

5. Manually Cleaning WAL Files (Use with Caution!)

Impact of WAL Bloat on Performance

Conclusion: WAL Management Requires a Proactive Approach

4. Long-Running Transactions and `max_slot_wal_keep_size`

1. Checking the `pg_wal` Directory Size

2. Using the `pg_walfilecount` Function

4. Adjusting `wal_keep_size` (or `wal_keep_segments`)