francotel

Posted on Jul 24

🚀 Tuning PostgreSQL Replication Parameters in Amazon RDS and Aurora

#devops #rds #aws #postgressql

📈 Importance and Growth of PostgreSQL

PostgreSQL is renowned for its robustness, extensibility, and standards compliance. As an open-source relational database system, it has seen significant growth and adoption across various industries. Major organizations leverage PostgreSQL for its advanced features such as support for JSON, powerful indexing, and support for complex queries.

In PostgreSQL, you can replicate data changes from one PostgreSQL database to another by using logical replication instead of physical, file-based replication. Logical replication uses the write-ahead log (WAL) to capture changes and supports the replication of selected tables or entire databases.

Amazon RDS for PostgreSQL and Aurora PostgreSQL-Compatible both support logical replication, so you can set up a highly available and scalable database architecture that can handle read and write traffic from multiple sources. These services use pglogical, an open-source PostgreSQL extension, to implement logical replication.

Tuning logical replication in Aurora and Amazon RDS is important for achieving optimal performance, scalability, and availability. You can tune the parameters in the pglogical extension to manage the performance of logical replication. For example, you can:

🚀 Improve Performance: Increase the number of worker processes or adjust their memory allocation.
⏳ Reduce Replication Lag: Adjust the frequency of synchronization between the source and replica databases.
📈 Optimize Resource Use: Adjust the memory and CPU allocation of the worker processes.
💡 Minimize Impact: Ensure the replication process does not unduly impact the performance of the source database.

🔧 Key Parameters for Tuning

You can use the following parameters in Aurora and Amazon RDS to control and configure logical replication:

max_replication_slots: Sets the maximum number of replication slots that can be created on the server. A replication slot is a named, persistent reservation for a replication connection to send WAL data to a replica.
max_wal_senders: Sets the maximum number of simultaneously connected WAL sender processes. WAL sender processes are used to stream the WAL from the primary server to the replica.
wal_sender_timeout: Sets the maximum time, in milliseconds, that a WAL sender waits for a response from the replica before giving up and reconnecting.
wal_receiver_timeout: Sets the maximum time, in milliseconds, that a replica waits for WAL data from the primary database before timing out.
log_replication_commands: When set to on, runs the replication-related SQL statements.

When you enable the rds.logical_replication parameter (by setting it to 1), the wal_level parameter is set to logical, which means that all changes made to the database are written to the WAL in a format that can be read and applied to a replica. This setting is required to enable logical replication. This setting also allows for the replication of SELECT statements.

Setting wal_level to logical can increase the amount of data written to the WAL, and therefore to disk, which can affect system performance. We recommend that you consider available disk space and system performance when enabling logical replication.

🔍 Example Configuration

You want to replicate data from your primary database to a secondary database for backup and disaster recovery purposes. However, the secondary database has a high volume of read operations, so you want to make sure that the replication process is as fast and efficient as possible without compromising data integrity.

The default values for logical replication in Amazon RDS and Aurora prioritize consistency over performance, so they might not be optimal for this use case. To optimize your logical replication settings for speed and efficiency, you can customize the parameters as follows:

Increase max_replication_slots from 10 (default for Amazon RDS) or 20 (default for Aurora) to 30 to accommodate potential future growth and replication needs.
Increase max_wal_senders from 10 (default) to 20 to ensure that there are enough WAL sender processes to keep up with replication demand.
Decrease wal_sender_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL sender processes are terminated more quickly, which frees up resources for active replication.
Decrease wal_receiver_timeout from 30 seconds (default) to 15 seconds to ensure that idle WAL receiver processes are terminated more quickly, which frees up resources for active replication.
Increase max_logical_replication_workers from 4 (default) to 8 to ensure that there are enough logical replication worker processes to keep up with replication demand.

These optimizations provide faster and more efficient data replication while maintaining data integrity and security.

For example, if a disaster were to occur and the primary database became unavailable, the secondary database would already have the latest data available due to the optimized replication process. This would enable your business operations to continue providing critical services without interruption.

1. Check Current Replication

SELECT name, setting FROM pg_settings WHERE name in 
('rds.logical_replication', 'max_replication_slots', 
'max_wal_senders', 'max_logical_replication_workers', 
'max_worker_processes', 'wal_level');

              name               | setting
---------------------------------+---------
 max_logical_replication_workers | 4
 max_replication_slots           | 20
 max_wal_senders                 | 20
 max_worker_processes            | 8
 rds.logical_replication         | off
 wal_level                       | replica

2. To enable logical replication in Amazon Aurora PostgreSQL:

Set the Parameter: Change the parameter rds.logical_replication to 1.

Rebooting an Aurora Cluster (1 Writer, 1 Reader)

Reboot the Reader Instance:
- Reboot the reader instance first to minimize impact on read operations.
Reboot the Writer Instance:
- Once the reader instance is back online, proceed to reboot the writer instance.

Check Post Configuration

SELECT name, setting FROM pg_settings WHERE name in 
('rds.logical_replication', 'max_replication_slots', 
'max_wal_senders', 'max_logical_replication_workers', 
'max_worker_processes', 'wal_level');

              name               | setting
---------------------------------+---------
 max_logical_replication_workers | 4
 max_replication_slots           | 20
 max_wal_senders                 | 20
 max_worker_processes            | 8
 rds.logical_replication         | on
 wal_level                       | logical

Key Points

Failover Handling: Aurora automatically handles failover, redirecting read traffic to the remaining instance during the writer reboot.
Order of Reboot: Always reboot the reader first, followed by the writer.
Minimize Downtime: Schedule the reboot during a maintenance window or low-traffic period and monitor the process through the RDS console and CloudWatch metrics.

🛠️ Best Practices

Tuning logical replication with huge workloads can be a complex task that depends on a variety of factors, including the size of the dataset, the number of tables being replicated, the number of replicas, and the available resources. Here are a few general tips for tuning logical replication with huge workloads:

Monitor Replication Lag: The replication lag is the time difference between the primary server and standby servers. Monitoring replication lag can help you identify potential bottlenecks and take action to improve replication performance. You can use the pg_current_wal_lsn() function to check the current replication lag.
Tune WAL Settings: The pg_logical extension uses WAL to transmit changes from the primary server to the standby server. If the WAL settings aren't tuned properly, replication can become slow and unreliable. Make sure to set the max_wal_senders and max_replication_slots parameters to adequate values, depending on your workloads.
Have an Indexing Strategy: Having proper indexes on the primary server can help improve the performance of the logical replication, reduce the I/O on the primary server, and reduce the load on the system.
Use Parallel Replication: Using parallel replication can help increase replication speed by allowing multiple parallel worker processes to replicate data. This feature is available on PostgreSQL 12 and later.

🔍 Key Metrics to Monitor in CloudWatch

Replication Metrics

📉 AuroraReplicaLag

Description: Measures the replication lag time in milliseconds between the primary instance and the replicas.
Importance: Helps identify synchronization issues between the replicas and the primary instance.
Suggested Threshold: Set an alarm if the lag exceeds 1000 ms.

📈 AuroraReplicaLagMaximum

Description: Measures the maximum replication lag time in milliseconds among Aurora replicas.
Importance: Similar to AuroraReplicaLag, but focused on the highest observed lag.
Suggested Threshold: Set an alarm if the maximum lag exceeds 1000 ms.

Feel free to reach out if you have any questions or need further assistance with Aurora PostgreSQL or logical replication! 😊🔧 You can also follow me for more insights and tips on AWS and PostgreSQL. Let's connect and contribute to the community! 🚀💬

DEV Community