DEV Community

leo
leo

Posted on

GUC parameter settings affecting disaster recovery performance indicators(openGauss)

GUC parameter settings affecting disaster recovery performance indicators
The impact of checkpoint related parameter settings
The disaster recovery performance indicators described in the feature specifications are all measured when the checkpoint-related parameters are set to default values.
For descriptions of checkpoint-related parameters, see "GUC Parameter Description > Write-Ahead Log > Checkpoint" in the Developer Guide. When "enable_incremental_checkpoint" is on, the maximum time between setting automatic WAL checkpoints will be determined by "incremental_checkpoint_timeout". If you do not use the default value and increase it, it may cause a large number of logs to be played back when the instance restarts, and then The RTO of the disaster recovery indicator becomes larger and cannot meet the feature specifications.
The impact of extreme RTO related parameter settings
For the description of parameters related to Extreme RTO, see the parameter descriptions of recovery_parse_workers and recovery_redo_workers in the "GUC Parameter Description > Write-Ahead Log > Log Replay" section of the Developer Guide. If you want to enable Extreme RTO, you should at least satisfy that the number of logical CPUs on each machine is greater than the number of additional threads started after enabling Extreme RTO (the calculation formula is (recovery_parse_workers * (recovery_redo_workers + 2) + 5) * DN instance on each machine Otherwise, threads may compete for CPU resources, resulting in longer time-consuming operations for some operations in the disaster recovery process, failing to meet the disaster recovery feature specifications.

basic operation
Disaster Recovery
Business load assessment of the primary database instance before disaster recovery is built
Data volume :

The amount of data stored in the primary database instance directly affects the amount of data that needs to be transmitted for disaster recovery. Combined with the remote network bandwidth, this value directly affects the duration of disaster recovery construction. You can set the timeout time in the "time-out" of the disaster recovery interface. The current default value is 20 minutes. The evaluation of the timeout period is related to the amount of data before the disaster recovery of the primary database instance and the available bandwidth in the off-site. The calculation formula is "data amount / transmission rate = time-consuming".

For example, the main database instance has 100TB of data, and the available bandwidth between remote database instances is 512Mbps (transmission rate is 64MB/s). It takes 1638400s (100*1024*1024/64, about 19 days) to build disaster recovery and transfer these data.

Log generation rate :

This value affects the amount of logs that the primary database instance needs to keep locally in the primary database instance during the disaster recovery construction process. The disaster recovery database instance will establish a streaming replication relationship with the primary database instance after completing full data recovery. Retaining it may cause the establishment of the streaming replication relationship to fail.

For example: after calculation, the construction process will last for 2 days, then the logs within these 2 days need to be kept on the local disk of the primary database instance before the construction is completed.

If the log generation rate of the primary database instance is greater than the remote transmission bandwidth; or if the bandwidth is sufficient, the log generation rate of the primary database instance is greater than the log playback rate of the disaster recovery database instance, that is, the RPO/RTO cannot be maintained after the disaster recovery is built in an over-spec scenario. at the feature specification level.

Disaster recovery building call interface
When building disaster recovery, you need to send a building request to the active and standby database instances. Refer to the gs_sdr tool in the "Tool Reference".

Notice:

When building disaster recovery, you need to use the same disaster recovery user name and password for authentication between database instances on the primary database instance and the disaster recovery database instance. The user's authority is Replication (the Replication attribute is a specific role and is only used for replication).
Before setting up disaster recovery, you need to create a disaster recovery user in the main cluster.
After a disaster recovery is set up, the user's password cannot be modified, and it will accompany the entire disaster recovery life cycle. If you need to modify the disaster recovery user name and password, you need to cancel the disaster recovery and use a new disaster recovery user to build again.
The timeout period can be set in "time_out" during the disaster recovery construction process, and the current default value is 20 minutes. The evaluation of the timeout period is related to the amount of data before the disaster recovery of the primary database instance and the available bandwidth in the off-site. The calculation formula is "data amount / transmission rate = time-consuming". For example, the main database instance has 100TB of data, and the available bandwidth between remote database instances is 512Mbps (transmission rate is 64MB/s). It takes 1638400s (100*1024*1024/64, about 19 days) to build disaster recovery and transfer these data.
Upgrade a disaster recovery database instance to master failover
To send a request to upgrade the disaster recovery database instance to the master of the disaster recovery database instance, refer to the gs_sdr tool in the "Tool Reference".

Notice:

The disaster recovery information will be cleared after the disaster recovery database instance is upgraded to the master.
If the primary database instance is in a normal state and is processing business, the disaster recovery database instance can execute this command because it needs to actively cancel the disaster recovery. After this command is issued, the disaster recovery database instance will no longer receive logs from the host, which will cause the RPO value of the disaster recovery indicator to continue to increase until the active and standby database instances are disconnected, and the RPO value is empty. For querying the RPO value, see Querying the Disaster Recovery Status of the Primary and Standby Database Instances.
Clearing the disaster recovery information of the primary database instance
To send a request for clearing disaster recovery information to the primary database instance, refer to the gs_sdr tool in the Tool Reference.

Notice:

This operation will clear the disaster recovery information of the primary database instance.
This operation can only be performed on the primary database instance after the disaster recovery database instance is upgraded to the primary one. Execution under the condition that the disaster recovery database instance has not been upgraded to the master will cause the disaster recovery relationship to be destroyed.

Top comments (0)