Node Self-Healing
Function Introduction
After the user turns on the cluster, the user does not need to care about the node status during operation, and the self-healing module will automatically try to repair the node when it is abnormally down or other situations.
Start Self-Healing
After cluster initialization, the self-healing module is off by default, and the doctor is started manually in adbmgr.
postgres=# start doctor;
mgr_doctor_start
------------------
t
(1 row)
Thelist doctor command allows you to view.
postgres=# list doctor;
type | subtype | key | value | comment
-----------+-----------------+----------------+-------+--------------------------------------------------------------------------------------------------------------
PARAMETER | -- | enable | 1 | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit.
PARAMETER | -- | forceswitch | 0 | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss.
PARAMETER | -- | switchinterval | 30 | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching.
PARAMETER | -- | nodedeadline | 30 | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
PARAMETER | -- | agentdeadline | 5 | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
NODE | gtmcoord master | gcn1 | t | enable doctor
NODE | gtmcoord slave | gcn2 | t | enable doctor
NODE | coordinator | cn1 | t | enable doctor
NODE | coordinator | cn2 | t | enable doctor
NODE | coordinator | cn3 | t | enable doctor
NODE | datanode master | dn1_1 | t | enable doctor
NODE | datanode master | dn2_1 | t | enable doctor
NODE | datanode master | dn3_1 | t | enable doctor
NODE | datanode slave | dn1_2 | t | enable doctor
NODE | datanode slave | dn1_3 | t | enable doctor
NODE | datanode slave | dn2_2 | t | enable doctor
NODE | datanode master | dn4_1 | t | enable doctor
NODE | coordinator | cn4 | t | enable doctor
NODE | datanode slave | dn3_2 | t | enable doctor
NODE | datanode slave | dn4_2 | t | enable doctor
HOST | -- | adb01 | t | enable doctor
HOST | -- | adb02 | t | enable doctor
(22 rows)
From the results, you can see that the types of components monitored by self-healing include:
node: each node in the cluster.
host: The agent process of the host in the cluster.
Multiple processes will be started in adbmgr:
[antdb@intel175 ~]$ ps xuf|grep doctor
antdb 193328 0.0 0.0 112716 984 pts/46 S+ 16:02 0:00 | \_ grep --color=auto doctor
antdb 134782 0.0 0.0 359748 7808 ? Ss 14:48 0:02 \_ adbmgr: antdb doctor launcher
antdb 137154 0.0 0.0 358944 6836 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor gcn1
antdb 137155 0.0 0.0 358948 6828 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor gcn2
antdb 137157 0.0 0.0 358948 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn1
antdb 137159 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn2
antdb 137163 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn3
antdb 137165 0.0 0.0 358948 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_1
antdb 137167 0.0 0.0 358948 6872 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn2_1
antdb 137169 0.0 0.0 358952 6856 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn3_1
antdb 137172 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_2
antdb 137175 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn1_3
antdb 137177 0.0 0.0 358952 6860 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn2_2
antdb 137180 0.0 0.0 358952 6848 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn4_1
antdb 137183 0.0 0.0 358952 6856 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor cn4
antdb 137186 0.0 0.0 358956 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn3_2
antdb 137189 0.0 0.0 358956 6852 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor node monitor dn4_2
antdb 137191 0.0 0.0 358948 5888 ? Ss 14:49 0:00 \_ adbmgr: antdb doctor host monitor
Turn off self-healing
Execute stop doctor;in adbmgr to shut down self-healing.
postgres=# stop doctor;
NOTICE: Update pgxc_node successfully in 'gcn1'.
NOTICE: Update pgxc_node successfully in 'cn1'.
NOTICE: Update pgxc_node successfully in 'cn2'.
NOTICE: Update pgxc_node successfully in 'cn3'.
NOTICE: Update pgxc_node successfully in 'cn4'.
NOTICE: Updating pgxc_node successfully at all datanode master.
mgr_doctor_stop
-----------------
t
(1 row)
postgres=# list doctor;
type | subtype | key | value | comment
-----------+-----------------+----------------+-------+--------------------------------------------------------------------------------------------------------------
PARAMETER | -- | enable | 0 | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit.
PARAMETER | -- | forceswitch | 1 | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss.
PARAMETER | -- | switchinterval | 10 | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching.
PARAMETER | -- | nodedeadline | 10 | In seconds. The maximum time for doctor tolerate a NODE running abnormally.
PARAMETER | -- | agentdeadline | 5 | In seconds. The maximum time for doctor tolerate a AGENT running abnormally.
NODE | gtmcoord master | gcn1 | t | enable doctor
NODE | gtmcoord slave | gcn2 | t | enable doctor
NODE | coordinator | cn1 | t | enable doctor
NODE | coordinator | cn2 | t | enable doctor
NODE | coordinator | cn3 | t | enable doctor
NODE | datanode master | dn1_1 | t | enable doctor
NODE | datanode master | dn2_1 | t | enable doctor
NODE | datanode master | dn3_1 | t | enable doctor
NODE | datanode slave | dn1_2 | t | enable doctor
NODE | datanode slave | dn1_3 | t | enable doctor
NODE | datanode slave | dn2_2 | t | enable doctor
NODE | datanode master | dn4_1 | t | enable doctor
NODE | coordinator | cn4 | t | enable doctor
NODE | datanode slave | dn3_2 | t | enable doctor
NODE | datanode slave | dn4_2 | t | enable doctor
HOST | -- | adb01 | t | enable doctor
HOST | -- | adb02 | t | enable doctor
(22 rows)
[antdb@intel175 ~]$ ps xuf|grep doctor
antdb 2435 0.0 0.0 112716 984 pts/46 S+ 16:09 0:00 | \_ grep --color=auto doctor
After stop execution, theenable parameter ofdoctor is 0 and there is no more doctor process.
Top comments (0)