DEV Community

AntDB
AntDB

Posted on

Manager Handbook for Distributed AntDB-T - P16

Node Self-Healing

Function Introduction
After the user turns on the cluster, the user does not need to care about the node status during operation, and the self-healing module will automatically try to repair the node when it is abnormally down or other situations.

Start Self-Healing
After cluster initialization, the self-healing module is off by default, and the doctor is started manually in adbmgr.

postgres=# start doctor; 
 mgr_doctor_start  
------------------ 
 t 
(1 row) 
Thelist doctor command allows you to view.
postgres=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 1     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 0     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 30    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 30    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows) 
Enter fullscreen mode Exit fullscreen mode

From the results, you can see that the types of components monitored by self-healing include:

  • node: each node in the cluster.

  • host: The agent process of the host in the cluster.
    Multiple processes will be started in adbmgr:

[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb   193328  0.0  0.0 112716   984 pts/46   S+   16:02   0:00  |   \_ grep --color=auto doctor 
antdb   134782  0.0  0.0 359748  7808 ?        Ss   14:48   0:02  \_ adbmgr: antdb doctor launcher    
antdb   137154  0.0  0.0 358944  6836 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn1    
antdb   137155  0.0  0.0 358948  6828 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor gcn2    
antdb   137157  0.0  0.0 358948  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn1    
antdb   137159  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn2    
antdb   137163  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn3    
antdb   137165  0.0  0.0 358948  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_1    
antdb   137167  0.0  0.0 358948  6872 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_1    
antdb   137169  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_1    
antdb   137172  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_2    
antdb   137175  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn1_3    
antdb   137177  0.0  0.0 358952  6860 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn2_2    
antdb   137180  0.0  0.0 358952  6848 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_1    
antdb   137183  0.0  0.0 358952  6856 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor cn4    
antdb   137186  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn3_2    
antdb   137189  0.0  0.0 358956  6852 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor node monitor dn4_2    
antdb   137191  0.0  0.0 358948  5888 ?        Ss   14:49   0:00  \_ adbmgr: antdb doctor host monitor  
Enter fullscreen mode Exit fullscreen mode

Turn off self-healing
Execute stop doctor;in adbmgr to shut down self-healing.

postgres=# stop doctor; 
NOTICE:  Update pgxc_node successfully in 'gcn1'. 
NOTICE:  Update pgxc_node successfully in 'cn1'. 
NOTICE:  Update pgxc_node successfully in 'cn2'. 
NOTICE:  Update pgxc_node successfully in 'cn3'. 
NOTICE:  Update pgxc_node successfully in 'cn4'. 
NOTICE:  Updating pgxc_node successfully at all datanode master. 
 mgr_doctor_stop  
----------------- 
 t 
(1 row) 

postgres=# list doctor; 
   type    |     subtype     |      key       | value |                                                   comment                                                     
-----------+-----------------+----------------+-------+-------------------------------------------------------------------------------------------------------------- 
 PARAMETER | --              | enable         | 0     | 0:false, 1:true. If true, doctor processes will be launched, or else, doctor processes exit. 
 PARAMETER | --              | forceswitch    | 1     | 0:false, 1:true. Whether force to switch the master/slave, note that force switch may cause data loss. 
 PARAMETER | --              | switchinterval | 10    | In seconds, The time interval for doctor retry the switching if an error occurred in the previous switching. 
 PARAMETER | --              | nodedeadline   | 10    | In seconds. The maximum time for doctor tolerate a NODE running abnormally. 
 PARAMETER | --              | agentdeadline  | 5     | In seconds. The maximum time for doctor tolerate a AGENT running abnormally. 
 NODE      | gtmcoord master | gcn1           | t     | enable doctor 
 NODE      | gtmcoord slave  | gcn2           | t     | enable doctor 
 NODE      | coordinator     | cn1            | t     | enable doctor 
 NODE      | coordinator     | cn2            | t     | enable doctor 
 NODE      | coordinator     | cn3            | t     | enable doctor 
 NODE      | datanode master | dn1_1          | t     | enable doctor 
 NODE      | datanode master | dn2_1          | t     | enable doctor 
 NODE      | datanode master | dn3_1          | t     | enable doctor 
 NODE      | datanode slave  | dn1_2          | t     | enable doctor 
 NODE      | datanode slave  | dn1_3          | t     | enable doctor 
 NODE      | datanode slave  | dn2_2          | t     | enable doctor 
 NODE      | datanode master | dn4_1          | t     | enable doctor 
 NODE      | coordinator     | cn4            | t     | enable doctor 
 NODE      | datanode slave  | dn3_2          | t     | enable doctor 
 NODE      | datanode slave  | dn4_2          | t     | enable doctor 
 HOST      | --              | adb01          | t     | enable doctor 
 HOST      | --              | adb02          | t     | enable doctor 
(22 rows) 

[antdb@intel175 ~]$ ps xuf|grep doctor 
antdb     2435  0.0  0.0 112716   984 pts/46   S+   16:09   0:00  |   \_ grep --color=auto doctor 
Enter fullscreen mode Exit fullscreen mode

After stop execution, theenable parameter ofdoctor is 0 and there is no more doctor process.

Top comments (0)