Manager Handbook for Distributed AntDB-T - P18

#database

Self-healing work example

After the self-healing function is turned on, if a node fails, it will try to perform pull-up and switchover operations to ensure the continuity of business.
Let's try to kill a datanode master node and see if the node will recover automatically.
•kill node
Selectdn2_1 to kill:

postgres=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:20:06.225503+08 
(1 row) 

[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    35846  0.0  0.0 112712   980 pts/56   S+   16:54   0:00      \_ grep --color=auto dn2_1 
antdb    11456  0.0  0.0 442624 92208 ?        S    16:20   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 358948  6908 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1     
[antdb@intel175 ~]$ kill -9 11456 

postgres=# monitor datanode master dn2_1; 
WARNING:  datanode master dn2_1 recovery status is unknown 
 nodename |    nodetype     | status | description |     host     | port  | recovery | boot time  
----------+-----------------+--------+-------------+--------------+-------+----------+----------- 
 dn2_1    | datanode master | f      | not running | 10.21.20.175 | 52541 | unknown  | unknow 
(1 row)

•Observe the node status
After waiting for a few seconds, observe the node status again:

postgres=# monitor datanode master dn2_1; 
 nodename |    nodetype     | status | description |     host     | port  | recovery |           boot time            
----------+-----------------+--------+-------------+--------------+-------+----------+------------------------------- 
 dn2_1    | datanode master | t      | running     | 10.21.20.175 | 52541 | false    | 2019-10-16 16:55:10.935821+08 
(1 row)

The node has recovered and the process information can be seen:

[antdb@intel175 ~]$ ps xuf|grep dn2_1 
antdb    36484  0.0  0.0 112712   980 pts/56   S+   16:55   0:00      \_ grep --color=auto dn2_1 
antdb    36441  1.8  0.0 442624 92212 ?        S    16:55   0:00 /data/danghb/app/adb50/bin/postgres --datanode -D /data/danghb/data/adb50/d1/dn2_1 -i 
antdb    12788  0.0  0.0 359084  7664 ?        Ss   16:22   0:00  \_ adbmgr: antdb doctor node monitor dn2_1

Corresponding adbmgr log information:

2019-10-16 16:55:03.315 CST,,,12788,,5da6d32e.31f4,6,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:server closed the connection u 
nexpectedly 
        This probably means the server terminated abnormally 
        before or while processing the request. 
",,,,,,,,,"" 
2019-10-16 16:55:05.818 CST,,,12788,,5da6d32e.31f4,7,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,8,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, CONNECT_FAIL, PQerrorMessage:could not connect to server: C 
onnection refused 
        Is the server running on host ""10.21.20.175"" and accepting 
        TCP/IP connections on port 52541? 
",,,,,,,,,"" 
2019-10-16 16:55:10.824 CST,,,12788,,5da6d32e.31f4,9,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node crashed",,,,,,,,,"" 
2019-10-16 16:55:10.826 CST,,,12788,,5da6d32e.31f4,10,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"antdb doctor node monitor dn2_1, try to startup node",,,,,,,,,"" 
2019-10-16 16:55:11.044 CST,,,12788,,5da6d32e.31f4,11,,2019-10-16 16:22:06 CST,12/6,651,LOG,00000,"start dn2_1 /data/antdb/data/adb50/d1/dn2_1 successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,12,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, startup node successfully",,,,,,,,,"" 
2019-10-16 16:55:11.086 CST,,,12788,,5da6d32e.31f4,13,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, reset node monitor",,,,,,,,,"" 
2019-10-16 16:55:11.092 CST,,,12788,,5da6d32e.31f4,14,,2019-10-16 16:22:06 CST,12/0,0,LOG,00000,"antdb doctor node monitor dn2_1, node running normally",,,,,,,,,""

You can see that the nodedn2_1 is back to normal after 7 seconds, and the recovery process does not require manual intervention.

DEV Community

Manager Handbook for Distributed AntDB-T - P18

Self-healing work example

Meet your AI code assistant

Top comments (0)

Read next

AI Creates Endless Video Loops from Text, Makes Perfect Seamless Animations for Social Media

AI Dataset Breakthrough: New Tool Extracts Complex Financial Metrics from Earnings Reports with 84% Accuracy

How AI Art Models Learn to Avoid Generating Specific Content: New Research on Concept Erasure

Audio-FLAN: 100M+ Examples Power Zero-Shot Learning Across Speech, Music, and Sound Tasks