Introduction
In distributed systems, Nodes need to know if their peers are still healthy and update their peer lists accordingly. The simplest way to do this is with heartbeats. If there is a network with nodes A
, B
, and C
, A
sends heartbeat messages to its peers B
and C
saying βIβm still hereβ. If B
and C
don't hear from node A
after a particular period, they can assume that node A
is dead and can no longer handle requests, and nodes B
and C
will stop sending traffic to node A
.
The question is how long should nodes B
and C
wait for A before marking it as dead. Twenty seconds ?, Thirty seconds ?. If a very short time is chosen as the threshold, there will be lots of false positives on the βdead nodesβ peer list. A lot of nodes will be marked as dead when they are actually not. They might have had bad internet connections etc. On the other hand, if a longer time is chosen as the threshold, there will be lots of failed requests because dead nodes will keep getting traffic as though theyβre still active and dropping them. So what exactly is the sweet spot? The Phi Accrual failure detector.
The Phi Accrual Failure Detector
The Phi Accrual failure detector, which was originally introduced in this paper uses a suspicion model. For example, when nodes B
and C
startup, theyβll set the confidence level they have in node A
to 0.5
, where the scale is from 0
to 1
. After a particular period, if node A
doesn't respond, the confidence level is dropped by 0.1
, which makes the new confidence 0.4
. If it does respond, it is increased by 0.1
and the confidence is 0.6
. A threshold can then be set, that if node A
doesn't respond when it's down to 0.2
then we can mark the node as dead. This is a more compound approach because it recognizes the past response times of the node. Data can also be stored, and used to analyze and decide what to set the threshold for a node to be marked as dead to.
Simulation
I did a small simulation with Docker and Golang, and this is what it looked like.
Leader
The first component is the Leader
. I set up the entity in leader.go. The Leader receives the heartbeats, updates the tables, manages the thresholds and the failure points.
var (
threshold = 15
failurePoint = 0.2
)
The threshold
is set to 15
and is denominated to seconds . This means if we don't hear from a node after fifteen seconds, the confidence level will be decremented by 0.1
.
The failurePoint
is set to 0.2
and once the nodeβs confidence level reaches the failurePoint
or lower, it will be marked as dead.
Follower
The next entity is the Follower which is setup in
follower.go. The followers will send periodic heartbeats to the leader. A heartbeat looks like this:
type HeartBeat struct {
IpAddress string
Pid int
Timestamp int64
}
The follower sends its IP address, process id (PID), and timestamp to the leader periodically.
I setup two docker containers, phi_leader
to hold the leader executable and phi_follower
for the follower executable.
I scaled the phi_follower
instance to 5
instances on a docker swarm network.
# docker service scale phi_follower=5
phi_follower scaled to 5
overall progress: 0 out of 5 tasks
1/5: ready [======================================> ]
2/5: ready [======================================> ]
3/5: starting [============================================> ]
4/5: preparing [=================================> ]
5/5: ready [======================================> ]
After running this process for about five minutes, I tailed the leaderβs logs and the peer list looks like this:
# docker service logs phi_leader -f
Received heart beat from process with IP 10.0.4.222 and PID 1
updating Accrual failure detection table...
*------------*-----*-----------------*------------*
βͺ IP βͺ PID βͺ CONFIDENCE βͺ TIME βͺ
*------------*-----*-----------------*------------*
βͺ 10.0.4.219 βͺ 1 βͺ 0.500000 βͺ 1634756932 βͺ
βͺ 10.0.4.223 βͺ 1 βͺ 0.500000 βͺ 1634756940 βͺ
βͺ 10.0.4.222 βͺ 1 βͺ 0.600000 βͺ 1634756942 βͺ
βͺ 10.0.4.224 βͺ 1 βͺ 0.600000 βͺ 1634756924 βͺ
βͺ 10.0.4.225 βͺ 1 βͺ 0.900000 βͺ 1634756922 βͺ
*------------*-----*-----------------*------------*
βͺ TOTAL PROCESSES βͺ 5 βͺ
*------------*-----*-----------------*------------*
The five different nodes have been registered on the leader, with different confidence levels and timestamps. With the confidence levels, all the peers are healthy.
Now I stopped two nodes, and watched to see if the detector decrements their confidence levels.
// Stop first container
# docker stop 14d07e02a1ad
14d07e02a1ad
// Stop second container
# docker stop 3829bb23fe41
3829bb23fe41
After 17
seconds, which is below the threshold of 15
seconds, a node hadnβt sent a heartbeat and the confidence level was decreased. Node with IP 10.0.4.223
has been dropped to 0.9
# docker service logs phi_leader -f
Checking for inactive processes...
Havent received a heartbeat from process with ip 10.0.4.223 and pid 1 for the past 17 seconds
Decrementing process confidence by 0.1
*------------*-----*-----------------*------------*
βͺ IP βͺ PID βͺ CONFIDENCE βͺ TIME βͺ
*------------*-----*-----------------*------------*
βͺ 10.0.4.219 βͺ 1 βͺ 1.000000 βͺ 1634757451 βͺ
βͺ 10.0.4.223 βͺ 1 βͺ 0.900000 βͺ 1634757439 βͺ
βͺ 10.0.4.222 βͺ 1 βͺ 1.000000 βͺ 1634757453 βͺ
βͺ 10.0.4.224 βͺ 1 βͺ 1.000000 βͺ 1634757440 βͺ
βͺ 10.0.4.225 βͺ 1 βͺ 1.000000 βͺ 1634757430 βͺ
*------------*-----*-----------------*------------*
βͺ TOTAL PROCESSES βͺ 5 βͺ
*------------*-----*-----------------*------------*
After a few minutes of tailing the logs, one of the nodes go below the failure point of 0.2and is marked as dead. The leader removes this node from its peer list, and the total drops to 4
# docker service logs phi_leader -f
This process with ip 10.0.4.224 and pid 1 now has confidence 0.200000 which is below or equal to the failure point 0.200000 and will be marked as dead
*------------*-----*-----------------*------------*
βͺ IP βͺ PID βͺ CONFIDENCE βͺ TIME βͺ
*------------*-----*-----------------*------------*
βͺ 10.0.4.219 βͺ 1 βͺ 0.900000 βͺ 1634757996 βͺ
βͺ 10.0.4.223 βͺ 1 βͺ 1.000000 βͺ 1634757998 βͺ
βͺ 10.0.4.222 βͺ 1 βͺ 1.000000 βͺ 1634758005 βͺ
βͺ 10.0.4.227 βͺ 1 βͺ 0.700000 βͺ 1634758010 βͺ
*------------*-----*-----------------*------------*
βͺ TOTAL PROCESSES βͺ 4 βͺ
*------------*-----*-----------------*------------*
So it worked! The code to the full implementation and simulation of the Phi Accrual Detector can be found here.
Thank you for reading!
Originally published here
Notes
The actual phi accrual detector is more complex than this, it uses a better probabilistic model like calculating the mean and standard deviation of previous results to determine the threshold and failure point instead of hardcoding the values as we did. You can find a good read here (phi = -log10(1 - F(timeSinceLastHeartbeat)) )
Top comments (0)