Failure Detection. The Phi Accrual Failure Detector

#distributedsystems #docker #go

Introduction

In distributed systems, Nodes need to know if their peers are still healthy and update their peer lists accordingly. The simplest way to do this is with heartbeats. If there is a network with nodes A, B, and C, A sends heartbeat messages to its peers B and C saying “I’m still here”. If B and C don't hear from node A after a particular period, they can assume that node A is dead and can no longer handle requests, and nodes B and C will stop sending traffic to node A.
The question is how long should nodes B and C wait for A before marking it as dead. Twenty seconds ?, Thirty seconds ?. If a very short time is chosen as the threshold, there will be lots of false positives on the “dead nodes” peer list. A lot of nodes will be marked as dead when they are actually not. They might have had bad internet connections etc. On the other hand, if a longer time is chosen as the threshold, there will be lots of failed requests because dead nodes will keep getting traffic as though they’re still active and dropping them. So what exactly is the sweet spot? The Phi Accrual failure detector.

The Phi Accrual Failure Detector

The Phi Accrual failure detector, which was originally introduced in this paper uses a suspicion model. For example, when nodes B and C startup, they’ll set the confidence level they have in node A to 0.5, where the scale is from 0 to 1. After a particular period, if node A doesn't respond, the confidence level is dropped by 0.1, which makes the new confidence 0.4. If it does respond, it is increased by 0.1 and the confidence is 0.6. A threshold can then be set, that if node A doesn't respond when it's down to 0.2 then we can mark the node as dead. This is a more compound approach because it recognizes the past response times of the node. Data can also be stored, and used to analyze and decide what to set the threshold for a node to be marked as dead to.
Simulation
I did a small simulation with Docker and Golang, and this is what it looked like.

Leader

The first component is the Leader. I set up the entity in leader.go. The Leader receives the heartbeats, updates the tables, manages the thresholds and the failure points.

var (
   threshold = 15
   failurePoint = 0.2
)

The threshold is set to 15 and is denominated to seconds . This means if we don't hear from a node after fifteen seconds, the confidence level will be decremented by 0.1.
The failurePoint is set to 0.2 and once the node’s confidence level reaches the failurePoint or lower, it will be marked as dead.

Follower

The next entity is the Follower which is setup in
follower.go. The followers will send periodic heartbeats to the leader. A heartbeat looks like this:

type HeartBeat struct {
   IpAddress string
   Pid       int
   Timestamp int64
}

The follower sends its IP address, process id (PID), and timestamp to the leader periodically.
I setup two docker containers, phi_leader to hold the leader executable and phi_follower for the follower executable.
I scaled the phi_follower instance to 5 instances on a docker swarm network.

# docker service scale phi_follower=5  

phi_follower scaled to 5
overall progress: 0 out of 5 tasks 
1/5: ready     [======================================>            ] 
2/5: ready     [======================================>            ] 
3/5: starting  [============================================>      ] 
4/5: preparing [=================================>                 ] 
5/5: ready     [======================================>            ]

After running this process for about five minutes, I tailed the leader’s logs and the peer list looks like this:

# docker service logs phi_leader -f
Received heart beat from process with IP 10.0.4.222 and PID 1
updating Accrual failure detection table...
    *------------*-----*-----------------*------------*
    ╪     IP     ╪ PID ╪   CONFIDENCE    ╪    TIME    ╪
    *------------*-----*-----------------*------------*
    ╪ 10.0.4.219 ╪   1 ╪        0.500000 ╪ 1634756932 ╪
    ╪ 10.0.4.223 ╪   1 ╪        0.500000 ╪ 1634756940 ╪
    ╪ 10.0.4.222 ╪   1 ╪        0.600000 ╪ 1634756942 ╪
    ╪ 10.0.4.224 ╪   1 ╪        0.600000 ╪ 1634756924 ╪
    ╪ 10.0.4.225 ╪   1 ╪        0.900000 ╪ 1634756922 ╪
    *------------*-----*-----------------*------------*
    ╪                    TOTAL PROCESSES ╪     5      ╪
    *------------*-----*-----------------*------------*

The five different nodes have been registered on the leader, with different confidence levels and timestamps. With the confidence levels, all the peers are healthy.
Now I stopped two nodes, and watched to see if the detector decrements their confidence levels.

// Stop first container
# docker stop 14d07e02a1ad            
14d07e02a1ad
// Stop second container
# docker stop 3829bb23fe41
3829bb23fe41

After 17 seconds, which is below the threshold of 15 seconds, a node hadn’t sent a heartbeat and the confidence level was decreased. Node with IP 10.0.4.223 has been dropped to 0.9

# docker service logs phi_leader -f
Checking for inactive processes...
Havent received a heartbeat from process with ip 10.0.4.223 and pid 1 for the past 17 seconds
Decrementing process confidence by 0.1
    *------------*-----*-----------------*------------*
    ╪     IP     ╪ PID ╪   CONFIDENCE    ╪    TIME    ╪
    *------------*-----*-----------------*------------*
    ╪ 10.0.4.219 ╪   1 ╪        1.000000 ╪ 1634757451 ╪
    ╪ 10.0.4.223 ╪   1 ╪        0.900000 ╪ 1634757439 ╪
    ╪ 10.0.4.222 ╪   1 ╪        1.000000 ╪ 1634757453 ╪
    ╪ 10.0.4.224 ╪   1 ╪        1.000000 ╪ 1634757440 ╪
    ╪ 10.0.4.225 ╪   1 ╪        1.000000 ╪ 1634757430 ╪
    *------------*-----*-----------------*------------*
    ╪                    TOTAL PROCESSES ╪     5      ╪
    *------------*-----*-----------------*------------*

After a few minutes of tailing the logs, one of the nodes go below the failure point of 0.2and is marked as dead. The leader removes this node from its peer list, and the total drops to 4

# docker service logs phi_leader -f
This process with ip 10.0.4.224 and pid 1 now has confidence 0.200000 which is below or equal to the failure point 0.200000 and will be marked as dead
    *------------*-----*-----------------*------------*
    ╪     IP     ╪ PID ╪   CONFIDENCE    ╪    TIME    ╪
    *------------*-----*-----------------*------------*
    ╪ 10.0.4.219 ╪   1 ╪        0.900000 ╪ 1634757996 ╪
    ╪ 10.0.4.223 ╪   1 ╪        1.000000 ╪ 1634757998 ╪
    ╪ 10.0.4.222 ╪   1 ╪        1.000000 ╪ 1634758005 ╪
    ╪ 10.0.4.227 ╪   1 ╪        0.700000 ╪ 1634758010 ╪
    *------------*-----*-----------------*------------*
    ╪                    TOTAL PROCESSES ╪     4      ╪
    *------------*-----*-----------------*------------*

So it worked! The code to the full implementation and simulation of the Phi Accrual Detector can be found here.
Thank you for reading!

Originally published here

Notes

The actual phi accrual detector is more complex than this, it uses a better probabilistic model like calculating the mean and standard deviation of previous results to determine the threshold and failure point instead of hardcoding the values as we did. You can find a good read here (phi = -log10(1 - F(timeSinceLastHeartbeat)) )

DEV Community