DEV Community

Cover image for Building a Heartbeat-style Gossip Node Network: Failure Detection in Distributed Systems
Dominic Streif
Dominic Streif

Posted on

Building a Heartbeat-style Gossip Node Network: Failure Detection in Distributed Systems

Please View Original article on LinkedIn for a better reading experience: article

Heartbeat-Style Gossip

In the context of modern computing, distributed systems have been essential for the world around us to operate effectively. Whether you know it or not, you have contributed to the propagation of data by computer nodes within a distributed system. Large scale vendors, banks, and cloud computing services all utilize distributed systems; and for good reason.

What is a Distributed System?

"Distributed Systems involve the study and design of systems where multiple independent computers or nodes work together to achieve a common goal" (Penn Engineering). The key motivations for the development of distributed systems were: fault tolerance, scalability, and the ability to divide tasks to achieve better performance and efficiency. Think about what happens when you place an Amazon order. Amazon has built thousands of different computer nodes at sites which contain replica copies of data for each user. When you place an order, the order record is propagated from one site to all of the others. This ensures data consistency and availability across all distributed nodes. In other words, your order, with high confidence, will not be lost to the void in the case of node failure.

Propagation Techniques

There are several different methods for which distributed systems propagate data while ensuring consistency and reliability. A common approach is to keep track of each node in a membership list, allowing the system to properly manage data replication, failure detection, and communication across the node network.

In this article, we will discuss and implement a strategy known as Heartbeat-style Gossip. This technique involves passing records of nodes in the membership list between every node in the network through rumors. In this node network, server-client relationships do not exist. Instead, all nodes are equal and periodically propagate their membership lists with a random subset of other nodes. This periodic sharing of data is known as a heartbeat. Heartbeats communicate to the receiving nodes that the sending node is still alive and functioning properly. When a receiving node does not hear a heartbeat from a sending node in a set period of time (t), they will mark the node as possibly failing. If the receiving node does not hear from the sending node for some period of time after being marked for failure (n * t), then that node will be purged from the membership list as it has lost communication and has most likely failed.

Our Heartbeat-Style Gossip Project

We will now implement our own Heartbeat-Style Gossip node network. But first, we must discuss some caveats.

For the purpose of our implementation, and to keep the project simple, we will develop our gossip network with some constraints. Firstly, each node will communicate with a predetermined subset of at most 2 nodes. Our network of communication will look more like a ring than a graph of connecting nodes. This also could mean if the two nodes in a subset fail, then the node will become isolated and lose communication with the rest of the network. Secondly, the network will only consist of eight nodes. This allows our implementation to be tested with a small network from which we can observe and record its behavior. This project is not meant to be scalable, and has not been tested with a large node network. Lastly, we will not be distributing work or tasks between nodes; only the membership tables will be exchanged throughout the network. However, we will simulate node failure and how the network is proven to be fault tolerant.

Implementation

For this implementation we will be making use of Golang's aptitude for distributed system development. Golang's net/rpc library provides all the necessary tools for us to establish baseline communication between the nodes. Remote Procedure Calls (RPC) will allow sending nodes to communicate their membership lists by calling functions on the receiving node.

In this project, membership lists will be implemented as Go slices, each containing "heartbeat tables". Every node will have its own corresponding heartbeat table which will be implemented as a struct with a few important fields.

For example,

type HBTable struct {
    NodeID int
    Status int
    HBCount int64
    Timestamp time.Time
}
Enter fullscreen mode Exit fullscreen mode

NodeID: This field contains the identification number for a particular node.

Status: The status field represents the state the node is currently in. 0 if the node is healthy, and 1 if the node is suspected to be in a failing state.

HBCount: We will track the number of heartbeats a node has sent.

Timestamp: Each table will contain a timestamp of the last known heartbeat a node has sent.

Each node must have both server-side logic and client-side logic in order to send and receive data using the Go net/rpc library. The sending node will call a function on the receiving node in order to pass its membership list.

Implement the Server Logic

We first create the hbserver and register it with the net/rpc library. Then we can simply wait for nodes to dial the server and accept their connection requests.

Once we have implemented the server logic, we can move forward with implementing the RPC function that the server will expose.

The exposed function takes two arguments:

args *Args: A pointer to an argument struct which contains the sending node's identification number, and that node's membership list copy. (Shown Below)

type Args struct {
    NodeID int
    HBTableList []HBTable
}
Enter fullscreen mode Exit fullscreen mode

reply *string: An optional reply pointer in case the receiving node may want to send data back to the sending node. For our purposes, we choose not to send any data back to the sending node.

Let's now discuss the line following the print statement in image Exposed RPC function.

hbChannel<-*args: A goroutine operation in Go. hbChannel is a Go "channel" which is used to communicate data between goroutines. Since the node is divided in two parts as a server and client implementation, we must find a way for the server part of the node to pass information to the client part of the node in order to update the node's membership list. In this case we can simply use a buffered go channel to pass the data along.

A globally defined heartbeat channel above in our implementation is dedicated to passing information between the two parts. The channel acts as a buffered queue of 50 slots, with each slot being the size of our Args struct.

Implement the Client Logic

For the next stage of our project, we will be implementing the client-side logic for our node.

In part one of our client-side implementation, as shown in image Client-side 1, we initialize a table for the node. Then, this table is appended to the membership list, which will eventually be propagated to the other nodes in the network.

Next we will discuss the client loop (Client-side 2 and 3). The client will first dial, at random, one of the two adjacent client nodes in the node subset. If the node cannot reach the adjacent node, it will try again. Then we initialize the args struct to pass into the RPC exported function, and call the function with the args struct. For our implementation, in case of error or timeout, we will simply loop again and retry propagation.

Finally, we will retrieve the membership list from the heartbeat channel queue and update the node's copy of the membership list.

Note: The heartbeat period is set to one second, however you may want to change it to a different period if you would like to observe slower or faster propagation patterns.

Update the Membership List

The final stage of our implementation will involve designing and implementing the function which will be responsible for maintaining and updating the membership lists based on the incoming heartbeat messages. This function will ensure that nodes accurately track the status of other members within the node network.

We will divide the tasks into three main steps:

  1. Check if the sending node exists in the membership list. We will iterate through our list to check if it is already present. If not, it should be added to receiving node's membership list.

Note: This step can be condensed into an if statement in step two, however the logic has specifically been written in a separate step for clarity.

  1. Compare the sending node's membership list with the receiving node's list. If a node is missing, it should be appended to the receiving node's list. If the node data already exists in the receiving node's membership list, the record is eligible for an update. Based on the timestamp field, we can verify if the sending node's record of a particular node is more current then the receiving node's record.

  1. Identify and handle failing nodes. For this step, we must once again iterate through the receiving node's membership list in order to identify which nodes will be a failure candidate, and which nodes must be purged from the membership immediately. Nodes that are candidates for removal will be marked as so in a separate slice in order to avoid mutating the slice during iteration. After iteration, they should be purged from the membership list.

Consider the runtime of the update_tablelist function. Summing the time complexity of the operations in this function we find that the overall time complexity is: O(n^2) = 2n^2 + n. Simplified, we can see that the time complexity is reduced to: O(n^2). While this approach ensures accurate membership updates, it will become inefficient for large-scale distributed systems. Optimizing the operations using hash maps for faster retrievals or reducing redundant iterations could help improve performance.

Closing Remarks

Implementing efficient membership tracking in distributed systems is key for maintaining a reliable, scalable, and fault tolerant operation. Through this article, we have outlined a structured approach to implementing a heartbeat-style node network. Hopefully, the intricacies outlined in the implementation will help guide you in building your own distributed system. As technology advances and distributed systems grow in complexity, refining these common strategies will be crucial for optimizing performance and improving resilience.

Resources

If you would like to take a look at the project, my implementation is free to view and play around with. Have fun!
Heartbeat-Style Gossip

References

Distributed Systems, networks, and operating systems. Penn Computer & Information Science Highlights. (n.d.). https://highlights.cis.upenn.edu/distributed-systems-networks-and-operating-systems/

Top comments (0)