Distributed Systems: Raft Algorithm for Leader Election

#distributedsystems #softwareengineering #programming #systemdesign

Welcome to another article in the distributed systems series. We will discuss the Raft algorithm and how a leader is elected in a replicated system when the leader goes down and followers become candidates.

Safety and Liveness

In a replicated system, safety means that at any time there is at most one leader. Liveness ensures that during failure, a leader is re-elected.

States of the Machine

Any process, at a given time, is in one of three states — leader, follower, or candidate. Every election is identified by a term value, which is simply an integer.

When the system starts up, all processes are in the follower state.

Ideally, every follower must receive heartbeats from the current leader containing the election term information. If a heartbeat is not received and a timeout occurs, the follower concludes that the leader is dead. At that point, the follower starts a new election by incrementing the current term and transitioning to the candidate state. It then votes for itself and sends a request to all processes in the system to vote for it, stamping the request with the current election term.

From here, three things can happen:

The process wins the election: This happens if a majority of other processes vote for the candidate. It then transitions into the leader state and starts sending requests to other processes. Note that other processes vote for at most one candidate on a first-come, first-served basis for a given term.
Another process wins the election: If the candidate receives a heartbeat from any other process having a term greater than or equal to the current term, it accepts the other process as the leader. It then transitions back to the follower state.
A period of time passes with no winner: This rarely happens, but if no candidate receives a majority vote, a re-election is conducted.

Conclusion

The Raft algorithm ensures reliable leader election in a replicated system by using terms, heartbeats, and majority voting. It guarantees safety by allowing only one leader at a time and ensures liveness by electing a new leader when the current one fails. This makes Raft simple, predictable, and suitable for building fault-tolerant distributed systems.

Follow for more articles on distributed systems and computer science.