Posted on Jul 25, 2022

Consensus and coordination with Raft

Created: July 18, 2022 11:55 AM
Published: Yes
Tags: Distributed Systems

This article intends to give a broad overview of the Raft Algorithm as it relates to consensus and coordination in distributed systems. Some details may be missing and it is encouraged to read more in depth of the topics for better understanding

Most small to medium scale software we build can have simplified design due to synchronicity and centralization. A client makes a request to our one server, which queries our one data store, performs operations and sends back a response all happening atomically. Our system is predictable, consistent and deals with little complexity.

Unfortunately this does not bode well when we need to scale. Our synchronous centralized system would see performance dwindle when it reaches its capacity leading to long response times, or no responses at all as well as more frequent crashes.

Horizontal scaling can help to spread the load our system is under and improve performance. Scalability adds more complexity to our systems are there are more nodes and processes to deal with to ensure clients obtain the same result of our now distributed system. Load balancers, caches, content delivery networks, proxy servers, databases etc. are all new elements to our previously centralized system that we must get acquainted with on our systems journey to scale.

Coordination is an important part of ensuring that our client cannot tell the difference between our centralized single node previous systems and our now distributed system. At the end of the day a client will send a request and expect a response regardless of how the system has been designed and architected.

Consensus deals with concerns of how to ensure that a group of nodes or processes work together to agree on a value and ensure this value is present and replicated everywhere. Consensus is a big part of achieving coordination in distributed systems. Successfully achieving consensus results in reliable, consistent and fault tolerant systems

Leader election and replication are two processes that come up when discussing consensus.

Leader election is a process whereby a node among a group of nodes gets assigned privileges to access some shared resource, distribute or coordinate some piece of work etc. It is useful when

There is some complex work that can achieve in multiple parts hence a process assigns and delegates the work to other processes and then collates the results to send a response
A group of nodes perform similar functions and there is a need for a clear single process to be the point of contact
There is need for strong consistency hence having a leader process service all requests to ensure a consistent view of data

Leader election is important to consensus because it provides a central point from which a value is proposed to which other processes can agree on and commit.

Replication is the process of ensuring that a value obtained from an operation can be found across all relevant nodes. It increases availability of distributed systems since multiple nodes can be queried to obtain data as opposed to a single centralized node.

Leader election combined with replication allows distributed systems to remain consistent available and ensures operations are durable and fault tolerant. By electing a leader, we have a central point to route elections to there by simplifying our system and reducing concurrency concerns. Replication ensures that changes and actions made by the leader can be observed on other nodes (with varying real time guarantees). Of course there are more nuanced activities that take place to ensure these guarantees can be achieve

Raft is a consensus algorithm that organises process as state machines. The idea is that processes move from one state to another as a result of operations and if each operation is applied in the same order among other state machines they would all arrive at the same state.

Raft leader election consists of processes being in one of three states:

Follower: A process in a follower state recognises another process as the leader
Candidate: A process in the candidate state is contending to be the next leader. A process transitions from the leader state to the candidate state when it does not obtain heartbeats from a leader.
Leader: A process in the leader state and sends heartbeats to other users until it fails

Time in the raft algorithm is represented as arbitrary terms. Every term must have a leader and leader election occurs at the beginning of every term

All processes start out as leaders. Each process has a randomized election timer (usually between 10ms to 300ms) within which it expects to obtain heartbeats from a leader process. If it does not receive heartbeats it transitions to the candidate state, votes for itself and requests votes from other follower processes. A process is assigned as leader if they obtain majority votes. In the event there is a split between processes on a votes another election takes place. The randomized election timers help to reduce occurrences of splits and to ensure elections take successfully

If a leader dies, other processes will not receive heart beats and follower processes who time out with transition to candidate for election of leader for the next term. Terms serve as logical timing for processes as physical clocks are not reliable in distributed systems

That takes care of identifying a process to serve as leader. However, when a leader fails and a new one is elected, how do we ensure changes and operations on the previous leader process is reflected on all new leaders. This is where replication comes in. A desirable replication algorithm needs to ensure that committed operations are durable and consistent and replicas will eventually be made to all available processes even in the event of failures.

With Rafts replication algorithm, after leader election has taken place and a leader has been elected, the leader maintains a log of operations performed which it replicates to follower processes who maintain copies of the log. After each operation, the leader updates it log with the new operation and sends an AppendEntries remote procedure call (RPC) to all its followers to do same. When it hears back successfully from majority of followers, it proceeds to commit the entry. Log entries are not committed until the leader hears back from majority of its followers.

If a follower process dies will keep sending AppendEntries requests till it comes back online. The index of the log preceding the entry being sent is added to each RPC. AppendEntries RPCs are idempotent so a process receiving multiple observes the same side effect. When a process comes back online and receives AppendEntries, it compares the log index included with its last log index and if it does not match it rejects the call. The leader responds to this rejection by sending another RPC but with the last two logs entries. If the process rejects this then it repeats with the last three log entries. It does this till it reaches a log entry that matches with the followers log and appends all entries. In this way follower processes remain up to date even when they fail and restart

If a leader fails, it is possible for the next process that becomes a leader to not have up to date entries. This is due to the fact that a leader only needs majority of followers to append and entry successfully in order to commit. During elections the last log entries of processes are compared. A process cannot vote for another process with a lower last log index.

Rafts leader election and replication allow us to achieve some coordination and allow us to reason about our distributed systems in a way such that they work as a single process. Consistency guarantees of the view of data however is an entire discussion on its own that may be looked at in another article

Making our centralized systems distributed can help us tackle scale and increase performance at the cost of increased complexity. Understanding and implementing coordination can help is reason around and work with this complexity to ensure that our systems can still serve its clients even if there are more cooks.

You can learn more about Raft and consensus algorithm from Diego Ongaro and John Ousterhout’s paper “In Search of an Understandable Consensus Algorithm”

You can also find a cool visualization of Raft here