ANKUSH CHOUDHARY JOHAL

Posted on May 5 • Originally published at johal.in

Raft: Comprehensive Guide From Start to Finish

#raft #comprehensive #guide #start

Raft Consensus Algorithm: Comprehensive Start-to-Finish Guide

The Raft consensus algorithm, introduced by Diego Ongaro and John Ousterhout in 2014, was designed to solve a critical problem in distributed systems: achieving consensus among a cluster of servers in a way that is understandable to practitioners, unlike the notoriously complex Paxos algorithm. This guide walks through every core component of Raft, from basic server states to advanced cluster management, to help you master its inner workings.

What Is Raft? Core Goals

Raft is a consensus algorithm for managing a replicated log across a cluster of servers. Its primary design goal was understandability: the authors prioritized breaking the problem into isolated, solvable subproblems over raw performance optimizations. Key goals include:

Strong leader: Only one leader per term handles all client requests and log replication, simplifying the workflow.
Leader election: Clusters must elect a new leader quickly when the current leader fails.
Log replication: The leader replicates its log to all followers, ensuring all servers store the same committed entries.
Safety: Raft guarantees that once an entry is committed, it will be present in all future leaders' logs, preventing data loss.

Raft Basics: Server States, Terms, and RPCs

Every Raft server operates in one of three states at any time:

Follower: Passive state; only responds to RPCs from leaders or candidates. If no heartbeat is received within an election timeout, the follower transitions to candidate.
Candidate: Actively campaigns to become the new leader by requesting votes from other servers.
Leader: Handles all client requests, replicates logs to followers, and sends periodic heartbeats to maintain authority.

Raft divides time into terms, which are numbered with consecutive integers. Each term starts with a leader election; if the election fails (e.g., split vote), the term ends without a leader, and a new term begins. Terms act as a logical clock, allowing servers to detect stale leaders or candidates.

Raft uses two types of RPCs to coordinate:

RequestVote RPC: Sent by candidates to gather votes during leader elections.
AppendEntries RPC: Sent by leaders to replicate log entries and issue heartbeats (empty AppendEntries with no log entries).

Leader Election

Leader election triggers when a follower’s election timeout (randomized between 150-300ms in typical implementations) expires without receiving a heartbeat from the current leader. The follower then:

Increments its current term.
Transitions to candidate state.
Votes for itself.
Sends RequestVote RPCs to all other servers in the cluster.

A candidate wins the election if it receives votes from a majority of the cluster (quorum). It then immediately sends AppendEntries heartbeats to all followers to establish its authority and prevent new elections. If a candidate receives an AppendEntries RPC from a valid leader with a term greater than or equal to its own, it steps down to follower.

Split votes occur when multiple followers become candidates at the same time, and no candidate receives a majority. Raft avoids this with randomized election timeouts: each follower waits a random amount of time before becoming a candidate, reducing the chance of simultaneous campaigns.

Log Replication

Once a leader is elected, it begins handling client requests. Each request is a command to be executed by the replicated state machine. The leader:

Appends the command as a new entry to its local log, with the current term and index.
Sends AppendEntries RPCs to all followers to replicate the entry.
Waits until a majority of servers have replicated the entry (quorum).
Commits the entry (applies it to its state machine) and notifies followers to commit the entry as well.
Returns the result to the client.

Raft enforces the log matching property: if two logs have an entry with the same index and term, they store identical commands for all preceding entries. This is guaranteed by including the term and index of the previous log entry in every AppendEntries RPC: if a follower’s previous entry does not match, it rejects the RPC, and the leader retries with an earlier previous entry until a match is found.

Safety Guarantees

Raft provides several critical safety guarantees to prevent inconsistent state:

Election restriction: A candidate can only win an election if its log contains all committed entries from previous terms. This is enforced by the RequestVote RPC: a server denies a vote if the candidate’s log is less up-to-date than its own (comparing last term, then last index).
Commitment rule: A leader can only commit entries from its current term once they are replicated to a majority of servers. Entries from previous terms are committed implicitly when a later entry from the current term is committed.
Leader completeness: If a log entry is committed in a given term, all future leaders will have that entry in their logs. This follows from the election restriction and quorum overlap.

Cluster Membership Changes

Changing the set of servers in a Raft cluster (e.g., adding or removing nodes) must be done without downtime. Raft originally used joint consensus: the cluster transitions through an intermediate state where decisions require majorities from both the old and new server sets. Once the joint consensus configuration is replicated to a majority of both sets, the cluster switches to the new configuration.

Later optimizations introduced single-server membership changes, which avoid the joint consensus state by only adding or removing one server at a time. This works because the quorum overlap between consecutive configurations is sufficient to maintain safety.

Log Compaction (Snapshots)

As the replicated log grows, it consumes more disk space and increases the time to replay the log for new servers. Raft uses snapshots to compact the log: each server periodically writes the state of its state machine to a snapshot, which replaces all log entries up to the snapshot’s last included index and term. Leaders send InstallSnapshot RPCs to followers that are too far behind to catch up via normal log replication.

Client Interaction

Clients send requests to the Raft leader. If a client contacts a follower, the follower redirects it to the current leader. To ensure linearizable semantics (each request appears to execute atomically at a single point in time), clients assign unique serial numbers to each request: the leader ignores duplicate requests with the same serial number, preventing duplicate execution.

Real-World Use Cases

Raft is widely used in production distributed systems:

etcd: The key-value store used for metadata in Kubernetes.
Consul: HashiCorp’s service discovery and configuration tool.
CockroachDB: A distributed SQL database that uses Raft for range replication.
TiKV: The distributed key-value store underlying TiDB.

Common Pitfalls and Best Practices

Always randomize election timeouts to avoid split votes.
Ensure quorum size is correctly calculated: a cluster of n servers requires a majority of ⌊n/2⌋ + 1 votes to elect a leader or commit an entry.
Test failure scenarios (leader crashes, network partitions) thoroughly, as Raft’s safety guarantees only hold under partial failures, not total cluster outages.
Use snapshots regularly to prevent log bloat, especially in high-throughput workloads.

Conclusion

Raft’s focus on understandability has made it the go-to consensus algorithm for modern distributed systems. By breaking consensus into leader election, log replication, and safety subproblems, it provides a clear, implementable alternative to Paxos. Mastering Raft is essential for any engineer working on distributed systems, as it underpins many of the tools and databases that power cloud-native infrastructure today.

DEV Community