Anapeksha Mukherjee

Posted on Jun 7

Raft, Quorum, and High Availability

#architecture #database #distributedsystems

If you run one database node, life is simple. There is one copy of the data, one process accepting writes, and one place to recover from.

The moment you run three nodes, you get a harder question:

If these machines disagree, which one is right?

That is the problem Raft is built to solve.

Raft is a consensus algorithm. More specifically, it is a way for a group of machines to agree on a single ordered log of operations, even when some machines crash, restart, lag behind, or lose network connectivity.

It shows up in databases, metadata stores, schedulers, service discovery systems, and control planes because it gives those systems a practical answer to a dangerous question: how do we stay available without letting different nodes invent different versions of reality?

The Core Idea

Raft turns a cluster of nodes into one logical state machine.

Instead of letting every node mutate state independently, Raft makes nodes agree on a log:

1. SET user:1 "Ada"
2. SET user:2 "Grace"
3. DELETE session:abc

Each node applies the same committed log entries in the same order. If the state machine is deterministic, the nodes end up with the same state.

That is the whole trick.

same log + same order + deterministic apply = same state

Raft is not really about key-value stores or SQL or config files. Those are just things you can build on top. Raft is about agreeing on the order of changes.

The Three Roles

Each Raft node is in one of three roles:

Follower
Candidate
Leader

A follower is passive. It accepts messages from the leader and votes in elections.

A candidate is trying to become leader.

The leader is the node currently responsible for handling writes and replicating log entries to the rest of the cluster.

In a healthy cluster, there is one leader for the current term.

node-a: leader
node-b: follower
node-c: follower

Clients usually send writes to the leader. Some systems let clients connect to any node, but under the hood the write still has to reach the leader or go through an equivalent consensus path.

Terms: Raft's Logical Clock

Raft uses numbered terms:

term 1
term 2
term 3

A term is not wall-clock time. It is a logical epoch.

Every election happens in a term. If a candidate wins, it becomes leader for that term.

Terms help nodes detect stale information. If a node receives a message from an older term, it knows the sender is behind. If it sees a newer term, it updates itself and steps down if needed.

Example:

node-a thinks it is leader in term 4
node-b has already seen term 5

If node-a sends a leader message with term 4, node-b rejects it. That prevents old leaders from continuing to act authoritative after the cluster has moved on.

What Quorum Means

A quorum is the minimum number of voting nodes needed to make a decision.

In Raft, quorum usually means a majority:

quorum = floor(N / 2) + 1

Voting Nodes	Quorum
1	1
2	2
3	2
5	3
7	4

The important part is not the formula. The important part is overlap.

Any two majorities must share at least one node.

In a 3-node cluster:

majority 1: node-a + node-b
majority 2: node-b + node-c

Both include node-b.

In a 5-node cluster:

majority 1: node-a + node-b + node-c
majority 2: node-c + node-d + node-e

Both include node-c.

That overlap is what stops two isolated groups from making conflicting committed decisions.

Why Quorum Matters

Suppose we have three nodes:

node-a
node-b
node-c

The leader receives:

SET color "blue"

If the leader writes locally and immediately returns success, the write is fragile. The leader could crash before anyone else receives the entry.

Raft waits for the entry to be replicated to a quorum.

For three nodes, quorum is two:

node-a: SET color "blue"
node-b: SET color "blue"
node-c: not yet replicated

Once two nodes have the entry, the leader can mark it committed.

Why is that safe?

Because any future leader also needs a quorum to win an election. Since quorums overlap, a future leader election must involve at least one node that knows about the committed entry.

That is the safety property. A committed entry cannot simply vanish because one machine died.

Leader Election

Followers expect regular heartbeats from the leader.

If a follower does not hear from the leader before its election timeout, it assumes the leader may be gone and starts an election:

It becomes a candidate.
It increments its term.
It votes for itself.
It asks the other nodes for votes.
If it receives votes from a quorum, it becomes leader.

For a 3-node cluster:

node-a: candidate, votes for node-a
node-b: votes for node-a
node-c: no response

node-a has two votes. That is a majority, so it becomes leader.

Raft uses randomized election timeouts to reduce split votes. If every follower started an election at exactly the same time, each might vote for itself and nobody would win. Randomized timeouts make it likely that one node starts first and collects votes before the others become candidates.

Log Replication

Once there is a leader, writes go through the leader.

The flow looks like this:

client -> leader: SET user:1 "Ada"
leader: append entry to local log
leader -> followers: replicate entry
followers -> leader: acknowledged
leader: commit after quorum
leader -> client: success

The leader does not need every follower to acknowledge the write. It needs a quorum.

In a 5-node cluster, quorum is 3. The leader plus two followers is enough to commit.

That is why a Raft cluster can keep working even when some nodes are down.

Committed vs Applied

Two words matter a lot in Raft:

committed
applied

An entry is committed when Raft has made it durable according to the quorum rule.

An entry is applied when a node has actually executed it against its local state machine.

A follower can know that entry 100 is committed but only have applied through entry 98.

That distinction matters for reads. A node that has not applied the latest committed entries may return stale data if it serves reads directly from local state.

Reads Are Subtle

Writes naturally go through the log. Reads are easier to get wrong.

Imagine an old leader gets separated from the rest of the cluster. It still has data. It still has a process running. It may even still believe it is leader for a short time.

Meanwhile, the majority side elects a new leader and commits newer writes.

If the old leader serves reads without proving it still has authority, it can return stale answers.

There are a few common ways to handle this.

Route Reads To The Leader

This is simple and common.

The read goes to the leader, and the leader confirms it still has quorum authority before serving the read.

This is correct, but all linearizable read traffic goes through the leader.

ReadIndex

Raft ReadIndex is an optimization.

The leader establishes a safe read point, usually by confirming authority with a quorum. A follower can then wait until it has applied through that index and serve the read locally.

The important part is the wait:

safe read index = 120
follower applied index = 118
wait
follower applied index = 120
serve read

Without that wait, the follower may be stale.

Lease Reads

Lease reads rely on timing assumptions. The leader assumes it remains valid for a lease window after contacting quorum.

They can be fast, but they require careful timeout and clock assumptions. Getting them wrong can break consistency.

Stale Reads

Some systems intentionally allow local follower reads because they are fast and distribute load.

That is fine when the API is honest about it:

linearizable read: latest committed state
stale read: local replica state

The trouble starts when stale reads are presented as if they are strongly consistent.

High Availability: What Raft Actually Gives You

High availability does not mean every node can always accept writes.

In a strongly consistent Raft system, availability means:

The cluster can continue accepting safe writes as long as a quorum is alive and connected.

For three nodes:

quorum = 2

The cluster can tolerate one failed node:

node-a: alive
node-b: alive
node-c: down

For five nodes:

quorum = 3

The cluster can tolerate two failed nodes:

node-a: alive
node-b: alive
node-c: alive
node-d: down
node-e: down

In general, a cluster of 2f + 1 voting nodes can tolerate f failures.

Nodes	Quorum	Failures Tolerated
1	1	0
3	2	1
5	3	2
7	4	3

This is why odd-sized clusters are common.

Why Two Nodes Are Usually Not Enough

A 2-node cluster has quorum 2.

node-a
node-b

If either node fails, only one node remains. One node is not a quorum, so the cluster cannot safely elect a leader or commit writes.

That means a 2-node Raft cluster often gives you redundancy without useful write availability.

Three nodes is usually the smallest practical production setup.

Network Partitions

Network partitions are where quorum really earns its keep.

Take a 5-node cluster:

node-a
node-b
node-c
node-d
node-e

Now split the network:

partition 1: node-a, node-b
partition 2: node-c, node-d, node-e

Quorum is 3.

Partition 1 has only 2 nodes, so it cannot elect a leader or commit writes.

Partition 2 has 3 nodes, so it can continue.

This prevents split brain. The minority side may be alive, but it cannot make authoritative decisions.

That is the tradeoff: Raft sacrifices availability on the minority side to protect consistency.

What Happens During Failover?

Suppose the leader dies:

node-a: leader, down
node-b: follower
node-c: follower

The followers stop receiving heartbeats. After an election timeout, one follower becomes candidate.

node-b: candidate

It requests votes:

node-b votes for node-b
node-c votes for node-b

Now node-b has quorum and becomes leader.

There is usually a short window where writes are unavailable. Once a new leader is elected, the cluster can accept writes again.

This is high availability with a consistency boundary. The system pauses rather than allowing two leaders to accept conflicting writes.

Repairing Divergent Logs

Followers can fall behind. They can also contain entries that were written by an old leader but never committed.

Raft repairs this through log matching.

Append requests include information about the entry immediately before the new entries:

previous log index
previous log term
new entries

If the follower does not have the expected previous entry, it rejects the append. The leader backs up and tries again from an earlier point.

Once the leader finds a matching prefix, it overwrites the follower's conflicting suffix.

That sounds aggressive, but it is correct. Uncommitted entries are not guaranteed to survive. Committed entries are.

Snapshots And Compaction

Logs cannot grow forever.

If a system has applied millions of entries, it should not need to keep every old entry around just to recover.

A snapshot captures the state machine at a specific log index:

snapshot at index 1,000,000

After that, older log entries can be compacted.

If a follower is far behind, the leader may send a snapshot instead of trying to stream a huge log history.

Snapshots are not an optional polish feature in real systems. They are part of making Raft operationally practical.

Membership Changes

Adding or removing voting nodes is harder than it looks because membership changes alter quorum.

If different parts of the cluster disagree about who can vote, you can accidentally create two groups that both think they have authority.

That is why membership changes must go through the replicated log too.

The cluster has to agree on configuration changes with the same care it uses for data changes.

Different Raft implementations handle this in different ways, often with joint consensus or carefully sequenced configuration transitions.

What Raft Guarantees

Raft is designed around a few key safety properties:

Election safety: at most one leader is elected in a term.
Leader append-only: a leader does not rewrite its own log.
Log matching: matching entries imply matching prior history.
Leader completeness: committed entries appear in future leaders' logs.
State machine safety: two nodes do not apply different commands at the same log index.

These properties let a group of machines behave like one ordered state machine.

What Raft Does Not Solve

Raft is not a complete database.

It does not automatically fix:

bad disk durability settings
overloaded nodes
slow replication links
unsafe read paths
poor timeout tuning
multi-shard transactions
application-level conflicts
bad operational procedures

Raft gives you a way to agree on ordered changes. The system around it still has to use that agreement correctly.

Common Mistakes

Acknowledging Writes Too Early

If a leader returns success before quorum replication, a successful write can disappear during failover.

The safe flow is:

append locally
replicate to quorum
mark committed
respond success

Treating Follower Reads As Strong Reads

Follower reads are not automatically linearizable.

They are fine if they are documented as stale reads. They are dangerous if callers expect read-after-write consistency.

Letting Old Leaders Serve Reads

A partitioned leader can still be alive.

Before serving a strong read, a leader needs proof that it still has authority. Otherwise it may return stale state after a new leader has already been elected.

Making Timeouts Too Aggressive

Short election timeouts can make a healthy cluster flap during latency spikes.

Long election timeouts make failover slow.

There is no universal perfect value. Timeouts have to match the environment.

A Good Mental Model

Raft is not "replication" in the casual sense of copying bytes to other machines.

It is stricter than that.

Raft is agreement over history.

The leader proposes the next entries. A quorum decides which entries are durable enough to become part of the cluster's history. Every node applies that history in order.

The shortest version is:

Raft = one agreed log, replicated by majority decisions

That is why quorum matters. It makes sure every committed decision intersects with future decisions.

Closing Thoughts

High availability is not just keeping processes alive. A system that stays online but returns conflicting answers is not healthy.

Raft gives distributed systems a way to remain available on the majority side of failures while protecting the integrity of committed state.

That is the tradeoff:

keep serving when a quorum is available
stop serving authoritative writes when a quorum is not available
never let two disconnected groups commit conflicting histories

Once that clicks, Raft becomes less mysterious. It is a disciplined way to decide which changes become real.

DEV Community