DEV Community

Consensus Protocols in Distributed Systems

Consensus Protocols in Distributed Systems

A Complete Learning Guide — Intermediate to Advanced


Table of Contents

  1. Foundation — What is Consensus and Why Is It Hard?
  2. Core Concepts & Terminology
  3. Paxos — The Classic Protocol
  4. Raft — The Understandable Protocol
  5. Byzantine Fault Tolerance (BFT)
  6. Other Notable Protocols
  7. Real-World Systems
  8. Scalability & Performance Considerations
  9. Trade-Off Analysis
  10. Design Decisions & Rationale
  11. Practice Problems
  12. Summary & Next Steps

1. Foundation

1.1 What is "Consensus" and Why Is It Hard?

Consensus is the problem of getting a group of distributed processes (nodes, servers, machines) to agree on a single value or sequence of values, even when some nodes fail, messages are delayed, or the network partitions into isolated groups.

Think of it like this: imagine five people in separate rooms, each with their own pen and paper, communicating only by passing notes through hallways that occasionally swallow messages. They must all write down the same answer to a question. That's consensus.

In software systems, "consensus" is needed whenever multiple servers must agree on:

  • Which server is the current leader
  • The next entry in a shared log
  • Whether a distributed transaction committed or aborted
  • The current configuration of a cluster

The difficulty comes from three fundamental realities of distributed computing known as the "three evils":

Evil Description
Asynchrony There is no global clock. Messages can be delayed arbitrarily. You can't tell if a node crashed or is just slow.
Partial failures Some nodes crash while others continue. The system is neither fully up nor fully down.
Network partitions The network can split into isolated islands of nodes that cannot communicate.

Any correct consensus protocol must work despite all three evils simultaneously.


1.2 The Two Generals Problem — ASCII Illustrated

The Two Generals Problem (1975) is the foundational impossibility result for consensus over unreliable networks. It proves that two parties can never achieve guaranteed consensus when communicating over a network that can lose messages.

Scenario: Two armies (General A and General B) must attack a city simultaneously to succeed. They can only communicate by messenger through enemy territory. Any messenger can be captured (message lost). General A sends a message: "Attack at dawn." But unless B confirms receipt, A doesn't know if B got the message. And if B sends a confirmation, B doesn't know if A received it. This chains infinitely.

  ARMY A                  [CITY]                ARMY B
    |                                              |
    |----[msg: Attack at dawn]---> ???  -------->  |
    |                                              |
    |    Did B get it?                             |
    |    I don't know...                           |
    |                                              |
    | <---[ack: Confirmed, attacking]--- ???  ---  |
    |                                              |
    |    Did A get my ack?                         |
    |    B doesn't know...                         |
    |                                              |
    | ----[ack-ack: Got your ack]---> ???  ------> |
    |                                              |
    |         ∞ regress — never certain            |
    |                                              |

  KEY INSIGHT: Over an unreliable channel, you can NEVER be
  100% certain both parties will act simultaneously.
  Practical consensus protocols work around this with
  QUORUMS + TIMEOUTS rather than certainty.
Enter fullscreen mode Exit fullscreen mode

What this means in practice: We cannot build a "perfect" consensus protocol over an unreliable network. Instead, we build protocols that are correct enough — they guarantee consistency when a majority of nodes can communicate, and make progress as long as failures are below a threshold.


1.3 The Byzantine Generals Problem

Lamport, Shostak, Pease (1982) extended the Two Generals Problem to include malicious actors.

Scenario: An empire's generals surround a city with separate armies. They must all agree to attack or retreat. Some generals may be traitors who send conflicting messages to sabotage coordination.

Key insight: With F traitors among N generals, loyal generals can only reach consensus if:

N ≥ 3F + 1
Enter fullscreen mode Exit fullscreen mode

With 1 traitor, you need at least 4 generals (3 loyal). With 3 traitors, you need at least 10 generals.

Analogy for engineers: A Byzantine node doesn't just crash — it lies. It might send "Vote YES" to half the cluster and "Vote NO" to the other half. It might replay old messages, send corrupted data, or pretend to be multiple nodes. Byzantine fault tolerance (BFT) is far more expensive than crash fault tolerance (CFT).

Most modern consensus protocols (Raft, Paxos, ZAB) are crash fault tolerant only — they assume nodes either work correctly or stop. BFT is reserved for adversarial environments like blockchains and aerospace.


1.4 FLP Impossibility Theorem

Fischer, Lynch, Paterson (1985) proved one of the most important results in distributed computing:

In a fully asynchronous distributed system, no consensus protocol can simultaneously guarantee Safety, Liveness, AND Fault Tolerance.

In plain English: If even one node can crash, no deterministic algorithm can guarantee that a consensus decision will ever be reached in finite time.

This is NOT saying consensus is impossible. It's saying you must make a trade-off:

If you want... You must sacrifice...
Safety + Fault Tolerance Liveness (may block forever)
Safety + Liveness Fault Tolerance (requires all nodes up)
Liveness + Fault Tolerance Safety (may give inconsistent answers)

How real protocols survive FLP:

  • Paxos & Raft sacrifice liveness under some conditions (they can block during certain partitions but never give wrong answers)
  • Randomized protocols (e.g., Ben-Or) use randomness to break symmetry and eventually terminate with probability 1
  • Partial synchrony assumption — real systems assume "eventually, the network becomes synchronous enough" to make progress

1.5 Three Properties Every Consensus Protocol Must Satisfy

┌─────────────────────────────────────────────────────────────┐
│                  CONSENSUS PROPERTIES                        │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌─────────────────────┐    │
│  │  SAFETY  │   │LIVENESS  │   │  FAULT TOLERANCE    │    │
│  │          │   │          │   │                     │    │
│  │ "Nothing │   │"Something│   │ "Works despite F    │    │
│  │  bad     │   │  good    │   │  node failures"     │    │
│  │  ever    │   │  eventually   │                     │    │
│  │  happens"│   │  happens"│   │ CFT: F < N/2        │    │
│  │          │   │          │   │ BFT: F < N/3        │    │
│  └──────────┘   └──────────┘   └─────────────────────┘    │
│                                                             │
│  + Agreement: All correct nodes decide the same value       │
│  + Validity:  The decided value was proposed by some node   │
│  + Termination: All correct nodes eventually decide         │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
  • Safety (Consistency): Two nodes will never decide on different values. No split-brain. No two leaders simultaneously accepting conflicting writes.
  • Liveness (Progress): The system eventually makes a decision. It doesn't get stuck forever.
  • Fault Tolerance: The system continues operating correctly even when F nodes fail.

1.6 Real-World Stakes: What Goes Wrong Without Consensus

Problem Cause Real-World Impact
Split-Brain Two nodes both think they're leader Conflicting writes, data corruption
Data Loss Write acknowledged before replicated to majority Data gone when leader crashes
Double-Spend Two transactions approved concurrently Financial fraud (blockchain use case)
Stale Reads Reading from a lagging replica User sees outdated data after a write
Cluster Membership Chaos Nodes disagree on who is in the cluster Writes accepted by excluded nodes

The 2011 LinkedIn outage and the 2012 MongoDB data corruption incidents are famous examples where consensus violations caused production disasters.


2. Core Concepts & Terminology

2.1 Leader Election — Why It's Needed

Most consensus protocols use a single leader to serialize decisions. Without a designated leader, all nodes could simultaneously propose conflicting values, leading to deadlock or inconsistency.

The leader:

  1. Receives all client writes
  2. Decides the global order of log entries
  3. Replicates entries to followers
  4. Commits entries once a quorum acknowledges

Leader election itself is a consensus problem — nodes must agree on who the leader is. This circularity is resolved by using terms and randomized timeouts (in Raft) or ballot numbers (in Paxos).


2.2 Log Replication — The Append-Only Distributed Log

The core data structure in most consensus protocols is a replicated log: an ordered sequence of commands that every node applies to its state machine in the same order.

  Node 1 (Leader)   [idx:1, cmd:SET x=1] [idx:2, cmd:SET y=2] [idx:3, cmd:DEL z]
  Node 2 (Follower) [idx:1, cmd:SET x=1] [idx:2, cmd:SET y=2] [idx:3, cmd:DEL z]
  Node 3 (Follower) [idx:1, cmd:SET x=1] [idx:2, cmd:SET y=2] [            ...  ]
                                                                  ↑ lagging
Enter fullscreen mode Exit fullscreen mode

The key invariant: if two logs have the same entry at the same index with the same term, they are identical up to that index. This is the Log Matching Property in Raft.


2.3 Quorum — Majority Rule

A quorum is the minimum number of nodes that must agree for a decision to be valid. It ensures that any two quorums overlap by at least one node, preventing contradictory decisions.

For crash fault tolerant protocols:

To tolerate F failures with N total nodes:
  N ≥ 2F + 1   (odd numbers are common: 3, 5, 7)
  Quorum size Q = ⌊N/2⌋ + 1

Examples:
  3-node cluster  → tolerate 1 failure  → quorum = 2
  5-node cluster  → tolerate 2 failures → quorum = 3
  7-node cluster  → tolerate 3 failures → quorum = 4
Enter fullscreen mode Exit fullscreen mode

For Byzantine fault tolerant protocols:

To tolerate F Byzantine failures:
  N ≥ 3F + 1
  Quorum size Q = ⌊2N/3⌋ + 1

Examples:
  4-node cluster  → tolerate 1 Byzantine failure  → quorum = 3
  7-node cluster  → tolerate 2 Byzantine failures → quorum = 5
Enter fullscreen mode Exit fullscreen mode

Why quorums prevent split-brain: If the cluster splits into two partitions of sizes A and B where A + B = N, only one partition can have a quorum of ⌊N/2⌋ + 1 nodes. The smaller partition stalls.


2.4 Terms / Epochs / Rounds

Every consensus protocol needs a way to logically track time and identify stale messages. Different protocols use different names:

Protocol Name Purpose
Raft Term Monotonically increasing integer. Each election starts a new term.
Paxos Ballot number Globally unique proposal number used to order proposals.
ZAB Epoch Similar to Raft term; identifies a leadership period.
PBFT View Identifies the current primary (leader) replica.

When a node receives a message with a higher term/epoch, it immediately defers to that term, abandons any ongoing operation, and updates its state. This prevents stale leaders from causing damage.


2.5 Heartbeats and Timeouts

Heartbeats are periodic messages sent by the leader to all followers to:

  1. Prove the leader is still alive
  2. Prevent unnecessary elections
  3. Carry the commitIndex so followers can apply committed entries

Election timeout: A follower starts a new election if it doesn't hear from the leader within a randomized timeout window (e.g., 150–300ms in Raft). Randomization prevents multiple candidates from starting elections simultaneously.

Timeline:
  t=0    Leader sends heartbeat ──────────────────► All followers
  t=50ms                                             Followers reset timeout
  t=150ms Leader crashes
  t=300ms Follower A's timeout expires → becomes Candidate → starts election
  t=310ms Follower B's timeout expires → Candidate A already has quorum vote
Enter fullscreen mode Exit fullscreen mode

2.6 Network Partitions and Failure Modes

  Normal Operation:
  [N1] ←→ [N2] ←→ [N3] ←→ [N4] ←→ [N5]

  Partition (2+3 split):
  [N1] ←→ [N2]   ✗✗✗   [N3] ←→ [N4] ←→ [N5]
  (minority, stalls)      (majority, continues)

  Failure types:
  ┌─────────────────┬───────────────────────────────────────┐
  │ Crash failure   │ Node stops completely. Safe.          │
  │ Network failure │ Node alive but can't communicate.     │
  │ Slow node       │ Node responds but very slowly.        │
  │ Byzantine fault │ Node sends arbitrary/malicious msgs.  │
  └─────────────────┴───────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

3. Paxos — The Classic Protocol

3.1 History and Background

Leslie Lamport invented Paxos around 1989 (published 1998 in "The Part-Time Parliament"). It was described through the metaphor of a Greek parliament on the island of Paxos. The paper was famously rejected for years because reviewers found it too whimsical.

Paxos is the theoretical foundation underlying nearly every production consensus system. Google's Chubby, Spanner, and Megastore are all Paxos derivatives.


3.2 Three Roles

Role Responsibility
Proposer Proposes values. Drives the protocol forward. Usually the leader or the client's contact point.
Acceptor Votes on proposals. Stores the last ballot it promised and accepted value. N acceptors = the voting body.
Learner Learns the final agreed value. Does not vote. Often the application layer.

In practice, a single node often plays all three roles simultaneously.


3.3 Two-Phase Protocol

Phase 1: Prepare → Promise

The Proposer broadcasts a Prepare(n) message with a ballot number n to all Acceptors.

Each Acceptor:

  • If n > highest_promised, respond with Promise(n, accepted_value) and promise never to accept ballots < n
  • Otherwise, ignore or send NACK

Phase 2: Accept → Accepted

Once the Proposer receives promises from a quorum:

  • If any Acceptor returned a previously accepted value, the Proposer must use the highest-ballot accepted value (not its own)
  • Otherwise, Proposer uses its own desired value
  • Proposer broadcasts Accept(n, value) to all Acceptors

Each Acceptor:

  • If it hasn't promised a higher ballot, accept the value and broadcast Accepted(n, value) to all Learners

When a Learner hears Accepted from a quorum, the value is committed.


3.4 Paxos Sequence Diagram

Proposer (P)         Acceptor A1      Acceptor A2      Acceptor A3
     |                    |                |                |
     |-- Prepare(n=5) --->|                |                |
     |-- Prepare(n=5) ---------------------->               |
     |-- Prepare(n=5) -------------------------------------->|
     |                    |                |                |
     |<-- Promise(5,⊥) ---|                |                |
     |<-- Promise(5,⊥) -------------------|                |
     |<-- Promise(5,⊥) -------------------------------|    |
     |                    |                |                |
     |  [Got quorum of promises, no prior accepted values]  |
     |  [Choose own value: "SET x=42"]                      |
     |                    |                |                |
     |-- Accept(5,"SET x=42") ----------->|                |
     |-- Accept(5,"SET x=42") ----------------------->     |
     |-- Accept(5,"SET x=42") ------------------------------>
     |                    |                |                |
     |<-- Accepted(5,"SET x=42") ---------|                |
     |<-- Accepted(5,"SET x=42") ----------------------|   |
     |<-- Accepted(5,"SET x=42") ---------------------------->
     |                    |                |                |
     |  [Quorum of Accepted → Value COMMITTED]              |
     |                    |                |                |

  CONFLICT SCENARIO: Acceptor A2 already accepted (4,"SET x=99")
     |<-- Promise(5, prev_accepted=(4,"SET x=99")) ---------|
     |
     |  [Must use "SET x=99" as the value in Accept phase]
     |  [Own value is overridden to preserve prior consensus]
Enter fullscreen mode Exit fullscreen mode

3.5 Why Paxos Is Hard to Implement

Lamport himself wrote a follow-up paper titled "Paxos Made Simple" (2001), which only made things slightly simpler. The problems:

  1. Under-specified: Paxos defines single-value consensus. Real systems need to agree on a sequence of values (a log). Multi-Paxos fills this gap but is not formally defined by Lamport.
  2. Livelock: Two proposers can keep outbidding each other's ballot numbers forever. Solution: elect a single distinguished proposer.
  3. Leader Leases: Paxos doesn't inherently define how to handle stale leaders, read consistency, or leader changes efficiently.
  4. Membership changes: Adding/removing nodes is not specified.
  5. Log gaps: If a leader crashes mid-proposal, gaps appear in the log that require complex repair.

As Chubby's engineers noted: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system."


3.6 Multi-Paxos Optimization

Basic Paxos requires 2 round trips per decision. Multi-Paxos elects a stable leader and skips Phase 1 for all subsequent entries (using the same ballot number):

Basic Paxos per entry: 2 RTT (Prepare + Accept)
Multi-Paxos after leader elected: 1 RTT (Accept only)

Leader sends: Accept(n, index, value) for each log entry
             ↓ No Prepare needed while leader is stable
Enter fullscreen mode Exit fullscreen mode

3.7 Paxos Proposer Logic (Pseudocode)

// Proposer logic for single-decree Paxos
class Proposer:
    n = 0          // ballot number (must be globally unique, e.g., timestamp + node_id)
    value = None

    function propose(desired_value):
        n = next_unique_ballot()

        // Phase 1: Prepare
        promises = []
        for each acceptor in acceptors:
            response = send_prepare(acceptor, n)
            if response.type == PROMISE:
                promises.append(response)

        if len(promises) < quorum:
            return FAILED  // retry with higher n

        // If any promise carries a prior accepted value, use it
        highest = max(promises, key=lambda p: p.accepted_ballot or -1)
        if highest.accepted_value != None:
            value = highest.accepted_value   // override our value!
        else:
            value = desired_value

        // Phase 2: Accept
        accepted_count = 0
        for each acceptor in acceptors:
            response = send_accept(acceptor, n, value)
            if response.type == ACCEPTED:
                accepted_count += 1

        if accepted_count >= quorum:
            notify_learners(value)
            return SUCCESS(value)
        else:
            n = next_unique_ballot()
            return propose(desired_value)  // retry

// Acceptor logic
class Acceptor:
    promised_ballot = -1
    accepted_ballot = -1
    accepted_value  = None

    function on_prepare(n):
        if n > promised_ballot:
            promised_ballot = n
            return Promise(n, accepted_ballot, accepted_value)
        else:
            return Nack(promised_ballot)

    function on_accept(n, value):
        if n >= promised_ballot:
            promised_ballot = n
            accepted_ballot = n
            accepted_value  = value
            return Accepted(n, value)
        else:
            return Nack(promised_ballot)
Enter fullscreen mode Exit fullscreen mode

4. Raft — The Understandable Protocol

4.1 Why Raft Was Created

Diego Ongaro and John Ousterhout (Stanford, 2014) set out to create a consensus protocol that was understandable — not just correct. Their paper "In Search of an Understandable Consensus Algorithm" ran a user study showing Raft was significantly easier to understand than Paxos.

The key design principle: decompose consensus into three relatively independent sub-problems, each understandable on its own.


4.2 Three Sub-Problems Raft Solves

┌───────────────────────────────────────────────────────────┐
│                    RAFT = 3 Sub-Problems                  │
│                                                           │
│  1. LEADER ELECTION                                       │
│     Who is in charge? One leader per term.                │
│                                                           │
│  2. LOG REPLICATION                                       │
│     Leader receives entries, replicates to followers,     │
│     commits when majority acknowledges.                   │
│                                                           │
│  3. SAFETY                                                │
│     Ensure only servers with up-to-date logs can          │
│     become leader. Committed entries never lost.          │
└───────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

4.3 Leader Election — State Machine

Each Raft node is always in one of three states:

                  ┌──────────────────────────────────────┐
                  │           STATE MACHINE               │
                  └──────────────────────────────────────┘

  ┌─────────┐  timeout, no    ┌───────────┐  receives votes   ┌────────┐
  │         │  heartbeat from  │           │  from majority    │        │
  │FOLLOWER │ ──────────────►  │ CANDIDATE │ ────────────────► │ LEADER │
  │         │                  │           │                   │        │
  └─────────┘                  └───────────┘                   └────────┘
       ▲                            │  │                           │
       │   discovers higher term    │  │ discovers current         │
       │   or valid leader          │  │ leader or higher term     │
       │◄───────────────────────────┘  │◄──────────────────────────┘
       │                               │
       │ (also: if split vote,         │
       │  restart election with        │
       │  new randomized timeout)      │

  Node starts as FOLLOWER
  FOLLOWER → CANDIDATE: election timeout expires (randomized 150-300ms)
  CANDIDATE → LEADER:   receives votes from ⌊N/2⌋+1 nodes
  CANDIDATE → FOLLOWER: sees higher term or valid leader heartbeat
  LEADER    → FOLLOWER: discovers higher term number
Enter fullscreen mode Exit fullscreen mode

Election process step-by-step:

  1. Candidate increments its current term
  2. Votes for itself
  3. Sends RequestVote(term, candidateId, lastLogIndex, lastLogTerm) to all nodes
  4. A node grants its vote if: it hasn't voted this term AND the candidate's log is at least as up-to-date as its own
  5. Candidate becomes leader upon receiving majority votes
  6. Leader immediately sends heartbeats to all nodes to establish authority

4.4 Log Replication — AppendEntries RPC Flow

CLIENT                LEADER (L)         FOLLOWER F1      FOLLOWER F2
  |                       |                   |                |
  |-- Write("SET x=5") -->|                   |                |
  |                       |                   |                |
  |              append to own log            |                |
  |              log: [..., (term=3,idx=7)]   |                |
  |                       |                   |                |
  |                       |--AppendEntries(term=3,           |
  |                       |   prevLogIdx=6,                  |
  |                       |   prevLogTerm=3,                 |
  |                       |   entries=[(3,7,"SET x=5")],     |
  |                       |   leaderCommit=6) ────────────>  |
  |                       |                                   |
  |                       |--AppendEntries(...) ────────────────────────>
  |                       |                   |                |
  |                       |<── Success ────── |                |
  |                       |<── Success ──────────────────────|
  |                       |                   |                |
  |        [Majority (2/3) acknowledged → COMMIT entry]        |
  |        [Update commitIndex = 7]                            |
  |                       |                   |                |
  |<── ACK (success) ─────|                   |                |
  |                       |                   |                |
  |            Next heartbeat carries leaderCommit=7           |
  |                       |──AppendEntries(entries=[], ──────>|
  |                       |   leaderCommit=7)                 |
  |                       |──AppendEntries(...) ─────────────────────>
  |                       |                                   |
  |               Followers apply entry to state machine      |
Enter fullscreen mode Exit fullscreen mode

4.5 Term Numbers and Stale Leaders

Terms are the core mechanism preventing stale leaders from causing damage:

Term 1:   [Node A is leader]──────────crash
                                        ↓
Term 2:   [Node B elected leader]─────────────────partition
                                                    ↓
Term 3:   [Nodes C,D,E elect Node C as leader]────────────►
                                        ↓
         Node B comes back from partition.
         Node B receives any message with term=3.
         Node B immediately steps down (term 2 < 3).
         Node B becomes Follower in term 3.
         Node B discards any uncommitted entries from term 2.
Enter fullscreen mode Exit fullscreen mode

Golden Rule: If a node receives any RPC with a higher term, it immediately becomes a follower and updates its term. Higher term always wins.


4.6 Safety — The Commitment Rule

A log entry is committed only when:

  1. It has been replicated to a majority of nodes
  2. AND the leader has committed at least one entry from its current term

The second condition prevents a subtle bug where an old entry from a previous term appears committed but gets overwritten during leader changes.

  Scenario (why condition 2 matters):

  Term 1: Leader L1 replicates entry E1 to majority. Crashes before committing.
  Term 2: Leader L2 replicates its own entry E2 to majority (overwriting E1 on some).
          L2 must commit E2 first, which "anchors" E1's commit safely.

  If L2 committed E1 directly (without a term-2 entry), a later leader
  could overwrite E1 on the nodes that didn't replicate it — violating safety.
Enter fullscreen mode Exit fullscreen mode

4.7 Log Compaction (Snapshots)

Logs grow forever. Raft solves this with snapshots:

  • Each node periodically serializes its entire state machine to disk as a snapshot
  • Log entries before the snapshot point are discarded
  • The snapshot includes the last included log index and term
  • New nodes joining the cluster receive the snapshot + recent log entries via InstallSnapshot RPC
Before snapshot:
[1:SET a=1][2:SET b=2][3:SET a=5][4:DEL b][5:SET c=3] [6:SET d=7][7:SET e=9]
 ←────────── included in snapshot ──────────────────►   ←── kept ──────────►

After snapshot:
[SNAPSHOT: {a=5,c=3} last_idx=5, last_term=2]  [6:SET d=7][7:SET e=9]
Enter fullscreen mode Exit fullscreen mode

4.8 AppendEntries Handler (Go-like Pseudocode)

// AppendEntries RPC Handler (on Follower)
func (rf *Raft) AppendEntries(args AppendEntriesArgs, reply *AppendEntriesReply) {
    rf.mu.Lock()
    defer rf.mu.Unlock()

    reply.Success = false

    // Rule 1: Reject if leader's term is stale
    if args.Term < rf.currentTerm {
        reply.Term = rf.currentTerm
        return
    }

    // Rule 2: Update term if leader has higher term
    if args.Term > rf.currentTerm {
        rf.currentTerm = args.Term
        rf.state = Follower
        rf.votedFor = -1
    }

    // Rule 3: Reset election timer (we heard from valid leader)
    rf.resetElectionTimer()

    // Rule 4: Log consistency check
    // Reject if log doesn't contain entry at prevLogIndex with prevLogTerm
    if args.PrevLogIndex > 0 {
        if args.PrevLogIndex > len(rf.log) {
            reply.ConflictIndex = len(rf.log) + 1
            reply.ConflictTerm = -1
            return
        }
        if rf.log[args.PrevLogIndex-1].Term != args.PrevLogTerm {
            reply.ConflictTerm  = rf.log[args.PrevLogIndex-1].Term
            reply.ConflictIndex = rf.firstIndexForTerm(reply.ConflictTerm)
            // Truncate conflicting entries
            rf.log = rf.log[:args.PrevLogIndex-1]
            return
        }
    }

    // Rule 5: Append new entries, overwriting conflicts
    for i, entry := range args.Entries {
        idx := args.PrevLogIndex + i + 1
        if idx <= len(rf.log) {
            if rf.log[idx-1].Term != entry.Term {
                rf.log = rf.log[:idx-1]  // truncate conflicting suffix
            }
        }
        if idx > len(rf.log) {
            rf.log = append(rf.log, entry)
        }
    }

    // Rule 6: Update commitIndex
    if args.LeaderCommit > rf.commitIndex {
        rf.commitIndex = min(args.LeaderCommit, len(rf.log))
        rf.applyCommittedEntries()  // apply to state machine
    }

    reply.Success = true
    reply.Term = rf.currentTerm
}
Enter fullscreen mode Exit fullscreen mode

4.9 Raft vs Paxos Comparison

Dimension Raft Paxos (Multi-Paxos)
Understandability Designed for clarity Notoriously hard to understand
Leader Explicit single leader per term Distinguished proposer (implicit)
Log gaps No gaps allowed Gaps possible, need repair
Membership changes Joint consensus (well-defined) Under-specified
Log replication Strong leader: follower log always matches leader Any node can propose; complex reconciliation
Snapshots Built into the spec Not specified
Formal proof TLA+ spec available Partial proofs
Implementations etcd, CockroachDB, TiKV, Consul Google Chubby, Spanner, Zookeeper (variant)

5. Byzantine Fault Tolerance (BFT)

5.1 Crash Faults vs Byzantine Faults

  CRASH FAULT (CFT):                BYZANTINE FAULT (BFT):
  Node either works or stops.       Node can behave arbitrarily.

  ┌────┐                            ┌────┐
  │ OK │ ─── responds correctly     │ OK │ ─── responds correctly
  └────┘                            └────┘

  ┌────┐                            ┌────┐
  │CRASH│ ─── stops, no response    │ BYZ│ ─── sends YES to A,
  └────┘                            └────┘     sends NO to B,
                                               replays old messages,
                                               forges signatures,
                                               delays strategically
Enter fullscreen mode Exit fullscreen mode

When BFT is needed:

  • Blockchain networks (untrusted nodes)
  • Aerospace systems (cosmic ray bit flips)
  • Financial clearing systems (malicious insiders)
  • Multi-organization consortiums

Cost of BFT: Requires N ≥ 3F + 1 nodes (vs 2F + 1 for CFT), plus cryptographic signatures on all messages (O(N²) message complexity per round).


5.2 PBFT — Practical Byzantine Fault Tolerance

Castro & Liskov (1999) created PBFT as the first practical BFT protocol. It has three phases:

Client   Primary (P)   Replica R1   Replica R2   Replica R3
  |           |              |            |            |
  |─ Request →|              |            |            |
  |           |              |            |            |
  |    [Phase 1: PRE-PREPARE]            |            |
  |           |─ Pre-Prepare(v,n,digest) →|            |
  |           |─ Pre-Prepare(v,n,digest) ─────────────►|
  |           |─ Pre-Prepare(v,n,digest) ──────────────────────►
  |           |              |            |            |
  |    [Phase 2: PREPARE — replicas broadcast to each other]
  |           |              |─ Prepare(v,n,digest) ──►|
  |           |              |─ Prepare(v,n,digest) ───────────►
  |           |◄─────────────── Prepare(v,n,digest) ───|
  |           |              |◄─────── Prepare(v,n,digest) ────|
  |           |              |            |◄─ Prepare(v,n,d) ──|
  |           |              |            |                     |
  |    [Phase 3: COMMIT — after 2F+1 Prepares]
  |           |              |─ Commit(v,n,digest) ───►|
  |           |              |─ Commit(v,n,digest) ────────────►
  |           |◄─────────────── Commit(v,n,digest) ────|
  |    ...    | (all commit after 2F+1 Commits)
  |           |              |            |            |
  |◄─ Reply ──| (or from any replica)

  v = view number, n = sequence number, digest = hash of request
Enter fullscreen mode Exit fullscreen mode

Message complexity per request: O(N²) — every replica talks to every other replica in the Prepare and Commit phases. This limits PBFT to small clusters (typically ≤ 20 nodes in practice).


5.3 Tendermint — Modern BFT for Blockchains

Tendermint (used in the Cosmos blockchain ecosystem) is a modern BFT protocol with:

  • Round-based voting with two vote stages: Prevote and Precommit
  • Locked value mechanism prevents equivocation
  • Partial synchrony assumption: progress requires eventual network stability
  • Deterministic leader rotation (round-robin by default)
  • Finality in 1 block: Unlike Nakamoto consensus (Bitcoin), Tendermint commits are final — no forks

Tendermint tolerates F < N/3 Byzantine failures and requires 2/3 + 1 of stake-weighted votes.


6. Other Notable Protocols

6.1 ZAB — ZooKeeper Atomic Broadcast

ZAB (Junqueira, Reed, Serafini 2011) is the consensus protocol used in Apache ZooKeeper. It's conceptually similar to Multi-Paxos / Raft but with key differences:

  • Primary-Backup model: The primary (leader) propagates transactions. Followers are backups.
  • Two phases: Broadcasting + Recovery: During leader election, a recovery phase ensures all committed transactions are replicated before a new primary begins serving.
  • Epoch-based: Uses epoch numbers similar to Raft terms.
  • FIFO ordering per channel: ZAB guarantees that messages from a given node are delivered in the order they were sent — a stronger guarantee than standard Paxos.

ZAB is tuned for high-throughput, low-write-latency with reads served locally (at the cost of potential staleness).


6.2 Viewstamped Replication (VSR)

Liskov & Cowling (2012) — VSR predates Paxos (original 1988 by Liskov & Shrira) and is arguably the clearest formulation of the state-machine replication problem. It uses:

  • View numbers (analogous to terms) to track leader identity
  • View changes for leader failover
  • Op numbers for log entries

Raft borrowed heavily from VSR. Many researchers consider VSR the conceptual predecessor of both Paxos and Raft.


6.3 EPaxos — Parallel Commits

Egalitarian Paxos (Moraru, Andersen, Kaminsky 2013) eliminates the single-leader bottleneck by allowing any replica to commit non-conflicting commands in a single round trip:

  • Commands that don't conflict (touch different keys) can be committed in parallel by different replicas
  • Conflicting commands go through a slower "slow path" (2 round trips)
  • Optimal latency: commit in 1 RTT to the closest majority
  • Higher implementation complexity than Raft

EPaxos is used in research systems but production adoption is limited due to complexity.


6.4 Chandra-Toueg Protocol

Chandra & Toueg (1996) proved that consensus is solvable in asynchronous systems if you have access to an unreliable failure detector — a distributed oracle that (eventually correctly) suspects crashed nodes. The protocol uses a rotating coordinator and ◇S (eventually strong) failure detector.

Practical significance: showed that the FLP impossibility is about perfect asynchrony. With even a weak failure detector (timeouts), consensus becomes solvable.


6.5 Comparison Table: Major Protocols

Protocol Fault Model Min Nodes Msg Complexity Leader Throughput Ease of Impl Use Cases
Paxos Crash (CFT) 2F+1 O(N) per phase Implicit High (Multi-Paxos) Hard Chubby, Spanner
Raft Crash (CFT) 2F+1 O(N) Explicit High Moderate etcd, Consul, CockroachDB
ZAB Crash (CFT) 2F+1 O(N) Explicit Very High Moderate ZooKeeper
PBFT Byzantine (BFT) 3F+1 O(N²) Explicit (view) Low Hard Hyperledger, research
Tendermint Byzantine (BFT) 3F+1 O(N²) Rotating Medium Moderate Cosmos blockchain
EPaxos Crash (CFT) 2F+1 O(N) Leaderless Very High Very Hard Research systems
VSR Crash (CFT) 2F+1 O(N) Explicit High Moderate Academic reference

7. Real-World Systems

7.1 etcd — Kubernetes Control Plane

etcd is a strongly consistent, distributed key-value store built on Raft. It is the backbone of Kubernetes, storing all cluster state: node information, pod specs, ConfigMaps, secrets, and service discovery data.

  ┌─────────────────────────────────────────────────────────────────┐
  │                   KUBERNETES CLUSTER                            │
  │                                                                 │
  │  ┌──────────────────────────────────────────┐                  │
  │  │            CONTROL PLANE                  │                  │
  │  │                                           │                  │
  │  │  ┌────────────┐  ┌────────────────────┐  │                  │
  │  │  │ API Server │  │ Controller Manager │  │                  │
  │  │  └─────┬──────┘  └────────────────────┘  │                  │
  │  │        │ read/write                        │                  │
  │  │        ▼                                   │                  │
  │  │  ┌─────────────────────────────────────┐  │                  │
  │  │  │         etcd Raft Cluster           │  │                  │
  │  │  │                                     │  │                  │
  │  │  │  ┌──────────┐   ┌──────────┐       │  │                  │
  │  │  │  │ etcd-1   │   │ etcd-2   │       │  │                  │
  │  │  │  │ (LEADER) │◄─►│(FOLLOWER)│       │  │                  │
  │  │  │  └──────────┘   └──────────┘       │  │                  │
  │  │  │        ▲              ▲             │  │                  │
  │  │  │        └──────┬───────┘             │  │                  │
  │  │  │               ▼                     │  │                  │
  │  │  │         ┌──────────┐                │  │                  │
  │  │  │         │ etcd-3   │                │  │                  │
  │  │  │         │(FOLLOWER)│                │  │                  │
  │  │  │         └──────────┘                │  │                  │
  │  │  │  Quorum=2, Tolerates 1 failure      │  │                  │
  │  │  └─────────────────────────────────────┘  │                  │
  │  └──────────────────────────────────────────┘                  │
  │                                                                 │
  │  ┌───────────┐  ┌───────────┐  ┌───────────┐                  │
  │  │  Worker 1 │  │  Worker 2 │  │  Worker 3 │                  │
  │  │ (kubelet) │  │ (kubelet) │  │ (kubelet) │                  │
  │  └───────────┘  └───────────┘  └───────────┘                  │
  └─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Production etcd clusters use 5 nodes (tolerates 2 failures). Write latency is ~1–5ms on fast SSDs. etcd is designed for control plane data (~1000 writes/sec), not application data.


7.2 Apache ZooKeeper

ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast). It provides a hierarchical namespace (like a filesystem), watches (event notifications), and ephemeral nodes (auto-deleted when client disconnects). Used for distributed locks, service discovery, and leader election in systems like Kafka, HBase, and older Hadoop.

ZooKeeper reads are served locally (from any follower without going to leader), which makes reads fast but allows stale reads unless you issue a sync() call first.


7.3 CockroachDB / TiKV — Multi-Raft

Both CockroachDB and TiKV (used in TiDB) implement multi-Raft: the key space is divided into ranges (~64MB each), and each range has its own independent Raft group.

  Key Space: [A..Z]

  Range 1 [A..H]:  Raft Group (Node1-Leader, Node2, Node3)
  Range 2 [I..P]:  Raft Group (Node2-Leader, Node3, Node4)
  Range 3 [Q..Z]:  Raft Group (Node3-Leader, Node4, Node5)

  Benefits:
  - Writes to different ranges are fully parallel
  - Each range's leadership is independent
  - Horizontal scaling: add nodes, rebalance ranges
Enter fullscreen mode Exit fullscreen mode

Cross-range transactions use a distributed transaction protocol (2PC over Raft) with optimistic concurrency control.


7.4 Google Chubby and Spanner

  • Chubby (Burrows 2006): A lock service using Multi-Paxos. Provides coarse-grained distributed locks and small metadata storage. Kubernetes etcd was partially inspired by Chubby's design.
  • Spanner (Corbett et al. 2012): Google's globally distributed relational database. Uses Multi-Paxos per shard + TrueTime API (GPS + atomic clocks) to provide external consistency (a stronger form of linearizability) across global datacenters with commit latency ~7–14ms.

7.5 Consul

HashiCorp Consul uses Raft for its catalog, KV store, and service mesh configuration. Notable features:

  • Pre-vote extension: A node confirms it can communicate with a quorum before starting an election (avoids disruptions from partitioned nodes)
  • Non-voting members: Servers can join as non-voters for read scale
  • Autopilot: Automatic dead server cleanup and new member stabilization

7.6 Hyperledger Fabric

Hyperledger Fabric uses a pluggable ordering service for consensus. Options include:

  • Raft ordering service (default since v2.0)
  • Solo (development only, no fault tolerance)
  • Custom BFT orderers (in development)

Fabric separates consensus into: Endorse → Order → Validate/Commit — allowing different organizations to endorse without all participating in consensus.


8. Scalability & Performance Considerations

8.1 Read vs Write Throughput Under Consensus

Writes must go through the leader and require a full Raft round (AppendEntries to quorum). This creates a write bottleneck at the leader.

Reads have more flexibility:

  • Linearizable reads: Must go through the leader (or leader must confirm it's still the leader via a quorum read-index). Prevents stale reads but adds latency.
  • Follower reads: Any follower can serve reads with potential staleness. Good for analytics and cache-friendly workloads.
  • Lease-based reads: Leader holds a time-bounded lease during which it's guaranteed to be the leader (based on election timeout). Allows serving reads without quorum round-trip.

8.2 Batching and Pipelining

Naive Raft sends one AppendEntries per write. Optimizations:

  • Batching: Accumulate multiple writes and send them in a single AppendEntries call. Reduces per-entry overhead dramatically (100 writes/batch = 100x throughput improvement over single writes).
  • Pipelining: Don't wait for an AppendEntries response before sending the next one. Keep a sliding window of in-flight RPCs. etcd pipelines up to 128 entries ahead.

8.3 Commit Latency Formula

The minimum commit latency in a Raft/Paxos cluster is:

commit_latency ≥ 1 RTT to majority

For a 5-node cluster in the same datacenter:
  RTT ≈ 0.5ms → commit_latency ≈ 0.5–2ms

For geo-distributed cluster (cross-continent):
  RTT ≈ 100ms → commit_latency ≈ 100–200ms

To tolerate 1 WAN failure in a 5-node geo-cluster:
  Majority = 3 nodes. Place 3 nodes in primary region.
  commit_latency ≈ intra-datacenter RTT ≈ 1ms
  (minority nodes in remote DCs don't slow down commits!)
Enter fullscreen mode Exit fullscreen mode

8.4 Multi-Raft Architecture

  ┌──────────────────────────────────────────────────────┐
  │                   5-NODE CLUSTER                      │
  │                                                       │
  │  Key Range A-M       Key Range N-Z                    │
  │  ┌──────────────┐    ┌──────────────┐                │
  │  │ Raft Group 1 │    │ Raft Group 2 │                │
  │  │ Leader: N1   │    │ Leader: N3   │                │
  │  │ N1, N2, N3   │    │ N3, N4, N5   │                │
  │  └──────────────┘    └──────────────┘                │
  │                                                       │
  │  Parallel write throughput scales with # of groups    │
  │  Leadership spread across nodes for load balance      │
  └──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

8.5 Geo-Distributed Consensus

The fundamental challenge: commit latency is bounded by the RTT to the majority of nodes. Strategies:

Strategy Mechanism Trade-off
Local quorum Place majority in one region Fast commits, single-region SPOF
Flexible quorum Choose quorum per write based on proximity Complex config, variable latency
Hierarchical Paxos Regional consensus + cross-region coord Lower WAN round-trips
CRDT-based Merge-based, no coordination Eventual consistency only
Spanner TrueTime GPS clocks bound commit wait External consistency, Google-only HW

9. Trade-Off Analysis

9.1 Protocol Comparison Table (Full)

Protocol Fault Model Min Nodes Msg Complexity Throughput Ease of Impl Primary Use Cases
Raft CFT 2F+1 O(N) High ★★★★ etcd, CockroachDB, Consul
Multi-Paxos CFT 2F+1 O(N) High ★★ Chubby, Spanner, Zookeeper*
ZAB CFT 2F+1 O(N) Very High ★★★ ZooKeeper
EPaxos CFT 2F+1 O(N) Very High Research
PBFT BFT 3F+1 O(N²) Low Hyperledger (legacy)
Tendermint BFT 3F+1 O(N²) Medium ★★ Cosmos, blockchain
VSR CFT 2F+1 O(N) High ★★★ Reference implementations

*ZooKeeper uses ZAB, not standard Paxos, but is influenced by Paxos design.


9.2 CAP Theorem Mapping

           CONSISTENCY
                │
                │
     Raft ──────┤
    Paxos ──────┤         ← All consensus protocols live here (CP)
      ZAB ──────┤
      PBFT ─────┤
                │
────────────────┼────────────────────
                │                  AVAILABILITY
    Cassandra ──┼──────────────────────────────────── (AP)
    DynamoDB ───┼─────────────────────────────── (AP)
                │
         PARTITION TOLERANCE (always present in distributed systems)
Enter fullscreen mode Exit fullscreen mode

Consensus protocols are CP systems:

  • They guarantee consistency (linearizability) even during partitions
  • The minority partition stalls (no progress) to preserve consistency
  • They sacrifice availability during partitions

Techniques to improve availability within CP systems:

  • Leader leases: Serve reads without quorum check (reduces read latency/stalls)
  • Follower reads: Accept stale reads for lower latency
  • Pre-vote: Prevent disruptive elections from partitioned nodes
  • Non-voting witnesses: Geographic replicas that don't participate in quorum

10. Design Decisions & Rationale

When building or selecting a consensus-based system, engineers face the following key decisions:

  1. Crash-Fault-Tolerant vs Byzantine-Fault-Tolerant

    • Rationale: CFT (Raft/Paxos) requires 2F+1 nodes and O(N) messages. BFT requires 3F+1 nodes and O(N²) messages. Unless you have untrusted nodes (blockchain, multi-org systems), CFT is almost always the right choice. BFT is 2-10× more expensive for the same fault tolerance level.
  2. Single-Leader vs Leaderless (Raft vs EPaxos)

    • Rationale: Single-leader simplifies reasoning, debugging, and client routing. The leader is the serialization point. Leaderless (EPaxos) gives better throughput for independent commands and lower latency by using closest replicas — but complicates conflict detection and makes debugging much harder. Choose single-leader unless you've hit the write bottleneck.
  3. Synchronous vs Asynchronous Replication Commitment

    • Rationale: Synchronous (wait for quorum ACK before responding to client) gives durability but adds latency. Asynchronous (acknowledge client immediately, replicate in background) loses durability on failure. Raft uses synchronous quorum replication for committed entries — this is non-negotiable for consistency. Some systems add an optional "async replicas" tier for geo-distribution without blocking commits.
  4. Snapshot Frequency and Log Truncation Strategy

    • Rationale: Too frequent snapshots → expensive serialization, I/O spikes. Too infrequent → log grows large, slow startup/recovery. Practical strategy: snapshot when log exceeds a size threshold (e.g., 64MB in etcd). Snapshots also accelerate new node bootstrapping — shipping a snapshot + recent log is far faster than replaying the full log.
  5. Network Partition Handling: Timeout Thresholds

    • Rationale: Election timeout must be longer than the max normal heartbeat round-trip (to avoid false elections from transient slowdowns) but short enough to elect a new leader quickly after a real failure. etcd default: 100ms heartbeat, 1000ms election timeout. Too short → frequent elections under load. Too long → long unavailability after crashes.
  6. Read Consistency: Linearizable Reads vs Follower Reads

    • Rationale: Linearizable reads require the leader to confirm it's still the leader (via ReadIndex in etcd) before serving the read — adds one RTT but prevents stale reads. Follower reads reduce leader load but risk serving stale data (bounded by replication lag). Decide based on application's consistency requirement.
  7. Cluster Membership Changes (Joint Consensus)

    • Rationale: Naively switching from old configuration to new in one step risks two independent quorums existing simultaneously (split-brain). Raft's joint consensus approach passes through a transitional phase where both old and new configurations must agree before switching. Always use a safe, tested membership change protocol — manual changes under load have caused many production outages.
  8. Witness Nodes / Observer Nodes for Geographic Distribution

    • Rationale: Full voting replicas in remote datacenters add cross-region RTT to every commit. Witness nodes participate in elections and store log metadata but don't receive full data copies. Observer/learner nodes receive committed data but don't vote. Strategically placing witnesses allows geo-redundancy without paying the full latency cost of geo-distributed quorums.
  9. Pre-Vote Extension

    • Rationale: A node that has been partitioned from the cluster will increment its term continuously while trying to get elected. When it reconnects, it disrupts the current leader by broadcasting a higher term (even though it can't actually win an election). The pre-vote phase (Raft extension by Ongaro) requires a candidate to confirm it can win an election before incrementing its term — preventing disruptive term inflation.
  10. Disk Persistence Strategy: fdatasync Frequency

    • Rationale: Raft requires log entries to be durable (written to disk) before responding. fdatasync() (or equivalent) on every write is correct but slow (~1–10ms per sync on SSDs). Batching multiple writes per sync dramatically improves throughput. WAL (Write-Ahead Log) files can be pre-allocated to reduce filesystem metadata overhead. This is often the #1 performance bottleneck in Raft implementations.

11. Practice Problems

Problem 1: 5-Node Raft Cluster Failure Analysis

Question: You have a 5-node Raft cluster (N1 through N5). How many failures can it tolerate? Walk through what happens when two nodes fail simultaneously, and then explain what happens during the next leader election.

Answer:

A 5-node cluster can tolerate F = 2 failures because N ≥ 2F+15 ≥ 2(2)+1 = 5. Quorum = ⌊5/2⌋+1 = 3.

Scenario: N1 (leader, term=4) and N3 fail simultaneously.

Before failure:  N1(L) ─ N2 ─ N3 ─ N4 ─ N5
After failure:    ✗(L)   N2   ✗   N4   N5  ← 3 nodes alive (quorum met)
Enter fullscreen mode Exit fullscreen mode
  1. N2, N4, N5 stop receiving heartbeats from N1
  2. After election timeout (150–300ms), one of them (say N4, with shortest timeout) increments to term=5 and sends RequestVote(term=5, ...) to N2 and N5
  3. N2 and N5 check: Has N4's log at least as up-to-date as theirs? If yes (and they haven't voted in term=5), they grant votes
  4. N4 gets votes from N2 and N5 → total = 3 (itself + 2) = quorum → N4 becomes leader in term=5
  5. N4 immediately sends heartbeats to N2 and N5 to prevent further elections
  6. Cluster continues operating with 3 nodes, tolerating 0 more failures

If three nodes failed, only 2 nodes remain — no quorum possible → cluster stalls (safety preserved, liveness lost).


Problem 2: Distributed Lock Service Using Raft

Question: Design a distributed lock service (like Chubby or etcd's lock API) using Raft as the consensus backend. What are the key components, and what are the failure scenarios you need to handle?

Answer:

Architecture:

Client ──► Lock Service (Raft cluster)
                │
          Raft State Machine
          (stores: lock_name → {holder, expiry, version})
Enter fullscreen mode Exit fullscreen mode

Core operations (all go through Raft log):

// Acquire lock: compare-and-swap on lock state
AcquireLock(name, client_id, ttl) -> (success, version)
  If lock is free OR expired  set holder=client_id, expiry=now+ttl
  Else  return false

// Release lock: conditional delete
ReleaseLock(name, client_id, version)
  If holder==client_id AND version matches  clear lock
  Else  return error (someone else holds it, or lock was stolen)

// Renew TTL (heartbeat)
RenewLock(name, client_id, version) -> new_expiry
Enter fullscreen mode Exit fullscreen mode

Key failure scenarios:

Scenario Risk Mitigation
Lock holder crashes Lock held indefinitely TTL expiry; lock auto-releases after TTL
Leader crashes mid-acquire Client unsure if lock acquired Idempotent operations with version/fencing token
Network partition (client isolated) Client thinks it holds lock but TTL expired Use fencing token (monotonically increasing version) in all downstream operations
Spurious timeout Client renews too slowly TTL should be >> client-server RTT; renew at TTL/3

Fencing tokens are critical: even if a client holds a stale lock reference, downstream resources reject writes with old fencing tokens.


Problem 3: Raft Cluster During a 2+3 Network Partition

Question: A 5-node Raft cluster (N1=leader in term=7, N2, N3, N4, N5) experiences a network partition splitting into {N1, N2} and {N3, N4, N5}. What happens in each partition? What if a client writes to N1?

Answer:

Partition A (minority): { N1(leader, term=7), N2 }
Partition B (majority): { N3, N4, N5 }
Enter fullscreen mode Exit fullscreen mode

Partition B (majority side):

  • N3, N4, N5 stop hearing heartbeats from N1
  • After election timeout: one of them (say N5) increments to term=8, runs election
  • N5 gets votes from N3 and N4 → N5 becomes leader in term=8
  • Partition B continues serving writes normally

Partition A (minority side):

  • N1 still thinks it's the leader (it hasn't heard a higher term yet)
  • N1 continues accepting client writes (bad for clients)
  • N1 sends AppendEntries to N2 only → only 2 nodes, never reaches quorum of 3
  • These writes are NEVER committed — they accumulate in N1 and N2's logs as uncommitted entries
  • Clients waiting for commit acknowledgment will timeout (no response from N1 since quorum unreachable)

On partition healing:

  • N1 receives a heartbeat or AppendEntries from N5 (term=8 > 7)
  • N1 immediately steps down, reverts to follower in term=8
  • N1's uncommitted entries (from partition A) are overwritten by N5's log
  • Any client writes that N1 "accepted" but never committed are silently lost (clients must retry)

Key takeaway: Raft's commitment rule (quorum required) means N1 never committed anything during the partition. No committed data is ever lost. Uncommitted data (write in flight) may be lost — clients must handle retries.


Problem 4: Why You Can't Add a New Node Instantly

Question: You have a 3-node Raft cluster and want to add a fourth node. Why can't you just add it instantly? What's the safe procedure?

Answer:

The problem with instant addition:

When you go from 3 nodes (quorum=2) to 4 nodes in a single configuration switch, there's a dangerous window:

Old config (3 nodes): quorum = 2
New config (4 nodes): quorum = 3

If N1 and N2 know about the new config (quorum=3) but
N3 still thinks it's 3 nodes (quorum=2)...

N3 + old-N1 can form a quorum of 2 (under old config)
N2 + N4 + new-N1 can form a quorum of 3 (under new config)
→ TWO simultaneous quorums → potential split-brain!
Enter fullscreen mode Exit fullscreen mode

Safe procedure — Raft Joint Consensus:

  1. New node joins as non-voter: The new node (N4) replicates the full log but doesn't vote. This can take time (catching up from snapshot + log tail).

  2. Transition to joint configuration C(old,new): The leader commits a configuration entry representing the joint state (both old and new nodes). During joint consensus, decisions require a quorum from BOTH old AND new configurations.

  3. Commit new configuration C(new): Once the joint config is committed, the leader commits the new config entry. Old config is abandoned.

Phase 1: [N1, N2, N3] + N4 joins as learner
Phase 2: C(old,new) committed — quorum requires: 2 of {N1,N2,N3} AND 3 of {N1,N2,N3,N4}
Phase 3: C(new) committed — quorum: 3 of {N1,N2,N3,N4}
Enter fullscreen mode Exit fullscreen mode

At no point can two independent quorums coexist, preventing split-brain.


Problem 5: Leader Crashes Before Committing

Question: The Raft leader receives a write request "SET x=100", appends it to its log at index 42 (term 7), and sends AppendEntries to all followers. The leader crashes after N2 acknowledges but before N3 acknowledges. What happens?

Answer:

State at crash:

N1 (leader, crashed):  [..., (term=7, idx=42, SET x=100)]  ← uncommitted
N2 (follower):         [..., (term=7, idx=42, SET x=100)]  ← has the entry
N3 (follower):         [... idx=41]                         ← doesn't have it
N4 (follower):         [..., (term=7, idx=42, SET x=100)]  ← has the entry
N5 (follower):         [... idx=41]                         ← doesn't have it
Enter fullscreen mode Exit fullscreen mode

Scenario A: N2 or N4 (who have the entry) wins the election

  • Say N2 wins (term=8). N2 has entry at idx=42.
  • N2 replicates its log to N3, N4, N5 (if not already there)
  • N2 commits a new entry in term=8 (say a no-op)
  • By the commitment rule, this anchors idx=42 — it becomes committed
  • "SET x=100" is committed and durable. Client write succeeds (on retry).

Scenario B: N3 or N5 (who don't have the entry) wins the election

  • Wait — can they? The election safety rule prevents this.
  • N3/N5 have log ending at idx=41 (term ≤ 7). N2/N4 have idx=42 (term=7).
  • In RequestVote, voters reject candidates whose logs are less up-to-date.
  • N2 and N4 will reject N3's and N5's vote requests (their logs are older).
  • So N3/N5 cannot win the election if N2 or N4 are alive.

Client behavior:

  • The client's write request timed out (no commit acknowledgment from N1 before crash)
  • Client should retry with idempotency key — the new leader will either already have the entry or not
  • This is why client libraries for Raft-based systems always retry writes with unique request IDs

12. Summary & Next Steps

Summary

Consensus protocols solve the fundamental problem of distributed agreement in the presence of failures. The journey from impossibility (FLP Theorem, Two Generals) to practical solutions (Paxos, Raft, ZAB) is one of the most intellectually rich areas of computer science.

Paxos laid the mathematical foundation: ballot numbers, quorum intersection, and two-phase commit of values. Its generalization (Multi-Paxos) powers Google's infrastructure. Raft took understandability seriously, decomposing the problem into leader election, log replication, and safety — making it the dominant protocol for new systems. BFT protocols (PBFT, Tendermint) extend correctness to adversarial environments at significant cost. Real systems like etcd, ZooKeeper, CockroachDB, and Spanner translate these theoretical constructs into production-grade infrastructure that underpins the cloud.

The key engineering insight: consensus is a spectrum of trade-offs between safety, liveness, throughput, latency, and implementation complexity. No single protocol dominates all dimensions. The right choice depends on your fault model, consistency requirements, geographic distribution, and operational maturity.


Concept Map

                        CONSENSUS PROTOCOLS
                               │
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                     ▼
    FOUNDATIONS           PROTOCOLS             REAL SYSTEMS
          │                    │                     │
    ┌─────┴──────┐    ┌────────┴────────┐    ┌──────┴──────┐
    │ Two Gen.   │    │ Paxos (classic) │    │ etcd (Raft) │
    │ Byzantine  │    │ Raft (modern)   │    │ ZooKeeper   │
    │ FLP Thm.   │    │ ZAB             │    │ CockroachDB │
    │ Safety/    │    │ PBFT (BFT)      │    │ Spanner     │
    │ Liveness   │    │ EPaxos          │    │ Consul      │
    └─────┬──────┘    └────────┬────────┘    └──────┬──────┘
          │                    │                     │
          └────────────┬───────┘                     │
                       ▼                             │
              CORE MECHANISMS                        │
                       │                             │
    ┌──────────────────┼──────────────────┐          │
    ▼                  ▼                  ▼          │
 Quorum            Terms/Epochs      Log Replication │
 (N≥2F+1)        (prevent stale)    (ordered cmds)  │
    │                  │                  │          │
    └──────────────────┴──────────────────┴──────────┘
                                │
                    PERFORMANCE & SCALE
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                  ▼
         Batching           Multi-Raft        Geo-Distribution
        Pipelining       (key sharding)    (WAN latency trade-offs)
Enter fullscreen mode Exit fullscreen mode

Suggested Next Learning Topics

Topic Why It's Relevant
Distributed Transactions (2PC, Saga) Consensus handles single-key ordering; cross-shard transactions need 2PC (atomic commit) layered on top
Vector Clocks & Logical Clocks Understanding causality and happens-before relationships; prerequisite for CRDTs and leaderless systems
CRDTs (Conflict-free Replicated Data Types) Alternative to consensus for high-availability eventual consistency; understanding when consensus is overkill
Consistent Hashing How to partition data across Raft groups / shards without rebalancing everything
Linearizability & Serializability Formal consistency models; understanding what guarantees your database actually provides
Failure Detectors Chandra-Toueg theory; how timeouts and φ-accrual detectors work in production (Cassandra, Akka)
Lamport Clocks / TrueTime How Google Spanner achieves external consistency without consensus per read
Gossip Protocols Scalable eventually-consistent dissemination (used for cluster membership, not log ordering)

Document generated as part of the Distributed Systems Learning Path.
Target audience: Intermediate-to-Advanced Software Engineers
Minimum reading time: 60–90 minutes for full comprehension

Top comments (0)