Haji Rufai

Posted on May 28

Building a Distributed Key-Value Store with Raft Consensus in Python

#python #distributedsystems #backend #tutorial

If you've ever wondered how systems like etcd, CockroachDB, or Consul maintain consistency across multiple servers, the answer is usually a consensus algorithm. The most popular one today is Raft — designed explicitly to be understandable.

I built RaftKV, a complete distributed key-value store from scratch in Python, implementing the full Raft consensus protocol. No external coordination services. No ZooKeeper. Just pure Python, TCP, and the Raft paper.

🔗 GitHub Repository | 🌐 Live Demo Page

Why Build This?

Every distributed system job interview touches on consensus. "How does Kafka guarantee ordering?" "What happens during a network partition?" "How does leader election work?"

Building Raft from scratch gives you genuine answers to all of these. Plus, it exercises nearly every systems programming skill:

Concurrent programming — async event loops, timers, state machines
Networking — HTTP RPC, timeouts, failure handling
Persistence — write-ahead logs, snapshots, crash recovery
Algorithms — leader election, log replication, commit rules

The Raft Algorithm in 5 Minutes

Raft organizes a cluster of nodes into three roles:

Follower ──[election timeout]──▶ Candidate ──[wins majority]──▶ Leader
    ▲                                │                              │
    │                  loses/timeout │          discovers higher    │
    └────────────────────────────────┘          term                │
    └──────────────────────────────────────────────────────────────┘

Leader Election

Every node starts as a Follower with a random election timeout (1.5–3 seconds)
If a follower doesn't hear from a leader before timeout → becomes a Candidate
The candidate increments its term, votes for itself, and sends RequestVote RPCs
If it receives votes from a majority → becomes Leader
The leader sends periodic heartbeats to maintain authority

The randomized timeouts prevent "election storms" where nodes keep splitting votes.

Log Replication

Once a leader is elected, all client writes go through it:

async def propose(self, command: dict) -> tuple[bool, str]:
    if self.state != NodeState.LEADER:
        return False, "Not leader"

    # 1. Append to local log
    entry = self.log.create_entry(self.current_term, command=command)

    # 2. Persist to WAL
    if self.wal:
        self.wal.append_entry(entry)

    # 3. Wait for majority replication
    future = asyncio.get_event_loop().create_future()
    self._pending_commands[entry.index] = future

    await asyncio.wait_for(future, timeout=5.0)
    return True, f"Committed at index {entry.index}"

The leader replicates entries to followers via AppendEntries RPCs. An entry is committed only when a majority of nodes have it.

Safety Properties

Raft ensures:

Election Safety: At most one leader per term
Leader Append-Only: Leaders never overwrite or delete log entries
Log Matching: If two logs contain an entry with the same index and term, all preceding entries are identical
Leader Completeness: A committed entry will be present in all future leaders' logs

Architecture Overview

┌─────────────────────────────────────────────┐
│              Client Request                  │
│         PUT /kv/name {"value":"Raft"}        │
└──────────────────┬──────────────────────────┘
                   ▼
┌──────────────────────────────────────────────┐
│              LEADER (Node 1)                  │
│  HTTP Server → Raft Engine → Log (WAL) → KV  │
└─────────┬──────────────────────┬─────────────┘
          │ AppendEntries        │ AppendEntries
          ▼                      ▼
   ┌─────────────┐       ┌─────────────┐
   │  FOLLOWER   │       │  FOLLOWER   │
   │  (Node 2)   │       │  (Node 3)   │
   └─────────────┘       └─────────────┘

Key components:

Module	Purpose
`consensus.py`	Core Raft engine — election, replication, commit
`log.py`	Append-only log with entries and conflict resolution
`wal.py`	Write-ahead log persistence and crash recovery
`store.py`	KV state machine — applies committed entries
`server.py`	HTTP server with client API + Raft RPC
`client.py`	Async Python client with leader auto-discovery

The Raft Log

The heart of Raft is the replicated log. Every entry has a term (election epoch) and index:

@dataclass
class LogEntry:
    term: int
    index: int
    entry_type: str = EntryType.COMMAND
    command: dict | None = None

The log handles conflict resolution — when a follower receives entries from a new leader that conflict with its existing log, it truncates:

def append_entries(self, prev_index, prev_term, entries):
    # Consistency check
    if prev_index > 0:
        prev_entry = self.get(prev_index)
        if prev_entry is None or prev_entry.term != prev_term:
            return False  # Reject — log inconsistency

    for entry in entries:
        existing = self.get(entry.index)
        if existing and existing.term != entry.term:
            self._truncate_from(entry.index)  # Conflict!
            self.entries.append(entry)
        elif existing is None:
            self.entries.append(entry)

    return True

Leader Election Deep Dive

The election logic is the most timing-sensitive part:

async def _start_election(self):
    self.current_term += 1
    self.state = NodeState.CANDIDATE
    self.voted_for = self.node_id

    votes_received = 1  # Self-vote
    votes_needed = self.config.majority

    vote_tasks = [
        self._request_vote(peer, self.current_term)
        for peer in self.peers
    ]
    results = await asyncio.gather(*vote_tasks, return_exceptions=True)

    for result in results:
        if result is True:
            votes_received += 1

    if votes_received >= votes_needed:
        self._become_leader()

A node only grants its vote if the candidate's term is current, the voter hasn't voted for someone else, and the candidate's log is at least as up-to-date.

Write-Ahead Log & Crash Recovery

Durability requires persisting entries before acknowledging them:

class WriteAheadLog:
    def append_entry(self, entry: LogEntry):
        with open(self.wal_path, "a") as f:
            f.write(json.dumps(entry.to_dict()) + "\n")

    def save_state(self, current_term, voted_for, commit_index):
        tmp = self.state_path.with_suffix(".tmp")
        with open(tmp, "w") as f:
            json.dump(state, f)
        tmp.rename(self.state_path)  # Atomic write

On startup, nodes replay the WAL from the last snapshot to recover state.

Running a Cluster

Docker Compose

docker-compose up --build
# 3 nodes on ports 8001, 8002, 8003

CLI Usage

# Store values
raftkv put db.host "postgres.internal"
raftkv get db.host
# db.host = postgres.internal

# Check cluster
raftkv status --node http://localhost:8001

HTTP API

curl -X PUT http://localhost:8001/kv/name \
     -d '{"value": "RaftKV"}'

curl http://localhost:8001/kv/name
# {"key": "name", "value": "RaftKV"}

curl http://localhost:8001/cluster/status

Testing

89 tests across 7 modules covering unit, RPC, consensus, and multi-node integration:

@pytest.mark.asyncio
async def test_data_replicated_to_followers(three_node_cluster):
    nodes = three_node_cluster
    leader = find_leader(nodes)

    await leader.propose({"op": "put", "key": "test", "value": "hello"})
    await asyncio.sleep(1.0)

    for node in nodes:
        assert node.store.get("test") == "hello"

Key Takeaways

Timing is everything — Randomized election timeouts are elegant but tricky to tune
The log consistency check is the real hero — prevLogIndex + prevLogTerm prevents split-brain corruption
Term numbers are the universal tiebreaker — Higher term always wins
Separating consensus from state machine is powerful — The Raft log is independent of the KV store
Persistence must be atomic — Write-then-rename for state, JSONL append for WAL

Check out the full source code on GitHub. Star it if you find it useful!

This is project #10 in my "Building From Scratch" series. Previous projects: SQL query engine, vector search engine, workflow orchestration engine.

DEV Community