If you've ever wondered how systems like etcd, CockroachDB, or Consul maintain consistency across multiple servers, the answer is usually a consensus algorithm. The most popular one today is Raft — designed explicitly to be understandable.
I built RaftKV, a complete distributed key-value store from scratch in Python, implementing the full Raft consensus protocol. No external coordination services. No ZooKeeper. Just pure Python, TCP, and the Raft paper.
🔗 GitHub Repository | 🌐 Live Demo Page
Why Build This?
Every distributed system job interview touches on consensus. "How does Kafka guarantee ordering?" "What happens during a network partition?" "How does leader election work?"
Building Raft from scratch gives you genuine answers to all of these. Plus, it exercises nearly every systems programming skill:
- Concurrent programming — async event loops, timers, state machines
- Networking — HTTP RPC, timeouts, failure handling
- Persistence — write-ahead logs, snapshots, crash recovery
- Algorithms — leader election, log replication, commit rules
The Raft Algorithm in 5 Minutes
Raft organizes a cluster of nodes into three roles:
Follower ──[election timeout]──▶ Candidate ──[wins majority]──▶ Leader
▲ │ │
│ loses/timeout │ discovers higher │
└────────────────────────────────┘ term │
└──────────────────────────────────────────────────────────────┘
Leader Election
- Every node starts as a Follower with a random election timeout (1.5–3 seconds)
- If a follower doesn't hear from a leader before timeout → becomes a Candidate
- The candidate increments its
term, votes for itself, and sendsRequestVoteRPCs - If it receives votes from a majority → becomes Leader
- The leader sends periodic heartbeats to maintain authority
The randomized timeouts prevent "election storms" where nodes keep splitting votes.
Log Replication
Once a leader is elected, all client writes go through it:
async def propose(self, command: dict) -> tuple[bool, str]:
if self.state != NodeState.LEADER:
return False, "Not leader"
# 1. Append to local log
entry = self.log.create_entry(self.current_term, command=command)
# 2. Persist to WAL
if self.wal:
self.wal.append_entry(entry)
# 3. Wait for majority replication
future = asyncio.get_event_loop().create_future()
self._pending_commands[entry.index] = future
await asyncio.wait_for(future, timeout=5.0)
return True, f"Committed at index {entry.index}"
The leader replicates entries to followers via AppendEntries RPCs. An entry is committed only when a majority of nodes have it.
Safety Properties
Raft ensures:
- Election Safety: At most one leader per term
- Leader Append-Only: Leaders never overwrite or delete log entries
- Log Matching: If two logs contain an entry with the same index and term, all preceding entries are identical
- Leader Completeness: A committed entry will be present in all future leaders' logs
Architecture Overview
┌─────────────────────────────────────────────┐
│ Client Request │
│ PUT /kv/name {"value":"Raft"} │
└──────────────────┬──────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ LEADER (Node 1) │
│ HTTP Server → Raft Engine → Log (WAL) → KV │
└─────────┬──────────────────────┬─────────────┘
│ AppendEntries │ AppendEntries
▼ ▼
┌─────────────┐ ┌─────────────┐
│ FOLLOWER │ │ FOLLOWER │
│ (Node 2) │ │ (Node 3) │
└─────────────┘ └─────────────┘
Key components:
| Module | Purpose |
|---|---|
consensus.py |
Core Raft engine — election, replication, commit |
log.py |
Append-only log with entries and conflict resolution |
wal.py |
Write-ahead log persistence and crash recovery |
store.py |
KV state machine — applies committed entries |
server.py |
HTTP server with client API + Raft RPC |
client.py |
Async Python client with leader auto-discovery |
The Raft Log
The heart of Raft is the replicated log. Every entry has a term (election epoch) and index:
@dataclass
class LogEntry:
term: int
index: int
entry_type: str = EntryType.COMMAND
command: dict | None = None
The log handles conflict resolution — when a follower receives entries from a new leader that conflict with its existing log, it truncates:
def append_entries(self, prev_index, prev_term, entries):
# Consistency check
if prev_index > 0:
prev_entry = self.get(prev_index)
if prev_entry is None or prev_entry.term != prev_term:
return False # Reject — log inconsistency
for entry in entries:
existing = self.get(entry.index)
if existing and existing.term != entry.term:
self._truncate_from(entry.index) # Conflict!
self.entries.append(entry)
elif existing is None:
self.entries.append(entry)
return True
Leader Election Deep Dive
The election logic is the most timing-sensitive part:
async def _start_election(self):
self.current_term += 1
self.state = NodeState.CANDIDATE
self.voted_for = self.node_id
votes_received = 1 # Self-vote
votes_needed = self.config.majority
vote_tasks = [
self._request_vote(peer, self.current_term)
for peer in self.peers
]
results = await asyncio.gather(*vote_tasks, return_exceptions=True)
for result in results:
if result is True:
votes_received += 1
if votes_received >= votes_needed:
self._become_leader()
A node only grants its vote if the candidate's term is current, the voter hasn't voted for someone else, and the candidate's log is at least as up-to-date.
Write-Ahead Log & Crash Recovery
Durability requires persisting entries before acknowledging them:
class WriteAheadLog:
def append_entry(self, entry: LogEntry):
with open(self.wal_path, "a") as f:
f.write(json.dumps(entry.to_dict()) + "\n")
def save_state(self, current_term, voted_for, commit_index):
tmp = self.state_path.with_suffix(".tmp")
with open(tmp, "w") as f:
json.dump(state, f)
tmp.rename(self.state_path) # Atomic write
On startup, nodes replay the WAL from the last snapshot to recover state.
Running a Cluster
Docker Compose
docker-compose up --build
# 3 nodes on ports 8001, 8002, 8003
CLI Usage
# Store values
raftkv put db.host "postgres.internal"
raftkv get db.host
# db.host = postgres.internal
# Check cluster
raftkv status --node http://localhost:8001
HTTP API
curl -X PUT http://localhost:8001/kv/name \
-d '{"value": "RaftKV"}'
curl http://localhost:8001/kv/name
# {"key": "name", "value": "RaftKV"}
curl http://localhost:8001/cluster/status
Testing
89 tests across 7 modules covering unit, RPC, consensus, and multi-node integration:
@pytest.mark.asyncio
async def test_data_replicated_to_followers(three_node_cluster):
nodes = three_node_cluster
leader = find_leader(nodes)
await leader.propose({"op": "put", "key": "test", "value": "hello"})
await asyncio.sleep(1.0)
for node in nodes:
assert node.store.get("test") == "hello"
Key Takeaways
- Timing is everything — Randomized election timeouts are elegant but tricky to tune
-
The log consistency check is the real hero —
prevLogIndex+prevLogTermprevents split-brain corruption - Term numbers are the universal tiebreaker — Higher term always wins
- Separating consensus from state machine is powerful — The Raft log is independent of the KV store
- Persistence must be atomic — Write-then-rename for state, JSONL append for WAL
Check out the full source code on GitHub. Star it if you find it useful!
This is project #10 in my "Building From Scratch" series. Previous projects: SQL query engine, vector search engine, workflow orchestration engine.
Top comments (0)