I Tried to Implement the Raft Consensus Algorithm. Here's What Happened.
GitHub Link: https://github.com/RedJOe-0608/raft-node-ts
So, this weekend, I thought of quickly implementing something I've been hearing about in the distributed systems world: the Raft Consensus Algorithm. After spending an entire day going through the paper and a few blog articles for reference, I realised that implementing this algorithm end-to-end - as proposed in the paper - would be a far greater challenge than what I had imagined when I set out on this little adventure.
So this blog is my dumbed-down implementation of the algorithm, which covers the crux of it. At the end, I'll discuss the limitations of my implementation. And if I ever dare to embark on tackling those limitations in the future, I'll post about that too. No promises though.
Also - I am very much a learner here. If you spot something wrong or misleading, please correct me in the comments. I'd genuinely appreciate it.
Introduction
The core question this algorithm tries to answer is: how do we keep multiple servers in sync with fresh, consistent data in the messy world of distributed systems - where networks fail, servers crash, and chaos is basically the default state?
Now, there was an algorithm before Raft that tackled this same problem, called Paxos. But from what I gathered in my research, Paxos was notoriously difficult to understand and even harder to implement correctly. In fact, it was precisely the overly complicated nature of Paxos that inspired Diego Ongaro and John Ousterhout to come up with Raft.
From the paper itself:
"Could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos?"
Their primary goal was understandability. A bold claim for a consensus algorithm. And if you ask me, it mostly holds up - though "understandable" is doing some heavy lifting in that sentence.
But First - Why Do We Even Need This?
Let me back up a bit, because jumping into consensus algorithms without context is a great way to confuse everyone, including yourself.
Imagine a world where there's one all-mighty super-server. One machine. Infinite compute, infinite memory, infinite bandwidth. Every request on the planet - every read, every write - routes through it.
In that world, you don't need a consensus algorithm. The server gets a write - say, x = 42 - it stores it. Done. There's no "who else knows about this?" because there is no one else. Consistency is trivially guaranteed because there's only one source of truth.
But we don't live in that world. And not just because no such machine exists - even if it did, a user in Mumbai hitting a server in Virginia would have to wait ~200ms just for the round trip, before the server even does anything. Scale that to billions of users and the latency alone makes the whole thing fall apart.
So in the real world, we distribute. We run multiple servers. We spread them across regions. We replicate data so that if one goes down, others can keep things running.
And that's where things get messy.
The Problem With Multiple Servers
The moment you have more than one server holding the same data, you have a consistency problem.
Say two servers both store a user's bank balance: $1000. A user sends a request - "deduct $50." Server A processes it and updates its copy to $950. But before it can tell Server B, the network has a little moment. Server B still thinks the balance is $1000. For a brief window, two servers disagree about reality.
This is a data integrity problem. The data exists in multiple places, and those places have fallen out of sync. Depending on which server your next request hits, you get a different answer to the same question. Not great.
Now - and this is important - in some situations, this is actually acceptable. If you're loading a social media feed and one server is two seconds behind, you probably don't notice, and you definitely don't care. The system will catch up. All servers will eventually agree on the same state. This model has a name: Eventual Consistency.
Eventual Consistency - and Why It's Not Always Enough
Before getting to Raft, it's worth understanding eventual consistency properly, because the two exist at opposite ends of a tradeoff.
The CAP Theorem states that a distributed system can only guarantee at most two out of three properties at any given time: Consistency, Availability, and Partition Tolerance. You can't have all three. (I'll be honest - the first time I read about CAP, I nodded along like I understood it and then immediately Googled it again. It's one of those things.)
Eventual consistency trades away strong consistency in favour of high availability. The idea: every write will eventually propagate to all nodes, but there's no guarantee of when. You might read stale data for a bit, but the system stays up and responsive even when parts of it are failing.
Databases like DynamoDB, Cassandra, and CouchDB lean heavily on this model. It works brilliantly for use cases like product catalogs, social media timelines, user preferences, recommendation feeds - anything where being momentarily stale doesn't cause real harm.
But "eventually consistent" is a phrase that should send a small chill down your spine when it shows up in the wrong context.
Think about a payment system. Or a distributed lock, where two services must never simultaneously believe they hold the lock. Or any system where other services depend on the data being correct right now, not "in a few seconds." In these cases, "the servers will agree... eventually" is not a promise you can ship.
You need strong consistency - the guarantee that once a write is acknowledged, every subsequent read from any node reflects that write. No stale data. No conflicting answers.
This is the gap that Raft fills.
Drawing From the Single-Server Idea
Here's an elegant observation. Even though we can't have one all-mighty server, we can borrow the intuition from why that model is so clean.
In a single-server world, consistency is easy because there's one decision-maker. Every write goes through one place. There's no ambiguity, no conflict, no "well, which server's version is correct?"
What if, even with multiple servers, we designated one of them as the single authority for writes - and made the others follow its lead? All writes go to the leader. The leader replicates them to the followers. The followers don't accept writes directly. They just mirror what the leader tells them.
That's the core intuition behind Raft's leader-follower model. It's essentially borrowing the simplicity of the single-server world and layering it on top of a distributed system.
The hard part isn't the idea. The hard part is: how do you safely elect a leader? What happens when the leader crashes mid-write? What if the network splits and followers can't hear the leader - do they elect a new one? And if they do, how do you prevent two nodes from both thinking they're the leader, both accepting writes, and silently diverging from each other?
These are the exact questions Raft answers. Let's get into how.
The Setup
Three nodes, each running a tiny Express server. Each node is a RaftNode that talks to its peers over HTTP. I used Docker Compose to wire them up:
services:
node-a:
build: .
environment:
NODE_ID: A
PORT: 3000
PEERS: http://node-b:3001,http://node-c:3002
ports:
- "3000:3000"
node-b:
build: .
environment:
NODE_ID: B
PORT: 3001
PEERS: http://node-a:3000,http://node-c:3002
ports:
- "3001:3001"
node-c:
build: .
environment:
NODE_ID: C
PORT: 3002
PEERS: http://node-a:3000,http://node-b:3001
ports:
- "3002:3002"
Each node knows its own ID and its peers' URLs. No shared state, no central coordinator. They figure out who's in charge entirely by themselves. Which is either impressive or terrifying depending on your mood.
The server.ts just wires the HTTP endpoints to the RaftNode methods:
const NODE_ID = process.env.NODE_ID || "A";
const PORT = process.env.PORT ? Number(process.env.PORT) : 3000;
const PEERS = process.env.PEERS ? process.env.PEERS.split(",") : [];
const raftNode = new RaftNode(NODE_ID, PEERS);
app.post("/request-vote", (req, res) => {
const { term, candidateId, lastLogIndex, lastLogTerm } = req.body;
const result = raftNode.handleRequestVote(term, candidateId, lastLogIndex, lastLogTerm);
res.json(result);
});
app.post("/append-entries", (req, res) => {
const { term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit } = req.body;
const result = raftNode.handleAppendEntries(
term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit
);
res.json(result);
});
app.post("/command", async (req, res) => {
const { op, key, value } = req.body;
const result = await raftNode.submitCommand({ op, key, value });
res.json(result);
});
Two RPCs - /request-vote and /append-entries - handle all the Raft-internal communication between nodes. /command is how a client submits a write. Clean enough.
Step 1: Every Node Starts as a Follower
When a RaftNode boots up, it has no idea what's going on. It's a follower, term 0, no log, no leader. Basically me at the start of this weekend project.
constructor(id: string, peers: string[]) {
this.id = id;
this.peers = peers;
this.state = "follower";
this.currentTerm = 0;
this.votedFor = null;
this.log = [];
this.commitIndex = -1;
this.lastApplied = -1;
this.matchIndex = {};
this.nextIndex = {};
this.store = new Store();
this.resetElectionTimer();
}
The moment it starts, it kicks off an election timer - a random timeout between 150ms and 300ms. If a follower doesn't hear from a leader before this timer fires, it assumes there is no leader and starts an election.
The randomness is intentional. You don't want all three nodes timing out at the exact same millisecond and all trying to become leader at once. That results in a split vote - nobody wins, everyone retries, and the cluster just keeps spinning. The random spread ensures that, most of the time, one node times out just a bit before the others and gets its campaign going first.
Step 2: Starting an Election
When the timer fires, the node promotes itself to candidate, increments its term, votes for itself, and starts asking peers for votes:
async startElection() {
this.state = 'candidate';
this.currentTerm += 1;
this.votedFor = this.id;
let votes = 1; // already voted for itself
const majority = Math.floor((this.peers.length + 1) / 2) + 1;
const voteRequests = this.peers.map(async (peer) => {
try {
const response = await fetch(`${peer}/request-vote`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
term: this.currentTerm,
candidateId: this.id,
lastLogIndex: this.log.length - 1,
lastLogTerm: this.log.length > 0
? this.log[this.log.length - 1].term
: -1
})
});
const data = await response.json();
if (data.term > this.currentTerm) {
this.stepDown(data.term);
return;
}
if (this.state === 'candidate' && data.voteGranted) {
votes++;
if (votes >= majority) {
this.becomeLeader();
}
}
} catch (error) {
console.log(`[${this.id}] Could not reach peer ${peer} for vote`);
}
});
await Promise.all(voteRequests);
if (this.state === 'candidate') {
this.stepDown(this.currentTerm);
}
}
A few things worth calling out:
The majority calculation. For 3 nodes, majority is 2. For 5 nodes, it's 3. The formula is Math.floor((total) / 2) + 1. I got this wrong the first time. Off-by-one errors - a timeless rite of passage.
What if a peer has a higher term? If a peer responds with a term number higher than mine, it means I'm stale - the cluster has moved on without me. I step down immediately. No arguments. Term number is the ultimate authority on who's current in Raft.
What if nobody wins? If all vote requests resolve and I still don't have majority, I step back to follower and reset the election timer. Raft calls this a "split vote." It just retries until someone wins. The randomised timeouts make sure this doesn't loop forever.
Step 3: The Voting Rules
On the receiving end, a node doesn't just hand out votes to anyone who knocks. There are rules:
handleRequestVote(term, candidateId, lastLogIndex, lastLogTerm) {
if (term < this.currentTerm) {
return { voteGranted: false, term: this.currentTerm };
}
if (term > this.currentTerm) {
this.stepDown(term);
}
const alreadyVotedForOther =
this.votedFor !== null && this.votedFor !== candidateId;
if (alreadyVotedForOther) {
return { voteGranted: false, term: this.currentTerm };
}
const myLastLogIndex = this.log.length - 1;
const myLastLogTerm = this.log.length > 0
? this.log[this.log.length - 1].term : -1;
const candidateLogIsUpToDate =
lastLogTerm > myLastLogTerm ||
(lastLogTerm === myLastLogTerm && lastLogIndex >= myLastLogIndex);
if (!candidateLogIsUpToDate) {
return { voteGranted: false, term: this.currentTerm };
}
this.votedFor = candidateId;
this.resetElectionTimer();
return { voteGranted: true, term: this.currentTerm };
}
The log up-to-date check is the bit that made me re-read the paper twice (Section 5.4.1, if you want to go down that rabbit hole). The rule: you should not vote for a candidate whose log is behind yours. A node with a stale log becoming leader could overwrite entries that were already committed elsewhere - and that would be bad. Very bad.
"More up-to-date" is defined precisely: higher term on the last log entry wins. If the terms are equal, the longer log wins. It sounds simple, but the reason it's necessary took me a while to actually internalise.
Step 4: Becoming Leader and Sending Heartbeats
Once a node reaches majority, it becomes leader. It cancels its election timer, initialises some per-peer tracking state, and immediately starts firing heartbeats:
becomeLeader() {
if (this.state !== 'candidate') return;
this.state = 'leader';
if (this.electionTimer) clearTimeout(this.electionTimer);
for (const peer of this.peers) {
this.nextIndex[peer] = this.log.length;
this.matchIndex[peer] = -1;
}
this.sendHeartbeats();
this.heartbeatTimer = setInterval(() => this.sendHeartbeats(), 50);
}
Two pieces of per-peer state the leader maintains:
-
nextIndex[peer]- the next log entry I need to send this peer -
matchIndex[peer]- the highest log index I know this peer has in its log
Heartbeats go out every 50ms. They're just appendEntries calls with no new entries - the leader's way of saying "I'm still alive, please don't start an election." If followers stop hearing heartbeats, their timers fire and things get political.
Step 5: Log Replication
This is where the real work happens. When a client submits a write, the leader appends it to its own log and immediately replicates it to all peers:
async submitCommand(command) {
if (this.state !== "leader") {
return { success: false, message: "Not the leader" };
}
const entry: LogEntry = { term: this.currentTerm, command };
this.log.push(entry);
await Promise.all(this.peers.map((peer) => this.replicateToPeer(peer)));
const entryIndex = this.log.length - 1;
if (this.commitIndex >= entryIndex) {
return { success: true, message: "Committed" };
} else {
return { success: false, message: "Failed to reach majority" };
}
}
Note: the entry isn't committed the moment the leader appends it. It first needs to be replicated to a majority of nodes. Only then is it safe to tell the client "yes, this is done." The distinction between appended and committed is one of the more important mental models in Raft.
The actual replication to a peer:
async replicateToPeer(peer: string) {
if (this.state !== 'leader') return;
const nextIndex = this.nextIndex[peer];
const entries = this.log.slice(nextIndex);
const prevLogIndex = nextIndex - 1;
const prevLogTerm = prevLogIndex >= 0
? this.log[prevLogIndex].term : -1;
const res = await fetch(`${peer}/append-entries`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
term: this.currentTerm,
leaderId: this.id,
prevLogIndex,
prevLogTerm,
entries,
leaderCommit: this.commitIndex,
}),
});
const data = await res.json();
if (data.term > this.currentTerm) {
this.stepDown(data.term);
return;
}
if (data.success) {
this.matchIndex[peer] = prevLogIndex + entries.length;
this.nextIndex[peer] = this.matchIndex[peer] + 1;
this.advanceCommitIndex();
} else {
this.nextIndex[peer] = Math.max(0, this.nextIndex[peer] - 1);
}
}
The prevLogIndex and prevLogTerm are a consistency check. The leader is saying: "Before you accept these entries, confirm that your log at index prevLogIndex has term prevLogTerm." If the peer's log doesn't match at that point, it rejects the request. The leader then decrements nextIndex by 1 and retries on the next heartbeat - walking further back until it finds a common point, then replaying everything forward from there.
This is called log reconciliation. The leader never assumes a follower is caught up. It proves it.
On the receiving end, the follower handles appendEntries like this:
handleAppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit) {
if (term < this.currentTerm) {
return { success: false, term: this.currentTerm };
}
if (term > this.currentTerm) {
this.stepDown(term);
} else if (this.state === "candidate") {
this.stepDown(term);
}
this.resetElectionTimer();
if (prevLogIndex >= 0) {
const prevEntry = this.log[prevLogIndex];
if (!prevEntry) {
return { success: false, term: this.currentTerm };
}
if (prevEntry.term !== prevLogTerm) {
this.log.splice(prevLogIndex);
return { success: false, term: this.currentTerm };
}
}
if (entries.length > 0) {
this.log.splice(prevLogIndex + 1, this.log.length, ...entries);
}
if (leaderCommit > this.commitIndex) {
this.commitIndex = Math.min(leaderCommit, this.log.length - 1);
this.applyCommittedEntries();
}
return { success: true, term: this.currentTerm };
}
If there's a log mismatch at prevLogIndex, I truncate my log from that point and reject - the leader will retry with an earlier index. Once the logs match up, I append the new entries and advance my commitIndex based on what the leader says is committed.
Step 6: Committing Entries
An entry is committed once the leader confirms that a majority of nodes have it in their logs. After every successful replication response, the leader checks whether it can advance its commit index:
advanceCommitIndex() {
for (let n = this.log.length - 1; n > this.commitIndex; n--) {
if (this.log[n].term !== this.currentTerm) continue;
const replicatedOn = this.peers.filter(
(peer) => (this.matchIndex[peer] ?? -1) >= n
).length + 1; // +1 to count self
const majority = Math.floor((this.peers.length + 1) / 2) + 1;
if (replicatedOn >= majority) {
this.commitIndex = n;
this.applyCommittedEntries();
break;
}
}
}
There's a subtle but critical line here: if (this.log[n].term !== this.currentTerm) continue. This comes from Section 5.4.2 of the paper. A leader must not directly commit entries from previous terms - only entries from its own term. Once a current-term entry is committed, everything before it is implicitly committed too.
I'll be honest: I didn't fully understand why this rule exists until I read the paper's Figure 8, which walks through a specific failure scenario where skipping this check leads to committed data being overwritten. If you're implementing this, that figure is worth your time.
Step 7: Applying to the State Machine
Once an entry is committed, it gets applied to the actual key-value store:
applyCommittedEntries() {
while (this.lastApplied < this.commitIndex) {
this.lastApplied += 1;
const entry = this.log[this.lastApplied];
if (entry.command.op === "set") {
this.store.set(entry.command.key, entry.command.value!);
console.log(`[${this.id}] Applied SET ${entry.command.key}=${entry.command.value}`);
} else if (entry.command.op === "delete") {
this.store.delete(entry.command.key);
console.log(`[${this.id}] Applied DELETE ${entry.command.key}`);
}
}
}
commitIndex tracks what the cluster has agreed on. lastApplied tracks what's actually been written to the store. Keeping them separate creates a clean boundary between the consensus layer and the application layer - which matters if you ever want to add snapshotting or crash recovery without touching the replication logic.
The store itself is almost comically simple:
export class Store {
private kvStore: Map<string, string> = new Map();
get(key: string): string | null { return this.kvStore.get(key) ?? null; }
set(key: string, value: string) { this.kvStore.set(key, value); }
delete(key: string) { this.kvStore.delete(key); }
getAll(): Record<string, string> { return Object.fromEntries(this.kvStore); }
}
That's the entire state machine. A Map. All of Raft's machinery - elections, heartbeats, log replication, commit tracking - exists purely to ensure that every node applies the same sequence of operations to their own copy of this Map. In the same order. Always. That's it. That's the whole game.
Limitations of This Implementation
I called this "naive" and I meant it. Here's what's missing:
No persistence. If a node crashes and restarts, it loses everything - its log, its term, who it voted for. A real implementation writes currentTerm, votedFor, and the log to disk before sending any response. Without this, a restarted node could vote twice in the same term, which breaks the single-vote guarantee that elections depend on.
No log compaction or snapshots. The log grows forever here. In a real system, you'd periodically snapshot the state machine and truncate the log. Otherwise you'll eventually run out of disk, and replaying a massive log to a brand new node would take forever.
Linearizable reads aren't fully guaranteed. My /read/:key endpoint returns directly from the leader's in-memory store. There's a subtle edge case: if a leader gets partitioned just after being elected, it might not know a newer leader was elected and could serve stale reads. A proper implementation routes reads through the log too, or uses a "read lease" mechanism. This is honestly one I don't fully understand yet - if you do, please drop a comment.
Follower log truncation is rough. When a follower detects a conflict in handleAppendEntries, I just splice from the conflict point. It works, but production implementations are more careful - they skip entries that already match rather than blindly overwriting from that index.
No cluster membership changes. The cluster size is hardcoded at three nodes. Adding or removing nodes from a live Raft cluster is a whole separate problem - the paper covers it under "cluster membership changes" using joint consensus. I haven't touched it, and I'm not sure I'm ready to.
Closing Thoughts
Honestly, this was one of the more satisfying things I've built recently - not because the code is impressive (it really isn't), but because every line forced me to actually understand why it was there.
Why does the election timer need to be random? (Split votes.) Why can't a leader commit entries from old terms? (The leader completeness property.) Why does votedFor reset when you step down? (So nodes can vote in the new term after a partition heals.) These aren't just implementation quirks - they're the invariants that hold the whole system together. You only appreciate them when you realise exactly what breaks if you remove them.
If you want to run this yourself, three Docker containers is all it takes. Try killing one node mid-operation and watching the other two hold an election and elect a new leader in under a second. It's weirdly satisfying.
And once more - I'm learning as I go. If something here is wrong, oversimplified, or just off, please tell me in the comments. I'd much rather be corrected than confidently wrong on the internet. The world already has enough of that.
Written after an entire day of reading papers, staring at election timer logs, and wondering why Node B keeps winning every single election.
Top comments (0)