Mohamed ABDELLANI

Posted on Apr 1

The Journey of a Request in a Raft-Based KV Store

#go #distributedsystems #backend #cloud

Originally published on:
https://abdellani.dev/posts/2026-04-01-the-journey-of-a-request-in-a-raft-based-kv-store/

Introduction:

This article comes after implementing labs 3 and 4 from the open MIT 6.584 course. We have a replicated KV server on top of the RAFT protocol, and we want to understand how the server works internally. To do so, I propose to follow all the steps that a request will pass through on the server before a client receives a response.

The objective of building the KV server on top of the RAFT protocol is to have a fault-tolerant system.
The system is composed of multiple KV server nodes, if some servers fall down, the system should continue serving the clients.
Raft is strongly consistent and can tolerate the failure of f nodes in a system composed of 2 * f + 1 nodes. For example, if we want the system to tolerate the failure of 2 nodes, we’ll need 5 nodes.

The labs are testing the implementation under different conditions:

an unreliable network that drops packets or delivers packets out of order
Creating network partitions
Crashing, restarting servers.

The system should work correctly under those conditions.

System design

Each server is composed of 3 layers:

1. Application layer

It’s responsible for handling the RPC calls coming from the clients, and for handling the logic related to the Key/Value store: read, write operations related to the keys. It communicates directly to the RSM layer and is not aware of the RAFT.
When the client sends a request, this layer will make a submit call to the RSM layer and wait for it to return before responding to the client.

2. RSM layer

The RSM layer is the glue between the Application layer and the raft layer.
When a request is submitted from the application layer, the request will be forwarded to the RAFT layer.
The RSM also monitors the apply Channel. When Raft decides to commit a command, the RSM layer forwards the command to the application layer to execute it.

3. Raft Layer

The Raft consensus protocol layer is responsible for leader elections and log replication.

When the client sends a request to the server, the server must be a leader or will reject the request.
The initial simple version of the follow consists of the following steps:

The application layer will submit the request to the RSM layer and wait for the function call to return before sending an answer to the client.
The RSM layer will add the client to its internal store and submit the request to the RAFT protocol. If the current server is not the leader, the command will be rejected at this level.
The Raft will append the command to its log and replicate the command to the other servers. The client must wait until the command is committed and applied before receiving a response.
If the command is replicated to a majority, and the current term condition is satisfied, the Raft protocol will decide to commit the command.
The RSM monitoring the apply channel, the channel used by the Raft layer to send the commands that are committed, will forward the command to the application layer by calling DoOp.
The application layer will update the Key/Value store state. For example, put(x,5) will set the key x to 5 at this stage.
The RSM receives the DoOp’s result from the application layer, checks its state to find the client waiting on step 2, and communicates the results to the client.

At first glance, this flow looks correct. In practice, it breaks in several ways.

1. The server loses the leadership before responding to a client request.

The client submits a request, the leader node receives it, submits it to the RSM, and then loses the leadership immediately. This can happen due to network issues between nodes.
In this situation, the client will be stuck waiting for a response while the command might be removed from the Raft logs since it’s not committed.
To solve this problem, the RSM needs to check periodically with the RAFT layer if the server is still the leader. If the server is no longer the leader, it’ll send an error code “ErrWrongLeader” to the application layer, which in its turn will forward it to the client.
This will tell the client that the targeted server is not the leader, and it has to try submitting its requests to another server.

2. The leader loses communication with the other nodes.

A client X sends its request to a leader. The leader receives the request and submits it to the Raft layer. The problem is that the leader is no longer able to reach the other nodes to replicate the command in the request.
The other nodes will notice the absence of the leader, run a new election, and select a new leader. The system will serve the other clients, but the initial leader will not notice the changes in the leadership and will assume that it’s still the current leader.
The client X will be stuck waiting for the old leader to replicate and commit the command.
To avoid this situation, different solutions can be implemented.
A leader can monitor its communications with the other nodes, and if it notices that none of the other nodes is reachable for a period of time, it’ll automatically increment the term and downgrade to a follower.
A simpler solution will be to implement a timer on the RSM level. After submitting a command to Raft, the RSM will use that timer to decide if enough time has passed to release the client and ask it to try with other servers.

3. The client sends a request, but the network fails.

The failure can happen before the server receives the requests, or when the server processes the request and tries to send a response.
The client might retry again and again, which can lead to overwriting the commands of the other clients.

One solution is to attach to every request two attributes: a client ID and a request ID.
The server will use those attributes to :

Track the requests' execution progress for each client, and
Implement a cache

The server will maintain two data structures.

latest request ID per client
latest response per client

In my implementation, I decided to implement the logic for checking and updating the request IDs tracking data structure and the cache on the DoOp function. This will keep RSM layer more flexible to be integrated with a different application layer. Also, we can not rely on the RPC handler to implement that logic. A follower node will only call the DoOp and will never use the RPC handler after processing a committed command. We want every node to keep track of those two data structures.

An optional optimization will be to check the cache on the RPC handler level before submitting the lower layers.

When the server receives a request from the client:

If the request ID is older than the last request ID per that client, the request will be ignored.
If the request ID equals the latest request ID per that client, return the result from the cache.
If the request ID is greater than the last request ID per that client, submit the request to RSM, then RAFT, update the two data structures, and then send the results to the client.

Also, it’s not necessary to persist the cache or the latest request IDs data structure, but they have to be part of the snapshots.
When a server restarts, after exchanging a few messages with the leader, it’ll learn about the commit index, replay the commands, and rebuild the cache and the request IDs.
If logs are compacted without adding the request IDs and the cache, there will be no way to rebuild them again.

Conclusion

What looks like a simple request flow might hide many complexities in practice. Most of the complexity is not in Raft itself, but in making the system behave correctly under failures.

DEV Community