Chord is a protocol for building decentralized data storage systems. It is used to efficiently locate the node that stores a specific piece of data in a network of nodes. It only describes one operation: given a key, it maps the key to the responsible node. By linking a key to each data item and storing the key-value pair at the corresponding node, data location can be easily implemented using Chord. Additionally, Chord can adapt and respond to queries even when nodes are constantly joining or leaving the system. Theoretical analysis and simulations demonstrate that Chord is scalable, with communication costs and the size of each node growing logarithmically as the number of Chord nodes increases.
In this blog post, I will be documenting the steps I took to implement the Chord protocol using Rust programming language and explaining my thought process along the way. All the code can be found in the repo https://github.com/kamilczerw/chord. I created a branch for this post, so you can follow along with the code, which you can find here.
I encorage you to read the original paper which describes Chord protocol in more comprehensive way. You can find it here.
Chord protocol
The protocol is designed in such a way that no organization can control the network without owning the majority of the nodes. When a node joins the network it gets an id generated based on its IP address. The id is generated using a consistent hashing algorithm, to evenely distribute ids in the network.
The core principle of the Chord protocol is that nodes do not need to maintain a complete list of all the other nodes in the network. Instead, each node only needs to maintain information about a small number of neighboring nodes.
Imagine the network as a ring with nodes distributed along it. For demonstration, let’s consider a ring with 10 nodes that can only handle 64 unique identifiers. This ring structure allows efficient routing of messages and ensures that each node only needs to maintain information about a small number of nodes to efficiently route messages and find the correct destination node.
Each of the nodes is responsible for maintaining a range of ids, starting with ids following the id of the previous node and including its own id. For example, in our demo network, node N7
is responsible for ids 2-7
. Node N1
is responsible for ids 59-63
and 0-1
. This is because the node N58
is preceding node N1
in the ring.
A node responsible for an id k
is called a successor
of k
. In the rest of this post, I will use the term successor
to refer to the node responsible for a given id.
Ids in the ring are numeric to be able to easily calculate the successor of a given id. In our example network, we will use a 6-bit identifier space, which means that the network can handle up to 64 unique identifiers. I will use 64-bit unsigned integers to represent ids in the code.
In real-world applications, we would prefer to use string representation of ids to make it easier to read and understand. To refer to string representation of ids we will use the term key. The key is converted to a numeric representation using a consistent hashing algorithm.
Lookup
To find the location of a key, each node uses a hash function to map the id to a position on the ring. It then uses so called finger table, to determine which node is responsible for the id. The finger table is a data structure that helps the node quickly locate other nodes in the network. The finger table is a table of size m
(where m
is the number of bits in the identifier space, in our example, it’s 6
). Each entry in the finger table contains
- Identifier, calculated with the formula
- Node communication information, which includes IP address and port
Here is the finger tables for node N40
from our example network:
k | finger id | node |
---|---|---|
1 | 41 | N43 |
2 | 42 | N43 |
3 | 44 | N45 |
4 | 48 | N53 |
5 | 56 | N58 |
6 | 8 | N18 |
As you can see in the example the finger id doesn’t map directly to the node id, it is used as an approximation of the location of an id we are looking for. When a node receives a request to find a specific id, it checks the finger table to see which finger id is closest to the requested id but still lower than the finger id. The node then forwards the request to the node returned from the finger table. The node that receives the request then repeats the process until it finds the node responsible for the id.
Here is a visualization of the process of finding the node responsible for id 20
in our example ring, using a lookup from node N40
:
Joining
When a node joins the Chord ring, it first needs to find its place in the network. It calculates its id by hashing its IP address and then uses the lookup operation to find the the successor
of the id. Immediate successor
is the node that is responsible for the next id after the id of the joining node.
Once the node has found its successor
, it sends a notify
request to the successor
node. The notify
message contains the id and communication information of the joining node. The successor
node then checks if the joining node is a better predecessor
than its current predecessor
. If the joining node is a better predecessor
, the successor
node updates its predecessor
to the joining node.
predecessor
is a node that is previous node in the ring. It is used to maintain the correct relationships between nodes in the network.
Each node runs a stabilize
job periodically. The stabilize
job is responsible for ensuring that each node knows the correct successor and predecessor nodes in the Chord ring. It does this by checking if the current successor node has a predecessor that is closer to the current node than the current successor node. If this is the case, the current node updates its successor to the new node. Additionally, the stabilize job also checks if the current predecessor node is still alive by sending a ping message to it. If there is no response, the current node updates its predecessor to the new node.
In summary, the stabilize
job is responsible for maintaining the correct relationships between nodes in the Chord ring and ensuring that the network is in sync.
Consistent hashing
A consistent hash function assigns each node and key an m
-bit identifier using a hash function. The original paper uses SHA-1
, but it is important to note that this hash function is not recommended for use in new systems due to known weaknesses. Instead, there are other more secure hash functions that are recommended, such as SHA-256
or BLAKE2
. The node's identifier is chosen by hashing its IP address, while a key identifier is produced by hashing the key.
This method ensures that the probability of two nodes or keys hashing to the same identifier is negligible, making it an efficient way to distribute data across a network.
The keys are assigned to nodes by ordering the identifiers on an identifier circle modulo
. The key, k
, is assigned to the first node whose identifier is equal to or follows (the identifier of) k
in the identifier space. This node is called the successor node of key k
, denoted by successor(k)
.
In practice, consistent hashing is a simple yet effective technique for data distribution in distributed systems. It's easy to implement and it provides a good balance between the number of keys that need to be remapped when a node is added or removed, and the number of keys that can be mapped to the same node.
Implementation
Enough with the theory, let’s get our hands dirty with some coding. The first thing we need is to set up the project. I use cargo workspaces to split my projects into smaller modules. It is a great way to keep the project organized.
Project setup
We will use the following structure for our project:
root/
Cargo.toml
server/
libs/
chord/
-
server
will contain all the code related to run the server -
libs/chord
will contain all the code related to chord node implementation
First, create a Cargo.toml
file in the root directory of the project.
[workspace]
members = [
"server",
# Libs
"libs/*",
]
This will tell cargo where to find all the modules.
Now we can create modules for our Chord project by executing the following commands:
cargo new server
cargo new libs/chord --lib
Chord module
As mentioned earlier, this module will contain all the code specific for the Chord protocol.
We will start by creating Node
struct. It will contain the node identifier and IP address along with the port.
use std::net::SocketAddr;
use seahash::hash;
#[derive(Clone)]
pub struct Node {
id: u64,
addr: SocketAddr
}
impl Node {
pub fn new(addr: SocketAddr) -> Self {
Self { id: hash(&addr.to_string().as_bytes()), addr }
}
}
Then new
method only takes an address and creates an instance of Node
with a unique identifier. The identifier is created by hashing the address using seahash
hash function. We don't need to expose the id
field, so the user doesn't have to worry about generating it manually.
Next, let's create NodeStore
struct. It will contain all the necessary information required to effectively communicate with other nodes in the ring.
pub struct NodeStore {
predecessor: Option<Node>,
finger_table: Vec<Finger>
}
impl NodeStore {
pub fn new(successor: Node) -> Self {
Self {
predecessor: None,
finger_table: Finger::init_finger_table(successor),
}
}
pub(crate) fn successor(&self) -> &Node {
&self.finger_table[0].node
}
}
The predecessor
field contains information about the node preceding the current node in the Chord ring. Whe need to store this information to keep the ring in sync. When a node runs a stabilize
job, it checks if it still is the predecessor
of its successor
. If it is not, it updates the finger table to include the new node. The finger_table
field contains information about some of the nodes in the ring for easier lookup. We also have a successor
method that returns the first node in the finger table, which is the immediate successor
of the current node.
When the finger_table
is created, all the fingers are pointing to the same node, which is the immediate successor
of the current node. The init_finger_table
method is used to initialize the finger table.
Let's create the Finger
struct and implement some methods. The struct will have start
field, which is a starting id from which the node is potentially responsible for. It will also have the node
field.
pub(crate) struct Finger {
start: u64,
node: Node
}
impl Finger {
pub(crate) fn init_finger_table(node: Node) -> Vec<Self> {
let mut fingers = Vec::with_capacity(64);
// We start at 1 because the calculation of the finger id is based on the index
// of the finger. The calculation assumes that the index starts at 1.
for i in 1..65 {
let finger_id = Self::finger_id(node.id, i);
fingers.push(Finger { start: finger_id, node: node.clone() });
}
fingers
}
pub(crate) fn finger_id(node_id: u64, index: u8) -> u64 {
if index == 0 {
return node_id;
}
let offset: u128 = 2_u128.pow((index - 1) as u32);
let power: u128 = 2_u128.pow(64 as u32);
let id = (node_id as u128 + offset) % power;
id as u64
}
}
To be able to communicate with other nodes we need a client. We haven't chosen a specific protocol yet, so we will create a trait that will be implemented by the client for the specific protocol.
Here is the code for the trait:
use crate::Node;
pub trait Client {
fn init(node: Node) -> Self;
fn find_successor(&self, id: u64) -> Result<Node, ClientError>;
fn successor(&self) -> Result<Node, ClientError>;
fn predecessor(&self) -> Result<Option<Node>, ClientError>;
fn notify(&self, predecessor: Node) -> Result<(), ClientError>;
fn ping(&self) -> Result<(), ClientError>;
}
pub enum ClientError {
ConnectionFailed(Node),
Unexpected(String),
}
If we try to build the project now we will get an error. That’s because we haven't added seahash
crate, which is used to generate id for our node.
Go ahead and add the dependency to libs/chord/Cargo.toml
## libs/chord/Cargo.toml
[dependencies]
seahash = "4.1.0"
When you run cargo build
, there should be no errors.
Next thing we need is NodeService
which will contain all to logic associated with Chord protocol.
I will walk you through each function and explain what it does. For now, let’s just create the struct and add the error enum.
use std::net::SocketAddr;
use crate::NodeStore;
pub struct NodeService {
id: u64,
addr: SocketAddr,
store: NodeStore,
}
pub mod error {
pub enum ServiceError {
Unexpected, // We don't expect to get any errors for now
}
}
impl NodeService {
// We will add our code here
}
Find successor
First function we will implement based on the spec is find successor. It is the entry point for key lookup in the ring. For a given key it returns a node which is responsible for that key. Such node is called successor of the key. Find successor was covered in the lookup chapter.
Here is the pseudocode from the spec:
// ask node n to find the successor of id n.find_successor(id) if (id ∈ (n, successor]) return successor; else n′ = closest_preceding_node(id); return n′.find_successor(id); // search the local table for the highest predecessor of id n.closest_preceding_node(id) for i = m downto 1 if (finger[i] ∈ (n,id)) return finger[i]; return n;
Let’s break this function down.
- The request comes to the node
n
to find the keyid
- If the
id
is betweenn
(exclusive) andsuccessor
(inclusive), returnsuccessor
- Otherwise get a node from the finger table, with highest
id
which is lower than theid
we are looking for. This node is calledclosest preceding node
, and we callfind_successor
on this node with theid
we are looking for. That node will repeat the same process until it finds the successor of theid
.
We need to create a function which checks if an id
is between 2 nodes on the ring, we can add it to Node
struct.
impl Node {
// ...
pub fn is_between_on_ring(id: u64, node1: u64, node2: u64) -> bool {
if node1 < node2 {
node1 < id && id <= node2
} else {
node1 < id || id <= node2
}
}
}
We also need to implement closest_preceding_node
function. It will return the node from the finger table which is the most immediate predecessor
of the id
.
impl NodeService {
// ...
fn closest_preceding_node(&self, id: u64) -> &Node {
for finger in self.store.finger_table.iter().rev() {
if finger.start > self.id && finger.node.id < id && finger.start < id {
return &finger.node;
} else if id < self.id {
// if the id is smaller than the current node, we return the last finger
return &finger.node;
}
}
self.store.successor()
}
}
It iterates over the finger table in reverse order, and returns the first node which is the preceeding the id
. If the id
is smaller than the current node, it returns the last finger. That's because the node doesn't have a finger which is smaller than the current node.
Now we can implement find_successor
method.
impl NodeService {
// ...
pub fn find_successor(&self, id: u64) -> Result<Node, error::ServiceError> {
if Node::is_between_on_ring(id, self.id, self.store.successor.id) {
Ok(self.store.successor.clone())
} else {
let node = self.closest_preceding_node(id);
// 🤔 Wait a secode, how do we call the node to find the successor?
}
}
}
We check if the id
is between the current node and the successor. If it is, we return the successor. Otherwise we call closest_preceding_node
to get the node which is the most immediate predecessor of the id
in our finger table. We need to call find_successor
on that node, but we don't have a client to do that. We should fix it.
We can add a client
method to Node
struct.
impl Node {
// ...
pub fn client<C: Client>(&self) -> C {
C::init(self.addr)
}
}
We don't know what type of client we will use, so we use a generic type C
which implements Client
trait and let the implementer decide what type of client it will be.
Now we can get the client from the node and call find_successor
on it.
impl NodeService {
// ...
pub fn find_successor(&self, id: u64) -> Result<Node, error::ServiceError> {
if Node::is_between_on_ring(id, self.id, self.store.successor.id) {
Ok(self.store.successor.clone())
} else {
let client = self.closest_preceding_node(id).client();
let successor = client.find_successor(id)?;
Ok(successor)
}
}
}
Mutch better. Let's try to build the project now.
cargo build
Oh noes, we get an error 😠
error[E0282]: type annotations needed
--> libs/chord/src/service/mod.rs:121:17
|
121 | let client = predecessor.client();
| ^^^^^^
122 | if let Err(ClientError::ConnectionFailed(_)) = client.ping() {
| ------ type must be known at this point
|
help: consider giving `client` an explicit type
|
121 | let client: _ = predecessor.client();
| +++
For more information about this error, try `rustc --explain E0282`.
error: could not compile `chord-rs` due to previous error
The compiler doesn't know what type client
variable should be. We can fix it by doing something like let client: Box<dyn Client> = //...
, but we don’t want to have a dynamic client, because we will use the same implementation of Client
in the entire application. There is a better way. We can add a generic type to NodeService
.
Let’s change NodeService
to be generic over C: Client
.
pub struct NodeService<C: Client> {
// ...
}
impl<C: Client> NodeService<C> {
// ...
}
Now we can set the type to our client
variable
let client: C = self.closest_preceding_node(id).client();
Let’s try to build again.
error[E0392]: parameter `C` is never used
--> libs/chord/src/service/mod.rs:7:24
|
7 | pub struct NodeService<C: Client> {
| ^ unused parameter
|
= help: consider removing `C`, referring to it in a field, or using a marker such as `PhantomData`
😡 Why is it failing? We clearly use C
in our code.
If you look closer at the error, you will see that the C
type is not used in the struct
. To fix it, we need to add std::marker::PhantomData
. It tells the compiler that your type acts as it stores a value of type C
, even though it doesn’t really. The struct
should look like this:
pub struct NodeService<C: Client> {
// ...
phantom: PhantomData<C>,
}
Everything should build again 🎉
Join
Next function we will implement is join
. Let’s take a look at the paper to see the definition.
Figure 6 shows the pseudocode for joins and stabilization.
When node n first starts, it calls n.join(n'), where n' is any known Chord node, or n.create() to create a new Chord network. The join() function asks n' to find the immediate successor of n. By itself, join() does not make the rest of the network aware of n.// join a Chord ring containing node n'. n.join(n') predecessor = nil; successor = n'.find_successor(n); // create a new Chord ring. n.create() predecessor = nil; successor = n;
We don’t need to implement create
because we already set the successor
to itself when creating the NodeService
.
The join
method calls the node n'
to find a successor of id n
. To explain this, we can use our example from above.
There is a node with id N10
and it wants to join the ring. The node N10
knows about N1
, so it calls N1.find_successor(N10)
. Node N1
responds with N18
, so N18
should be the node which is the successor of node N10
.
Here is the rust implementation of join
function
impl NodeService {
// ...
pub fn join(&mut self, node: Node) -> Result<(), error::ServiceError> {
let client: C = node.client();
let successor = client.find_successor(self.id)?;
self.store.successor = successor;
Ok(())
}
}
That one was quite simple.
Notify
So far so good, this function should also be pretty simple to implement.
// n' thinks it might be our predecessor.
n.notify(n')
if (predecessor is nil or n' ∈ (predecessor, n))
predecessor = n';
Our node N18
receives a notify request from N10
, that it might be N18
's new predecessor. Node N18
needs to check if N10
is between its own predecessor (N7
) and itself. If no other node joined in the meantime the node N10
will be set as a predecessor of node N18
.
impl NodeService {
// ...
pub fn notify(&mut self, node: Node) {
let predecessor = self.store.predecessor();
if predecessor.is_none() || Node::is_between_on_ring(node.id.clone(), predecessor.unwrap().id, self.id) {
self.store.set_predecessor(node);
}
}
}
Another one is done.
Stabilize
Every node runs stabilize() periodically to learn about newly joined nodes. Each time node n runs stabilize(), it asks its successor for the successor's predecessor p, and decides whether p should be n's successor instead. This would be the case if node p recently joined the system. In addition, stabilize() notifies node n's successor of n's existence, giving the successor the chance to change its predecessor to n. The successor does this only if it knows of no closer predecessor than n.
// called periodically. verifies n’s immediate // successor, and tells the successor about n. n.stabilize() x = successor.predecessor; if (x ∈ (n,successor)) successor = x; successor.notify(n);
The stabilize
method should run periodically, to update the node’s successor when a new node joins the ring. It does it by asking for a predecessor of it’s successor. When the returned node (we will call it x
) is between node n
and successor of n
then the new successor of n
is set to x
.
We can explain it on our example ring. Here are the steps taken when the node N10
joins the ring:
- Node
N10
setsN18
as it’s successor (it's done byjoin
method) - Node
N10
runsstabilize
- Node
N10
calls nodeN18
for it’s predecessor -
N7
is returned, so the successor is not changed.
- Node
- Node
N10
callsnotify
on nodeN18
. - Node
N18
checks ifN10
is betweenN8
andN18
- Node
N18
setsN10
as it’s new predecessor - Node
N7
runsstabilize
- Node
N7
calls nodeN18
for it’s predecessor -
N10
is returned -
N10
is betweenN7
andN18
- Node
N7
setsN10
as it’s successor
- Node
Here is the Rust implementation of stabilize
method
impl NodeService {
// ...
pub fn stabilize(&mut self) -> Result<(), error::ServiceError> {
let client: C = self.store.successor().client();
let result = client.predecessor();
if let Ok(Some(x)) = result {
if Node::is_between_on_ring(x.id.clone(), self.id, self.store.successor().id) {
self.store.set_successor(x);
}
}
let client: C = self.store.successor().client();
client.notify(Node { id: self.id, addr: self.addr })?;
Ok(())
}
}
Fix fingers
Each node periodically calls fix_fingers to make sure its finger table entries are correct; this is how new nodes initialize their finger tables, and it is how existing nodes incorporate new nodes into their finger tables
// called periodically. refreshes finger table entries. // next stores the index of the next finger to fix. n.fix_fingers() next = next + 1; if (next > m) next = 1; finger[next] = find_successor(n + 2^next−1);
The fix_fingers
method is called periodically to update the finger table. It ensures that all the finger table entries are correct. It is also used to initialize the finger table of a newly joined node.
The next
variable is used to keep track of the next finger to fix. It is incremented after each call to fix_fingers
. When the next
variable is greater than the number of fingers in the finger table it is reset to 1.
We don't have to keep track of the next
variable in our implementation, because we can just iterate over the finger table. Let's do that.
impl NodeService {
// ...
pub fn fix_fingers(&mut self) {
for finger in self.store.finger_table.iter_mut() {
if let Ok(successor) = self.find_successor(finger.start) {
finger.node = successor;
}
}
}
}
This looks pretty simple, but unfortunately it won't work. It won't even compile. When we try to compile it we get the following error:
error[E0502]: cannot borrow `*self` as immutable because it is also borrowed as mutable
--> libs/chord/src/service/mod.rs:148:37
|
147 | for finger in self.store.finger_table.iter_mut() {
| ----------------------------------
| |
| mutable borrow occurs here
| mutable borrow later used here
148 | if let Ok(successor) = self.find_successor(finger.start) {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ immutable borrow occurs here
The problem is that we are trying to borrow self
as mutable and immutable at the same time. We can't do that. Maybe the next
variable is not that bad after all. We will not increment it as described in the paper, we will use for loop.
impl NodeService {
// ...
pub fn fix_fingers(&mut self) {
for i in 0..self.store.finger_table.len() {
let id = self.id + 2u64.pow(i as u32);
if let Ok(successor) = self.find_successor(id) {
self.store.finger_table[i].node = successor;
}
}
}
}
That works, because we don't hold on to self
when iterating over the fingers, so we can call self.find_successor
without any problems.
Check predecessor
Another task which has to be run periodically is check_predecessor
.
// called periodically. checks whether predecessor has failed.
n.check_predecessor()
if (predecessor has failed)
predecessor = nil;
Each node needs to ping its predecessor to see if it is still up. If not the predecessor is removed from the node.
impl NodeService {
// ...
pub fn check_predecessor(&mut self) {
if let Some(predecessor) = self.store.predecessor() {
let client: C = predecessor.client();
if let Err(ClientError::ConnectionFailed(_)) = client.ping() {
self.store.unset_predecessor();
};
}
}
}
Conclusion
This blog post has become a bit bigger than what I anticipated. I tried to cover all the code with tests, but I decided not to include them in this post. You can take a look at the repo and see the tests there. https://github.com/kamilczerw/chord/tree/chord.
I plan to write a follow-up post where I will go through the RPC implementation. I will also write a post about how to run the code. I hope you enjoyed reading it. If you have any questions or comments, feel free to open an issue in the repo or reach out to me on matrix.
Top comments (0)