Cassandra-Like Distributed Database

#database #architecture #showdev #python

This semester, I worked on building a Cassandra-like Distributed Database in Python. In this article, I aim to delve into the distributed nature of Cassandra, its implementation, and its associated limitations.

Cassandra is a distributed key-value store, which addresses fault tolerance, consistency, replication, and membership management in a decentralised architecture.

Consistent Hashing and Distributed Partitioning

The distributed system that I've built uses consistent hashing for key distribution, which means that each node maps to a position on a ring via SHA-1. The ring in this case consists of multiple nodes connected to each other (refer to the figure above), where each node is an instance of Cassandra.

I use the SHA-1 hashing of node IDs. This consistent hashing minimises rebalancing, meaning that only keys near the affected node will move their position.

Replication and Fault Tolerance
Depending on the replication factor, the system replicates data to multiple nodes for availability and durability. Initially, the write goes to the assigned node based on the key and then gets replicated to the next (replication factor - 1) nodes clockwise on the ring. The replication factor itself is configurable, but I used three for high availability.

When the primary node is down, the writes will be forwarded to the first alive replica, which becomes a coordinator. And when we encounter dead replicas, they are skipped during the replication process. When the recovery happens, a node requests the missing state from its peers. In the CAP Theorem trade-off, we see here that we choose availability and partition tolerance, with eventual consistency.

I've also integrated Gossip Protocol for membership management. In the Gossip Protocol, we do not have a central coordinator, and the failure is detected based on the information exchange between nodes. Therefore, periodically (by default, it's 1000 milliseconds), each node selects a random peer and exchanges membership state. If there is no update that has been received from the node, then it will be marked as "down" by another node. The reason behind using this protocol is that it is decentralised, scalable, and fault-tolerant, which means that it can handle node failures and message loss. Eventually, all nodes have the same view of membership.

Crash Recovery
I have also taken care of crash recovery and durability as the system persists data and recovers from crashes using the commit log and snapshots. The commit log is an append-only log of mutations, such as PUT and DELETE, with all timestamps included. Snapshots are periodic full-state snapshots that cover the entire state of the system (DataStax Developers, 2021).

Conflict Resolution
Since all mutations include timestamps, conflicts during recovery and replication are resolved using the last-write-wins principle. It is simple and fast, but it might lead to potential data loss. There is also a garbage collection process with a 10-day grace period, which means that after information is deleted within a 10-day period, it will be permanently removed from the system. Additionally, when an information key-value pair is deleted from the database, not only do replicas delete the key-value pair, but also the ones that have the same key. This is a simplified version of Cassandra-like, which doesn't reflect how Cassandra actually works. In real life, it uses hinted hand-offs and read repairs (Cassandra Documentation, 2025).

I also implemented a couple of tests for partition tolerance, replication, and gossip protocols, which you can check out on my GitHub.

Overall, the system prioritises availability and partition tolerance, accepting eventual consistency. The trade-offs that were made make this database suitable for scenarios where high availability and low latency matter more than immediate consistency. This made me learn that trade-offs are unavoidable in the case of distributed systems, and fault tolerance is essential. There are numerous scenarios to consider when building a fault-tolerant system. To test, I have run multiple tests, but the visualizations were also helpful in identifying where the system is failing and not performing as expected.

References

Cassandra Documentation 2025 https://cassandra.apache.org/_/index.html

DataStax Developers 2021 https://www.youtube.com/watch?v=YjYWsN1vek8&t=181s