Mouloud hasrane

Posted on Jan 29, 2025

Understanding Replication and Sharding in System Design

#systemdesign #softwareengineering #backend #database

When discussing system design, terms like sharding, partitioning, replication, and master-slave relationships frequently come up. If these concepts seem confusing, don't worry! This article will clarify the differences between replication and sharding to help you build a solid foundation in distributed database systems.
ps: images are from building data intensive systems book be sure to check it out

Replication

Replication involves creating multiple instances of a database that contain the same data as the main instance. This allows requests to be distributed among replicas, improving system performance and availability. But how do we decide which requests go to which instance? Replication can be categorized into different types, but in practice, two are most commonly used:

1. Master-Slave Replication (Leader-Follower Replication)

In this model, a leader (master) database handles both read and write operations, while followers (slaves) handle only read operations. Any write operation made to the leader is replicated to the followers, either synchronously or asynchronously (we’ll discuss these approaches in detail in a future article).

Why is this useful?

Most systems are read-heavy rather than write-heavy, meaning they perform far more read operations than writes. By distributing read queries to multiple followers, we can significantly reduce the load on the leader instance, improving performance and scalability.

2. Master-Master Replication (Leader-Leader Replication)

Now, imagine our system has grown to the point where we need multiple data centers worldwide to minimize latency. If we rely on leader-follower replication, choosing a single leader across all data centers would lead to slow write operations for users far from that leader.

With leader-leader replication, each data center has its own leader. When a write operation is performed at one leader, it is propagated to all other leaders, which then replicate it to their followers, in other words each leader acts as a follower to the leader that just got the write operations and then as a leader to send the new data to his followers
Challenges:

While this approach improves performance and fault tolerance, it introduces challenges such as write conflicts, where two leaders handle conflicting writes at the same time. Handling such conflicts requires careful resolution strategies, such as conflict-free replicated data types (CRDTs) or last-write-wins (LWW) mechanisms.

Sharding

Unlike replication, which creates copies of the same database, sharding (also called partitioning) splits the database into smaller subsets, called shards, each containing a portion of the data. This allows for better scalability by distributing requests across multiple shards.
How do we determine which shard a request should go to?

Two common techniques for sharding are:

1. Key Range Sharding

In this approach, data is divided into shards based on a predefined range of values. Think of it like an encyclopedia where different volumes contain different alphabetic sections of words.

Example: User IDs from 1–1000 go to Shard A, 1001–2000 go to Shard B, and so on.

Advantage: Simple to implement.

Disadvantage: Can lead to hot spots—if a specific range receives more requests than others, that shard will become overloaded.

2. Hash-Based Sharding

Here, a hash function is applied to a data attribute (e.g., User ID) to determine which shard should store the data. A well-designed hash function distributes data evenly across shards.

say for example If hash(UserID) % 3 == 0, the request is sent to Shard A; if hash(UserID) % 3 == 1, it goes to Shard B, and so on.

Advantage: Ensures even distribution and prevents hot spots.

Disadvantage: Can make it difficult to perform range queries, as data is spread across multiple shards unpredictably.

Rebalancing Shards

As data grows and system demands change, sharded databases must be rebalanced to maintain efficiency. This involves redistributing data and queries among nodes while minimizing disruptions. Rebalancing is crucial to:

Prevent overload on specific shards.

Maintain system availability for read/write operations.

Minimize data movement to reduce network and disk I/O overhead.

We will discuss different rebalancing strategies and common pitfalls in a future article.

why bother using any of those techniques :

as the system scales up or single instance database will not be able to handle all of the request efficiently, which will cause reduce the response time , that will affect the user experience which in of itself will affect our sales and user count, you may think no problem let's just get a stronger pc that can handle those request , but as we keep scaling up the hardware will be costly and hard to find (you'll reach a point when you have the most powerful disk out there to run you db on ),for that reason we use this strategies to keep or response time and overall experience smooth no matter how much our app scales.
Final Thoughts

Now that you have a fundamental understanding of replication and sharding, you can explore their trade-offs and edge cases in greater detail. Each technique has its own benefits and challenges, and the best approach depends on your system’s requirements.

Stay tuned for more in-depth articles on advanced replication strategies, handling conflicts in distributed databases, and optimizing sharding techniques!

Happy learning! 🚀

DEV Community