๐น What is Sharding in Databases?
Sharding is a database partitioning technique where data is split across multiple databases or servers (called shards) to improve scalability and performance.
Each shard contains a portion of the data (not a full copy). For example:
- Suppose you have a users table with 100 million rows.
-
Instead of storing all in one DB, you could split:
- Shard 1 โ users with IDs 1โ25M
- Shard 2 โ users with IDs 25Mโ50M
- Shard 3 โ users with IDs 50Mโ75M
- Shard 4 โ users with IDs 75Mโ100M
Now each shard only handles a subset, so queries and writes become faster.
๐ Sharding is commonly used in very large-scale systems (e.g., Facebook, Twitter, YouTube) where one DB server cannot handle all the load.
๐น Similar Concepts in Databases
Sharding is just one of many scaling/distribution techniques. Letโs compare:
1. Replication
- Copies of the same database are kept across multiple servers.
- Master โ Replica(s) model.
- Used for read scaling and high availability.
- Example: MySQL Replication, PostgreSQL Streaming Replication.
๐ Difference from sharding: Replicas have full copies, while shards have partial data.
2. Partitioning (within a single DB)
- Data is divided into smaller, manageable partitions inside one database server.
-
Types:
- Horizontal Partitioning โ rows are split (like sharding, but within one server).
- Vertical Partitioning โ columns are split (e.g., frequently accessed columns in one table, large JSON/blob in another).
Example: MySQL Partitioning, PostgreSQL Table Partitioning.
๐ Sharding is basically horizontal partitioning across multiple servers.
3. Federation / Database Clustering
- Splitting data across different databases logically (like sharding, but sometimes done at the application level).
- Example: SQL Server Federation, Citus for PostgreSQL.
4. Connection Pooling
- Not data distribution, but an important concept: reuses DB connections to improve performance under high load.
5. Caching
- Store frequently accessed data in Redis, Memcached, etc.
- Helps reduce load on DB (often combined with sharding).
๐น In MySQL vs PostgreSQL
-
MySQL
- Supports replication, partitioning, and can be sharded via middleware like Vitess (used by YouTube).
- MySQL Cluster (NDB) also provides automatic sharding.
-
PostgreSQL
- Native table partitioning.
- Built-in replication.
- Can use extensions like Citus (for distributed PostgreSQL with sharding).
this is the tricky part of sharding โ how the system decides which shard to query.
Thereโs no magic โ the database (or middleware, or your app) needs a shard key (a rule to determine where a piece of data lives).
๐น How Queries Find the Right Shard
- Shard Key
A special column (like
user_id
,customer_id
,order_id
, etc.) is chosen.-
Example: If
user_id = 12345
, then the system applies a rule like:
shard_number = user_id % 4
- If result = 1 โ go to Shard 1
- If result = 2 โ go to Shard 2, etc.
This is called hash-based sharding.
- Shard Mapping (Lookup Table)
- Sometimes, a mapping table (or config service) stores which shard holds which data.
-
Example:
user_id_range -> shard 1 - 25M -> shard1.db.com 25M - 50M -> shard2.db.com 50M - 75M -> shard3.db.com
When a query comes, the app looks up the mapping table first.
- Middleware / Proxy
- Tools like Vitess (MySQL), Citus (Postgres), or custom middleware sit between the app and DB.
- They intercept queries, check the shard key, and route to the right shard.
- To your app, it looks like one database.
- Broadcast Queries (last resort)
- If a query doesnโt include the shard key (e.g., โfind all users with email = xyz@gmail.comโ), the system may need to query all shards and combine results.
- This is slow, which is why choosing the right shard key is critical.
๐น Example: Shard Query Flow
Say youโre storing users across 4 shards:
-
Shard Rule:
user_id % 4
- User with
user_id = 1017
logs in.
Query:
SELECT * FROM users WHERE user_id = 1017;
Flow:
- App sees
user_id = 1017
. - Computes โ
1017 % 4 = 1
. - Query goes to Shard 1 only.
- Result returned.
โ Summary:
- Queries find the right shard using shard keys.
-
Routing is done by either:
- App logic
- Middleware/proxy
- Lookup table
If no shard key โ either broadcast query or centralized index.
โ Quick Recap:
- Replication โ copies of same data.
- Sharding โ split data into subsets across servers.
- Partitioning โ split data inside a single DB.
- Clustering / Federation โ multiple databases acting as one.
- Caching โ reduce DB load.
Top comments (0)