Ahmed Raza Idrisi

Posted on Sep 19

Sharding in Databases: How Queries Identify the Right Shard

#webdev #programming #database #backend

🔹 What is Sharding in Databases?

Sharding is a database partitioning technique where data is split across multiple databases or servers (called shards) to improve scalability and performance.

Each shard contains a portion of the data (not a full copy). For example:

Suppose you have a users table with 100 million rows.
Instead of storing all in one DB, you could split:
- Shard 1 → users with IDs 1–25M
- Shard 2 → users with IDs 25M–50M
- Shard 3 → users with IDs 50M–75M
- Shard 4 → users with IDs 75M–100M

Now each shard only handles a subset, so queries and writes become faster.

👉 Sharding is commonly used in very large-scale systems (e.g., Facebook, Twitter, YouTube) where one DB server cannot handle all the load.

🔹 Similar Concepts in Databases

Sharding is just one of many scaling/distribution techniques. Let’s compare:

1. Replication

Copies of the same database are kept across multiple servers.
Master → Replica(s) model.
Used for read scaling and high availability.
Example: MySQL Replication, PostgreSQL Streaming Replication.

👉 Difference from sharding: Replicas have full copies, while shards have partial data.

2. Partitioning (within a single DB)

Data is divided into smaller, manageable partitions inside one database server.
Types:
- Horizontal Partitioning → rows are split (like sharding, but within one server).
- Vertical Partitioning → columns are split (e.g., frequently accessed columns in one table, large JSON/blob in another).
Example: MySQL Partitioning, PostgreSQL Table Partitioning.

👉 Sharding is basically horizontal partitioning across multiple servers.

3. Federation / Database Clustering

Splitting data across different databases logically (like sharding, but sometimes done at the application level).
Example: SQL Server Federation, Citus for PostgreSQL.

4. Connection Pooling

Not data distribution, but an important concept: reuses DB connections to improve performance under high load.

5. Caching

Store frequently accessed data in Redis, Memcached, etc.
Helps reduce load on DB (often combined with sharding).

🔹 In MySQL vs PostgreSQL

MySQL
- Supports replication, partitioning, and can be sharded via middleware like Vitess (used by YouTube).
- MySQL Cluster (NDB) also provides automatic sharding.
PostgreSQL
- Native table partitioning.
- Built-in replication.
- Can use extensions like Citus (for distributed PostgreSQL with sharding).

this is the tricky part of sharding — how the system decides which shard to query.

There’s no magic — the database (or middleware, or your app) needs a shard key (a rule to determine where a piece of data lives).

🔹 How Queries Find the Right Shard

Shard Key

A special column (like user_id, customer_id, order_id, etc.) is chosen.
Example: If user_id = 12345, then the system applies a rule like:
```
 shard_number = user_id % 4
```
- If result = 1 → go to Shard 1
- If result = 2 → go to Shard 2, etc.
This is called hash-based sharding.

Shard Mapping (Lookup Table)

Sometimes, a mapping table (or config service) stores which shard holds which data.

Example:

 user_id_range   -> shard
 1 - 25M         -> shard1.db.com
 25M - 50M       -> shard2.db.com
 50M - 75M       -> shard3.db.com

When a query comes, the app looks up the mapping table first.

Middleware / Proxy

Tools like Vitess (MySQL), Citus (Postgres), or custom middleware sit between the app and DB.
They intercept queries, check the shard key, and route to the right shard.
To your app, it looks like one database.

Broadcast Queries (last resort)

If a query doesn’t include the shard key (e.g., “find all users with email = xyz@gmail.com”), the system may need to query all shards and combine results.
This is slow, which is why choosing the right shard key is critical.

🔹 Example: Shard Query Flow

Say you’re storing users across 4 shards:

Shard Rule: user_id % 4
User with user_id = 1017 logs in.

Query:

SELECT * FROM users WHERE user_id = 1017;

Flow:

App sees user_id = 1017.
Computes → 1017 % 4 = 1.
Query goes to Shard 1 only.
Result returned.

✅ Summary:

Queries find the right shard using shard keys.
Routing is done by either:
- App logic
- Middleware/proxy
- Lookup table
If no shard key → either broadcast query or centralized index.

✅ Quick Recap:

Replication → copies of same data.
Sharding → split data into subsets across servers.
Partitioning → split data inside a single DB.
Clustering / Federation → multiple databases acting as one.
Caching → reduce DB load.

DEV Community