DEV Community

Cover image for Sharding in Databases: How Queries Identify the Right Shard
Ahmed Raza Idrisi
Ahmed Raza Idrisi

Posted on

Sharding in Databases: How Queries Identify the Right Shard

๐Ÿ”น What is Sharding in Databases?

Sharding is a database partitioning technique where data is split across multiple databases or servers (called shards) to improve scalability and performance.

Each shard contains a portion of the data (not a full copy). For example:

  • Suppose you have a users table with 100 million rows.
  • Instead of storing all in one DB, you could split:

    • Shard 1 โ†’ users with IDs 1โ€“25M
    • Shard 2 โ†’ users with IDs 25Mโ€“50M
    • Shard 3 โ†’ users with IDs 50Mโ€“75M
    • Shard 4 โ†’ users with IDs 75Mโ€“100M

Now each shard only handles a subset, so queries and writes become faster.

๐Ÿ‘‰ Sharding is commonly used in very large-scale systems (e.g., Facebook, Twitter, YouTube) where one DB server cannot handle all the load.


๐Ÿ”น Similar Concepts in Databases

Sharding is just one of many scaling/distribution techniques. Letโ€™s compare:

1. Replication

  • Copies of the same database are kept across multiple servers.
  • Master โ†’ Replica(s) model.
  • Used for read scaling and high availability.
  • Example: MySQL Replication, PostgreSQL Streaming Replication.

๐Ÿ‘‰ Difference from sharding: Replicas have full copies, while shards have partial data.


2. Partitioning (within a single DB)

  • Data is divided into smaller, manageable partitions inside one database server.
  • Types:

    • Horizontal Partitioning โ†’ rows are split (like sharding, but within one server).
    • Vertical Partitioning โ†’ columns are split (e.g., frequently accessed columns in one table, large JSON/blob in another).
  • Example: MySQL Partitioning, PostgreSQL Table Partitioning.

๐Ÿ‘‰ Sharding is basically horizontal partitioning across multiple servers.


3. Federation / Database Clustering

  • Splitting data across different databases logically (like sharding, but sometimes done at the application level).
  • Example: SQL Server Federation, Citus for PostgreSQL.

4. Connection Pooling

  • Not data distribution, but an important concept: reuses DB connections to improve performance under high load.

5. Caching

  • Store frequently accessed data in Redis, Memcached, etc.
  • Helps reduce load on DB (often combined with sharding).

๐Ÿ”น In MySQL vs PostgreSQL

  • MySQL

    • Supports replication, partitioning, and can be sharded via middleware like Vitess (used by YouTube).
    • MySQL Cluster (NDB) also provides automatic sharding.
  • PostgreSQL

    • Native table partitioning.
    • Built-in replication.
    • Can use extensions like Citus (for distributed PostgreSQL with sharding).

this is the tricky part of sharding โ€” how the system decides which shard to query.

Thereโ€™s no magic โ€” the database (or middleware, or your app) needs a shard key (a rule to determine where a piece of data lives).


๐Ÿ”น How Queries Find the Right Shard

  1. Shard Key
  • A special column (like user_id, customer_id, order_id, etc.) is chosen.

  • Example: If user_id = 12345, then the system applies a rule like:

     shard_number = user_id % 4
    
    • If result = 1 โ†’ go to Shard 1
    • If result = 2 โ†’ go to Shard 2, etc.
  • This is called hash-based sharding.


  1. Shard Mapping (Lookup Table)
  • Sometimes, a mapping table (or config service) stores which shard holds which data.
  • Example:

     user_id_range   -> shard
     1 - 25M         -> shard1.db.com
     25M - 50M       -> shard2.db.com
     50M - 75M       -> shard3.db.com
    
  • When a query comes, the app looks up the mapping table first.


  1. Middleware / Proxy
  • Tools like Vitess (MySQL), Citus (Postgres), or custom middleware sit between the app and DB.
  • They intercept queries, check the shard key, and route to the right shard.
  • To your app, it looks like one database.

  1. Broadcast Queries (last resort)
  • If a query doesnโ€™t include the shard key (e.g., โ€œfind all users with email = xyz@gmail.comโ€), the system may need to query all shards and combine results.
  • This is slow, which is why choosing the right shard key is critical.

๐Ÿ”น Example: Shard Query Flow

Say youโ€™re storing users across 4 shards:

  • Shard Rule: user_id % 4
  • User with user_id = 1017 logs in.

Query:

SELECT * FROM users WHERE user_id = 1017;
Enter fullscreen mode Exit fullscreen mode

Flow:

  1. App sees user_id = 1017.
  2. Computes โ†’ 1017 % 4 = 1.
  3. Query goes to Shard 1 only.
  4. Result returned.

โœ… Summary:

  • Queries find the right shard using shard keys.
  • Routing is done by either:

    • App logic
    • Middleware/proxy
    • Lookup table
  • If no shard key โ†’ either broadcast query or centralized index.

โœ… Quick Recap:

  • Replication โ†’ copies of same data.
  • Sharding โ†’ split data into subsets across servers.
  • Partitioning โ†’ split data inside a single DB.
  • Clustering / Federation โ†’ multiple databases acting as one.
  • Caching โ†’ reduce DB load.

Top comments (0)