DEV Community

Sandeep Kumar Seeram
Sandeep Kumar Seeram

Posted on

Horizontal partitioning (Sharding) on MongoDB - Cloud-Native Data Patterns

In many large-scale solutions, data is divided into partitions that can be managed and accessed separately. Partitioning can improve scalability, reduce contention, and optimize performance. It can also provide a mechanism for dividing data by usage pattern.

For example, you can archive older data in cheaper data storage.

Designing partitions

There are three typical strategies for partitioning data:

Horizontal partitioning (often called sharding). In this strategy, each partition is a separate data store, but all partitions have the same schema. Each partition is known as a shard and holds a specific subset of the data, such as all the orders for a specific set of customers.

Vertical partitioning. In this strategy, each partition holds a subset of the fields for items in the data store. The fields are divided according to their pattern of use. For example, frequently accessed fields might be placed in one vertical partition and less frequently accessed fields in another.

Functional partitioning. In this strategy, data is aggregated according to how it is used by each bounded context in the system. For example, an e-commerce system might store invoice data in one partition and product inventory data in another.

In this article, we will take a step-by-step approach to dividing data into partitions (horizontal partitioning aka. sharding) so that it can be accessed and managed separately.

Technologies used: MongoDB, Docker, Docker-Compose

Creating a Shard on MongoDB: We will be using MongoDB to implement sharding. Sharding in MongoDB is using groups of MongoDB instances called clusters. We will need a configuration server and a router, which holds the information about different shards in the cluster, while router is responsible for routing the client requests to appropriate backend shards.

We will be using below docker compose file with scripts defining the mongodb configuration server with 3 shards and a router configuration.

version: "2"
services:
# Configuration server
config:
image: mongo
command: mongod --configsvr --replSet configserver --port 27017
volumes:
- ./scripts:/scripts
# Shards
shard1:
image: mongo
command: mongod --shardsvr --replSet shard1 --port 27018
volumes:
- ./scripts:/scripts
shard2:
image: mongo
command: mongod --shardsvr --replSet shard2 --port 27019
volumes:
- ./scripts:/scripts
shard3:
image: mongo
command: mongod --shardsvr --replSet shard3 --port 27020
volumes:
- ./scripts:/scripts
# Router
router:
image: mongo
command: mongos --configdb configserver/config:27017 --bind_ip_all --port 27017
ports:
- "27017:27017"
volumes:
- ./scripts:/scripts
depends_on:
- config
- shard1
- shard2
- shard3

init-config.js, script is used to configure the configuration server, we are initiating the single configuration server.

// Initializes the config server
rs.initiate({
_id: 'configserver',
configsvr: true,
version: 1,
members: [
{
_id: 0,
host: 'config:27017',
},
],
});

When we execute by providing init-config.js script inside the config container

docker-compose exec config sh -c "mongosh --port 27017 < /scripts/init-config.js"

Next, we can configure the three shards by running the three shard scripts. Just like the previous initialization command, this one will initialize the MongoDB shards.

Shard configuration files:

// Initialize shard1
rs.initiate({
_id: 'shard1',
version: 1,
members: [{ _id: 0, host: 'shard1:27018' }],
});

// Initialize shard2
rs.initiate({
_id: 'shard2',
version: 1,
members: [{ _id: 0, host: 'shard2:27019' }],
});

// Initialize shard3
rs.initiate({
_id: 'shard3',
version: 1,
members: [{ _id: 0, host: 'shard3:27020' }],
});

Initialize shard 1:

docker-compose exec shard1 sh -c "mongosh --port 27018 < /scripts/init-shard1.js"

Initialize shard 2:

docker-compose exec shard2 sh -c "mongosh --port 27019 < /scripts/init-shard2.js"

Initialize shard 3:

docker-compose exec shard3 sh -c "mongosh --port 27020 < /scripts/init-shard3.js"

Now, we have 3 shards configured and now we have to create the routes to these shards with the router configuration.

init-router.js, script is used to configure the routes.

// Initialize the router with shards
sh.addShard('shard1/shard1:27018');
sh.addShard('shard2/shard2:27019');
sh.addShard('shard3/shard3:27020');

Configure the router with this command:

docker-compose exec router sh -c "mongosh < /scripts/init-router.js"

Verification:

To verify the status of the sharded cluster, we can get a shell inside of the router container

docker-compose exec router mongosh

Once we get the mongos> prompt, we can run the sh.status() command.

Under the shards field the configuration should look like this:

shards:
{ "_id" : "shard1", "host" : "shard1/shard1:27018", "state" : 1 }
{ "_id" : "shard2", "host" : "shard2/shard2:27019", "state" : 1 }
{ "_id" : "shard3", "host" : "shard3/shard3:27020", "state" : 1 }

Type exit to exit from the router container.

Configure hashed sharding

With the MongoDB sharded cluster running, we need to enable the sharding for a database and then shard a specific collection in that database.

MongoDB provides two strategies for sharding collections:

hashed sharding and range-based sharding.

The hashed sharding uses a hashed index of a single field as a shard key to partition the data. The range-based sharding can use multiple fields as the shard key and will divide the data into adjacent ranges based on the shard key values.

The first thing that we need to do is to enable sharding on our database.

We will be running all commands from within the router container.

Run docker-compose exec router mongosh to get the MongoDB shell.

Let's enable sharding in a database called test:

sh.enableSharding('test')
finally, we can configure how we want to shard the specific collections inside the database. We will configure sharding for the customertable. Let us enable the collection sharding:

sh.shardCollection("test.customertable", { title : "hashed" } )
Type exit to exit the container.

With sharding enabled, we can start import your data and see how it gets distributed between different shards.

Top comments (0)