Mohamed Hussain S

Posted on Sep 9

Scaling Databases with ClickHouse Sharding (Hands-On Simulation)

#clickhouse #olap #dataengineering #resources

Hey Devs 👋,

When datasets grow, even the most powerful database server eventually hits its limits:

📦 Disk space fills up
⚡ CPU maxes out under query load
🧠 Memory struggles with joins & aggregations

That’s where sharding comes in. Instead of scaling up a single machine, we split the data across multiple nodes.

In this post, I’ll walk you through a hands-on simulation of ClickHouse sharding that I built — so you can try it locally and understand how it works in practice.

🔗 GitHub Repo: Check my profile → GitHub → look for ClickHouse_Sharding_Simulation

📦 What This Project Does

This is a beginner-friendly project that demonstrates:

🗂️ Creating a multi-shard ClickHouse setup with Docker Compose
📊 Distributing data across shards with weights (e.g., one shard can take 10× more data)
🔁 Querying through a Distributed table to merge results across shards
📈 Showing how queries scale horizontally as data grows

🛠️ Tech Stack

ClickHouse → high-performance OLAP database
Docker Compose → spin up shards + distributed node easily
SQL → to define shards, distributed tables, and run queries

⚙️ How To Run It Locally

Step 1. Clone the repo

git clone https://github.com/mohhddhassan/ClickHouse_Sharding_Simulation.git
cd ClickHouse_Sharding_Simulation

Step 2. Start the cluster

docker-compose up -d

Step 3. Enter a shard container and insert sample data

docker exec -it ch1 clickhouse-client

Step 4. Query from the distributed table

SELECT * FROM distributed_table;

Boom 🚀 you’ll see results merged from multiple shards!

🗂️ Project Structure

ClickHouse_Sharding_Simulation/
├── docker-compose.yml
├── configs/
│   └── remote_servers.xml
└── README.md                  # Example queries + schema

🤯 What I Learned

💡 How ClickHouse uses Distributed tables to query across shards
💡 How shard weights balance load between nodes
💡 Why horizontal scaling beats vertical scaling for OLAP workloads
💡 How to simulate real-world database scaling locally with Docker

🔍 Why You Should Try This

If you’re learning data engineering or databases:

🔹 Understand sharding in a safe, local environment
🔹 Practice setting up a mini ClickHouse cluster with Docker
🔹 See how queries scale across nodes
🔹 Build intuition for horizontal scaling vs vertical scaling

📌 What’s Next?

Add replication for fault tolerance
Benchmark query speed vs single-node setup
Try larger datasets for performance testing

🙋‍♂️ About Me

Mohamed Hussain S
Associate Data Engineer
LinkedIn | GitHub

🧪 Building simple to understand the logic.

DEV Community