nishaant dixit

Posted on May 7 • Originally published at sivaro.in

ClickHouse Cluster Setup Guide

I spent three nights debugging a sharded ClickHouse cluster that kept losing data. The logs were useless. Zookeeper was throwing cryptic errors. My team was ready to abandon the whole thing.

Turns out, we had the replication config wrong. One missing parameter. That's it.

ClickHouse is fast. Blazingly fast. But a misconfigured cluster? It's a nightmare.

This guide covers exactly how to set up a ClickHouse cluster from scratch. The hard truths. The trade-offs. The configs that actually work.

What is a ClickHouse cluster? It's a distributed system where data is sharded across multiple nodes and replicated for fault tolerance. Each node stores a subset of data. Queries run in parallel across all nodes. Results merge automatically. According to ClickHouse Docs, a production cluster typically has 3-10 shards with 2-3 replicas each.

Here's what you'll learn: The exact architecture decisions I've made building clusters processing 200K events/second. The configs that break silently. And the testing steps most tutorials ignore.

Most people think ClickHouse clustering is like any other distributed database. Drop some configs. Run a few commands. Done.

They're wrong.

ClickHouse has a unique architecture. SQL-based. Columnar storage. Shared-nothing design. You must understand three layers.

The storage layer. Each ClickHouse server stores data locally on disk. No shared storage. If a node dies, its data is gone unless replicated. This is by design. Local storage gives you insane read speeds. But it means you need replication.

The coordination layer. This is where Zookeeper or ClickHouse Keeper comes in. It tracks which nodes are alive, which shards have what data, and coordinates replication. According to Altinity's guide, Zookeeper is the most common setup. But I've found it's also the biggest pain point. It requires its own cluster. Minimum 3 nodes.

The query layer. Queries hit any node. That node becomes the coordinator. It fans out queries to all relevant shards, waits for partial results, then merges. The client sees one result set.

Here's what I learned the hard way: You can't mix sharding strategies. Either use consistent hashing or round-robin. Pick one. Stick with it.

In my experience, round-robin is simpler. Consistent hashing gives you better resharding capabilities. But both work if you plan ahead.

Why bother with a cluster? Single-node ClickHouse is already fast.

Parallel query execution. A 10-node cluster isn't 10x faster. It's more like 8x. Network overhead and merge operations cost something. But that 8x matters when you're scanning billions of rows. According to SeveralNines, properly sharded clusters see 5-7x improvement in analytical queries.

Fault tolerance. This is the real reason. Data replication means you survive node failures. No downtime. No data loss. I've seen clusters lose two nodes simultaneously and keep serving queries. You can't do that with a single instance.

Storage scaling. ClickHouse compresses data aggressively. But even compressed, a petabyte of data doesn't fit on one machine. Sharding spreads storage across nodes. Each node handles its share.

I've found that the hardest benefit to capture is cost efficiency. A cluster of smaller nodes is often cheaper than one massive server. You pay for commodity hardware instead of enterprise pricing. And you can scale horizontally as you grow.

The problem isn't the benefits. It's the complexity. Everyone wants high availability. Nobody wants to debug Zookeeper at 3 AM.

Let me show you the exact setup I use. I'll walk through every config file and command.

First, install ClickHouse on all nodes. According to ClickHouse Installation Docs, the process is straightforward. On Ubuntu:

sudo apt-get install -y apt-transport-https ca-certificates dirmngr
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4
echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client

Standard stuff. The real work comes next.

Config file for each node. Every node needs a config.xml with cluster definitions. Here's a minimal example for a 2-shard, 2-replica setup:

<yandex>
    <remote_servers>
        <my_cluster>
            <shard>
                <replica>
                    <host>clickhouse-01</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-02</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <replica>
                    <host>clickhouse-03</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-04</host>
                    <port>9000</port>
                </replica>
            </shard>
        </my_cluster>
    </remote_servers>
    <zookeeper>
        <node>
            <host>zookeeper-01</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper-02</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper-03</host>
            <port>2181</port>
        </node>
    </zookeeper>
    <macros>
        <shard>01</shard>
        <replica>clickhouse-01</replica>
    </macros>
</yandex>

The <macros> section is critical. Each node must have unique shard and replica values. Without this, replicated tables won't work. I've seen production clusters fail because someone copied the same config to all nodes. Don't be that person.

Creating distributed tables. After configs are in place, create the tables:

-- Create local table on each shard
CREATE TABLE events_local ON CLUSTER my_cluster (
    event_id UInt64,
    timestamp DateTime,
    user_id UInt32,
    event_type String
) ENGINE = ReplicatedMergeTree(
    '/clickhouse/my_cluster/tables/{shard}/events',
    '{replica}'
)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, event_id);

-- Create distributed view
CREATE TABLE events_distributed ON CLUSTER my_cluster
AS events_local
ENGINE = Distributed(
    my_cluster,
    default,
    events_local,
    rand()
);

Notice the ON CLUSTER my_cluster syntax. It tells ClickHouse to run this command on every node. Much better than running SQL on each machine manually. According to Abhinav Mallick's guide, this is the recommended approach for production deployments.

The rand() in the Distributed engine determines sharding. Random distribution works for most use cases. If you need consistent routing by user_id, use cityHash64(user_id) instead.

Common pitfall. I've seen people forget that Distributed tables don't store data. They're views. Data lives in the local ReplicatedMergeTree tables. Query the distributed table. Insert into the distributed table. The engine handles routing.

After building clusters for fintech, adtech, and SaaS companies, here's what works.

Use ClickHouse Keeper instead of Zookeeper. Zookeeper is a separate dependency. Another thing to monitor. Another failure domain. ClickHouse Keeper is built into ClickHouse. Same protocol. No separate deployment. According to ClickHouse Operator documentation, Keeper handles all coordination needs.

Plan your shard count before data loads. Changing shards later requires data redistribution. That means downtime or complex migration scripts. I've found that starting with 4-8 shards works for most workloads. You can always add nodes within existing shards for replication.

Monitor merge performance. ClickHouse merges data in the background. Too many partitions means too many merges. Your cluster slows down. Keep partition sizes between 100GB and 200GB. Partition by month or week, not by day.

Use max_replication_delay wisely. Set it to 60 seconds. If a replica falls behind, queries stop routing to it. Prevents stale data from being served. But don't set it too low. Network hiccups will cause unnecessary failovers.

The hard truth about ClickHouse clusters: They're not magical. They require planning. A badly configured cluster is slower than a well-tuned single node. I've seen it happen.

Should you use a cluster? Not always.

Single node is better when: Your data fits on one machine. Your queries are fast enough. You don't need HA. A single ClickHouse instance handles 10-50 TB compressed data easily. According to Rakesh Therani's guide, most teams don't need clustering until they exceed 100 TB.

Cluster is necessary when: You need HA and failover. Your data exceeds single-node capacity. Your queries need parallel execution for sub-second responses.

The trade-off is real. Clusters add complexity. Zookeeper/Keeper monitoring. Network latency. Merge coordination. Query routing. Each layer introduces failure modes.

In my experience, start with a single node. Add replication first. Then sharding. Incremental complexity is manageable. Jumping straight to a 10-node cluster? You'll spend weeks debugging.

Here's my decision framework:

Less than 10 TB? Single node with replication.
10-50 TB? Single node with replication and horizontal partitioning by time.
50-200 TB? 2-4 shards with 2 replicas each.
200+ TB? 4-8 shards with 2-3 replicas each.

Problems will happen. Here's how to fix the common ones.

Zookeeper session expired. Your cluster stops writing. Queries return errors. Restart Zookeeper nodes one at a time. Then restart ClickHouse nodes. Check session timeout settings. Default is 30 seconds. Increase it to 60 seconds. According to Cedrick Chee's cluster setup, this is the most common production issue.

Replication lag. One replica is behind. Data is inconsistent. Check system.replicas table. Look at absolute_delay column. High lag usually means the replica is overloaded. Add more resources or reduce query load on that node.

Merge fails with "too many parts". Your insert rate exceeds merge capacity. Partition more aggressively. Or reduce insert batch size. I've found that batch sizes of 100K-500K rows work well. Larger batches increase merge pressure.

Data skew. Some shards have more data than others. Queries slow down because one shard is the bottleneck. Re-evaluate your sharding key. Use cityHash64(user_id) instead of rand(). Consistent hashing distributes data more evenly.

Node failure. One node goes down. Replicated tables survive. Distributed queries fail if data isn't available on remaining replicas. Set internal_replication=true in your cluster config. This tells ClickHouse to handle replication automatically. Without it, you write data twice. Data corruption follows.

The biggest lesson I've learned: Test failure scenarios before production. Kill a node. Watch replication catch up. Simulate network partitions. Most teams skip this. They learn the hard way during an outage.

How many nodes do I need for a ClickHouse cluster?
Minimum 2 for replication. Minimum 4 for sharding with replication. Most production clusters have 6-12 nodes. According to Instaclustr's tutorial, 3 shards with 2 replicas each is the sweet spot.

Can I add nodes to an existing ClickHouse cluster?
Yes. Add them as new replicas to existing shards. But you cannot add new shards without redistributing data. Plan shard count upfront.

What's the difference between replication and sharding?
Replication copies data across nodes for redundancy. Sharding splits data across nodes for scale. You need both for a production cluster.

Does ClickHouse support automatic failover?
Yes, when using Zookeeper or ClickHouse Keeper. If a node fails, queries route to replicas. Data is not lost. No manual intervention needed.

What sharding key should I use?
Use a column with high cardinality and even distribution. user_id, session_id, or order_id are good candidates. Avoid columns with skewed distributions like status codes.

How long does data replication take?
Depends on data volume and network speed. 100 GB typically replicates in 5-10 minutes on a 10 Gbps network. Initial sync takes longer. Incremental replication is near real-time.

Setting up a ClickHouse cluster isn't magic. It's engineering. Plan your shard count. Configure replication carefully. Test failure scenarios. Monitor merge performance.

Start with a single node. Add replication. Then sharding. Incremental wins.

The three biggest mistakes I see: Skipping replication. Using wrong macros config. Forgetting to test failover.

Here's what to do next:

Install ClickHouse on 4 nodes
Configure Zookeeper or Keeper
Set up replication configs
Create distributed tables
Insert test data
Kill a node. Verify failover works.

Your cluster will thank you.

Nishaant Dixit: Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn

Originally published at https://sivaro.in/articles/clickhouse-cluster-setup-guide.

DEV Community

ClickHouse Cluster Setup Guide

Top comments (0)