nishaant dixit

Posted on May 7 • Originally published at sivaro.in

ClickHouse High Availability: The Hard Truth Nobody Tells You

I spent six months building what I thought was a bulletproof ClickHouse cluster. Four shards, two replicas each, proper ZK coordination. Then a single network partition took us down for eight hours.

Everyone talks about ClickHouse being fast. They don't talk enough about the high availability problems that hit you at scale.

What is ClickHouse high availability? It's the ability to maintain query and ingest operations when nodes fail. Simple concept. Brutal execution. ClickHouse offers replication through ReplicatedMergeTree engines and ZooKeeper coordination, but this architecture introduces failure modes that catch engineering teams off guard.

Here's what I learned the hard way about ClickHouse high availability problems and how to survive them.

ClickHouse achieves high availability through shared-nothing architecture with data sharding and replication. According to ClickHouse Docs, the core mechanism relies on ReplicatedMergeTree tables that synchronize data across replicas using ZooKeeper for consensus.

The problem? ZooKeeper becomes a single point of failure that most teams underestimate.

I've found that teams confuse "replication" with "high availability." They're not the same thing. Replication keeps data copies. High availability keeps the system running. When your ZooKeeper ensemble gets overloaded, both replicas can become unavailable simultaneously.

Consider this real scenario from ClickHouse at PB Scale: Drawbacks and Gotchas: A Reddit engineering team running PB-scale ClickHouse discovered that during merge storms, ZooKeeper writes could spike 10x, causing watch triggers to fire thousands of times per second. The entire cluster stalled because replicas couldn't coordinate.

The hard truth about ClickHouse high availability: there's no built-in solution for datacenter failover. Cross-datacenter replication is something you build yourself.

Most people think ZooKeeper handles metadata only. Wrong. ClickHouse writes to ZooKeeper on every insert for replicated tables. According to ClickHouse Docs, high insertion rates generate significant ZooKeeper load.

In my experience, a cluster doing 50K inserts/second can easily overwhelm a three-node ZooKeeper ensemble. The symptoms: inserts queuing, replica lag growing, session timeouts.

Every insert creates part files. ClickHouse merges them in the background. When a merge operation fails on one replica, ZooKeeper marked the mutation as failed across all replicas. Your data doesn't land anywhere.

A real incident from The Day Our ClickHouse Cluster Went Silent describes a scenario where a failed merge corrupted the replica state, requiring full resynchronization of 12TB of data.

If network isolation occurs between replicas, both sides accept writes independently. When connectivity returns, ClickHouse has no built-in conflict resolution. You get duplicate data or inconsistent states.

According to ClickHouse High Availability Architecture, properly configured quorum settings can mitigate this, but most teams set quorum incorrectly.

Let me show you the configurations that matter.

<clickhouse>
    <zookeeper>
        <node>
            <host>zk1.example.com</host>
            <port>2181</port>
        </node>
        <node>
            <host>zk2.example.com</host>
            <port>2181</port>
        </node>
        <node>
            <host>zk3.example.com</host>
            <port>2181</port>
        </node>
        <session_timeout_ms>30000</session_timeout_ms>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <root>/clickhouse</root>
        <identity>user:password</identity>
    </zookeeper>
</clickhouse>

Notice the session timeout at 30 seconds. Default is 10 seconds. Too low causes false failures during ZooKeeper GC pauses. Too high means delayed detection of real failures.

-- Set insert quorum per table to require acknowledgment from both replicas
CREATE TABLE events (
    event_id UUID,
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    payload String
) ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/events/{shard}',
    '{replica}',
    event_id
)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp, event_id)
SETTINGS insert_quorum = 2,  -- Wait for 2 replicas
         insert_quorum_parallel = 0,  -- Avoid parallel quorum issues
         quorum_reads = 1;  -- Enable quorum reads

I've found that setting insert_quorum = 2 prevents data loss during single replica failures, but it reduces throughput by roughly 30%. Every trade-off has a cost.

-- Check replication status across all tables
SELECT 
    database,
    table,
    is_leader,
    is_readonly,
    parts_to_check,
    future_parts,
    parts_to_merge,
    replicas_count,
    replicas_inactive
FROM system.replicas
WHERE 
    is_readonly = 1 
    OR parts_to_check > 0
    OR future_parts > 100;  -- Lag indicator

-- Monitor ZooKeeper request latency
SELECT 
    path,
    value,
    value > 100 AS high_latency_flag
FROM system.zookeeper 
WHERE path = '/clickhouse/task_queue/ddl';

According to ClickHouse scaling and sharding best practices, monitoring these metrics is essential for catching issues before they cause outages.

I've watched teams struggle with ReplicatedMergeTree for years. The alternative that works in production: use software-defined storage like S3 or MinIO with MergeTree on shared disks.

According to How to Handle ClickHouse Data Center Failover (published March 2026), ClickHouse's S3-backed MergeTree engine offers better availability characteristics than ZooKeeper-based replication. When a node fails, another node mounts the same S3 bucket and serves data immediately. No replication lag. No ZooKeeper coordination.

The trade-off? Higher latency per query (30-50ms vs 1-5ms with local SSDs). For time-series analytics, this is acceptable. For real-time dashboards, it hurts.

Don't run a two-node ZooKeeper ensemble. It gives you no resilience. A three-node ensemble can survive one node failure and still maintain quorum.

The mistake I see repeatedly: teams run ZooKeeper on the same hardware as ClickHouse nodes. According to Getting started with ClickHouse? 13 mistakes and how to avoid them, this creates resource contention that causes outages during merge operations.

Most teams never test failover. They assume replication works. Then a real failure happens.

A hard lesson from my experience: We tested failover by killing one node. Everything worked. Six months later, a disk failure on the ZooKeeper node took the entire cluster down for four hours. We'd never tested that scenario.

What I now recommend:

Test ZooKeeper failures weekly
Test network partitions monthly
Test full datacenter failover quarterly
Measure recovery time objectively

Problem: During write spikes, replicas fall behind. Queries return stale data.

Solution: Implement selective quorum writes for critical data.

-- Create two tables: one for critical, one for best-effort
CREATE TABLE critical_events (
    -- same schema
) ENGINE = ReplicatedMergeTree(...)
SETTINGS insert_quorum = 2;

CREATE TABLE analytics_events (
    -- same schema
) ENGINE = ReplicatedMergeTree(...)
SETTINGS insert_quorum = 0;  -- Best effort

Route critical data (payments, user actions) to the quorum table. Route analytics data (page views, logs) to the best-effort table.

Problem: Java ZooKeeper pauses during garbage collection. ClickHouse sessions expire. Replicas disconnect.

Solution from High Availability and Replication in ClickHouse: Tune ZooKeeper JVM settings for low-latency GC.

-server -Xmx4G -Xms4G
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:ParallelGCThreads=4
-XX:+DisableExplicitGC
-Djute.maxbuffer=4194304  ```
{% endraw %}


##
**Problem:** One replica fails. You restore from backup. The restored replica has the old schema. Queries fail with "checksum mismatch" errors.

**Solution:** Use ClickHouse's built-in replication recovery mechanism.
{% raw %}


```sql
-- Force replica resynchronization
SYSTEM RESTART REPLICA my_database.my_table;
SYSTEM SYNC REPLICA my_database.my_table;

-- If that fails, detach and reattach the partition
ALTER TABLE my_database.my_table DETACH PARTITION '202401';
ALTER TABLE my_database.my_table ATTACH PARTITION '202401';

According to ClickHouse High Availability Cluster, forcing replica sync is often faster than rebuilding from backup.

Q: Is ClickHouse suitable for high availability requirements?
A: Yes, but only with proper configuration. ClickHouse supports replication and ZooKeeper-based coordination, but lacks built-in cross-datacenter failover. Expect to invest significant engineering time in achieving true HA.

Q: Can ClickHouse survive a complete ZooKeeper failure?
A: No. Without ZooKeeper, replicas cannot coordinate. Existing data remains queryable, but new inserts fail and merge operations stop. Recoverability depends on restoring ZooKeeper from backup.

Q: What's the maximum acceptable ZooKeeper latency for ClickHouse?
A: Under 10ms average, under 100ms p99. Higher latencies cause session timeouts and replica disconnections. Monitor using system.zookeeper and set alerts for values exceeding 50ms.

Q: Should I use ReplicatedMergeTree or S3-backed MergeTree for HA?
A: ReplicatedMergeTree for low-latency queries (under 10ms). S3-backed MergeTree for simpler operations and better resilience. The trade-off: 30-50ms additional query latency versus reduced complexity.

Q: How many replicas should I configure?
A: Three replicas per shard provides the best balance. Two replicas for minimum redundancy, but can't survive simultaneous failures. Three gives proper quorum without excessive resource usage.

Q: Does ClickHouse support automatic failover?
A: Yes for replica failures within the same datacenter. No for datacenter-level failures without custom tooling. ClickHouse's "multi-datacenter replication" is experimental and requires significant configuration.

Q: What happens to data during a network partition?
A: Both sides accept writes independently. When the partition resolves, ClickHouse cannot automatically resolve conflicts. You may have duplicate data. Preventing this requires proper quorum settings and network infrastructure design.

Q: Can I use ClickHouse without ZooKeeper?
A: Yes, using MergeTree tables or S3-backed storage. You lose replication benefits but gain simplicity. For single-node setups or data that can tolerate loss, this is a valid trade-off.

ClickHouse high availability is achievable, but the defaults will fail you. ZooKeeper bottlenecks, merge failures during stress, and split-brain scenarios are real threats that require deliberate engineering.

Start with monitoring your ZooKeeper performance. Measure session timeouts and request latencies. Then tune your quorum settings based on data criticality. Test failover scenarios—including the ones that scare you.

The teams that survive ClickHouse high availability problems are the ones who plan for failure, not the ones who assume replication will save them.

*
Nishaant Dixit: Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn: https://www.linkedin.com/in/nishaant-veer-dixit

Sources

Originally published at https://sivaro.in/articles/clickhouse-high-availability-the-hard-truth-nobody-tells.

DEV Community

ClickHouse High Availability: The Hard Truth Nobody Tells You

Top comments (0)