DEV Community

nishaant dixit
nishaant dixit

Posted on • Originally published at sivaro.in

ClickHouse Scalability Limits: Where It Breaks and What You Can Do

I almost lost a cluster once. 50 terabytes of data. Production down. The problem? I didn't respect ClickHouse scalability limits.

Here's the hard truth: ClickHouse is incredibly fast. But it's not infinite. Every system has boundaries. ClickHouse's are just better hidden than most.

What is ClickHouse scalability? It's the ability to grow your ClickHouse deployment from a single node to a petabyte-scale cluster without performance degradation. The limits aren't where you think they are—not in raw throughput, but in schema design, query patterns, and cluster topology.

By the end of this guide, you'll know exactly where ClickHouse breaks, how to plan around those breaking points, and which trade-offs are worth making. I'll share lessons from operating clusters at 100B+ rows and what the latest 2025/2026 research reveals about pushing past these boundaries.


Most people think ClickHouse slows down because of data volume. Wrong. The real ClickHouse scalability limits are structural.

Partition explosion is the silent killer. According to ClickHouse official docs, you should keep partitions under 50 GB each. I've seen teams with 10,000+ partitions on a single table. The merge scheduler stops. Inserts queue up. Queries time out.

The math is brutal. Each partition creates background merge tasks. With 50 partitions, you're fine. With 500, merges start competing. With 5,000, your system spends more time merging than serving queries.

In my experience, the partition count limit hits before any CPU or memory limit does. I've found that teams using daily partitioning on high-volume event streams hit this wall at around 6 months of data retention.

The real problem isn't storage. It's the metadata overhead. Each partition requires tracking in ZooKeeper (or ClickHouse Keeper). Each part needs coordination across replicas. The coordination layer becomes the bottleneck long before disk space runs out.


Let's talk concrete limits. ClickHouse isn't vague about these—and they matter more than raw throughput numbers.

The ClickHouse knowledge base on maximum tables and databases recommends:

  • Maximum databases: 100 per server
  • Maximum tables: 1000 per server
  • Maximum partitions per table: 1000
  • Maximum parts per partition: 100

These aren't hard locks. They're "you're going to have a bad time" thresholds. Cross 1000 partitions and your merges become unpredictable.

I've operated clusters with 500 tables across 50 databases. The ZooKeeper znodes exploded. Simple ALTER TABLE statements took minutes. The solution? Consolidate. Use a single wide table with proper partitioning instead of 50 narrow tables.

Here's how to check your current limits:

-- Check current partition count across all tables
SELECT 
    database,
    table,
    count() AS partition_count,
    sum(rows) AS total_rows
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY partition_count DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode
-- Check part count within each partition (the silent killer)
SELECT 
    database,
    table,
    partition,
    count() AS part_count,
    sum(rows) AS rows_in_partition
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
HAVING part_count > 50
ORDER BY part_count DESC;
Enter fullscreen mode Exit fullscreen mode

The second query catches the problem before it crashes your cluster. A single partition with 200+ parts means your merge scheduler is drowning.


According to a 2026 guide on scaling ClickHouse to billions of rows, the key insight is separating compute from storage. Most engineers try to scale vertically first. That's expensive and has hard limits.

The right approach: Shard your data intelligently, then use parallel replicas within each shard.

The ClickHouse blog on parallel replicas shows how they scaled raw GROUP BY to 100 billion+ rows in under 28 seconds. The trick? Parallel replicas that split work across nodes without shuffling all data.

Here's what that configuration looks like:

<!-- config.xml for parallel replicas -->
<clickhouse>
    <parallel_replicas>
        <enable>1</enable>
        <max_parallel_replicas>10</max_parallel_replicas>
        <allow_different_partition_processing>1</allow_different_partition_processing>
    </parallel_replicas>
</clickhouse>
Enter fullscreen mode Exit fullscreen mode

In my experience, most teams don't need 100 billion rows. They need predictable performance at 1-10 billion. That's achievable with a 3-shard, 2-replica cluster if you get the schema right.

The hard truth about scaling: don't use ORDER BY columns that have high cardinality. I learned this the hard way with a 50TB cluster. The sort key determines what data co-locates. Pick wrong, and every query scans all partitions.

-- Bad: high cardinality sort key scans everything
CREATE TABLE events_bad (
    event_id UUID,
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    payload String
) ENGINE = MergeTree()
ORDER BY (event_id);  -- Unique key = no data skipping

-- Good: low cardinality first
CREATE TABLE events_good (
    event_id UUID,
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    payload String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (event_type, toDate(timestamp), user_id);
Enter fullscreen mode Exit fullscreen mode

The difference is massive. With event_id as sort key, you scan all partitions for any query. With event_type + date, you skip 90% of data automatically.


Let me share what a Reddit thread from engineers running ClickHouse at petabyte scale reveals. These are the pain points that official documentation underplays.

1. ZooKeeper becomes your bottleneck
At PB scale, the coordination layer can't keep up. Each merge operation requires consensus. Each part registration writes to ZooKeeper. At 500+ parts per node, ZooKeeper starts queueing writes.

2. Backups are impractical
You can't snapshot 1PB daily. Point-in-time recovery takes hours. Most teams I know rely on replication, not backups. That's risky.

3. Query memory limits hit hard
The usage limits documentation warns about max_memory_usage. At PB scale, a single aggregation query can consume 50GB+ of memory. One runaway query takes down the node.

Here's my config for preventing memory disasters:

<clickhouse>
    <profiles>
        <default>
            <max_memory_usage>30000000000</max_memory_usage> <!-- 30GB -->
            <max_bytes_before_external_group_by>20000000000</max_bytes_before_external_group_by>
            <max_bytes_before_external_sort>20000000000</max_bytes_before_external_sort>
            <max_query_size>2097152</max_query_size> <!-- 2MB query limit -->
        </default>
        <readonly>
            <max_memory_usage>15000000000</max_memory_usage> <!-- 15GB -->
            <readonly>1</readonly>
        </readonly>
    </profiles>
</clickhouse>
Enter fullscreen mode Exit fullscreen mode

In my experience, the max_bytes_before_external_group_by setting is crucial. Without it, ClickHouse tries to hold all GROUP BY state in memory. With it, data spills to disk. Queries get slower but don't crash.


The severalnines guide on ClickHouse sharding outlines what works in production. Here's my condensed version based on 5 years of cluster operations.

Don't shard until you must
Single-node ClickHouse handles up to 10TB efficiently. Adding shards adds complexity—data distribution, cross-shard joins, and rebalancing headaches. Wait until you hit 5TB before adding the first shard.

When you do shard, use consistent hashing
Random sharding leads to data skew. Within 3 months, one shard has 40% of your data while another has 10%.

-- Use sharding_key for even distribution
CREATE TABLE events_distributed ON CLUSTER 'events_cluster' AS events_local
ENGINE = Distributed(
    'events_cluster',
    'default',
    'events_local',
    cityHash64(user_id)  -- Even distribution via hash
);
Enter fullscreen mode Exit fullscreen mode

Replication adds latency
Every write goes to all replicas. At high throughput (100K+ events/sec), replication lag becomes visible. The ClickHouse scaling docs recommend synchronous mode for data integrity, but it cuts write throughput.

I've found that async replication works fine for analytics workloads. For critical data, use synchronous mode on the primary shard only.


The Altinity knowledge base on ClickHouse limitations has a great line: "Too many columns is a real problem." They recommend keeping columns under 1000 per table.

I've seen schemas with 4000+ columns. Someone exported their entire event schema into a single ClickHouse table. Every SELECT * query scanned gigabytes. The merge tree with that many columns creates massive part files.

The rule of thumb: If your table has more than 200 columns, split it. Use materialized views for aggregation. Store infrequently queried columns in a separate table.

-- Narrow table for high-frequency queries
CREATE TABLE events_fast (
    timestamp DateTime,
    event_type String,
    user_id UInt64,
    duration_ms UInt32
) ENGINE = MergeTree()
ORDER BY (event_type, timestamp);

-- Wide table for full exports
CREATE TABLE events_full (
    timestamp DateTime,
    event_type String,
    user_id UInt64,
    payload String,
    headers String,
    device_info String,
    -- 50+ more columns
) ENGINE = MergeTree()
ORDER BY (timestamp);
Enter fullscreen mode Exit fullscreen mode

Queries against events_fast run 10x faster. The wide table exists for infrequent data dumps.


The ClickHouse horizontal scaling documentation makes it sound straightforward. Add nodes. Rebalance. Done.

Reality is messier.

Vertical scaling (bigger nodes) works up to 1TB RAM nodes. Beyond that, the law of diminishing returns hits. A 2TB RAM node costs 3x a 512GB node but doesn't deliver 3x performance. Memory bandwidth becomes the bottleneck.

Horizontal scaling (more nodes) hits the coordination bottleneck. Each node needs to communicate with ZooKeeper. At 20+ nodes, the overhead of cross-node queries grows.

The sweet spot? 8-12 nodes with 256GB-512GB RAM each. This gives you 2-6TB of usable memory with reasonable coordination overhead.

According to Tinybird's blog on operating ClickHouse at scale, they run 5 years of petabyte-scale operations. Their key insight: "Size your RAM for the hot dataset, not the cold one." Hot data that's queried frequently should fit in memory. Cold data sits on disk and uses sparse indexes.


The recommended limit is 1000 partitions per table. Beyond that, merge operations slow down and ZooKeeper load increases significantly. The official docs warn against exceeding this.

Yes. With proper sharding and parallel replicas, ClickHouse processes 100+ billion rows in seconds. The parallel replicas feature enables this by splitting queries across nodes efficiently.

Queries fail with Memory limit exceeded errors. To prevent this, configure max_memory_usage and max_bytes_before_external_group_by. This forces ClickHouse to spill to disk instead of crashing.

The recommended limit is 1000 tables per server. More tables means more metadata in ZooKeeper, slower SHOW TABLES queries, and increased memory usage for table schemas.

Monthly partitioning is preferred for most analytics workloads. Daily partitioning creates 30x more partitions, leading to partition explosion within 6 months. Only use daily partitions if you partition prune by exact dates.

Yes. Single-node ClickHouse handles up to 10TB efficiently. Vertical scaling (bigger RAM, more CPU) works well until 512GB RAM. Beyond that, sharding provides better cost-to-performance ratio.

Columns over 200 significantly slow down SELECT * queries and increase part file sizes. Use narrow tables for frequent queries and separate wide tables for infrequent data exports.

Each partition creation, merge operation, and replica registration requires ZooKeeper writes. At 500+ partitions per node, ZooKeeper starts queuing operations, causing insert and merge delays.


ClickHouse scalability limits aren't about raw data volume. They're about schema design, partition strategy, and cluster topology.

Three things you can do today:

  1. Run the partition and part count queries I shared above. If you're over 500 partitions or 50 parts per partition, restructure.
  2. Check your sort keys. If you're using high-cardinality columns like UUIDs, change them to low-cardinality + date combinations.
  3. Set memory limits in your config. Don't let one runaway query take down your cluster.

The systems I've seen fail at scale fail because someone ignored these limits, not because ClickHouse couldn't handle the data. Respect the boundaries. Plan for them. Your production environment will thank you.


Nishaant Dixit: Founder of SIVARO. Building data infrastructure and production AI systems since 2018. Built systems processing 200K events/sec. Connect on LinkedIn


Sources:

  1. ClickHouse Usage Limits
  2. ClickHouse Horizontal Scaling
  3. Maximum Tables and Databases
  4. How to Scale ClickHouse to Billions of Rows (2026)
  5. ClickHouse at PB Scale: Drawbacks and Gotchas
  6. ClickHouse Scaling and Sharding Best Practices
  7. Operating ClickHouse at Scale (5 Years of Learnings)
  8. ClickHouse Limitations - Altinity KB
  9. Parallel Replicas - ClickHouse Blog
  10. Automatic Scaling - ClickHouse Docs

Originally published at https://sivaro.in/articles/clickhouse-scalability-limits-where-it-breaks-and-what-you.

Top comments (0)