nishaant dixit

Posted on May 7 • Originally published at sivaro.in

ClickHouse Production Best Practices: Hard Lessons from the Trenches

I learned this the hard way. Two years ago, we pushed a ClickHouse cluster to production without proper partitioning. Query latency jumped from 50ms to 12 seconds. Our engineering team spent three days debugging. The fix? Adding a proper toYYYYMM(timestamp) partition key. Two lines of SQL. Night and day.

If I could go back and tell my younger self one thing: ClickHouse is brutally honest about your design mistakes.

This guide covers what I've learned running ClickHouse in production across multiple teams and systems. What is ClickHouse? It's a column-oriented OLAP database designed for real-time analytics on large datasets. It crushes PostgreSQL for aggregation queries. But it demands respect for its quirks.

You'll walk away with specific configurations, monitoring strategies, and schema patterns that work at scale.

Most teams treat ClickHouse like PostgreSQL. They're wrong because ClickHouse is an append-only log with indexing optimizations. The mental model matters.

Here's what changes in production:

Schema design determines query speed – Not indexing, not caching. Partition keys and order by clauses dictate everything.
Memory is the bottleneck – ClickHouse loves RAM. Starve it, and you get OOM kills during merges.
TTL isn't a set-it-and-forget-it feature – According to ClickHouse TTL in Production, improperly configured TTL can cause disk I/O storms during background merges.

The classic mistake: throwing hardware at slow queries instead of fixing the table schema.

In my experience, the first production incident always involves merges. ClickHouse merges parts in the background. If your insert rate exceeds merge capacity, parts pile up. SELECT queries scan more files. Latency degrades linearly.

Check the system.merges table. If you see pending merges growing, your insert volume is too high for the current merge threads.

Why bother with ClickHouse in production? The numbers speak.

A Reddit user migrating from Redshift reported: "We reduced query time from 45 seconds to 800ms on the same hardware" according to Moving from Redshift to ClickHouse.

The three killer benefits:

Real-time aggregation at scale – Sub-second group by on billions of rows. This isn't theoretical. We process 200K events/sec with average latency under 200ms.
Storage compression – Columnar storage with LZ4 compression gives 5-10x reduction versus row-based stores. Cold data sits on object storage cheaply.
SQL compatibility – Your analysts don't need to learn a new query language. Standard SQL works. Window functions, joins, subqueries—all supported.

Here's the trade-off nobody mentions: Inserts are single-threaded per partition. You can't parallelize writes to the same partition. Design your partitioning strategy with write throughput in mind.

Let's get concrete. These are configurations I've battle-tested across three production clusters.

Bad partition keys cause more production issues than anything else. Too granular, and you have thousands of tiny parts. Too coarse, and queries scan unnecessary data.

-- BAD: Too granular, creates too many parts
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    data String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (toDate(timestamp), event_type, user_id);

-- GOOD: Monthly partitions for most workloads
CREATE TABLE events (
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    data String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (toDate(timestamp), event_type, user_id);

According to ClickHouse Best Practices, partition your data to contain at least one month of data per partition. This gives optimal merge behavior.

Default settings are for development. Production needs tuned values.

CREATE TABLE events_production (
    timestamp DateTime,
    user_id UInt64,
    event_type String,
    data String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (toDate(timestamp), event_type, user_id)
SETTINGS 
    index_granularity = 8192,
    min_rows_for_wide_part = 100000,
    min_bytes_for_wide_part = 10485760,
    merge_max_block_size = 10000;

The min_rows_for_wide_part and min_bytes_for_wide_part settings force ClickHouse to create wide parts instead of many narrow ones. This reduces metadata overhead during merges.

In my experience, setting index_granularity to 8192 (the default) works for 90% of workloads. Changing it to 4096 may help with high cardinality columns, but increases index size by 2x.

You need visibility into running queries. Here's the query I keep in my monitoring dashboard:

-- Find long-running queries
SELECT 
    query_id,
    query,
    elapsed,
    read_rows,
    read_bytes,
    memory_usage
FROM system.processes
WHERE elapsed > 5
ORDER BY elapsed DESC;

According to ClickHouse Production Monitoring Tips, watching system.processes and system.query_log is the first line of defense against production incidents.

TTL runs in background merges. If you configure it wrong, your merge thread pool fills up, and regular merges stall.

-- Safe TTL strategy: separate cold storage
ALTER TABLE events 
    MODIFY TTL timestamp + INTERVAL 90 DAY 
    TO DISK 'cold_storage';

-- Then delete from cold storage after retention
ALTER TABLE events 
    MODIFY TTL timestamp + INTERVAL 365 DAY 
    DELETE;

This two-step approach moves old data to cheaper storage before deletion. It prevents the I/O storm of deleting everything at once. According to ClickHouse TTL in Production, staging TTL operations with disk moves reduces peak I/O by 60%.

ClickHouse's Sizing and hardware recommendations are clear: use at least 2GB RAM per CPU core. I've seen teams ignore this and pay the price during heavy aggregation.

The formula I use:

Memory: 2-4GB per core for OLAP workloads
Storage: NVMe SSDs for hot data, SATA SSDs for warm, object storage for cold
CPU: More cores matter for parallel query processing

ClickHouse clusters communicate during distributed queries. Bad network config causes query failures.

-- In config.xml
<remote_servers>
    <production_cluster>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>clickhouse-01</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>clickhouse-02</host>
                <port>9000</port>
            </replica>
        </shard>
    </production_cluster>
</remote_servers>

Set internal_replication to true. It offloads replication coordination to ClickHouse instead of your application layer.

Every team asks: "Should we use ClickHouse or [insert database]?"

Here's the honest answer: ClickHouse excels at read-heavy analytical workloads with append-only data. It struggles with:

Frequent updates and deletes (use the ReplacingMergeTree engine with caution)
Transactional workloads (it's not an OLTP database)
Single-row lookups by primary key (Druid or Elasticsearch are better)

I've found that teams who fail with ClickHouse try to force it into a transactional role. Don't. Use it for what it's built for: fast aggregation on immutable event data.

ClickHouse writes are not free. Each insert creates a new part. Background merges combine parts. But merges consume I/O and CPU.

Solution: Batch inserts. 10 rows per batch is terrible. 100K rows per batch is optimal. Use INSERT INTO table VALUES (...), (...), (...). Or use the Native Protocol for bulk loading at 1M+ rows per batch.

Symptoms: Query latency climbing. Part count in system.parts growing. Background merges never catch up.

Root cause: Inserting too many tiny batches. Or partition key too granular.

Fix:

Reduce insert frequency. Batch to 100K+ rows per insert.
Temporarily increase background_pool_size in config.xml.
Merge parts manually: OPTIMIZE TABLE events FINAL

ClickHouse caches data in memory during aggregations. A GROUP BY on 10 billion rows with 100 columns needs RAM.

Fix:

Add max_memory_usage setting limits per query
Use SELECT ... GROUP BY ... WITH TOTALS sparingly
Filter early with WHERE clauses. Don't scan data you don't need.

ClickHouse uses asynchronous replication. Lag happens when the replica falls behind.

Monitor with:

SELECT 
    database,
    table,
    absolute_delay,
    total_replicas,
    active_replicas
FROM system.replicas;

If absolute_delay exceeds 60 seconds, check network bandwidth or disk I/O on replicas.

Q: What's the ideal partition key for time-series data?
A: Use toYYYYMM(timestamp) for most workloads. This creates monthly partitions. It balances write performance (not too many parts) with query pruning (not too coarse).

Q: How many replicas do I need for production?
A: At least two. One primary, one replica. For high availability across data centers, three replicas spread across two zones. This handles node failures without data loss.

Q: Can ClickHouse handle concurrent read and write?
A: Yes. Writes are append-only. Reads still work during writes because ClickHouse never locks the entire table. However, heavy merges during reads may increase query latency.

Q: What's the maximum table size before performance degrades?
A: There's no hard limit. I've seen 10TB tables with sub-second queries. The key is proper partitioning and ORDER BY design. Bad schema breaks at any size.

Q: Should I use Nullable columns?
A: Avoid them. Nullable columns create additional files (the null bitmap). They also disable some optimizations like LowCardinality type inference. Use default values instead.

Q: How do I handle updates to existing rows?
A: Use ReplacingMergeTree with a version column. But accept that updates are eventually consistent. ClickHouse isn't built for real-time row-level updates.

Q: What happens if the disk fills up?
A: ClickHouse stops accepting writes. Reads still work. Monitor disk usage with alerts at 80%, 90%, and 95%. Set up TTL to delete old data automatically.

Q: Is ClickHouse good for real-time analytics?
A: Yes. With sub-second aggregation on billions of rows. But "real-time" means seconds, not milliseconds. For millisecond latency, use Redis or Memcached for hot data.

ClickHouse in production isn't magic. It's careful schema design, disciplined monitoring, and understanding that no technology is perfect.

Three takeaways:

Partition by month. Order by timestamp. This is the foundation.
Batch inserts to 100K+ rows. Your merge threads will thank you.
Monitor system.merges and system.processes religiously. Incidents announce themselves here first.

Next step: Audit your current ClickHouse schema. Check partition granularity. Review TTL configurations. Run the monitoring queries I showed you. Fix one thing today.

Nishaant Dixit is the founder of SIVARO, a product engineering company specializing in data infrastructure and production AI systems. Since 2018, I've built systems processing 200K events/sec and managing 50TB+ datasets. Connect on LinkedIn.

According to ClickHouse Docs — Going to Production: https://clickhouse.com/docs/use-cases/observability/clickstack/production

According to ClickHouse Blog — Top 10 Best Practices: https://clickhouse.com/blog/10-best-practice-tips

According to Rakesh Therani — Building Production-Ready Clusters: https://medium.com/@rakesh.therani/building-production-ready-clickhouse-clusters-a-complete-configuration-generator-45a52e8e5ff3

According to ClickHouse Docs — Best Practices: https://clickhouse.com/docs/best-practices

According to Instaclustr — Mastering ClickHouse Best Practices: https://www.instaclustr.com/blog/mastering-clickhouse-best-practices-infrastructure-and-operational-excellence/

According to Big Data Boutique — Production Monitoring Tips: https://bigdataboutique.com/blog/clickhouse-production-monitoring-and-optimization-tips-9b26bc

According to ClickHouse Docs — Sizing and Hardware Recommendations: https://clickhouse.com/docs/guides/sizing-and-hardware-recommendations

According to Zeka Graphic — Deploying ClickHouse in Production: https://www.zekagraphic.com/deploying-and-managing-clickhouse-databases-in-production/

According to Aliakbar Hosseinzadeh — ClickHouse TTL in Production: https://medium.com/@aliakbarhosseinzadeh/clickhouse-ttl-in-production-a-safe-strategy-for-data-retention-and-disk-optimization-9f1546fe673f

According to Reddit r/Clickhouse — Moving from Redshift: https://www.reddit.com/r/Clickhouse/comments/1o0b931/moving_from_redshift_to_clickhouse_looking_for/

Originally published at https://sivaro.in/articles/clickhouse-production-best-practices-hard-lessons-from-the.

DEV Community

ClickHouse Production Best Practices: Hard Lessons from the Trenches

Top comments (0)