Carl-Gerhard Lindesvärd

Posted on Nov 17

ClickHouse: The Good, The Bad, and The Ugly

#performance #database #analytics

ClickHouse is one of those databases that everyone gets excited about after their first benchmark. It’s absurdly fast, column-oriented, and built for analytics at scale.

It’s also surprisingly easy to plug into an existing stack. You can stream data from Postgres, Mongo, S3, or pretty much anywhere. That’s what makes it so appealing. You hit the point where your Postgres queries start to struggle, you don’t want to rebuild everything, so you drop in ClickHouse and suddenly your dashboard loads in milliseconds.

It’s like putting a turbocharger on your reporting.

ClickHouse is also improving at a crazy pace. Every month they roll out new features, bug fixes and faster queries.

But with speed comes responsibility. ClickHouse is a beast. It’ll reward you when you treat it well, but it’ll bite you if you cut corners.

Cloud vs self-hosting

Your first big decision is whether to self-host or go with a managed provider.

	Cloud	Self-Hosted
Setup time	Minutes	Days/weeks
Cost at scale	$$$	$ + engineer time
Backup/HA	Automatic	DIY
Headaches	A few but not hosting related	Many
Good for	Most use cases	Cost optimization

The cloud route is painless. You get uptime, automatic scaling, and zero headaches. The trade-off is cost. ClickHouse Cloud, Altinity Cloud, Tinybird, they all work great, but the bill can sting hard once you start pushing data around. You don’t get server issues, but you pay for the peace of mind.

In Clickhouse Cloud you also don't need to worry about replication, this is handled under the hood so you don't need to create replicated and distributed tables.

Self-hosting looks easy at first, but it’s not.

You spin up a single node, everything flies, and then one day something fails, stops merging data, corrupt data or what ever. This is just the top of the iceberg.

To handle real production traffic you’ll end up with replicated and distributed tables. You’ll need to decide between vertical scaling, horizontal scaling, or both. Then you start worrying about corrupted parts, cluster topology, and backups.

Running ClickHouse yourself works fine for smaller setups. Once you grow, it’s a full-time job unless you use something like the Altinity ClickHouse Operator on Kubernetes. That operator makes things bearable. You define clusters in YAML, it handles replication, zookeeper (Clickhouse Keeper), and have great backup strategies. If you ever plan to self-host long-term, start there.

The dark side of joins

Joins in ClickHouse are not the joins you’re used to. They work, but they’re not “free.”

ClickHouse doesn’t have a full query optimizer like Postgres or MySQL. That means it doesn’t plan your joins intelligently. If you join two big tables, it’ll happily try and load everything in memory and die trying.

You have to think ahead. Filter first, join later.

A few ways to survive:

Use CTEs or sub-queries to narrow down the joined dataset before the join actually happens.
Use dictionaries (in-memory lookup tables) for small reference data. They’re insanely fast, but they have to fit in memory.
Know your sorting keys. ClickHouse relies on them for efficient reads. Bad keys make joins worse.
Always join the smaller table

You'll notice when you have a bad join since it will take ages or die trying.

Updates, deletes, and the reality of immutability

ClickHouse was never designed for frequent updates or deletes. It’s a write-once, append-forever kind of database. You can’t just run UPDATE users SET ... like in Postgres.

To their credit, the ClickHouse team has made big progress here. They’ve added lightweight deletes and updates, and there are new table engines like ReplacingMergeTree and VersionedCollapsingMergeTree that can simulate mutable data. But it still takes extra thought.

You need to design your tables knowing that changing data later is harder. That’s fine for analytics workloads, but painful if you expect relational behavior.

I still run into these kind of issues today. Hoping the lightweight updates which is in beta now will make my life easier.

Inserting data the right way

Here’s the biggest beginner trap.

ClickHouse loves big inserts. It hates small ones.

Every insert triggers background merges, index updates, compression, and part creation. Do that one row at a time and you’ll drown it. Batch your inserts into chunks, ideally thousands of rows at a time. You’ll instantly see CPU drop and throughput skyrocket.

If you’re ingesting data continuously, throw it into a queue and batch it there. That’s what we do at OpenPanel.dev. It smooths out traffic spikes and keeps our ingestion fast and predictable.

Replication and sharding

This isn’t a bad thing about ClickHouse. In fact, it’s one of its best features.
But I still want to cover a few parts that confused me when I first started using it.

There are three kinds of tables you’ll deal with when setting up replication or sharding:

Your existing table (usually MergeTree or something similar)
A replicated table (ReplicatedMergeTree)
A distributed table (Distributed)

Each of these plays a different role in your cluster, and it’s worth understanding them before you begin.

First, decide how many replicas you want. Most setups use three replicas for high availability and to get replication working, you’ll replace your existing table with a replicated one.

You can do that by creating a new table and swapping MergeTree for ReplicatedMergeTree in the engine section.

CREATE TABLE events_replicated ON CLUSTER '{cluster}' (
  ...
)
ENGINE = ReplicatedMergeTree(
  '/clickhouse/{installation}/{cluster}/tables/{shard}/openpanel/v1/{table}',
  '{replica}'
)
PARTITION BY toYYYYMM(created_at)
ORDER BY (created_at)

Once that’s done, any data written to one node will be replicated to the others. ZooKeeper or ClickHouse Keeper handles the synchronization automatically.

If you want to move your data from your existing table to the replicated table you can use INSERT SELECT to do this.

INSERT INTO events_replicated SELECT * FROM events;

Now let’s look at where sharding and Distributed tables come in.
Sharding is how you scale horizontally by splitting data into smaller chunks and spreading them across nodes. That said, it’s usually better to scale vertically first, because ClickHouse handles vertical scaling surprisingly well.

If you decide to shard, you’ll need to create a distributed table. A distributed table knows where your data lives and redirects queries to the right node.

When creating one, you define how data should be split across nodes. In the example below, the data is sharded using cityHash64(project_id), which spreads rows evenly based on the project_id.

CREATE TABLE events_distributed ON CLUSTER '{cluster}' AS events_replicated
ENGINE = Distributed(
  '{cluster}',
  currentDatabase(),
  events_replicated,
  cityHash64(project_id)
)

Now you can query data from any node, and ClickHouse will automatically route the request to where the data actually sits.

If you want to dig deeper, check out the official docs on Distributed tables and Replication.

So why stick with it?

Because when it works, it’s magic.

At OpenPanel we hit all these issues. Slow inserts, bad joins, tricky replication, and we still use ClickHouse every single day. Once you set it up correctly, nothing else compares. It’s unbelievably fast and scales far beyond what most relational databases can handle.

You just have to respect it. Treat it like a Ferrari, not a Corolla.

If you want me to go deeper into how we deploy and manage our own cluster on Kubernetes using the Altinity operator, let me know in the comments. I can show exactly how we keep it stable and cost-efficient.

Top comments (5)

david duymelinck • Nov 17

I see a lot of - I don't understand the database type, but I still want to use it - in this post.
Maybe the fact that Clickhouse is using SQL might make people think it is a relational database, but then they are wrong.
SQL is a DSL to write queries. So are there solutions to write SQL for MongoDB, SQL for Neo4j.
So SQl is not an one database type solution.

But it is still a good post to make people aware of Clickhouse.

Carl-Gerhard Lindesvärd • Nov 17

What I try to explain here is that Clickhouse actually not is your ordinary relationship database. Clickhouse is a powerful beast but you need to know some things about it before you start. I hope it can help a few out there.

I started to use clickhouse 2 years ago and didn't have much knowledge about columnar databases then, so a lot of the stuff in the article are real world issues I assume many does.

Anyway, thanks for reading and commenting.

david duymelinck • Nov 17 • Edited

My comment is not meant to shame you. It is a general observation about people adopting a different database type while not understanding it.
Another thing, besides thinking SQL is for relational databases, I see is that people use the relational mindset to structure other database types. If you don't structure the database the right way, the performance will be poor. And most of the times the database type will be blamed.

We all have a learning curve when we start using a unknown technology. The thing I learned over the years is to first get enough knowledge to find out what problem is the technology solving and how does it do that. And then start to explore if it helps you to solve the problem that made you look at the new technology.

Liam Porter • Nov 17

Thanks for tackling the often-overlooked ClickHouse gotchas. The cloud vs self-hosted trade-offs, join pitfalls, and batching guidance are spot on—great, practical clarity on a niche topic that's easy to gloss over.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.