In this part, I will cover ClickHouse table engines. Like any other database, ClickHouse uses engines to determine a table's storage, replication, and concurrency methodologies. Every engine has pros and cons, and you should choose them by your need. Moreover, engines are categorized into families sharing the main features. As a practical article, I will deep dive into the most usable ones in every family and leave the others to your interest.
Now, let's start with the first and most usable family:
Merge-Tree Family
As it said, this is the most possible choice when you want to create a table in ClickHouse. It's based on the data structure of the Log Structured Merge-Tree. LSM trees are optimized for write-intensive workloads. They are designed to handle a large volume of writes by buffering them in memory and then periodically flushing them to disk in sorted order. This allows for faster writes of massive data and reduces the likelihood of disk fragmentation. They are considered an alternative to the B-Tree data structure which is common in traditional relational databases like MySQL.
Note: For all engines of this family, you can use
Replicated
as a prefix to the engine name to create a replication of the table on every ClickHouse node.
Now let's investigate common engines in this family.
MergeTree
Here is an example of a merge-tree DDL:
CREATE TABLE inventory
(
`id` Int32,
`status` String,
`price` String,
`comment` String
)
ENGINE = MergeTree
PRIMARY KEY (id, price)
ORDER BY (id, price, status)
Merge-tree tables use sparse indexing to optimize queries.
Briefly, in sparse indexing, data is split into multiple parts. Every part is sorted by the order by
keys (referred to as sort keys), where the first key has the highest priority in sorting. Then every part is broken down into groups called granules whose first and last items for primary keys are considered as marks. Since these marks are extracted from the sorted data, primary keys should be a subset of sort keys. Then for every query containing a filter on primary keys, ClickHouse performs a binary search on those marks to find the target granules as fast as possible. Finally, ClickHouse loads target granules in memory and searches for the matching rows.
Note: You can omit the
PRIMARY KEY
in DDL, and ClickHouse will consider sort keys as primary keys.
ReplaingMergeTree
DDL
In this engine, rows with equal order keys are replaced by the last row. Consider the below engine:
CREATE TABLE inventory
(
`id` Int32,
`status` String,
`price` String,
`comment` String
)
ENGINE = ReplacingMergeTree
PRIMARY KEY (id)
ORDER BY (id, status);
Suppose that you insert a row in this table:
INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed');
Now let's insert another row with the same sort keys:
INSERT INTO inventory VALUES (23, 'success', '2000', 'Cancelled');
Now the latter row will replace the previous one. Note that if you get select rows, you may face both of them:
SELECT * from inventory WHERE id=23;
That's because Clickhouse performs the replacement process while merging the parts, which happens in the background asynchronously and not immediately. To see the final result immediately, you can use the FINAL
modifier:
SELECT * from inventory FINAL WHERE id=23;
Note: You can specify a column as version while defining the table to replace rows accordingly.
Usage
Replacing Merge Tree is widely used for deduplication. As ClickHouse performs poorly in frequent updates, you can update a column by inserting a new row with the equal sort keys, and ClickHouse will remove the stalled rows in the background. Surely it's challenging to update sort keys because it won't delete the old rows in that situation. In that case, you can use Collapsing Merge Trees, explained in the next part.
CollapsingMergeTree
In this engine, you can define a sign column and ask the database to delete stall rows with sign=-1
and keep the new row with sign=1
.
DDL
CREATE TABLE inventory
(
`id` Int32,
`status` String,
`price` String,
`comment` String,
`sign` Int8
)
ENGINE = CollapsingMergeTree(sign)
PRIMARY KEY (id)
ORDER BY (id, status);
Let's insert a row in this table:
INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed', 1);
Now to update the row:
INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed', -1), (23, 'success', '2000', 'Cancelled', 1);
To see the results:
SELECT * FROM inventory;
To see the final results immediately:
SELECT * FROM inventory FINAL;
Usage
Collapsing Merge Trees can handle updates and deletes in a more controlled manner. For example, you can update sorts keys by inserting the same row with sign=-1
and the row with new sort keys with sign=1
. There are two challenges with this engine:
Since you need to insert the old row with
sign=1
, you need to inquire about it by fetching from the database or another data store.In case of inserting multiple rows accidentally or deliberately, with the
sign
equal to 1 or -1, you may face unwanted results. That's why you should consider all situations explained here.
AggreragatingMergeTree
Using this engine, you can materialize the aggregation of a table into another one.
DDL
Consider this inventory table. We need to have the maximum price per every item id and the sum of its number of items in another table.
CREATE TABLE inventory
(
`id` Int32,
`status` String,
`price` Int32,
`num_items` UInt64
) ENGINE = MergeTree ORDER BY (id, status);
Now let's materialize its results into another table via AggregatingMergeTree:
CREATE MATERIALIZED VIEW agg_inventory
(
`id` Int32,
`max_price` AggregateFunction(max, Int32),
`sum_items` AggregateFunction(sum, UInt64)
)
ENGINE = AggregatingMergeTree() ORDER BY (id)
AS SELECT
id,
maxState(price) as max_price,
sumState(num_items) as sum_items
FROM inventory2
GROUP BY id;
Now let's insert rows into it and see the results:
INSERT INTO inventory2 VALUES (3, 100, 2), (3, 500, 4);
SELECT id, maxMerge(max_price) AS max_price, sumMerge(sum_items) AS sum_items
FROM agg_inventory WHERE id=3 GROUP BY id;
Usage
This engine helps you reduce the response time of heavy, fixed analytics queries by calculating them in writing time. That will end up decreasing in database load in query time too.
Log Family
Lightweight engines with minimum functionality. They're the most effective when you need to quickly write many small tables (up to approximately 1 million rows) and read them later. Additionally, there are no indexes in this family. However, Log
and StripeLog
engines can break down data into multiple blocks to support multi-threading while reading data.
I will only look into the TinyLog engine. To check the others, you can visit this link.
TinyLog
This table is mainly used as a write-once method. i.e., you will write data once and read it as often as you want. As ClickHouse reads data in a single stream, it's better to keep the size of the table up to 1M rows.
CREATE TABLE log_location
(
id
Int32,
long
String,
lat
Int32
) ENGINE = TinyLog;
Usage
You can use this engine as an intermediate state for batch operations.
Integration Family
The engines in this family are widely used to connect with other databases and brokers with the ability to fetch or insert data.
I'll cover MySQL and Kafka Engines, but you can study the others here.
MySQL Engine
With this engine, you can connect with a MySQL database through ClickHouse and read its data or insert rows.
CREATE TABLE mysql_inventory
(
id
Int32,
price
Int32
)
ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password')
Kafka Engine
Using this engine, you can make a connection to a Kafka Cluster and read its data with a defined consumer group. This engine is broadly used for CDC purposes.
To learn more about this feature, read this article specifically on this topic.
Conclusion
In this article, we saw some of the most important engines of the ClickHouse database. It is clear that ClickHouse provides a wide range of engine options to suit various use-cases. The Merge Tree engine is the default engine and is suitable for most scenarios, but it can be replaced with other engines like AggregatingMergeTree, TinyLog, etc.
It's important to note that choosing the right engine for your use-case can significantly improve performance and efficiency. Therefore, it's worth taking the time to understand the strengths and limitations of each engine and select the one that best meets your needs.
Top comments (1)
👏👏👏