Introduction
When people first start using ClickHouse®, most of the attention naturally goes to the MergeTree engine. It's the default choice for analytical workloads thanks to its exceptional query performance, indexing capabilities, and efficient storage.
However, ClickHouse offers a wide range of table engines beyond MergeTree. Each engine is designed to solve a specific problem, whether it's handling updates, pre-aggregating data, buffering high-frequency inserts, integrating with external systems, or scaling queries across clusters.
Understanding these engines is essential for designing efficient ClickHouse architectures and selecting the right storage strategy for your workload.
In this article, we'll explore some of the most commonly used ClickHouse table engines beyond MergeTree, understand their strengths, and look at practical examples of when to use them.
What Is a Table Engine?
A table engine defines how ClickHouse stores, organizes, and manages data. It determines:
- How data is written
- How queries are executed
- How data is merged and maintained
- Whether replication is supported
- How storage is organized
Every table in ClickHouse must specify an engine during creation.
CREATE TABLE my_table
(
id UInt32,
name String
)
ENGINE = MergeTree()
ORDER BY id;
Choosing the appropriate engine is one of the most important design decisions when building a ClickHouse database.
Part 1: The MergeTree Family
The MergeTree family forms the foundation of ClickHouse. Several specialized engines extend MergeTree to support different data management patterns.
ReplacingMergeTree
ReplacingMergeTree automatically removes duplicate rows with the same primary key during background merges. If a version column is provided, the row with the highest version is retained.
Best For
- Handling row updates
- Slowly changing dimensions
- Event corrections
- Avoiding expensive mutations
CREATE TABLE default.orders
(
order_id UInt32,
status LowCardinality(String),
amount Float64,
version UInt32
)
ENGINE = ReplacingMergeTree(version)
ORDER BY order_id;
Insert the original row:
INSERT INTO default.orders
VALUES (1, 'pending', 450.00, 1);
Insert an updated version:
INSERT INTO default.orders
VALUES (1, 'completed', 450.00, 2);
After background merges complete, only the record with version = 2 remains.
SummingMergeTree
SummingMergeTree automatically sums numeric columns for rows sharing the same primary key.
This significantly reduces storage while improving analytical query performance.
Best For
- Daily metrics
- Revenue reports
- Event counters
- Time-series summaries
CREATE TABLE default.daily_sales
(
sale_date Date,
country LowCardinality(String),
total_amount Float64,
total_orders UInt32
)
ENGINE = SummingMergeTree()
ORDER BY (sale_date, country);
Insert sample data:
INSERT INTO default.daily_sales VALUES
('2024-01-01','IN',1200.00,3),
('2024-01-01','IN',800.00,2);
After merging:
| sale_date | country | total_amount | total_orders |
|---|---|---|---|
| 2024-01-01 | IN | 2000.00 | 5 |
AggregatingMergeTree
AggregatingMergeTree extends pre-aggregation beyond simple sums by storing aggregate states.
It works particularly well with Materialized Views.
Best For
- Average calculations
- Unique counts
- Quantiles
- Complex aggregations
CREATE TABLE default.orders_summary
(
order_date Date,
country LowCardinality(String),
total_count AggregateFunction(count, UInt32),
avg_amount AggregateFunction(avg, Float64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, country);
Create a materialized view:
CREATE MATERIALIZED VIEW default.orders_mv
TO default.orders_summary
AS
SELECT
toDate(order_date) AS order_date,
country,
countState() AS total_count,
avgState(amount) AS avg_amount
FROM default.orders
GROUP BY order_date, country;
Query the aggregated results:
SELECT
order_date,
country,
countMerge(total_count) AS total_orders,
avgMerge(avg_amount) AS avg_amount
FROM default.orders_summary
GROUP BY order_date, country;
CollapsingMergeTree
CollapsingMergeTree manages logical deletes and updates using a sign column.
Rows with opposite sign values (+1 and -1) cancel each other out during merges.
Best For
- Event sourcing
- Logical deletes
- Update-heavy datasets
CREATE TABLE default.orders
(
order_id UInt32,
status LowCardinality(String),
amount Float64,
order_date DateTime,
sign Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;
Insert a row:
INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',1);
Logically delete it:
INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',-1);
After background merges, both rows collapse and the record is effectively removed.
Part 2: The Log Family
Log engines are lightweight storage engines intended for simple workloads where MergeTree's advanced capabilities are unnecessary.
TinyLog
TinyLog is the simplest storage engine available.
It stores columns separately and provides no indexing.
Best For
- Small datasets
- Temporary tables
- Testing
- One-time imports
CREATE TABLE default.temp_data
(
id UInt32,
value String
)
ENGINE = TinyLog;
Log
Log is similar to TinyLog but offers improved efficiency for slightly larger datasets.
It supports concurrent reads but not concurrent writes.
CREATE TABLE default.log_data
(
id UInt32,
message String,
timestamp DateTime
)
ENGINE = Log;
StripeLog
StripeLog stores columns together in stripes within a single file, improving write performance compared to Log.
CREATE TABLE default.stripe_data
(
id UInt32,
value String
)
ENGINE = StripeLog;
Part 3: Special Engines
Memory
The Memory engine stores all data in RAM.
Since nothing is written to disk, reads and writes are extremely fast.
Best For
- Temporary lookup tables
- Session data
- Testing
- Small cache tables
CREATE TABLE default.cache_data
(
id UInt32,
value String
)
ENGINE = Memory;
Insert data:
INSERT INTO default.cache_data
VALUES (1,'cached_value');
Read data:
SELECT *
FROM default.cache_data;
Note: All data is lost whenever the ClickHouse server restarts.
Buffer
The Buffer engine temporarily stores incoming inserts in memory before flushing them to another table.
This reduces the creation of excessive small parts during high-frequency ingestion.
Best For
- Streaming workloads
- High insert rates
- Improving ingestion efficiency
CREATE TABLE default.orders_buffer
AS default.orders
ENGINE = Buffer(
default,
orders,
4,
10,
60,
1000,
100000,
1000000,
10000000
);
Incoming data is buffered and automatically written to the destination table once configured thresholds are reached.
Dictionary
The Dictionary engine exposes ClickHouse dictionaries as queryable SQL tables.
CREATE TABLE default.products_dict_table
ENGINE = Dictionary('default.products_dict');
Example:
SELECT *
FROM default.products_dict_table
LIMIT 5;
Part 4: Integration Engines
Integration engines allow ClickHouse to access external systems without copying data into local storage.
Kafka Engine
The Kafka engine enables real-time event ingestion directly from Kafka topics.
CREATE TABLE default.orders_kafka
(
order_id UInt32,
status String,
amount Float64
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list='localhost:9092',
kafka_topic_list='orders',
kafka_group_name='clickhouse_consumer',
kafka_format='JSONEachRow';
Best For
- Streaming pipelines
- Event ingestion
- Real-time analytics
MySQL Engine
The MySQL engine allows ClickHouse to query MySQL tables directly.
CREATE TABLE default.mysql_orders
ENGINE = MySQL(
'mysql-host:3306',
'mydb',
'orders',
'readonly_user',
'password'
);
The same concept also applies to PostgreSQL using the PostgreSQL engine.
S3 Engine
The S3 engine reads data directly from Amazon S3 without importing it first.
CREATE TABLE default.s3_orders
ENGINE = S3(
'https://s3.amazonaws.com/my-bucket/orders/*.csv',
'ACCESS_KEY',
'SECRET_KEY',
'CSV'
);
Example query:
SELECT count()
FROM default.s3_orders;
URL Engine
The URL engine reads structured data directly from HTTP or HTTPS endpoints.
CREATE TABLE default.url_data
ENGINE = URL(
'https://example.com/data.csv',
'CSV'
);
Query it like any other table:
SELECT *
FROM default.url_data
LIMIT 5;
Part 5: Distributed Engine
Unlike most table engines, the Distributed engine does not store data.
Instead, it routes queries across multiple ClickHouse shards and combines the results into a single response.
CREATE TABLE default.orders_distributed
ENGINE = Distributed(
my_cluster,
default,
orders,
rand()
);
When querying orders_distributed, ClickHouse automatically:
- Sends the query to every shard
- Executes it in parallel
- Collects the results
- Returns a unified response
Best For
- Large distributed clusters
- Horizontal scaling
- Multi-node analytics
Quick Reference: Choosing the Right Engine
| Use Case | Recommended Engine |
|---|---|
| General analytical workloads | MergeTree |
| Row updates | ReplacingMergeTree |
| Pre-aggregated metrics | SummingMergeTree |
| Complex aggregations | AggregatingMergeTree |
| Logical deletes | CollapsingMergeTree |
| Temporary tables | TinyLog or Memory |
| High-frequency inserts | Buffer |
| Real-time Kafka ingestion | Kafka |
| External MySQL/PostgreSQL data | MySQL / PostgreSQL |
| Reading files from S3 | S3 |
| Querying HTTP endpoints | URL |
| Distributed analytics | Distributed |
Final Thoughts
While MergeTree remains the backbone of most ClickHouse deployments, it is only one part of a much broader ecosystem of table engines.
Specialized engines like ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree address specific data management needs, while engines such as Memory, Buffer, and Dictionary simplify caching and ingestion workflows. Integration engines connect ClickHouse to external systems like Kafka, MySQL, PostgreSQL, S3, and HTTP endpoints, and the Distributed engine enables seamless horizontal scaling across clusters.
Choosing the right table engine for your workload can significantly improve performance, simplify architecture, and reduce operational complexity. Understanding these engines is a key step toward designing efficient, scalable, and production-ready ClickHouse solutions.
Top comments (0)