Kanishga Subramani

Posted on Jun 27

Day 42: Exploring ClickHouse® Table Engines Beyond MergeTree

#clickhouse #devops #database #analytics

Introduction

When people first start using ClickHouse®, most of the attention naturally goes to the MergeTree engine. It's the default choice for analytical workloads thanks to its exceptional query performance, indexing capabilities, and efficient storage.

However, ClickHouse offers a wide range of table engines beyond MergeTree. Each engine is designed to solve a specific problem, whether it's handling updates, pre-aggregating data, buffering high-frequency inserts, integrating with external systems, or scaling queries across clusters.

Understanding these engines is essential for designing efficient ClickHouse architectures and selecting the right storage strategy for your workload.

In this article, we'll explore some of the most commonly used ClickHouse table engines beyond MergeTree, understand their strengths, and look at practical examples of when to use them.

What Is a Table Engine?

A table engine defines how ClickHouse stores, organizes, and manages data. It determines:

How data is written
How queries are executed
How data is merged and maintained
Whether replication is supported
How storage is organized

Every table in ClickHouse must specify an engine during creation.

CREATE TABLE my_table
(
    id UInt32,
    name String
)
ENGINE = MergeTree()
ORDER BY id;

Choosing the appropriate engine is one of the most important design decisions when building a ClickHouse database.

Part 1: The MergeTree Family

The MergeTree family forms the foundation of ClickHouse. Several specialized engines extend MergeTree to support different data management patterns.

ReplacingMergeTree

ReplacingMergeTree automatically removes duplicate rows with the same primary key during background merges. If a version column is provided, the row with the highest version is retained.

Best For

Handling row updates
Slowly changing dimensions
Event corrections
Avoiding expensive mutations

CREATE TABLE default.orders
(
    order_id UInt32,
    status LowCardinality(String),
    amount Float64,
    version UInt32
)
ENGINE = ReplacingMergeTree(version)
ORDER BY order_id;

Insert the original row:

INSERT INTO default.orders
VALUES (1, 'pending', 450.00, 1);

Insert an updated version:

INSERT INTO default.orders
VALUES (1, 'completed', 450.00, 2);

After background merges complete, only the record with version = 2 remains.

SummingMergeTree

SummingMergeTree automatically sums numeric columns for rows sharing the same primary key.

This significantly reduces storage while improving analytical query performance.

Best For

Daily metrics
Revenue reports
Event counters
Time-series summaries

CREATE TABLE default.daily_sales
(
    sale_date Date,
    country LowCardinality(String),
    total_amount Float64,
    total_orders UInt32
)
ENGINE = SummingMergeTree()
ORDER BY (sale_date, country);

Insert sample data:

INSERT INTO default.daily_sales VALUES
('2024-01-01','IN',1200.00,3),
('2024-01-01','IN',800.00,2);

After merging:

sale_date	country	total_amount	total_orders
2024-01-01	IN	2000.00	5

AggregatingMergeTree

AggregatingMergeTree extends pre-aggregation beyond simple sums by storing aggregate states.

It works particularly well with Materialized Views.

Best For

Average calculations
Unique counts
Quantiles
Complex aggregations

CREATE TABLE default.orders_summary
(
    order_date Date,
    country LowCardinality(String),
    total_count AggregateFunction(count, UInt32),
    avg_amount AggregateFunction(avg, Float64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, country);

Create a materialized view:

CREATE MATERIALIZED VIEW default.orders_mv
TO default.orders_summary
AS
SELECT
    toDate(order_date) AS order_date,
    country,
    countState() AS total_count,
    avgState(amount) AS avg_amount
FROM default.orders
GROUP BY order_date, country;

Query the aggregated results:

SELECT
    order_date,
    country,
    countMerge(total_count) AS total_orders,
    avgMerge(avg_amount) AS avg_amount
FROM default.orders_summary
GROUP BY order_date, country;

CollapsingMergeTree

CollapsingMergeTree manages logical deletes and updates using a sign column.

Rows with opposite sign values (+1 and -1) cancel each other out during merges.

Best For

Event sourcing
Logical deletes
Update-heavy datasets

CREATE TABLE default.orders
(
    order_id UInt32,
    status LowCardinality(String),
    amount Float64,
    order_date DateTime,
    sign Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;

Insert a row:

INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',1);

Logically delete it:

INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',-1);

After background merges, both rows collapse and the record is effectively removed.

Part 2: The Log Family

Log engines are lightweight storage engines intended for simple workloads where MergeTree's advanced capabilities are unnecessary.

TinyLog

TinyLog is the simplest storage engine available.

It stores columns separately and provides no indexing.

Best For

Small datasets
Temporary tables
Testing
One-time imports

CREATE TABLE default.temp_data
(
    id UInt32,
    value String
)
ENGINE = TinyLog;

Log

Log is similar to TinyLog but offers improved efficiency for slightly larger datasets.

It supports concurrent reads but not concurrent writes.

CREATE TABLE default.log_data
(
    id UInt32,
    message String,
    timestamp DateTime
)
ENGINE = Log;

StripeLog

StripeLog stores columns together in stripes within a single file, improving write performance compared to Log.

CREATE TABLE default.stripe_data
(
    id UInt32,
    value String
)
ENGINE = StripeLog;

Part 3: Special Engines

Memory

The Memory engine stores all data in RAM.

Since nothing is written to disk, reads and writes are extremely fast.

Best For

Temporary lookup tables
Session data
Testing
Small cache tables

CREATE TABLE default.cache_data
(
    id UInt32,
    value String
)
ENGINE = Memory;

Insert data:

INSERT INTO default.cache_data
VALUES (1,'cached_value');

Read data:

SELECT *
FROM default.cache_data;

Note: All data is lost whenever the ClickHouse server restarts.

Buffer

The Buffer engine temporarily stores incoming inserts in memory before flushing them to another table.

This reduces the creation of excessive small parts during high-frequency ingestion.

Best For

Streaming workloads
High insert rates
Improving ingestion efficiency

CREATE TABLE default.orders_buffer
AS default.orders
ENGINE = Buffer(
    default,
    orders,
    4,
    10,
    60,
    1000,
    100000,
    1000000,
    10000000
);

Incoming data is buffered and automatically written to the destination table once configured thresholds are reached.

Dictionary

The Dictionary engine exposes ClickHouse dictionaries as queryable SQL tables.

CREATE TABLE default.products_dict_table
ENGINE = Dictionary('default.products_dict');

Example:

SELECT *
FROM default.products_dict_table
LIMIT 5;

Part 4: Integration Engines

Integration engines allow ClickHouse to access external systems without copying data into local storage.

Kafka Engine

The Kafka engine enables real-time event ingestion directly from Kafka topics.

CREATE TABLE default.orders_kafka
(
    order_id UInt32,
    status String,
    amount Float64
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list='localhost:9092',
kafka_topic_list='orders',
kafka_group_name='clickhouse_consumer',
kafka_format='JSONEachRow';

Best For

Streaming pipelines
Event ingestion
Real-time analytics

MySQL Engine

The MySQL engine allows ClickHouse to query MySQL tables directly.

CREATE TABLE default.mysql_orders
ENGINE = MySQL(
    'mysql-host:3306',
    'mydb',
    'orders',
    'readonly_user',
    'password'
);

The same concept also applies to PostgreSQL using the PostgreSQL engine.

S3 Engine

The S3 engine reads data directly from Amazon S3 without importing it first.

CREATE TABLE default.s3_orders
ENGINE = S3(
    'https://s3.amazonaws.com/my-bucket/orders/*.csv',
    'ACCESS_KEY',
    'SECRET_KEY',
    'CSV'
);

Example query:

SELECT count()
FROM default.s3_orders;

URL Engine

The URL engine reads structured data directly from HTTP or HTTPS endpoints.

CREATE TABLE default.url_data
ENGINE = URL(
    'https://example.com/data.csv',
    'CSV'
);

Query it like any other table:

SELECT *
FROM default.url_data
LIMIT 5;

Part 5: Distributed Engine

Unlike most table engines, the Distributed engine does not store data.

Instead, it routes queries across multiple ClickHouse shards and combines the results into a single response.

CREATE TABLE default.orders_distributed
ENGINE = Distributed(
    my_cluster,
    default,
    orders,
    rand()
);

When querying orders_distributed, ClickHouse automatically:

Sends the query to every shard
Executes it in parallel
Collects the results
Returns a unified response

Best For

Large distributed clusters
Horizontal scaling
Multi-node analytics

Quick Reference: Choosing the Right Engine

Use Case	Recommended Engine
General analytical workloads	MergeTree
Row updates	ReplacingMergeTree
Pre-aggregated metrics	SummingMergeTree
Complex aggregations	AggregatingMergeTree
Logical deletes	CollapsingMergeTree
Temporary tables	TinyLog or Memory
High-frequency inserts	Buffer
Real-time Kafka ingestion	Kafka
External MySQL/PostgreSQL data	MySQL / PostgreSQL
Reading files from S3	S3
Querying HTTP endpoints	URL
Distributed analytics	Distributed

Final Thoughts

While MergeTree remains the backbone of most ClickHouse deployments, it is only one part of a much broader ecosystem of table engines.

Specialized engines like ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree address specific data management needs, while engines such as Memory, Buffer, and Dictionary simplify caching and ingestion workflows. Integration engines connect ClickHouse to external systems like Kafka, MySQL, PostgreSQL, S3, and HTTP endpoints, and the Distributed engine enables seamless horizontal scaling across clusters.

Choosing the right table engine for your workload can significantly improve performance, simplify architecture, and reduce operational complexity. Understanding these engines is a key step toward designing efficient, scalable, and production-ready ClickHouse solutions.

DEV Community

Day 42: Exploring ClickHouse® Table Engines Beyond MergeTree

Introduction

What Is a Table Engine?

Part 1: The MergeTree Family

ReplacingMergeTree

Best For

SummingMergeTree

Best For

AggregatingMergeTree

Best For

CollapsingMergeTree

Best For

Part 2: The Log Family

TinyLog

Best For

Log

StripeLog

Part 3: Special Engines

Memory

Best For

Buffer

Best For

Dictionary

Part 4: Integration Engines

Kafka Engine

Best For

MySQL Engine

S3 Engine

URL Engine

Part 5: Distributed Engine

Best For

Quick Reference: Choosing the Right Engine

Final Thoughts

Top comments (0)