DEV Community

Cover image for Day 42: Exploring ClickHouse® Table Engines Beyond MergeTree
Kanishga Subramani
Kanishga Subramani

Posted on

Day 42: Exploring ClickHouse® Table Engines Beyond MergeTree

Introduction

When people first start using ClickHouse®, most of the attention naturally goes to the MergeTree engine. It's the default choice for analytical workloads thanks to its exceptional query performance, indexing capabilities, and efficient storage.

However, ClickHouse offers a wide range of table engines beyond MergeTree. Each engine is designed to solve a specific problem, whether it's handling updates, pre-aggregating data, buffering high-frequency inserts, integrating with external systems, or scaling queries across clusters.

Understanding these engines is essential for designing efficient ClickHouse architectures and selecting the right storage strategy for your workload.

In this article, we'll explore some of the most commonly used ClickHouse table engines beyond MergeTree, understand their strengths, and look at practical examples of when to use them.


What Is a Table Engine?

A table engine defines how ClickHouse stores, organizes, and manages data. It determines:

  • How data is written
  • How queries are executed
  • How data is merged and maintained
  • Whether replication is supported
  • How storage is organized

Every table in ClickHouse must specify an engine during creation.

CREATE TABLE my_table
(
    id UInt32,
    name String
)
ENGINE = MergeTree()
ORDER BY id;
Enter fullscreen mode Exit fullscreen mode

Choosing the appropriate engine is one of the most important design decisions when building a ClickHouse database.


Part 1: The MergeTree Family

The MergeTree family forms the foundation of ClickHouse. Several specialized engines extend MergeTree to support different data management patterns.

ReplacingMergeTree

ReplacingMergeTree automatically removes duplicate rows with the same primary key during background merges. If a version column is provided, the row with the highest version is retained.

Best For

  • Handling row updates
  • Slowly changing dimensions
  • Event corrections
  • Avoiding expensive mutations
CREATE TABLE default.orders
(
    order_id UInt32,
    status LowCardinality(String),
    amount Float64,
    version UInt32
)
ENGINE = ReplacingMergeTree(version)
ORDER BY order_id;
Enter fullscreen mode Exit fullscreen mode

Insert the original row:

INSERT INTO default.orders
VALUES (1, 'pending', 450.00, 1);
Enter fullscreen mode Exit fullscreen mode

Insert an updated version:

INSERT INTO default.orders
VALUES (1, 'completed', 450.00, 2);
Enter fullscreen mode Exit fullscreen mode

After background merges complete, only the record with version = 2 remains.


SummingMergeTree

SummingMergeTree automatically sums numeric columns for rows sharing the same primary key.

This significantly reduces storage while improving analytical query performance.

Best For

  • Daily metrics
  • Revenue reports
  • Event counters
  • Time-series summaries
CREATE TABLE default.daily_sales
(
    sale_date Date,
    country LowCardinality(String),
    total_amount Float64,
    total_orders UInt32
)
ENGINE = SummingMergeTree()
ORDER BY (sale_date, country);
Enter fullscreen mode Exit fullscreen mode

Insert sample data:

INSERT INTO default.daily_sales VALUES
('2024-01-01','IN',1200.00,3),
('2024-01-01','IN',800.00,2);
Enter fullscreen mode Exit fullscreen mode

After merging:

sale_date country total_amount total_orders
2024-01-01 IN 2000.00 5

AggregatingMergeTree

AggregatingMergeTree extends pre-aggregation beyond simple sums by storing aggregate states.

It works particularly well with Materialized Views.

Best For

  • Average calculations
  • Unique counts
  • Quantiles
  • Complex aggregations
CREATE TABLE default.orders_summary
(
    order_date Date,
    country LowCardinality(String),
    total_count AggregateFunction(count, UInt32),
    avg_amount AggregateFunction(avg, Float64)
)
ENGINE = AggregatingMergeTree()
ORDER BY (order_date, country);
Enter fullscreen mode Exit fullscreen mode

Create a materialized view:

CREATE MATERIALIZED VIEW default.orders_mv
TO default.orders_summary
AS
SELECT
    toDate(order_date) AS order_date,
    country,
    countState() AS total_count,
    avgState(amount) AS avg_amount
FROM default.orders
GROUP BY order_date, country;
Enter fullscreen mode Exit fullscreen mode

Query the aggregated results:

SELECT
    order_date,
    country,
    countMerge(total_count) AS total_orders,
    avgMerge(avg_amount) AS avg_amount
FROM default.orders_summary
GROUP BY order_date, country;
Enter fullscreen mode Exit fullscreen mode

CollapsingMergeTree

CollapsingMergeTree manages logical deletes and updates using a sign column.

Rows with opposite sign values (+1 and -1) cancel each other out during merges.

Best For

  • Event sourcing
  • Logical deletes
  • Update-heavy datasets
CREATE TABLE default.orders
(
    order_id UInt32,
    status LowCardinality(String),
    amount Float64,
    order_date DateTime,
    sign Int8
)
ENGINE = CollapsingMergeTree(sign)
ORDER BY order_id;
Enter fullscreen mode Exit fullscreen mode

Insert a row:

INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',1);
Enter fullscreen mode Exit fullscreen mode

Logically delete it:

INSERT INTO default.orders
VALUES (1,'completed',1200.00,'2024-01-05',-1);
Enter fullscreen mode Exit fullscreen mode

After background merges, both rows collapse and the record is effectively removed.


Part 2: The Log Family

Log engines are lightweight storage engines intended for simple workloads where MergeTree's advanced capabilities are unnecessary.


TinyLog

TinyLog is the simplest storage engine available.

It stores columns separately and provides no indexing.

Best For

  • Small datasets
  • Temporary tables
  • Testing
  • One-time imports
CREATE TABLE default.temp_data
(
    id UInt32,
    value String
)
ENGINE = TinyLog;
Enter fullscreen mode Exit fullscreen mode

Log

Log is similar to TinyLog but offers improved efficiency for slightly larger datasets.

It supports concurrent reads but not concurrent writes.

CREATE TABLE default.log_data
(
    id UInt32,
    message String,
    timestamp DateTime
)
ENGINE = Log;
Enter fullscreen mode Exit fullscreen mode

StripeLog

StripeLog stores columns together in stripes within a single file, improving write performance compared to Log.

CREATE TABLE default.stripe_data
(
    id UInt32,
    value String
)
ENGINE = StripeLog;
Enter fullscreen mode Exit fullscreen mode

Part 3: Special Engines

Memory

The Memory engine stores all data in RAM.

Since nothing is written to disk, reads and writes are extremely fast.

Best For

  • Temporary lookup tables
  • Session data
  • Testing
  • Small cache tables
CREATE TABLE default.cache_data
(
    id UInt32,
    value String
)
ENGINE = Memory;
Enter fullscreen mode Exit fullscreen mode

Insert data:

INSERT INTO default.cache_data
VALUES (1,'cached_value');
Enter fullscreen mode Exit fullscreen mode

Read data:

SELECT *
FROM default.cache_data;
Enter fullscreen mode Exit fullscreen mode

Note: All data is lost whenever the ClickHouse server restarts.


Buffer

The Buffer engine temporarily stores incoming inserts in memory before flushing them to another table.

This reduces the creation of excessive small parts during high-frequency ingestion.

Best For

  • Streaming workloads
  • High insert rates
  • Improving ingestion efficiency
CREATE TABLE default.orders_buffer
AS default.orders
ENGINE = Buffer(
    default,
    orders,
    4,
    10,
    60,
    1000,
    100000,
    1000000,
    10000000
);
Enter fullscreen mode Exit fullscreen mode

Incoming data is buffered and automatically written to the destination table once configured thresholds are reached.


Dictionary

The Dictionary engine exposes ClickHouse dictionaries as queryable SQL tables.

CREATE TABLE default.products_dict_table
ENGINE = Dictionary('default.products_dict');
Enter fullscreen mode Exit fullscreen mode

Example:

SELECT *
FROM default.products_dict_table
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Part 4: Integration Engines

Integration engines allow ClickHouse to access external systems without copying data into local storage.


Kafka Engine

The Kafka engine enables real-time event ingestion directly from Kafka topics.

CREATE TABLE default.orders_kafka
(
    order_id UInt32,
    status String,
    amount Float64
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list='localhost:9092',
kafka_topic_list='orders',
kafka_group_name='clickhouse_consumer',
kafka_format='JSONEachRow';
Enter fullscreen mode Exit fullscreen mode

Best For

  • Streaming pipelines
  • Event ingestion
  • Real-time analytics

MySQL Engine

The MySQL engine allows ClickHouse to query MySQL tables directly.

CREATE TABLE default.mysql_orders
ENGINE = MySQL(
    'mysql-host:3306',
    'mydb',
    'orders',
    'readonly_user',
    'password'
);
Enter fullscreen mode Exit fullscreen mode

The same concept also applies to PostgreSQL using the PostgreSQL engine.


S3 Engine

The S3 engine reads data directly from Amazon S3 without importing it first.

CREATE TABLE default.s3_orders
ENGINE = S3(
    'https://s3.amazonaws.com/my-bucket/orders/*.csv',
    'ACCESS_KEY',
    'SECRET_KEY',
    'CSV'
);
Enter fullscreen mode Exit fullscreen mode

Example query:

SELECT count()
FROM default.s3_orders;
Enter fullscreen mode Exit fullscreen mode

URL Engine

The URL engine reads structured data directly from HTTP or HTTPS endpoints.

CREATE TABLE default.url_data
ENGINE = URL(
    'https://example.com/data.csv',
    'CSV'
);
Enter fullscreen mode Exit fullscreen mode

Query it like any other table:

SELECT *
FROM default.url_data
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Part 5: Distributed Engine

Unlike most table engines, the Distributed engine does not store data.

Instead, it routes queries across multiple ClickHouse shards and combines the results into a single response.

CREATE TABLE default.orders_distributed
ENGINE = Distributed(
    my_cluster,
    default,
    orders,
    rand()
);
Enter fullscreen mode Exit fullscreen mode

When querying orders_distributed, ClickHouse automatically:

  • Sends the query to every shard
  • Executes it in parallel
  • Collects the results
  • Returns a unified response

Best For

  • Large distributed clusters
  • Horizontal scaling
  • Multi-node analytics

Quick Reference: Choosing the Right Engine

Use Case Recommended Engine
General analytical workloads MergeTree
Row updates ReplacingMergeTree
Pre-aggregated metrics SummingMergeTree
Complex aggregations AggregatingMergeTree
Logical deletes CollapsingMergeTree
Temporary tables TinyLog or Memory
High-frequency inserts Buffer
Real-time Kafka ingestion Kafka
External MySQL/PostgreSQL data MySQL / PostgreSQL
Reading files from S3 S3
Querying HTTP endpoints URL
Distributed analytics Distributed

Final Thoughts

While MergeTree remains the backbone of most ClickHouse deployments, it is only one part of a much broader ecosystem of table engines.

Specialized engines like ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree address specific data management needs, while engines such as Memory, Buffer, and Dictionary simplify caching and ingestion workflows. Integration engines connect ClickHouse to external systems like Kafka, MySQL, PostgreSQL, S3, and HTTP endpoints, and the Distributed engine enables seamless horizontal scaling across clusters.

Choosing the right table engine for your workload can significantly improve performance, simplify architecture, and reduce operational complexity. Understanding these engines is a key step toward designing efficient, scalable, and production-ready ClickHouse solutions.

Top comments (0)