Kanishga Subramani

Posted on Jul 3

Day 56 – Mastering ClickHouse® AggregatingMergeTree: Build Faster Analytics with Pre-Aggregated Data

#clickhouse #devops #database #dataengineering

Introduction

As data volumes continue to grow, running aggregation queries directly on raw datasets becomes increasingly expensive. Business dashboards, analytics platforms, and reporting systems often execute the same calculations repeatedly—such as total sales, daily active users, page views, or revenue trends. While ClickHouse® is designed to process analytical workloads at remarkable speed, repeatedly scanning billions of records still consumes valuable CPU, memory, and storage resources.

This is where AggregatingMergeTree proves its value.

Rather than calculating aggregates every time a query is executed, AggregatingMergeTree stores intermediate aggregation states that are merged automatically in the background. This approach allows analytical queries to read compact, pre-aggregated datasets, resulting in dramatically faster response times and reduced infrastructure costs.

In this guide, you'll learn how AggregatingMergeTree works, why aggregate states matter, how to build an automated aggregation pipeline using Materialized Views, and when this engine is the right choice for your ClickHouse® workloads.

What is AggregatingMergeTree?

AggregatingMergeTree is a specialized ClickHouse® table engine designed to store aggregate function states instead of raw records.

Unlike the standard MergeTree engine, which stores every inserted row, AggregatingMergeTree keeps partially aggregated values that ClickHouse combines during background merge operations. This significantly reduces the amount of data that must be processed when generating analytical reports.

Because much of the computational work happens during data ingestion, dashboards and reporting applications can retrieve summarized information much more efficiently.

Typical scenarios include:

Sales reporting
Website traffic analytics
Financial summaries
IoT sensor monitoring
Business KPI dashboards
Application observability metrics

Why Use AggregatingMergeTree?

Imagine an online marketplace processing millions of transactions every day.

A dashboard needs to display total revenue generated by each product.

A typical query might look like this:

SELECT
    product,
    sum(amount) AS total_revenue
FROM sales
GROUP BY product;

Although ClickHouse executes aggregation queries efficiently, scanning billions of records every time a dashboard refreshes eventually becomes costly.

As datasets grow larger, repeated aggregations increase:

CPU utilization
Memory consumption
Disk reads
Query latency

AggregatingMergeTree addresses this challenge by storing aggregation states as data arrives. Instead of recalculating totals from the beginning, queries simply merge pre-computed states to produce final results.

This significantly reduces execution time for frequently accessed reports.

How AggregatingMergeTree Works

A typical aggregation pipeline consists of five stages:

Incoming events are written into a raw MergeTree table.
A Materialized View processes newly inserted records.
Aggregate state functions generate intermediate values.
AggregatingMergeTree stores those aggregate states.
Background merges combine compatible states automatically.

Application
      │
      ▼
Raw MergeTree Table
      │
      ▼
Materialized View
      │
      ▼
Aggregate States
(sumState(), avgState())
      │
      ▼
AggregatingMergeTree
      │
      ▼
Background Merge
      │
      ▼
sumMerge()
      │
      ▼
Analytics Dashboard

This architecture shifts expensive aggregation work from query execution time to data ingestion, making analytical queries much faster.

Aggregate Functions vs. AggregatingMergeTree

Although their names are similar, aggregate functions and AggregatingMergeTree serve different purposes.

Aggregate Functions	AggregatingMergeTree
SQL functions	Table engine
Execute calculations during queries	Stores aggregate states
Used in SELECT statements	Used for summary tables
Examples: sum(), avg(), count()	Stores AggregateFunction columns

A simple way to think about it is:

Aggregate functions perform calculations.
AggregatingMergeTree stores those calculations in an intermediate format that can be merged later.

Understanding Aggregate States

Aggregate states are the foundation of AggregatingMergeTree.

Consider this standard query:

SELECT sum(amount)
FROM sales;

The query immediately returns a final numeric result.

With AggregatingMergeTree, ClickHouse instead stores an intermediate aggregation state using functions such as:

sumState(amount)

This state is not the final answer.

Later, when querying the summary table, ClickHouse combines multiple stored states using:

sumMerge(total_sales)

Think of an aggregate state as a partially completed calculation that can continue growing as new data arrives.

Because ClickHouse only merges these compact states instead of scanning every raw row, reporting becomes significantly more efficient.

Creating the Raw Events Table

Let's build a simple example that tracks website page views.

First, create a table that stores every page visit.

CREATE TABLE page_views
(
    event_date Date,
    page String,
    views UInt64
)
ENGINE = MergeTree
ORDER BY (event_date, page);

Example data:

Date	Page	Views
2026-07-01	Home	1
2026-07-01	Home	1
2026-07-01	Products	1
2026-07-01	Contact	1
2026-07-01	Home	1

Every page visit is stored individually inside the MergeTree table.

Creating the Summary Table

Instead of saving completed totals, the summary table stores aggregation states.

CREATE TABLE page_views_summary
(
    event_date Date,
    page String,
    total_views AggregateFunction(sum, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_date, page);

Notice that the total_views column uses:

AggregateFunction(sum, UInt64)

rather than a standard numeric type.

This enables ClickHouse to merge partial aggregations automatically.

Automatically Building Aggregates with Materialized Views

To avoid manually updating summary tables, create a Materialized View that processes every new insert.

CREATE MATERIALIZED VIEW mv_page_views
TO page_views_summary
AS
SELECT
    event_date,
    page,
    sumState(views) AS total_views
FROM page_views
GROUP BY
    event_date,
    page;

Whenever fresh data is inserted into the raw events table, the Materialized View calculates aggregation states and stores them in the AggregatingMergeTree table automatically.

This eliminates repeated aggregation work during query execution.

Querying Aggregated Data

Since the summary table contains aggregate states rather than completed values, use merge functions to retrieve final results.

SELECT
    event_date,
    page,
    sumMerge(total_views) AS total_views
FROM page_views_summary
GROUP BY
    event_date,
    page
ORDER BY
    event_date,
    page;

Example output:

Date	Page	Total Views
2026-07-01	Home	35,842
2026-07-01	Products	12,614
2026-07-01	Contact	4,238

Instead of scanning millions of individual page-view events, ClickHouse reads a compact summary table, making dashboards load much faster.

When Should You Use AggregatingMergeTree?

AggregatingMergeTree is best suited for analytical workloads where identical aggregation queries are executed repeatedly.

Ideal use cases include:

Interactive BI dashboards
Daily and monthly reporting
Executive KPI dashboards
Website analytics
Application monitoring
Event and log analysis
Financial reporting
IoT telemetry aggregation

By storing pre-aggregated information, organizations can reduce query execution times while minimizing resource consumption.

When It May Not Be the Right Choice

Although powerful, AggregatingMergeTree isn't appropriate for every workload.

Consider alternative engines when your application requires:

Transactional (OLTP) processing
Frequent row-level updates
Point lookups
Individual record modifications
Highly mutable datasets

For these scenarios, standard MergeTree variants are generally a better fit.

Common Mistakes to Avoid

When working with AggregatingMergeTree, developers often encounter a few common pitfalls.

Avoid these mistakes:

Using sum() instead of sumState() while inserting aggregate data.
Querying aggregate states without merge functions like sumMerge().
Defining aggregation columns as numeric types instead of AggregateFunction.
Expecting background merges to happen instantly after every insert.
Using AggregatingMergeTree for transactional workloads rather than analytical reporting.

Understanding these concepts helps ensure accurate query results and better long-term performance.

Best Practices

To get the most out of AggregatingMergeTree:

Keep raw data in a MergeTree table.
Automate aggregation with Materialized Views.
Store only aggregation states inside summary tables.
Query summary tables using merge functions.
Choose appropriate sorting keys for efficient merges.
Monitor background merge activity for optimal performance.

These practices help build scalable analytics pipelines capable of handling billions of records efficiently.

Conclusion

AggregatingMergeTree is one of the most powerful storage engines available in ClickHouse® for accelerating analytical workloads. By storing intermediate aggregate states instead of recalculating values from raw data, it dramatically improves reporting performance while reducing CPU, memory, and disk usage.

When combined with Materialized Views and aggregate state functions such as sumState() and sumMerge(), it provides an elegant solution for building high-performance dashboards, business intelligence systems, monitoring platforms, and real-time analytics applications.

If your organization regularly executes the same aggregation queries over massive datasets, adopting AggregatingMergeTree can significantly reduce query latency, improve scalability, and simplify the architecture of your analytics platform.

Top comments (1)

Aldo • Jul 19

Our user-facing analytics dashboards used to struggle significantly with latency once we crossed into the tens of billions of rows for event data. Even with a powerful columnar store, GROUP BY operations on high-cardinality dimensions became a bottleneck, especially when users expected near real-time updates. That's precisely where patterns like the AggregatingMergeTree really shine. We adopted a similar strategy, building out materialized views that pre-aggregate daily or hourly metrics for our core dashboard views.

The biggest win, predictably, is query performance. Our core metrics went from several seconds to milliseconds. However, it's a careful dance. Designing the state columns and choosing the right *State functions for what you might need versus what you definitely need is critical. Over-aggregating can lead to unnecessary storage and CPU on writes, while under-aggregating forces you back to raw data, defeating the purpose. We also had to balance data freshness; while AggregatingMergeTree handles merges efficiently, the actual raw data ingress and subsequent aggregation job need to be robust for a reliable user experience.

One area we invested heavily in was the pipeline feeding these tables. Using a streaming platform to pre-process and normalize events before they hit ClickHouse significantly reduced the complexity of the AggregatingMergeTree definitions themselves. It also helped manage late-arriving data more gracefully, allowing us to re-process specific time windows without a full table rebuild. Ultimately, it's a powerful tool, but like any strong optimization, it requires a clear understanding of your query patterns and a commitment to managing the aggregation schemas as your product evolves.