Kanishga Subramani

Posted on Jun 26

Day 39: How ClickHouse® Powers Real-Time Analytics

#clickhouse #devops #database #analytics

Real-time analytics has become a core requirement for modern applications. Product teams expect dashboards to refresh instantly, security teams need immediate visibility into suspicious activities, and operations teams rely on live monitoring to detect issues before they impact users.

Traditional relational databases excel at transactional workloads, but they struggle when asked to analyze billions of records while new data is continuously arriving. This is where ClickHouse® stands out.

In this article, we'll explore what real-time analytics is, why traditional databases face challenges, and how ClickHouse® enables fast analytical queries on continuously growing datasets.

What is Real-Time Analytics?

Real-time analytics is the process of collecting, processing, storing, and analyzing data almost immediately after it is generated.

The acceptable latency depends on the use case:

Fraud detection systems often require responses within milliseconds.
Infrastructure monitoring platforms typically operate with delays of a few seconds.
Business dashboards usually refresh every few seconds or every minute.

Real-time doesn't necessarily mean instant. It means the delay between an event occurring and that event becoming available for analysis is small enough to support business decisions.

A typical real-time analytics pipeline looks like this:

Applications
      │
      ▼
Event Stream
(Kafka, Pulsar, etc.)
      │
      ▼
ClickHouse®
      │
      ▼
Dashboards / APIs / Alerts

Applications continuously generate events, which are streamed into ClickHouse®. Dashboards and monitoring systems can then query the latest data without waiting for scheduled ETL jobs.

Why Traditional Databases Struggle

Traditional relational databases are designed for Online Transaction Processing (OLTP) workloads.

They are optimized for:

Small inserts
Point lookups
Frequent updates
ACID transactions

Analytical workloads are very different.

A single dashboard query may need to:

Scan billions of rows
Aggregate massive datasets
Filter across multiple dimensions
Return results within seconds

These workloads require storage and execution strategies that differ significantly from transactional databases.

Consider the following query:

SELECT
    country,
    count(*) AS page_views
FROM page_events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY country
ORDER BY page_views DESC;

A row-oriented database often has to scan every column for every row.

ClickHouse®, however, stores data by columns. It reads only the required columns (country and timestamp), dramatically reducing disk I/O and improving query performance.

Why ClickHouse® Excels at Real-Time Analytics

ClickHouse® was built specifically for analytical workloads.

Several architectural decisions make it ideal for real-time analytics.

1. Column-Oriented Storage

Instead of storing complete rows together, ClickHouse® stores data column by column.

Example:

Traditional Row Storage

ID | User | Country | Device

ClickHouse Column Storage

ID
User
Country
Device

When a query only needs two out of twenty columns, ClickHouse® reads only those columns.

This minimizes disk reads and significantly speeds up analytical queries.

2. High Data Compression

Values within the same column are often similar.

For example:

Country

India
India
India
India
India

Since similar values are stored together, ClickHouse® achieves excellent compression ratios.

Smaller datasets mean:

Less disk I/O
Better cache utilization
Faster query execution

3. Vectorized Query Execution

Instead of processing one row at a time, ClickHouse® processes batches of values.

Modern CPUs are highly optimized for performing operations on batches of data.

Benefits include:

Better CPU utilization
Higher throughput
Faster aggregations

4. Parallel Query Execution

ClickHouse® automatically distributes work across multiple CPU cores.

Large aggregation queries execute in parallel instead of sequentially, allowing hardware resources to be fully utilized.

5. Continuous Data Ingestion

Real-time systems continuously receive new events.

ClickHouse® supports inserting new data while simultaneously serving analytical queries.

This enables dashboards to display fresh data without interrupting ingestion.

However, ClickHouse® performs best when data is inserted in batches.

Very small inserts (such as one row per request) create unnecessary overhead.

In production systems, Kafka is commonly used to buffer events before writing larger batches into ClickHouse®.

Building a Real-Time Analytics Pipeline

A common architecture looks like this:

Applications
        │
        ▼
Apache Kafka
        │
        ▼
ClickHouse®
        │
        ▼
Grafana / Superset / Custom Dashboard

Step 1: Generate Events

Applications generate events such as:

{
  "user_id": 245,
  "country": "India",
  "page": "/pricing",
  "timestamp": "2025-08-04T12:31:14Z"
}

Step 2: Stream Events

A messaging platform like Apache Kafka stores events temporarily.

Kafka acts as a durable event log, decoupling producers from consumers while handling traffic spikes efficiently.

Step 3: Store Events in ClickHouse®

Events are continuously ingested into ClickHouse®.

Example table:

CREATE TABLE page_events
(
    timestamp DateTime,
    user_id UInt64,
    country String,
    page String
)
ENGINE = MergeTree
ORDER BY (timestamp, country);

The MergeTree engine is ClickHouse®'s primary storage engine and is optimized for analytical workloads.

Selecting the right ORDER BY key is one of the most important design decisions because it determines how data is physically organized on disk.

Step 4: Query Recent Data

Once data is stored, dashboards can execute queries such as:

SELECT
    toStartOfMinute(timestamp) AS minute,
    count(*) AS events
FROM page_events
WHERE timestamp >= now() - INTERVAL 30 MINUTE
GROUP BY minute
ORDER BY minute;

As new events arrive, they automatically appear in subsequent queries.

Performance Best Practices

Choosing ClickHouse® alone isn't enough. Proper design significantly impacts performance.

Batch Inserts

Insert larger batches instead of individual rows.

Batching reduces overhead and improves ingestion throughput.

Choose the Right ORDER BY Key

The sorting key determines how data is stored.

Design it around the filters your queries use most frequently.

A poor ordering strategy increases the amount of data scanned.

Avoid Too Many Small Parts

Every insert creates a data part.

Thousands of tiny inserts create many small parts, increasing background merge activity and reducing performance.

Partition Carefully

Partitioning simplifies retention and deletion.

However, excessive partitioning creates unnecessary metadata.

For most workloads, daily or monthly partitions are sufficient.

Use Materialized Views Wisely

Materialized views can precompute expensive aggregations during ingestion.

Benefits include:

Faster dashboards
Lower query latency

Trade-offs include:

Increased storage usage
Additional insert overhead

Use them only when repeated query patterns justify the cost.

Limitations

Although ClickHouse® is an outstanding analytical database, it isn't designed for every workload.

It is generally not the best choice for:

High-frequency row updates
Heavy transactional workloads
Complex multi-row ACID transactions
OLTP applications with frequent updates and deletes

A common architecture uses PostgreSQL or MySQL for transactional operations while using ClickHouse® for analytics.

When Should You Use ClickHouse®?

ClickHouse® is an excellent choice when you need to:

Analyze billions of events
Build real-time dashboards
Monitor logs and metrics
Process clickstream data
Generate BI reports
Analyze observability data

If your workload focuses primarily on transactions, user management, or frequent record updates, a traditional relational database remains the better option.

Conclusion

Real-time analytics is more than simply choosing the right database. It requires an end-to-end pipeline capable of collecting, transporting, storing, and analyzing continuously arriving data.

ClickHouse® addresses the storage and query layer exceptionally well through its column-oriented architecture, efficient compression, vectorized execution, parallel processing, and continuous ingestion capabilities.

When paired with thoughtful schema design, proper ordering keys, and efficient batch ingestion, ClickHouse® enables organizations to build highly scalable real-time analytics platforms capable of analyzing billions of events with minimal latency.

Understanding where ClickHouse® excels—and where traditional databases remain the better choice—is essential for designing reliable, high-performance data platforms.

References

https://clickhouse.com/docs/

In this article, we'll explore what real-time analytics is, why traditional databases face challenges, and how ClickHouse® enables fast analytical queries on continuously growing datasets.

What is Real-Time Analytics?

Real-time analytics is the process of collecting, processing, storing, and analyzing data almost immediately after it is generated.

The acceptable latency depends on the use case:

Fraud detection systems often require responses within milliseconds.
Infrastructure monitoring platforms typically operate with delays of a few seconds.
Business dashboards usually refresh every few seconds or every minute.

Real-time doesn't necessarily mean instant. It means the delay between an event occurring and that event becoming available for analysis is small enough to support business decisions.

A typical real-time analytics pipeline looks like this:

Applications
      │
      ▼
Event Stream
(Kafka, Pulsar, etc.)
      │
      ▼
ClickHouse®
      │
      ▼
Dashboards / APIs / Alerts

Applications continuously generate events, which are streamed into ClickHouse®. Dashboards and monitoring systems can then query the latest data without waiting for scheduled ETL jobs.

Why Traditional Databases Struggle

Traditional relational databases are designed for Online Transaction Processing (OLTP) workloads.

They are optimized for:

Small inserts
Point lookups
Frequent updates
ACID transactions

Analytical workloads are very different.

A single dashboard query may need to:

Scan billions of rows
Aggregate massive datasets
Filter across multiple dimensions
Return results within seconds

These workloads require storage and execution strategies that differ significantly from transactional databases.

Consider the following query:

SELECT
    country,
    count(*) AS page_views
FROM page_events
WHERE timestamp >= now() - INTERVAL 1 HOUR
GROUP BY country
ORDER BY page_views DESC;

A row-oriented database often has to scan every column for every row.

ClickHouse®, however, stores data by columns. It reads only the required columns (country and timestamp), dramatically reducing disk I/O and improving query performance.

Why ClickHouse® Excels at Real-Time Analytics

ClickHouse® was built specifically for analytical workloads.

Several architectural decisions make it ideal for real-time analytics.

1. Column-Oriented Storage

Instead of storing complete rows together, ClickHouse® stores data column by column.

Example:

Traditional Row Storage

ID | User | Country | Device

ClickHouse Column Storage

ID
User
Country
Device

When a query only needs two out of twenty columns, ClickHouse® reads only those columns.

This minimizes disk reads and significantly speeds up analytical queries.

2. High Data Compression

Values within the same column are often similar.

For example:

Country

India
India
India
India
India

Since similar values are stored together, ClickHouse® achieves excellent compression ratios.

Smaller datasets mean:

Less disk I/O
Better cache utilization
Faster query execution

3. Vectorized Query Execution

Instead of processing one row at a time, ClickHouse® processes batches of values.

Modern CPUs are highly optimized for performing operations on batches of data.

Benefits include:

Better CPU utilization
Higher throughput
Faster aggregations

4. Parallel Query Execution

ClickHouse® automatically distributes work across multiple CPU cores.

Large aggregation queries execute in parallel instead of sequentially, allowing hardware resources to be fully utilized.

5. Continuous Data Ingestion

Real-time systems continuously receive new events.

ClickHouse® supports inserting new data while simultaneously serving analytical queries.

This enables dashboards to display fresh data without interrupting ingestion.

However, ClickHouse® performs best when data is inserted in batches.

Very small inserts (such as one row per request) create unnecessary overhead.

In production systems, Kafka is commonly used to buffer events before writing larger batches into ClickHouse®.

Building a Real-Time Analytics Pipeline

A common architecture looks like this:

Applications
        │
        ▼
Apache Kafka
        │
        ▼
ClickHouse®
        │
        ▼
Grafana / Superset / Custom Dashboard

Step 1: Generate Events

Applications generate events such as:

{
  "user_id": 245,
  "country": "India",
  "page": "/pricing",
  "timestamp": "2025-08-04T12:31:14Z"
}

Step 2: Stream Events

A messaging platform like Apache Kafka stores events temporarily.

Kafka acts as a durable event log, decoupling producers from consumers while handling traffic spikes efficiently.

Step 3: Store Events in ClickHouse®

Events are continuously ingested into ClickHouse®.

Example table:

CREATE TABLE page_events
(
    timestamp DateTime,
    user_id UInt64,
    country String,
    page String
)
ENGINE = MergeTree
ORDER BY (timestamp, country);

The MergeTree engine is ClickHouse®'s primary storage engine and is optimized for analytical workloads.

Selecting the right ORDER BY key is one of the most important design decisions because it determines how data is physically organized on disk.

Step 4: Query Recent Data

Once data is stored, dashboards can execute queries such as:

SELECT
    toStartOfMinute(timestamp) AS minute,
    count(*) AS events
FROM page_events
WHERE timestamp >= now() - INTERVAL 30 MINUTE
GROUP BY minute
ORDER BY minute;

As new events arrive, they automatically appear in subsequent queries.

Performance Best Practices

Choosing ClickHouse® alone isn't enough. Proper design significantly impacts performance.

Batch Inserts

Insert larger batches instead of individual rows.

Batching reduces overhead and improves ingestion throughput.

Choose the Right ORDER BY Key

The sorting key determines how data is stored.

Design it around the filters your queries use most frequently.

A poor ordering strategy increases the amount of data scanned.

Avoid Too Many Small Parts

Every insert creates a data part.

Thousands of tiny inserts create many small parts, increasing background merge activity and reducing performance.

Partition Carefully

Partitioning simplifies retention and deletion.

However, excessive partitioning creates unnecessary metadata.

For most workloads, daily or monthly partitions are sufficient.

Use Materialized Views Wisely

Materialized views can precompute expensive aggregations during ingestion.

Benefits include:

Faster dashboards
Lower query latency

Trade-offs include:

Increased storage usage
Additional insert overhead

Use them only when repeated query patterns justify the cost.

Limitations

Although ClickHouse® is an outstanding analytical database, it isn't designed for every workload.

It is generally not the best choice for:

High-frequency row updates
Heavy transactional workloads
Complex multi-row ACID transactions
OLTP applications with frequent updates and deletes

A common architecture uses PostgreSQL or MySQL for transactional operations while using ClickHouse® for analytics.

When Should You Use ClickHouse®?

ClickHouse® is an excellent choice when you need to:

Analyze billions of events
Build real-time dashboards
Monitor logs and metrics
Process clickstream data
Generate BI reports
Analyze observability data

If your workload focuses primarily on transactions, user management, or frequent record updates, a traditional relational database remains the better option.

Conclusion

Real-time analytics is more than simply choosing the right database. It requires an end-to-end pipeline capable of collecting, transporting, storing, and analyzing continuously arriving data.

Understanding where ClickHouse® excels—and where traditional databases remain the better choice—is essential for designing reliable, high-performance data platforms.

References

https://clickhouse.com/docs/