Kanishga Subramani

Posted on Jun 28

Day 45: Using ClickHouse® for Log Analytics and Observability

#clickhouse #devops #database #dataengineering

Introduction

Modern applications generate an enormous amount of operational data. In many modern systems, ClickHouse® is increasingly being explored as a high-performance solution for handling this data. Every API request, database query, application event, and infrastructure component produces logs that help engineers troubleshoot failures, investigate incidents, and understand system behavior. As systems scale, storing and querying this data becomes increasingly challenging, especially with traditional log management tools that struggle with cost, performance, or scalability limitations.

This is where ClickHouse® becomes an interesting option. ClickHouse® is an open-source columnar database management system designed for analytical workloads. While it was not originally built as a dedicated log management platform, its architecture makes it well-suited for storing and analyzing large volumes of time-series event data, including application logs.

However, it is important to understand what ClickHouse® is—and what it is not. It is a database for analytics, not a log collector, observability platform, or dashboarding solution. Instead, it is commonly used as the core storage and query engine within modern observability stacks.

Why Traditional Log Storage Becomes Expensive

In small applications, logs are often stored in text files or relational databases. As application traffic increases, this approach quickly becomes difficult to scale.

A production system can easily generate millions of log entries every day. These logs commonly include information such as:

Timestamp
Service name
Log level
Request ID
User ID
HTTP status code
Response time
Error message

Engineers rarely read logs sequentially. Instead, they perform analytical queries such as:

Show all ERROR logs from the payment service during the last hour.
Find requests with latency greater than 500 ms.
Count errors by service over the last 24 hours.
Identify the most common exception messages.

These workloads require scanning very large datasets while reading only a handful of columns.

Traditional row-oriented databases are designed for transactional operations rather than analytical scans across billions of records. As log volumes increase, this often leads to slower queries, higher storage costs, and more expensive infrastructure.

Why ClickHouse® Works Well for Log Analytics

ClickHouse® was built for Online Analytical Processing (OLAP), where large datasets are scanned and aggregated efficiently.

Several characteristics make it an excellent fit for log analytics.

Columnar Storage

Unlike row-oriented databases, ClickHouse® stores data column by column.

For example, if a query only requires:

timestamp
service
log_level

ClickHouse® reads only those three columns instead of every field contained in each log record.

This dramatically reduces disk I/O and allows analytical queries to execute much faster.

High Compression

Operational logs contain many repeated values, including:

Service names
HTTP methods
Status codes
Log levels

Because identical values are stored together within columns, ClickHouse® achieves excellent compression ratios.

Better compression means:

Less storage consumption
Lower storage costs
Faster query execution due to reduced disk reads

Fast Aggregations

Most observability workloads involve aggregation.

For example:

SELECT
    service,
    count(*) AS total_errors
FROM logs
WHERE log_level = 'ERROR'
  AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service
ORDER BY total_errors DESC;

Queries like this can process billions of rows efficiently on properly configured hardware.

Actual performance depends on schema design, sorting keys, partitioning strategy, hardware resources, and query patterns.

A Typical Observability Architecture

ClickHouse® usually serves as the analytical database rather than the complete observability platform.

A common architecture looks like this:

Applications
      │
      ▼
OpenTelemetry / Fluent Bit / Vector
      │
      ▼
Kafka (optional)
      │
      ▼
ClickHouse®
      │
      ▼
Grafana / Custom Dashboards

Each component has a dedicated responsibility.

Component	Responsibility
Applications	Generate logs
OpenTelemetry / Fluent Bit / Vector	Collect and forward logs
Kafka	Buffer and decouple ingestion
ClickHouse®	Store and analyze logs
Grafana	Build dashboards and visualizations

Separating these responsibilities makes the system easier to scale and maintain.

Designing a Log Table

Logs are essentially timestamped events.

A simple schema might look like:

CREATE TABLE logs
(
    timestamp DateTime64(3),
    service LowCardinality(String),
    log_level LowCardinality(String),
    request_id String,
    user_id UInt64,
    endpoint String,
    status_code UInt16,
    duration_ms UInt32,
    message String
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (service, timestamp);

Several design decisions are worth understanding.

Using LowCardinality

Fields such as service names and log levels usually contain relatively few unique values.

Using LowCardinality(String):

Reduces storage requirements
Improves compression
Can speed up filtering and grouping

Partitioning

Partitioning by date allows ClickHouse® to skip unnecessary partitions.

For example, a query requesting yesterday's logs doesn't need to examine partitions from previous weeks.

This significantly improves query efficiency.

Choosing the Sorting Key

The ORDER BY clause determines how data is physically organized on disk.

There is no universally correct sorting key.

Instead, it should reflect the most common query patterns.

Since changing the sorting key later usually requires rebuilding the table, careful planning is important.

Common Log Analytics Queries

Once logs are stored, engineers typically perform analyses like the following.

Finding Recent Errors

SELECT *
FROM logs
WHERE log_level = 'ERROR'
ORDER BY timestamp DESC
LIMIT 100;

Request Volume Per Minute

SELECT
    toStartOfMinute(timestamp) AS minute,
    count(*) AS requests
FROM logs
GROUP BY minute
ORDER BY minute;

Identifying Slow Endpoints

SELECT
    endpoint,
    avg(duration_ms) AS average_latency
FROM logs
GROUP BY endpoint
ORDER BY average_latency DESC;

These types of analytical workloads are exactly where ClickHouse® excels.

Performance Best Practices

Using ClickHouse® effectively requires understanding several operational characteristics.

Avoid Small Inserts

Writing one log record at a time creates many tiny data parts.

This increases background merge activity and reduces overall performance.

Whenever possible, batch log inserts before writing them to ClickHouse®.

Choose an Appropriate Sorting Key

The sorting key has one of the biggest impacts on query performance.

Because changing it later is expensive, spend time understanding your query patterns before selecting it.

Configure TTL Policies

Logs generally have a limited retention period.

Instead of manually deleting old data, configure a TTL policy.

Example:

TTL timestamp + INTERVAL 30 DAY

Expired records are automatically removed during background merges.

Monitor Background Merges

ClickHouse® continuously merges smaller parts into larger ones.

Heavy ingestion workloads can increase:

CPU utilization
Memory usage
Disk bandwidth

Monitoring merge activity is essential for maintaining consistent performance in production.

Limitations

Although ClickHouse® is an excellent analytical database, it is not a complete observability platform.

Some important limitations include:

It does not collect logs by itself.
It does not replace distributed tracing systems.
It does not provide alerting.
It does not include built-in dashboards.
Full-text search differs from dedicated search engines like Elasticsearch.
Poor schema design can significantly impact performance.

For these reasons, organizations typically integrate ClickHouse® with log collectors, telemetry pipelines, and visualization platforms instead of using it as a standalone solution.

When Should You Use ClickHouse®?

ClickHouse® is an excellent choice if your workload involves:

High-volume log ingestion
Time-series event storage
Interactive analytical queries
Cost-efficient long-term log retention
Dashboard-driven operational analytics

It may not be the best option if your primary requirement is:

Advanced full-text search
Document indexing
A complete observability platform without additional tooling

Conclusion

ClickHouse® has become one of the most compelling databases for large-scale log analytics because its columnar architecture closely matches how engineers analyze operational data.

Rather than reading entire log records, it efficiently scans only the required columns, enabling extremely fast aggregations over massive datasets while maintaining excellent storage efficiency through compression.

DEV Community