Kanishga Subramani

Posted on Jul 3

Day 53: ClickHouse® Data Sampling Explained – Query Billions of Rows Faster

#clickhouse #devops #database #dataengineering

ClickHouse® Data Sampling: Querying Billions of Rows Fast

Day 53 of #100DaysOfClickHouse

Modern analytics platforms routinely manage billions—or even trillions—of rows of data. Although ClickHouse® is built to execute analytical queries at incredible speed, there are many situations where an approximate answer is more valuable than waiting for an exact one.

Imagine you're monitoring application metrics, exploring user behavior, or checking whether a marketing campaign is performing as expected. In these scenarios, getting an answer in a few hundred milliseconds can be far more useful than waiting several seconds for perfect accuracy.

This is exactly where data sampling becomes valuable.

Instead of reading every row in a table, ClickHouse® can process only a portion of the data and generate approximate results. This reduces disk I/O, CPU usage, and query execution time, making exploratory analytics significantly faster.

However, sampling is not a universal optimization. It comes with a trade-off between speed and accuracy, and it only works efficiently when your table is designed to support it.

In this article, we'll explore how data sampling works in ClickHouse®, when it should be used, and the best practices for designing tables that take advantage of it.

What Is Data Sampling?

Data sampling is the process of executing a query against only a subset of the available data rather than scanning every row.

Consider a table containing 10 billion events.

Without sampling, ClickHouse® reads all 10 billion rows to produce an exact result.

With sampling enabled, ClickHouse® may read only 10% of the data and return an approximate answer based on that subset.

Because less data is processed, sampling typically provides several benefits:

Faster query execution
Lower CPU utilization
Reduced disk I/O
Lower overall resource consumption

The trade-off is straightforward:

The results are approximate rather than exact.

Whether this approximation is acceptable depends entirely on your use case.

How Sampling Works in ClickHouse®

One common misconception is that ClickHouse® randomly selects rows every time a sampled query runs.

That's not how sampling works.

Instead, efficient sampling depends on how the table was designed.

For MergeTree tables, you specify a sampling key using the SAMPLE BY clause during table creation.

CREATE TABLE events
(
    user_id UInt64,
    event_time DateTime,
    event_name String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
SAMPLE BY user_id;

The sampling key determines how data is organized for efficient sampling.

Later, you can query a portion of the table using:

SELECT count()
FROM events
SAMPLE 0.1;

Rather than randomly inspecting rows, ClickHouse® deterministically reads approximately 10% of the data based on the sampling key.

This deterministic behavior provides an important advantage:

Running the same sampled query multiple times returns the same subset of data, provided the underlying data has not changed.

This makes sampled queries consistent and predictable.

Why the `SAMPLE BY` Clause Matters

The SAMPLE clause only performs efficiently when the table has been created with an appropriate sampling expression.

Without a sampling key, ClickHouse® has no efficient way to determine which portions of the data should be read.

Simply adding:

SAMPLE 0.1

to every query will not automatically improve performance.

Sampling is a table design feature rather than a query optimization that can be added later.

If you expect to use sampling regularly, it should be considered during schema design.

Understanding the `SAMPLE` Clause

The most common syntax is:

SELECT *
FROM events
SAMPLE 0.1;

Here, 0.1 instructs ClickHouse® to process approximately 10% of the data.

Other examples include:

SAMPLE 0.25

which processes roughly 25% of the data.

Or:

SAMPLE 0.5

which reads approximately half the table.

Larger sampling ratios generally provide more accurate results but require reading more data.

Smaller ratios improve speed while increasing estimation error.

There is no universally correct sampling ratio—the ideal choice depends on your workload and acceptable level of approximation.

Choosing a Good Sampling Key

Selecting the right sampling key is one of the most important design decisions.

An effective sampling key should generally have:

High cardinality
Even distribution
Stable values

Common examples include:

user_id
session_id
device_id

These columns usually distribute rows evenly across the dataset, producing representative samples.

In contrast, low-cardinality columns such as:

country
status
gender

may create biased or uneven samples depending on the underlying data distribution.

Choosing the right sampling key is often just as important as selecting the sampling ratio.

Speed vs Accuracy

Sampling always involves a trade-off.

Full Scan	Sampled Query
Exact results	Approximate results
Reads all matching rows	Reads only a subset
Higher resource usage	Lower resource usage
Best for reporting	Best for exploration

The quality of sampled results depends on several factors:

Sampling ratio
Sampling key
Data distribution
Aggregation being performed

For example, calculating an average from a representative sample often produces reliable estimates.

However, estimating very rare events using a small sample may lead to significant inaccuracies.

When Should You Use Data Sampling?

Sampling is most valuable when approximate answers are sufficient.

Common use cases include:

Interactive analytics
Dashboard exploration
Product analytics
Monitoring event streams
Trend analysis
Investigating large datasets before running exact queries

In these situations, reducing query latency often provides more value than perfect precision.

When Should You Avoid Sampling?

Some workloads require exact answers.

Sampling should generally be avoided for:

Financial reporting
Billing systems
Compliance reporting
Tax calculations
Exact customer counts
Regulatory reporting

Whenever correctness is more important than execution speed, a full scan remains the better choice.

Performance Expectations

It's tempting to assume that sampling always produces dramatic speed improvements.

In reality, the performance gain depends on several factors, including:

Table design
Sampling ratio
Query complexity
Data distribution
Storage layout
Available hardware

Queries that already read only a small portion of the table because of highly selective filters may gain little from sampling.

Similarly, very small tables often don't benefit enough for sampling to justify the loss in accuracy.

Instead of assuming a fixed improvement, benchmark sampled and non-sampled queries against your own workload.

Best Practices

If you plan to use data sampling in production, consider these recommendations:

Design tables with an appropriate SAMPLE BY expression from the beginning.
Choose high-cardinality sampling keys whenever possible.
Periodically compare sampled results against full scans to measure estimation error.
Increase the sampling ratio when higher accuracy is required.
Clearly label sampled metrics as approximate in dashboards and reports.

These practices help users understand the limitations of sampled data while still benefiting from faster query execution.

Conclusion

ClickHouse® data sampling is a powerful feature for accelerating analytical queries when approximate answers are acceptable.

By processing only a subset of the data, ClickHouse® can significantly reduce disk I/O, CPU usage, and query latency, making exploratory analytics much more responsive. When combined with a well-designed SAMPLE BY expression and an appropriate high-cardinality sampling key, sampling becomes an effective tool for working with massive datasets.

However, sampling is not a replacement for exact analytics. It introduces estimation error, requires thoughtful schema design, and should not be used for workloads where every record matters.

Like many performance optimization techniques, sampling is most effective when applied intentionally. Understanding its strengths, limitations, and design considerations allows you to make informed decisions about when fast approximate answers are sufficient—and when a full scan is the better choice.

If you're following along, this is Day 53 of #100DaysOfClickHouse, where I'm exploring one ClickHouse® concept every day and sharing practical examples, best practices, and real-world use cases.

DEV Community

Day 53: ClickHouse® Data Sampling Explained – Query Billions of Rows Faster

ClickHouse® Data Sampling: Querying Billions of Rows Fast

What Is Data Sampling?

How Sampling Works in ClickHouse®

Why the `SAMPLE BY` Clause Matters

Understanding the `SAMPLE` Clause

Choosing a Good Sampling Key

Speed vs Accuracy

When Should You Use Data Sampling?

When Should You Avoid Sampling?

Performance Expectations

Best Practices

Conclusion

Top comments (0)

ClickHouse® Data Sampling: Querying Billions of Rows Fast

What Is Data Sampling?

How Sampling Works in ClickHouse®

Why the SAMPLE BY Clause Matters

Understanding the SAMPLE Clause

Choosing a Good Sampling Key

Speed vs Accuracy

When Should You Use Data Sampling?

When Should You Avoid Sampling?

Performance Expectations

Best Practices

Conclusion

Why the `SAMPLE BY` Clause Matters

Understanding the `SAMPLE` Clause