ClickHouse® Data Sampling: Querying Billions of Rows Fast
Day 53 of #100DaysOfClickHouse
Modern analytics platforms routinely manage billions—or even trillions—of rows of data. Although ClickHouse® is built to execute analytical queries at incredible speed, there are many situations where an approximate answer is more valuable than waiting for an exact one.
Imagine you're monitoring application metrics, exploring user behavior, or checking whether a marketing campaign is performing as expected. In these scenarios, getting an answer in a few hundred milliseconds can be far more useful than waiting several seconds for perfect accuracy.
This is exactly where data sampling becomes valuable.
Instead of reading every row in a table, ClickHouse® can process only a portion of the data and generate approximate results. This reduces disk I/O, CPU usage, and query execution time, making exploratory analytics significantly faster.
However, sampling is not a universal optimization. It comes with a trade-off between speed and accuracy, and it only works efficiently when your table is designed to support it.
In this article, we'll explore how data sampling works in ClickHouse®, when it should be used, and the best practices for designing tables that take advantage of it.
What Is Data Sampling?
Data sampling is the process of executing a query against only a subset of the available data rather than scanning every row.
Consider a table containing 10 billion events.
Without sampling, ClickHouse® reads all 10 billion rows to produce an exact result.
With sampling enabled, ClickHouse® may read only 10% of the data and return an approximate answer based on that subset.
Because less data is processed, sampling typically provides several benefits:
- Faster query execution
- Lower CPU utilization
- Reduced disk I/O
- Lower overall resource consumption
The trade-off is straightforward:
The results are approximate rather than exact.
Whether this approximation is acceptable depends entirely on your use case.
How Sampling Works in ClickHouse®
One common misconception is that ClickHouse® randomly selects rows every time a sampled query runs.
That's not how sampling works.
Instead, efficient sampling depends on how the table was designed.
For MergeTree tables, you specify a sampling key using the SAMPLE BY clause during table creation.
CREATE TABLE events
(
user_id UInt64,
event_time DateTime,
event_name String
)
ENGINE = MergeTree
ORDER BY (user_id, event_time)
SAMPLE BY user_id;
The sampling key determines how data is organized for efficient sampling.
Later, you can query a portion of the table using:
SELECT count()
FROM events
SAMPLE 0.1;
Rather than randomly inspecting rows, ClickHouse® deterministically reads approximately 10% of the data based on the sampling key.
This deterministic behavior provides an important advantage:
Running the same sampled query multiple times returns the same subset of data, provided the underlying data has not changed.
This makes sampled queries consistent and predictable.
Why the SAMPLE BY Clause Matters
The SAMPLE clause only performs efficiently when the table has been created with an appropriate sampling expression.
Without a sampling key, ClickHouse® has no efficient way to determine which portions of the data should be read.
Simply adding:
SAMPLE 0.1
to every query will not automatically improve performance.
Sampling is a table design feature rather than a query optimization that can be added later.
If you expect to use sampling regularly, it should be considered during schema design.
Understanding the SAMPLE Clause
The most common syntax is:
SELECT *
FROM events
SAMPLE 0.1;
Here, 0.1 instructs ClickHouse® to process approximately 10% of the data.
Other examples include:
SAMPLE 0.25
which processes roughly 25% of the data.
Or:
SAMPLE 0.5
which reads approximately half the table.
Larger sampling ratios generally provide more accurate results but require reading more data.
Smaller ratios improve speed while increasing estimation error.
There is no universally correct sampling ratio—the ideal choice depends on your workload and acceptable level of approximation.
Choosing a Good Sampling Key
Selecting the right sampling key is one of the most important design decisions.
An effective sampling key should generally have:
- High cardinality
- Even distribution
- Stable values
Common examples include:
user_idsession_iddevice_id
These columns usually distribute rows evenly across the dataset, producing representative samples.
In contrast, low-cardinality columns such as:
countrystatusgender
may create biased or uneven samples depending on the underlying data distribution.
Choosing the right sampling key is often just as important as selecting the sampling ratio.
Speed vs Accuracy
Sampling always involves a trade-off.
| Full Scan | Sampled Query |
|---|---|
| Exact results | Approximate results |
| Reads all matching rows | Reads only a subset |
| Higher resource usage | Lower resource usage |
| Best for reporting | Best for exploration |
The quality of sampled results depends on several factors:
- Sampling ratio
- Sampling key
- Data distribution
- Aggregation being performed
For example, calculating an average from a representative sample often produces reliable estimates.
However, estimating very rare events using a small sample may lead to significant inaccuracies.
When Should You Use Data Sampling?
Sampling is most valuable when approximate answers are sufficient.
Common use cases include:
- Interactive analytics
- Dashboard exploration
- Product analytics
- Monitoring event streams
- Trend analysis
- Investigating large datasets before running exact queries
In these situations, reducing query latency often provides more value than perfect precision.
When Should You Avoid Sampling?
Some workloads require exact answers.
Sampling should generally be avoided for:
- Financial reporting
- Billing systems
- Compliance reporting
- Tax calculations
- Exact customer counts
- Regulatory reporting
Whenever correctness is more important than execution speed, a full scan remains the better choice.
Performance Expectations
It's tempting to assume that sampling always produces dramatic speed improvements.
In reality, the performance gain depends on several factors, including:
- Table design
- Sampling ratio
- Query complexity
- Data distribution
- Storage layout
- Available hardware
Queries that already read only a small portion of the table because of highly selective filters may gain little from sampling.
Similarly, very small tables often don't benefit enough for sampling to justify the loss in accuracy.
Instead of assuming a fixed improvement, benchmark sampled and non-sampled queries against your own workload.
Best Practices
If you plan to use data sampling in production, consider these recommendations:
- Design tables with an appropriate
SAMPLE BYexpression from the beginning. - Choose high-cardinality sampling keys whenever possible.
- Periodically compare sampled results against full scans to measure estimation error.
- Increase the sampling ratio when higher accuracy is required.
- Clearly label sampled metrics as approximate in dashboards and reports.
These practices help users understand the limitations of sampled data while still benefiting from faster query execution.
Conclusion
ClickHouse® data sampling is a powerful feature for accelerating analytical queries when approximate answers are acceptable.
By processing only a subset of the data, ClickHouse® can significantly reduce disk I/O, CPU usage, and query latency, making exploratory analytics much more responsive. When combined with a well-designed SAMPLE BY expression and an appropriate high-cardinality sampling key, sampling becomes an effective tool for working with massive datasets.
However, sampling is not a replacement for exact analytics. It introduces estimation error, requires thoughtful schema design, and should not be used for workloads where every record matters.
Like many performance optimization techniques, sampling is most effective when applied intentionally. Understanding its strengths, limitations, and design considerations allows you to make informed decisions about when fast approximate answers are sufficient—and when a full scan is the better choice.
If you're following along, this is Day 53 of #100DaysOfClickHouse, where I'm exploring one ClickHouse® concept every day and sharing practical examples, best practices, and real-world use cases.
Read more... https://www.quantrail-data.com/clickhouse-data-sampling-querying-billions-of-rows-fast
Top comments (0)