DEV Community

abel-cheng
abel-cheng

Posted on

Clickhouse Source Code Analysis: sample by

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by

How is sample by implemented in clickhouse?

To use SAMPLE in clickhouse, we need to add sample by expression when we create MergeTree table.

If SAMPLE BY is specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer.

Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).

Sample by doesn't change the data structure of MergeTree, as the sample by expression is only used for checking grammatical legality and select process.

Image description
In a select process, sample phase happens before filtering mark ranges.

Inside a sample phase, there are two important things:
1 Get the actual sample range.
2 Add the sample range to key conditions, so we can use the sample to skip data.

Transform sample by rows
In above code, it's transforming sample rows to sample fraction.

Get max value
Then calculate the upper bound of the sample by expression, which is the max unsigned integer value(255 for uint8, 65535 for uint16 and so on).
The lower bound is defined by sample offset.

Add sample filter conditions
Add conditions based on the lower and upper bound calculated above.
These conditions will be used in filtering data by filterPartsByPrimaryKeyAndSkipIndexes functions next.

The entire implementation process is surprisingly simple.
We can see that sample by a low cardinality field is not useful in sampling.
When sampling, we need to use hash function like intHash32 even the field is originally unsigned int. For example, if the sample by field range is UInt8 (0-255) and actual values are 0-25, we may read all data when we select sample 0.1. I feel it's better to calculate the upper and lower bounds from the mark file, comparing with directly using the maximum value of the field type. Of course, if you use the hash function directly, you won't have this problem.

When sample doesn't achieve better performance as we wanted, it might be that sample by expression ranks low in order by sequence. Like order by (a, b, c), if we don't use a and b in the where clause, sample by c may not achieve good data skipping effects.

Image of AssemblyAI tool

Challenge Submission: SpeechCraft - AI-Powered Speech Analysis for Better Communication

SpeechCraft is an advanced real-time speech analytics platform that transforms spoken words into actionable insights. Using cutting-edge AI technology from AssemblyAI, it provides instant transcription while analyzing multiple dimensions of speech performance.

Read full post

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay