DEV Community: abel-cheng

Clickhouse Source Code Analysis: insert deduplication

abel-cheng — Thu, 23 Jan 2025 07:46:27 +0000

Those who don’t know about clickhouse deduplication can first read the link below.
Deduplicating Inserts on Retries

MergeTree settings:

M(UInt64, replicated_deduplication_window, 1000, "How many last blocks of hashes should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, replicated_deduplication_window_for_async_inserts, 10000, "How many last hash values of async_insert blocks should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds_for_async_inserts, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window_for_async_inserts\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, non_replicated_deduplication_window, 0, "How many last blocks of hashes should be kept on disk (0 - disabled).", 0) \

query settings:

M(Bool, insert_deduplicate, true, "For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(Bool, async_insert_deduplicate, false, "For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(String, insert_deduplication_token, "", "If not empty, used for duplicate detection instead of data digest", 0) \ 
M(Bool, deduplicate_blocks_in_dependent_materialized_views, false, "Should deduplicate blocks for materialized views. Use true to always deduplicate in dependent tables.", 0) \
M(Bool, throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert, true, "Throw exception on INSERT query when the setting `deduplicate_blocks_in_dependent_materialized_views` is enabled along with `async_insert`. It guarantees correctness, because these features can't work together.", 0) \

Before exploring these source codes, I have two questions:
1 How does these settings affected write process?
2 Is insert_deduplication_token settings useful for MATERIALIZED VIEW?

insert_deduplicate

Starting from setting insert_deduplicate

When committing parts, generate a block_id from part data and then check whether block_id contains in deduplication_log.
Let's dig into how block_id is generated.

Only parts with level 0 (which means parts are generated from inserts instead of merges) can generate a block_id.
If block_dedup_token is empty, we just get a hash value from checksum files. If block_dedup_token is not empty, this means hash value is already calculated and can be used directly.

The final block_id is like _, there is no table's info here. So same data part inserting into different tables may also be deduplicated? That doesn't make sense.

Let's figure out this questions by a demo.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
CREATE TABLE dst_1
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET insert_deduplicate=1;

INSERT INTO dst SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
INSERT INTO dst_1 SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);

Data is not deduplicated for different tables.

We can see each table has it's own deduplication log, so data is not deduplicated for different tables.

async_insert_deduplicate is for ReplicatedMergeTree. For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.

Make deduplication check. If a duplicate is detected, no nodes are created.

Instead of store block_id in memory, store block_id in zk.

deduplication_window

replicated_deduplication_window
Replicated means block_ids are stored on zk, this setting set the max number of block_ids stored on zk.
ReplicatedMergeTreeCleanupThread on leader replica will do old block_id cleaning works.

replicated_deduplication_window_seconds and replicated_deduplication_window are all used in cleaning block_ids.

non_replicated_deduplication_window is used to control size of deduplication_log in memory.

insert_deduplication_token

When set insert_deduplication_token, SetUserTokenTransform will be added to pipeline.

SetUserTokenTransform set user-defined token to chunk's tokenInfo, which will be used to generate block_id instead of checksum hash of bin files.

deduplication in materialized_views

deduplicate_blocks_in_dependent_materialized_views controls whether we will do deduplication in materialized_views.

Materialized views use MergeTreeSink as source. So data is not deduplicated if the source data was different.

throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert
Let's agree on terminology and say that a mini-INSERT is an asynchronous INSERT which typically contains not a lot of data inside and a big-INSERT in an INSERT which was formed by concatenating several mini-INSERTs together. In case when the client had to retry some mini-INSERTs then they will be properly deduplicated by the source tables. But then they will be glued together into a block and pushed through a chain of Materialized Views if any.The process of forming such blocks is not deterministic so each time we retry mini-INSERTs the resulting block may be concatenated differently. That's why deduplication in dependent Materialized Views doesn't make sense in presence of async INSERTs.

Is insert_deduplication_token settings useful for MATERIALIZED VIEW? Let's test.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;

CREATE MATERIALIZED VIEW mv_dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000
AS SELECT
    0 AS key,
    value AS value
FROM dst;

SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET deduplicate_blocks_in_dependent_materialized_views=1;

INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
# duplicated data
INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';

It works, no duplicated data is inserted.

Clickhouse Source Code Analysis: How is primary key generated and used?

abel-cheng — Mon, 20 Jan 2025 03:50:47 +0000

Introduction

First let's find some details of primary key from a demo.
Demo table:

CREATE TABLE helloworld.my_first_table
(
    `user_id` UInt32,
    `message` String,
    `timestamp` DateTime,
    `metric` Float32,
    INDEX message_idx message TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1
)
ENGINE = MergeTree
PRIMARY KEY (user_id, toStartOfTenMinutes(timestamp))
SETTINGS index_granularity = 2

Set index_granularity to 1, so that we can get lots of mark ranges even with limited demo data.

Insert some data:

INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (103, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (103, 'Sort your data based on your commonly-used queries', today() - 1,     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );

Use user_id in where, we can see 3 mark is filtered out.

Use timestamp < today() or toStartOfTenMinutes(timestamp) < today(), 2 marks is filtered out.

Therefore, the conclusion is that even if we do not use the first position of the sort key when filtering, data skipping can still take effect. How is this happened?

Next, we will explore the principles of PK, including the working mechanism in writing and querying.

Writing of primary key

Write entrance

1 Generate sort by and other expressions from data block, here we got toStartOfTenMinutes(timestamp).

2 Reserve space on disk and create part object.

3 Write data to output stream.

4 Finalize data on disk, including flushing all data, writing minmax for partition keys, uuid.txt, partition.dat, checksum.txt, count.txt.

The writing logic of primary key is inside write function, this function has two implementations: compact and wide parts.

Compact part

primary keys are written into files in function writeDataBlockPrimaryIndexAndSkipIndices.

Writing data block in compact mode is like the tranditional paquet format, columns of the same granule is placing together.

Each granule has one row in primary keys.

Write primary index block, which contains the primary index expressions. (toStartOfTenMinutes(timestamp) in this demo)

Create lots of streams on same data file for different compression codecs.

Wide part

For wide part, primary key has the same logic as they all extends WriterOnDisk.
Columns are writing seperately, write all granules of column1 and then column2...

Each column has it's own data file and mark file.

read process(to be done)

Read entrance

We can see that the filter logic is in function markRangesFromPKRange. This function is invoked on part one by one.
Function description:

Calculates a set of mark ranges, that could possibly contain keys, required by condition.
In other words, it removes subranges from whole range, that definitely could not contain required keys.
If @exact_ranges is not null, fill it with ranges containing marks of fully matched records.

Here is the logic of judging whether a mark may contain the data requested.

In a single part, data is ordered by pk already, so we can use binary search to find mark for left bound and right bound.

We can see that main logic for filter is in key_condition's checkInRange function. Key condition is from indexes.

key_condition is build like this from primary key expressions.

Above binary search logic happens only when we use prefix contiguous subset of pk columns(here we use user_id, and primary keys are user_id, timestamp).

Instead we use timestamp only, the logic will be totally different.

Do exclusion search, where we drop ranges that do not match.

Some personal thoughts

Clickhouse record the pks of the first row in granules and then use these pks to do filtering when querying.
This is not suitable for some extentions, such as zorder.
Zorder is more suitable for minmax indexes, we can filter a mark
by the min and max values in a granule.
Clickhouse now only record minmax for columns used in partitioning, maybe we can extend the minmax to primary keys in the future.

Clickhouse Source Code Analysis: sample by

abel-cheng — Fri, 17 Jan 2025 08:13:34 +0000

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by

How is sample by implemented in clickhouse?

To use SAMPLE in clickhouse, we need to add sample by expression when we create MergeTree table.

If SAMPLE BY is specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer.

Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).

Sample by doesn't change the data structure of MergeTree, as the sample by expression is only used for checking grammatical legality and select process.

In a select process, sample phase happens before filtering mark ranges.

Inside a sample phase, there are two important things:
1 Get the actual sample range.
2 Add the sample range to key conditions, so we can use the sample to skip data.

In above code, it's transforming sample rows to sample fraction.

Then calculate the upper bound of the sample by expression, which is the max unsigned integer value(255 for uint8, 65535 for uint16 and so on).
The lower bound is defined by sample offset.

Add conditions based on the lower and upper bound calculated above.
These conditions will be used in filtering data by filterPartsByPrimaryKeyAndSkipIndexes functions next.

The entire implementation process is surprisingly simple.
We can see that sample by a low cardinality field is not useful in sampling.
When sampling, we need to use hash function like intHash32 even the field is originally unsigned int. For example, if the sample by field range is UInt8 (0-255) and actual values are 0-25, we may read all data when we select sample 0.1. I feel it's better to calculate the upper and lower bounds from the mark file, comparing with directly using the maximum value of the field type. Of course, if you use the hash function directly, you won't have this problem.

When sample doesn't achieve better performance as we wanted, it might be that sample by expression ranks low in order by sequence. Like order by (a, b, c), if we don't use a and b in the where clause, sample by c may not achieve good data skipping effects.

Build clickhouse remote development environment with vscode(v24.8.11.5-lts)

abel-cheng — Thu, 16 Jan 2025 11:15:35 +0000

1 Build dev docker image

FROM docker.io/ubuntu:22.04
RUN rm /bin/sh && ln -s /bin/bash /bin/sh
RUN apt-get -y update
RUN apt-get install -y curl vim git ssh openssh-server cmake ccache python3 ninja-build nasm yasm gawk lsb-release wget software-properties-common gnupg
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
RUN touch ~/.ssh/authorized_keys
RUN bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
RUN echo "export CC=clang-18" >> /root/.bashrc
RUN echo "export CXX=clang++-18" >> /root/.bashrc
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
RUN echo "export PATH=~/.cargo/bin:$PATH" >> /root/.bashrc
RUN /root/.cargo/bin/rustup toolchain install nightly-2024-12-01
RUN /root/.cargo/bin/rustup default nightly-2024-12-01
RUN /root/.cargo/bin/rustup component add rust-src

The rustup installation may fail due to network problems. Just try again, or write an automatic retry script.

After docker build, you will get a development image, named xxx-clickhouse-dev-env:24.8 in this article.

2 Download clickhouse code

git clone git@github.com:ClickHouse/ClickHouse.git <workspace>/clickhouse-24
cd <workspace>/clickhouse-24
git checkout v24.8.11.5-lts

Download all submodules

#!/bin/sh
while :
do
git submodule update --init --recursive --force
sleep 1
done

The loop is because the code clone is unstable and may fail.
When the actual clone does not appear in the log, manually kill the above process.

3 Set up development environment

remote development environment

docker run -itd --name xxx -v <workspace>/clickhouse-24:/data/clickhouse --privileged=true -p xxx:22 --cap-add="NET_ADMIN" --security-opt seccomp=unconfined xxx-clickhouse-dev-env-24.8 bash

Execute the following command in the docker container to enable ssh.

echo xxx >> ~/.ssh/authorized_keys # Write the token in your local id_rsa.pub
/etc/init.d/ssh start

local development environment

Download vscode from https://code.visualstudio.com/.
Install extension remote-SSH next, then we can connect to remote server in vscode.

Add a new ssh host in configuration like below.

Host clickhouse-24
    HostName <ip>
    User root
    Port <port>

Then you can connect to the server as clickhouse-24, then open the code directory /data/clickhouse.

Next we need to install necessary extensions on remote server.

Compile

Add .vscode directory in the working directory and a tasks.json file with the following contents:

{
    "version": "2.0.0",
    "tasks": [
        {
            "type": "shell",
            "label": "cmake",
            "group": "build",
            "command": "cmake -DCMAKE_BUILD_TYPE=Debug -DENABLE_CCACHE=1 -DCMAKE_C_COMPILER=/usr/lib/llvm-18/bin/clang -DCMAKE_CXX_COMPILER=/usr/lib/llvm-18/bin/clang++ -DCMAKE_PREFIX_PATH=/usr/lib/llvm-18/ -DENABLE_JEMALLOC=ON -DENABLE_TESTS=OFF -DCOMPILER_FLAGS=-DNDEBUG -DWERROR=OFF -G Ninja -B build",
            "options": {
                "cwd": "${workspaceFolder}"
            },
            "problemMatcher": [],
            "presentation": {
                "echo": true,
                "reveal": "always",
                "focus": true,
                "panel": "shared",
                "showReuseMessage": true,
                "clear": false
            }
        },
        {
            "type": "shell",
            "label": "ninja clickhouse",
            "group": "build",
            "command": "ninja clickhouse clickhouse-server clickhouse-client -j16",
            "options": {
                "cwd": "${workspaceFolder}/build"
            }
            // "dependsOn": "cmake",
            },
        {
            "type": "shell",
            "label": "ninja all",
            "group": "build",
            "command": "ninja -j16",
            "options": {
            "cwd": "${workspaceFolder}/build"
            }
            // "dependsOn": "cmake",
        }
    ]
}

Do cmake first and then ninja clickhouse, ninja clickhouse will cost several hours.

Add settings.json file in dir .vscode to enable code reference.

{
    "clangd.path": "/usr/lib/llvm-18/bin/clangd",
    "clangd.checkUpdates": false,
    "clangd.arguments": [
        "--background-index",
        "--compile-commands-dir=build",
        "-j=12",
        "--query-driver=/usr/lib/llvm-18/bin/clang++",
        "--clang-tidy",
        "--clang-tidy-checks=performance-*,bugprone-*",
        "--all-scopes-completion",
        "--completion-style=detailed",
        "--header-insertion=iwyu",
        "--pch-storage=disk"
    ],
    "clangd.onConfigChanged": "restart",
    "lldb.commandCompletions": true,
    "lldb.dereferencePointers": true,
    "lldb.evaluateForHovers": true,
    "lldb.launch.expressions": "simple",
    "lldb.showDisassembly": "never",
    "lldb.verboseLogging": true,
    // cpp_tools Config
    "C_Cpp.autocomplete": "Disabled",
    "C_Cpp.formatting": "Disabled",
    "C_Cpp.errorSquiggles": "Disabled",
    "C_Cpp.intelliSenseEngine": "Disabled",
    "git.ignoreLimitWarning": true,
}

If the word indexing appears at the bottom, it means the configuration is successful. If it fails, try restarting the window several times.

Debug

Start a stand-alone clickhouse-server instance and perform single-point debugging.
You need to confirm that LLDB is installed.
If the download speed is too slow, you can download it in website and install it manually.

Add launch.json in dir .vscode:

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Clickhouse Server",
            "type": "lldb",
            "request": "launch",
            "program": "${workspaceFolder}/build/programs/clickhouse",
            "args": [
                "server", "--config-file=${workspaceFolder}/programs/server/config.xml"
            ],
            "initCommands": [
                "process handle -p false -s false -n false SIGUSR1",
                "process handle -p false -s false -n false SIGUSR2"
            ],
            "preLaunchTask": "ninja clickhouse",
            // "stopAtEntry": false,
            // "osx": {
            // "MIMode": "lldb"
            // },
            "cwd": "${workspaceFolder}"
        }
    ]
}

Launch clickhouse-server here, and you can mark some breakpoints to debug.