Clickhouse Source Code Analysis: insert deduplication

#clickhous #olap

Those who don’t know about clickhouse deduplication can first read the link below.
Deduplicating Inserts on Retries

MergeTree settings:

M(UInt64, replicated_deduplication_window, 1000, "How many last blocks of hashes should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, replicated_deduplication_window_for_async_inserts, 10000, "How many last hash values of async_insert blocks should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds_for_async_inserts, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window_for_async_inserts\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, non_replicated_deduplication_window, 0, "How many last blocks of hashes should be kept on disk (0 - disabled).", 0) \

query settings:

M(Bool, insert_deduplicate, true, "For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(Bool, async_insert_deduplicate, false, "For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(String, insert_deduplication_token, "", "If not empty, used for duplicate detection instead of data digest", 0) \ 
M(Bool, deduplicate_blocks_in_dependent_materialized_views, false, "Should deduplicate blocks for materialized views. Use true to always deduplicate in dependent tables.", 0) \
M(Bool, throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert, true, "Throw exception on INSERT query when the setting `deduplicate_blocks_in_dependent_materialized_views` is enabled along with `async_insert`. It guarantees correctness, because these features can't work together.", 0) \

Before exploring these source codes, I have two questions:
1 How does these settings affected write process?
2 Is insert_deduplication_token settings useful for MATERIALIZED VIEW?

insert_deduplicate

Starting from setting insert_deduplicate

When committing parts, generate a block_id from part data and then check whether block_id contains in deduplication_log.
Let's dig into how block_id is generated.

Only parts with level 0 (which means parts are generated from inserts instead of merges) can generate a block_id.
If block_dedup_token is empty, we just get a hash value from checksum files. If block_dedup_token is not empty, this means hash value is already calculated and can be used directly.

The final block_id is like _, there is no table's info here. So same data part inserting into different tables may also be deduplicated? That doesn't make sense.

Let's figure out this questions by a demo.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
CREATE TABLE dst_1
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET insert_deduplicate=1;

INSERT INTO dst SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
INSERT INTO dst_1 SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);

Data is not deduplicated for different tables.

We can see each table has it's own deduplication log, so data is not deduplicated for different tables.

async_insert_deduplicate is for ReplicatedMergeTree. For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.

Make deduplication check. If a duplicate is detected, no nodes are created.

Instead of store block_id in memory, store block_id in zk.

deduplication_window

replicated_deduplication_window
Replicated means block_ids are stored on zk, this setting set the max number of block_ids stored on zk.
ReplicatedMergeTreeCleanupThread on leader replica will do old block_id cleaning works.

replicated_deduplication_window_seconds and replicated_deduplication_window are all used in cleaning block_ids.

non_replicated_deduplication_window is used to control size of deduplication_log in memory.

insert_deduplication_token

When set insert_deduplication_token, SetUserTokenTransform will be added to pipeline.

SetUserTokenTransform set user-defined token to chunk's tokenInfo, which will be used to generate block_id instead of checksum hash of bin files.

deduplication in materialized_views

deduplicate_blocks_in_dependent_materialized_views controls whether we will do deduplication in materialized_views.

Materialized views use MergeTreeSink as source. So data is not deduplicated if the source data was different.

throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert
Let's agree on terminology and say that a mini-INSERT is an asynchronous INSERT which typically contains not a lot of data inside and a big-INSERT in an INSERT which was formed by concatenating several mini-INSERTs together. In case when the client had to retry some mini-INSERTs then they will be properly deduplicated by the source tables. But then they will be glued together into a block and pushed through a chain of Materialized Views if any.The process of forming such blocks is not deterministic so each time we retry mini-INSERTs the resulting block may be concatenated differently. That's why deduplication in dependent Materialized Views doesn't make sense in presence of async INSERTs.

Is insert_deduplication_token settings useful for MATERIALIZED VIEW? Let's test.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;

CREATE MATERIALIZED VIEW mv_dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000
AS SELECT
    0 AS key,
    value AS value
FROM dst;

SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET deduplicate_blocks_in_dependent_materialized_views=1;

INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
# duplicated data
INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';

It works, no duplicated data is inserted.