DEV Community

abel-cheng
abel-cheng

Posted on

Clickhouse Source Code Analysis: insert deduplication

Those who don’t know about clickhouse deduplication can first read the link below.
Deduplicating Inserts on Retries

MergeTree settings:

M(UInt64, replicated_deduplication_window, 1000, "How many last blocks of hashes should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, replicated_deduplication_window_for_async_inserts, 10000, "How many last hash values of async_insert blocks should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds_for_async_inserts, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window_for_async_inserts\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, non_replicated_deduplication_window, 0, "How many last blocks of hashes should be kept on disk (0 - disabled).", 0) \

Enter fullscreen mode Exit fullscreen mode

query settings:

M(Bool, insert_deduplicate, true, "For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(Bool, async_insert_deduplicate, false, "For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(String, insert_deduplication_token, "", "If not empty, used for duplicate detection instead of data digest", 0) \ 
M(Bool, deduplicate_blocks_in_dependent_materialized_views, false, "Should deduplicate blocks for materialized views. Use true to always deduplicate in dependent tables.", 0) \
M(Bool, throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert, true, "Throw exception on INSERT query when the setting `deduplicate_blocks_in_dependent_materialized_views` is enabled along with `async_insert`. It guarantees correctness, because these features can't work together.", 0) \

Enter fullscreen mode Exit fullscreen mode

Before exploring these source codes, I have two questions:
1 How does these settings affected write process?
2 Is insert_deduplication_token settings useful for MATERIALIZED VIEW?

insert_deduplicate

Starting from setting insert_deduplicate
Image description
When committing parts, generate a block_id from part data and then check whether block_id contains in deduplication_log.
Let's dig into how block_id is generated.
Image description
Only parts with level 0 (which means parts are generated from inserts instead of merges) can generate a block_id.
If block_dedup_token is empty, we just get a hash value from checksum files. If block_dedup_token is not empty, this means hash value is already calculated and can be used directly.
Image description
The final block_id is like _, there is no table's info here. So same data part inserting into different tables may also be deduplicated? That doesn't make sense.
Image description
Let's figure out this questions by a demo.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
CREATE TABLE dst_1
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET insert_deduplicate=1;

INSERT INTO dst SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
INSERT INTO dst_1 SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
Enter fullscreen mode Exit fullscreen mode

Data is not deduplicated for different tables.
Image description

We can see each table has it's own deduplication log, so data is not deduplicated for different tables.
Image description

async_insert_deduplicate is for ReplicatedMergeTree. For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.
Image description

Make deduplication check. If a duplicate is detected, no nodes are created.
Image description
Instead of store block_id in memory, store block_id in zk.
Image description

deduplication_window

replicated_deduplication_window
Replicated means block_ids are stored on zk, this setting set the max number of block_ids stored on zk.
ReplicatedMergeTreeCleanupThread on leader replica will do old block_id cleaning works.
Image description

replicated_deduplication_window_seconds and replicated_deduplication_window are all used in cleaning block_ids.
Image description

non_replicated_deduplication_window is used to control size of deduplication_log in memory.
Image description

insert_deduplication_token

When set insert_deduplication_token, SetUserTokenTransform will be added to pipeline.
Image description
SetUserTokenTransform set user-defined token to chunk's tokenInfo, which will be used to generate block_id instead of checksum hash of bin files.
Image description

deduplication in materialized_views

deduplicate_blocks_in_dependent_materialized_views controls whether we will do deduplication in materialized_views.
Image description
Image description

Materialized views use MergeTreeSink as source. So data is not deduplicated if the source data was different.
Image description
Image description

throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert
Let's agree on terminology and say that a mini-INSERT is an asynchronous INSERT which typically contains not a lot of data inside and a big-INSERT in an INSERT which was formed by concatenating several mini-INSERTs together. In case when the client had to retry some mini-INSERTs then they will be properly deduplicated by the source tables. But then they will be glued together into a block and pushed through a chain of Materialized Views if any.The process of forming such blocks is not deterministic so each time we retry mini-INSERTs the resulting block may be concatenated differently. That's why deduplication in dependent Materialized Views doesn't make sense in presence of async INSERTs.

Is insert_deduplication_token settings useful for MATERIALIZED VIEW? Let's test.

CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;

CREATE MATERIALIZED VIEW mv_dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000
AS SELECT
    0 AS key,
    value AS value
FROM dst;

SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET deduplicate_blocks_in_dependent_materialized_views=1;

INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
# duplicated data
INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
Enter fullscreen mode Exit fullscreen mode

It works, no duplicated data is inserted.
Image description

Image of Datadog

Learn how to monitor AWS container environments at scale

In this eBook, Datadog and AWS share insights into the changing state of containers in the cloud and explore why orchestration technologies are an essential part of managing ever-changing containerized workloads.

Download the eBook

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more