<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: abel-cheng</title>
    <description>The latest articles on DEV Community by abel-cheng (@cheng-w).</description>
    <link>https://dev.to/cheng-w</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2722819%2Fdc690906-ecfb-4f13-b343-17a1a0723e35.png</url>
      <title>DEV Community: abel-cheng</title>
      <link>https://dev.to/cheng-w</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cheng-w"/>
    <language>en</language>
    <item>
      <title>Clickhouse Source Code Analysis: insert deduplication</title>
      <dc:creator>abel-cheng</dc:creator>
      <pubDate>Thu, 23 Jan 2025 07:46:27 +0000</pubDate>
      <link>https://dev.to/cheng-w/clickhouse-source-code-analysis-insert-deduplication-3jep</link>
      <guid>https://dev.to/cheng-w/clickhouse-source-code-analysis-insert-deduplication-3jep</guid>
      <description>&lt;p&gt;Those who don’t know about clickhouse deduplication can first read the link below.&lt;br&gt;
&lt;a href="https://clickhouse.com/docs/en/guides/developer/deduplicating-inserts-on-retries" rel="noopener noreferrer"&gt;Deduplicating Inserts on Retries&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MergeTree settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M(UInt64, replicated_deduplication_window, 1000, "How many last blocks of hashes should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, replicated_deduplication_window_for_async_inserts, 10000, "How many last hash values of async_insert blocks should be kept in ZooKeeper (old blocks will be deleted).", 0) \
M(UInt64, replicated_deduplication_window_seconds_for_async_inserts, 7 * 24 * 60 * 60 /* one week */, "Similar to \"replicated_deduplication_window_for_async_inserts\", but determines old blocks by their lifetime. Hash of an inserted block will be deleted (and the block will not be deduplicated after) if it outside of one \"window\". You can set very big replicated_deduplication_window to avoid duplicating INSERTs during that period of time.", 0) \
M(UInt64, non_replicated_deduplication_window, 0, "How many last blocks of hashes should be kept on disk (0 - disabled).", 0) \

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;query settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M(Bool, insert_deduplicate, true, "For INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(Bool, async_insert_deduplicate, false, "For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed", 0) \
M(String, insert_deduplication_token, "", "If not empty, used for duplicate detection instead of data digest", 0) \ 
M(Bool, deduplicate_blocks_in_dependent_materialized_views, false, "Should deduplicate blocks for materialized views. Use true to always deduplicate in dependent tables.", 0) \
M(Bool, throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert, true, "Throw exception on INSERT query when the setting `deduplicate_blocks_in_dependent_materialized_views` is enabled along with `async_insert`. It guarantees correctness, because these features can't work together.", 0) \

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before exploring these source codes, I have two questions:&lt;br&gt;
1 How does these settings affected write process?&lt;br&gt;
2 Is insert_deduplication_token settings useful for MATERIALIZED VIEW?&lt;/p&gt;
&lt;h2&gt;
  
  
  insert_deduplicate
&lt;/h2&gt;

&lt;p&gt;Starting from setting &lt;code&gt;insert_deduplicate&lt;/code&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy73z78kr5ow5ntowm0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy73z78kr5ow5ntowm0t.png" alt="Image description" width="800" height="625"&gt;&lt;/a&gt;&lt;br&gt;
When committing parts, generate a block_id from part data and then check whether block_id contains in deduplication_log.&lt;br&gt;
Let's dig into how block_id is generated.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjeq065jpggaxxqms1j71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjeq065jpggaxxqms1j71.png" alt="Image description" width="800" height="615"&gt;&lt;/a&gt;&lt;br&gt;
Only parts with level 0 (which means parts are generated from inserts instead of merges) can generate a block_id.&lt;br&gt;
If block_dedup_token is empty, we just get a hash value from checksum files. If block_dedup_token is not empty, this means hash value is already calculated and can be used directly.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tfk9faisjt92p7lthh9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tfk9faisjt92p7lthh9.png" alt="Image description" width="800" height="346"&gt;&lt;/a&gt;&lt;br&gt;
The final block_id is like _, there is no table's info here. So same data part inserting into different tables may also be deduplicated? That doesn't make sense.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F082q5znib2mx99v5il0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F082q5znib2mx99v5il0b.png" alt="Image description" width="800" height="643"&gt;&lt;/a&gt;&lt;br&gt;
Let's figure out this questions by a demo.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
CREATE TABLE dst_1
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;
SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET insert_deduplicate=1;

INSERT INTO dst SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
INSERT INTO dst_1 SELECT
    0 AS key,
    'A' AS value
FROM numbers(2);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data is not deduplicated for different tables.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp78g1bvjrw72qoxvgcb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp78g1bvjrw72qoxvgcb.png" alt="Image description" width="682" height="728"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see each table has it's own deduplication log, so data is not deduplicated for different tables.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9q983wpqxnaq2fs7leu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl9q983wpqxnaq2fs7leu.png" alt="Image description" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;async_insert_deduplicate&lt;/code&gt; is for ReplicatedMergeTree. For async INSERT queries in the replicated table, specifies that deduplication of inserting blocks should be performed.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnejyqjr04crttljlncsh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnejyqjr04crttljlncsh.png" alt="Image description" width="800" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make deduplication check. If a duplicate is detected, no nodes are created.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzmigxlbknbnoupiz53f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzmigxlbknbnoupiz53f.png" alt="Image description" width="800" height="538"&gt;&lt;/a&gt;&lt;br&gt;
Instead of store block_id in memory, store block_id in zk.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncit3sh52qki0bgnb2ob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fncit3sh52qki0bgnb2ob.png" alt="Image description" width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  deduplication_window
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;replicated_deduplication_window&lt;/code&gt; &lt;br&gt;
Replicated means block_ids are stored on zk, this setting set the max number of block_ids stored on zk.&lt;br&gt;
ReplicatedMergeTreeCleanupThread on leader replica will do old block_id cleaning works.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyxt30tuhqtl213gqfwk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyxt30tuhqtl213gqfwk.png" alt="Image description" width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;replicated_deduplication_window_seconds&lt;/code&gt; and &lt;code&gt;replicated_deduplication_window&lt;/code&gt; are all used in cleaning block_ids.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoof4pwikb5493sxmy6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoof4pwikb5493sxmy6u.png" alt="Image description" width="800" height="542"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;non_replicated_deduplication_window&lt;/code&gt; is used to control size of deduplication_log in memory.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faz06q3pxva3i58e2ibbe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faz06q3pxva3i58e2ibbe.png" alt="Image description" width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  insert_deduplication_token
&lt;/h2&gt;

&lt;p&gt;When set &lt;code&gt;insert_deduplication_token&lt;/code&gt;, SetUserTokenTransform will be added to pipeline.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88qzw8nkxpirx15kpubt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88qzw8nkxpirx15kpubt.png" alt="Image description" width="800" height="516"&gt;&lt;/a&gt;&lt;br&gt;
SetUserTokenTransform set user-defined token to chunk's tokenInfo, which will be used to generate block_id instead of checksum hash of bin files.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fji2jxtie9nhqfuyigayd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fji2jxtie9nhqfuyigayd.png" alt="Image description" width="800" height="538"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  deduplication in materialized_views
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;deduplicate_blocks_in_dependent_materialized_views&lt;/code&gt; controls whether  we will do deduplication in materialized_views.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6xr7x5ixgacaa5ioot5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6xr7x5ixgacaa5ioot5.png" alt="Image description" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr26ugdkuiwa1touywaes.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr26ugdkuiwa1touywaes.png" alt="Image description" width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Materialized views use MergeTreeSink as source. So data is not deduplicated if the source data was different.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk414f8onmbworiot69nx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk414f8onmbworiot69nx.png" alt="Image description" width="800" height="574"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gz6jr02mjborn5smrlh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gz6jr02mjborn5smrlh.png" alt="Image description" width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;throw_if_deduplication_in_dependent_materialized_views_enabled_with_async_insert&lt;/code&gt;&lt;br&gt;
Let's agree on terminology and say that a mini-INSERT is an asynchronous INSERT which typically contains not a lot of data inside and a big-INSERT in an INSERT which was formed by concatenating several mini-INSERTs together.  In case when the client had to retry some mini-INSERTs then they will be properly deduplicated by the source tables. But then they will be glued together into a block and pushed through a chain of Materialized Views if any.The process of forming such blocks is not deterministic so each time we retry mini-INSERTs the resulting block may be concatenated differently. That's why deduplication in dependent Materialized Views doesn't make sense in presence of async INSERTs.&lt;/p&gt;

&lt;p&gt;Is insert_deduplication_token settings useful for MATERIALIZED VIEW? Let's test.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000;

CREATE MATERIALIZED VIEW mv_dst
(
    `key` Int64,
    `value` String
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS non_replicated_deduplication_window=1000
AS SELECT
    0 AS key,
    value AS value
FROM dst;

SET max_block_size=1;
SET min_insert_block_size_rows=0;
SET min_insert_block_size_bytes=0;
SET deduplicate_blocks_in_dependent_materialized_views=1;

INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
# duplicated data
INSERT INTO dst SELECT
    number + 1 AS key,
    IF(key = 0, 'A', 'B') AS value
FROM numbers(2)
settings insert_deduplication_token='some_user_token';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works, no duplicated data is inserted.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh93cio46rhgaqet1ojj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjh93cio46rhgaqet1ojj.png" alt="Image description" width="698" height="442"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>clickhous</category>
      <category>olap</category>
    </item>
    <item>
      <title>Clickhouse Source Code Analysis: How is primary key generated and used?</title>
      <dc:creator>abel-cheng</dc:creator>
      <pubDate>Mon, 20 Jan 2025 03:50:47 +0000</pubDate>
      <link>https://dev.to/cheng-w/clickhouse-source-code-analysis-how-is-primary-key-generated-and-used-46ei</link>
      <guid>https://dev.to/cheng-w/clickhouse-source-code-analysis-how-is-primary-key-generated-and-used-46ei</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;First let's find some details of primary key from a demo.&lt;br&gt;
Demo table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE helloworld.my_first_table
(
    `user_id` UInt32,
    `message` String,
    `timestamp` DateTime,
    `metric` Float32,
    INDEX message_idx message TYPE ngrambf_v1(3, 10000, 3, 7) GRANULARITY 1
)
ENGINE = MergeTree
PRIMARY KEY (user_id, toStartOfTenMinutes(timestamp))
SETTINGS index_granularity = 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set index_granularity to 1, so that we can get lots of mark ranges even with limited demo data.&lt;/p&gt;

&lt;p&gt;Insert some data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (103, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (103, 'Sort your data based on your commonly-used queries', today() - 1,     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use user_id in where, we can see 3 mark is filtered out.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb4wt6nitus45caywf4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb4wt6nitus45caywf4t.png" alt="Image description" width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use timestamp &amp;lt; today() or toStartOfTenMinutes(timestamp) &amp;lt; today(), 2 marks is filtered out.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq94qxw7m1dvb85zh1beq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq94qxw7m1dvb85zh1beq.png" alt="Image description" width="800" height="306"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9lazl5g59a60bicy50d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9lazl5g59a60bicy50d.png" alt="Image description" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Therefore, the conclusion is that even if we do not use the first position of the sort key when filtering, data skipping can still take effect. How is this happened?&lt;/p&gt;

&lt;p&gt;Next, we will explore the principles of PK, including the working mechanism in writing and querying.&lt;/p&gt;


&lt;h2&gt;
  
  
  Writing of primary key
&lt;/h2&gt;

&lt;p&gt;Write entrance&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7olfwpp9yrvhntd2xgfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7olfwpp9yrvhntd2xgfl.png" alt="Image description" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1 Generate sort by and other expressions from data block, here we got toStartOfTenMinutes(timestamp).&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipooewnd3m6afw3o3l1x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipooewnd3m6afw3o3l1x.png" alt="Image description" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3ixb3sks83jsh8jto8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyb3ixb3sks83jsh8jto8.png" alt="Image description" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2 Reserve space on disk and create part object.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvcfpusasabm1o3a16r0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvcfpusasabm1o3a16r0.png" alt="Image description" width="800" height="549"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3 Write data to output stream.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyn72mho4cwk5kmkmd3y1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyn72mho4cwk5kmkmd3y1.png" alt="Image description" width="800" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;4 Finalize data on disk, including flushing all data, writing minmax for partition keys, uuid.txt, partition.dat, checksum.txt, count.txt.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zytmrqajzhjqqssm2zn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zytmrqajzhjqqssm2zn.png" alt="Image description" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The writing logic of primary key is inside write function, this function has two implementations: compact and wide parts.&lt;/p&gt;
&lt;h1&gt;
  
  
  Compact part
&lt;/h1&gt;

&lt;p&gt;primary keys are written into files in function writeDataBlockPrimaryIndexAndSkipIndices.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ht9vvbd3jvf2qzwkjmi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ht9vvbd3jvf2qzwkjmi.png" alt="Image description" width="800" height="469"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5yfbkkra74dlze6u8er.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5yfbkkra74dlze6u8er.png" alt="Image description" width="800" height="706"&gt;&lt;/a&gt;&lt;br&gt;
Writing data block in compact mode is like the tranditional paquet format, columns of the same granule is placing together.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0ne9jllyqw9ndjcbday.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd0ne9jllyqw9ndjcbday.png" alt="Image description" width="800" height="232"&gt;&lt;/a&gt;&lt;br&gt;
Each granule has one row in primary keys.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9h48i7392gvpcygn8yb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa9h48i7392gvpcygn8yb.png" alt="Image description" width="800" height="451"&gt;&lt;/a&gt;&lt;br&gt;
Write primary index block, which contains the primary index expressions. (toStartOfTenMinutes(timestamp) in this demo)&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5jdxe5e9206y56tvmu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5jdxe5e9206y56tvmu1.png" alt="Image description" width="800" height="385"&gt;&lt;/a&gt;&lt;br&gt;
Create lots of streams on same data file for different compression codecs.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F495nvdmtkbmdgyeesz65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F495nvdmtkbmdgyeesz65.png" alt="Image description" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Wide part
&lt;/h1&gt;

&lt;p&gt;For wide part, primary key has the same logic as they all extends WriterOnDisk.&lt;br&gt;
Columns are writing seperately, write all granules of column1 and then column2...&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0k7xe9ts4zr316h2by1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0k7xe9ts4zr316h2by1v.png" alt="Image description" width="800" height="679"&gt;&lt;/a&gt;&lt;br&gt;
Each column has it's own data file and mark file.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnkabk0v6vqvk6oocmu09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnkabk0v6vqvk6oocmu09.png" alt="Image description" width="800" height="612"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  read process(to be done)
&lt;/h2&gt;

&lt;p&gt;Read entrance&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig63u85nezdttm00mhic.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fig63u85nezdttm00mhic.png" alt="Image description" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the filter logic is in function markRangesFromPKRange. This function is invoked on part one by one.&lt;br&gt;
Function description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Calculates a set of mark ranges, that could possibly contain keys, required by condition.
In other words, it removes subranges from whole range, that definitely could not contain required keys.
If @exact_ranges is not null, fill it with ranges containing marks of fully matched records.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the logic of judging whether a mark may contain the data requested.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2brrcq0dlida9pqacskb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2brrcq0dlida9pqacskb.png" alt="Image description" width="800" height="614"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a single part, data is ordered by pk already, so we can use binary search to find mark for left bound and right bound.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuqxnlb8z904d87iinbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsuqxnlb8z904d87iinbq.png" alt="Image description" width="800" height="572"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that main logic for filter is in key_condition's checkInRange function. Key condition is from indexes.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbihkjab1x8nyzedcon3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbihkjab1x8nyzedcon3m.png" alt="Image description" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;key_condition is build like this from primary key expressions.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2k51a4759kiqcrbfyah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2k51a4759kiqcrbfyah.png" alt="Image description" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Above binary search logic happens only when we use prefix contiguous subset of pk columns(here we use user_id, and primary keys are user_id, timestamp).&lt;/p&gt;

&lt;p&gt;Instead we use timestamp only, the logic will be totally different.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62mzidbkoj2h2rne36mk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62mzidbkoj2h2rne36mk.png" alt="Image description" width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Do exclusion search, where we drop ranges that do not match.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69ey3td17ijmenhzw2f7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69ey3td17ijmenhzw2f7.png" alt="Image description" width="800" height="691"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Some personal thoughts
&lt;/h2&gt;

&lt;p&gt;Clickhouse record the pks of the first row in granules and then use these pks to do filtering when querying.&lt;br&gt;
This is not suitable for some extentions, such as zorder.&lt;br&gt;
Zorder is more suitable for minmax indexes, we can filter a mark&lt;br&gt;
 by the min and max values in a granule.&lt;br&gt;
Clickhouse now only record minmax for columns used in partitioning, maybe we can extend the minmax to primary keys in the future.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>olap</category>
    </item>
    <item>
      <title>Clickhouse Source Code Analysis: sample by</title>
      <dc:creator>abel-cheng</dc:creator>
      <pubDate>Fri, 17 Jan 2025 08:13:34 +0000</pubDate>
      <link>https://dev.to/cheng-w/clickhouse-source-code-analysis-sample-by-2c5e</link>
      <guid>https://dev.to/cheng-w/clickhouse-source-code-analysis-sample-by-2c5e</guid>
      <description>&lt;p&gt;&lt;a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by" rel="noopener noreferrer"&gt;https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#sample-by&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How is sample by implemented in clickhouse?&lt;/p&gt;

&lt;p&gt;To use SAMPLE in clickhouse, we need to add sample by expression when we create MergeTree table.&lt;/p&gt;

&lt;p&gt;If SAMPLE BY is specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer.&lt;/p&gt;

&lt;p&gt;Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).&lt;/p&gt;

&lt;p&gt;Sample by doesn't change the data structure of MergeTree, as the sample by expression is only used for checking grammatical legality and select process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7crxxpa2nilfgd3emve9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7crxxpa2nilfgd3emve9.png" alt="Image description" width="800" height="627"&gt;&lt;/a&gt;&lt;br&gt;
In a select process, sample phase happens before filtering mark ranges. &lt;/p&gt;

&lt;p&gt;Inside a sample phase, there are two important things:&lt;br&gt;
1 Get the actual sample range.&lt;br&gt;
2 Add the sample range to key conditions, so we can use the sample to skip data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawdhnurhf7557xx51plu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawdhnurhf7557xx51plu.png" alt="Transform sample by rows" width="800" height="704"&gt;&lt;/a&gt;&lt;br&gt;
In above code, it's transforming sample rows to sample fraction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv7v51j06rzeqohhp9sg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyv7v51j06rzeqohhp9sg.png" alt="Get max value" width="800" height="666"&gt;&lt;/a&gt;&lt;br&gt;
Then calculate the upper bound of the sample by expression, which is the max unsigned integer value(255 for uint8, 65535 for uint16 and so on).&lt;br&gt;
The lower bound is defined by sample offset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkty8waub2wu6ez1z1log.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkty8waub2wu6ez1z1log.png" alt="Add sample filter conditions" width="800" height="689"&gt;&lt;/a&gt;&lt;br&gt;
Add conditions based on the lower and upper bound calculated above.&lt;br&gt;
These conditions will be used in filtering data by filterPartsByPrimaryKeyAndSkipIndexes functions next.&lt;/p&gt;

&lt;p&gt;The entire implementation process is surprisingly simple. &lt;br&gt;
We can see that sample by a low cardinality field is not useful in sampling.&lt;br&gt;
When sampling, we need to use hash function like intHash32 even the field is originally unsigned int. For example, if the sample by field range is UInt8 (0-255) and actual values are 0-25, we may read all data when we select sample 0.1. I feel it's better to calculate the upper and lower bounds from the mark file, comparing with directly using the maximum value of the field type. Of course, if you use the hash function directly, you won't have this problem.&lt;/p&gt;

&lt;p&gt;When sample doesn't achieve better performance as we wanted, it might be that sample by expression ranks low in order by sequence. Like order by (a, b, c), if we don't use a and b in the where clause, sample by c may not achieve good data skipping effects.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>code</category>
    </item>
    <item>
      <title>Build clickhouse remote development environment with vscode(v24.8.11.5-lts)</title>
      <dc:creator>abel-cheng</dc:creator>
      <pubDate>Thu, 16 Jan 2025 11:15:35 +0000</pubDate>
      <link>https://dev.to/cheng-w/build-clickhouse-remote-development-environment-with-vscodev248115-lts-3599</link>
      <guid>https://dev.to/cheng-w/build-clickhouse-remote-development-environment-with-vscodev248115-lts-3599</guid>
      <description>&lt;h2&gt;
  
  
  1 Build dev docker image
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM docker.io/ubuntu:22.04
RUN rm /bin/sh &amp;amp;&amp;amp; ln -s /bin/bash /bin/sh
RUN apt-get -y update
RUN apt-get install -y curl vim git ssh openssh-server cmake ccache python3 ninja-build nasm yasm gawk lsb-release wget software-properties-common gnupg
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
RUN touch ~/.ssh/authorized_keys
RUN bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
RUN echo "export CC=clang-18" &amp;gt;&amp;gt; /root/.bashrc
RUN echo "export CXX=clang++-18" &amp;gt;&amp;gt; /root/.bashrc
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
RUN echo "export PATH=~/.cargo/bin:$PATH" &amp;gt;&amp;gt; /root/.bashrc
RUN /root/.cargo/bin/rustup toolchain install nightly-2024-12-01
RUN /root/.cargo/bin/rustup default nightly-2024-12-01
RUN /root/.cargo/bin/rustup component add rust-src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rustup installation may fail due to network problems. Just try again, or write an automatic retry script.&lt;/p&gt;

&lt;p&gt;After docker build, you will get a development image, named xxx-clickhouse-dev-env:24.8 in this article.&lt;/p&gt;

&lt;h2&gt;
  
  
  2 Download clickhouse code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone git@github.com:ClickHouse/ClickHouse.git &amp;lt;workspace&amp;gt;/clickhouse-24
cd &amp;lt;workspace&amp;gt;/clickhouse-24
git checkout v24.8.11.5-lts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Download all submodules&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/sh
while :
do
git submodule update --init --recursive --force
sleep 1
done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loop is because the code clone is unstable and may fail.&lt;br&gt;
When the actual clone does not appear in the log, manually kill the above process.&lt;/p&gt;
&lt;h2&gt;
  
  
  3 Set up development environment
&lt;/h2&gt;
&lt;h1&gt;
  
  
  remote development environment
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;docker run -itd --name xxx -v &amp;lt;workspace&amp;gt;/clickhouse-24:/data/clickhouse --privileged=true -p xxx:22 --cap-add="NET_ADMIN" --security-opt seccomp=unconfined  xxx-clickhouse-dev-env-24.8 bash&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Execute the following command in the docker container to enable ssh.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;echo xxx &amp;gt;&amp;gt; ~/.ssh/authorized_keys # Write the token in your local id_rsa.pub
/etc/init.d/ssh start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  local development environment
&lt;/h1&gt;

&lt;p&gt;Download vscode from &lt;a href="https://code.visualstudio.com/" rel="noopener noreferrer"&gt;https://code.visualstudio.com/&lt;/a&gt;.&lt;br&gt;
Install extension remote-SSH next, then we can connect to remote server in vscode.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5z3nxwjuwtdw55ptnkq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5z3nxwjuwtdw55ptnkq.png" alt="new ssh server entrance 1" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmioj7stdq72xpwkxkj9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmioj7stdq72xpwkxkj9.png" alt="new ssh server entrance 2" width="800" height="237"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1fuye32vzbxu96i5ufa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1fuye32vzbxu96i5ufa.png" alt="new ssh server entrance 3" width="800" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Add a new ssh host in configuration like below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host clickhouse-24
    HostName &amp;lt;ip&amp;gt;
    User root
    Port &amp;lt;port&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvimzok98inpna80pq0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvimzok98inpna80pq0z.png" alt="ssh connect" width="800" height="277"&gt;&lt;/a&gt;&lt;br&gt;
Then you can connect to the server as clickhouse-24, then open the code directory /data/clickhouse.&lt;/p&gt;

&lt;p&gt;Next we need to install necessary extensions on remote server.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimsq3ouaafrsywushn0c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimsq3ouaafrsywushn0c.png" alt="install extensions" width="800" height="506"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Compile
&lt;/h2&gt;

&lt;p&gt;Add .vscode directory in the working directory and a tasks.json file with the following contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "version": "2.0.0",
    "tasks": [
        {
            "type": "shell",
            "label": "cmake",
            "group": "build",
            "command": "cmake -DCMAKE_BUILD_TYPE=Debug -DENABLE_CCACHE=1 -DCMAKE_C_COMPILER=/usr/lib/llvm-18/bin/clang -DCMAKE_CXX_COMPILER=/usr/lib/llvm-18/bin/clang++ -DCMAKE_PREFIX_PATH=/usr/lib/llvm-18/ -DENABLE_JEMALLOC=ON -DENABLE_TESTS=OFF -DCOMPILER_FLAGS=-DNDEBUG -DWERROR=OFF -G Ninja -B build",
            "options": {
                "cwd": "${workspaceFolder}"
            },
            "problemMatcher": [],
            "presentation": {
                "echo": true,
                "reveal": "always",
                "focus": true,
                "panel": "shared",
                "showReuseMessage": true,
                "clear": false
            }
        },
        {
            "type": "shell",
            "label": "ninja clickhouse",
            "group": "build",
            "command": "ninja clickhouse clickhouse-server clickhouse-client -j16",
            "options": {
                "cwd": "${workspaceFolder}/build"
            }
            // "dependsOn": "cmake",
            },
        {
            "type": "shell",
            "label": "ninja all",
            "group": "build",
            "command": "ninja -j16",
            "options": {
            "cwd": "${workspaceFolder}/build"
            }
            // "dependsOn": "cmake",
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zoyhn5ti2xin8woe3yh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4zoyhn5ti2xin8woe3yh.png" alt="Run task" width="800" height="271"&gt;&lt;/a&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F925tq52n88mf6hz4qqh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F925tq52n88mf6hz4qqh3.png" alt="tasks" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Do cmake first and then ninja clickhouse, ninja clickhouse will cost several hours.&lt;/p&gt;

&lt;p&gt;Add settings.json file in dir .vscode to enable code reference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "clangd.path": "/usr/lib/llvm-18/bin/clangd",
    "clangd.checkUpdates": false,
    "clangd.arguments": [
        "--background-index",
        "--compile-commands-dir=build",
        "-j=12",
        "--query-driver=/usr/lib/llvm-18/bin/clang++",
        "--clang-tidy",
        "--clang-tidy-checks=performance-*,bugprone-*",
        "--all-scopes-completion",
        "--completion-style=detailed",
        "--header-insertion=iwyu",
        "--pch-storage=disk"
    ],
    "clangd.onConfigChanged": "restart",
    "lldb.commandCompletions": true,
    "lldb.dereferencePointers": true,
    "lldb.evaluateForHovers": true,
    "lldb.launch.expressions": "simple",
    "lldb.showDisassembly": "never",
    "lldb.verboseLogging": true,
    // cpp_tools Config
    "C_Cpp.autocomplete": "Disabled",
    "C_Cpp.formatting": "Disabled",
    "C_Cpp.errorSquiggles": "Disabled",
    "C_Cpp.intelliSenseEngine": "Disabled",
    "git.ignoreLimitWarning": true,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the word indexing appears at the bottom, it means the configuration is successful. If it fails, try restarting the window several times.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxn9i67vuy1q7w9y81b8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxn9i67vuy1q7w9y81b8.png" alt="indexing" width="800" height="57"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Debug
&lt;/h2&gt;

&lt;p&gt;Start a stand-alone clickhouse-server instance and perform single-point debugging.&lt;br&gt;
You need to confirm that LLDB is installed.&lt;br&gt;
If the download speed is too slow, you can download it in website and install it manually.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4by6r6wb92fq4jwc98g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4by6r6wb92fq4jwc98g.png" alt="install lldb manually" width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Add launch.json in dir .vscode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Clickhouse Server",
            "type": "lldb",
            "request": "launch",
            "program": "${workspaceFolder}/build/programs/clickhouse",
            "args": [
                "server", "--config-file=${workspaceFolder}/programs/server/config.xml"
            ],
            "initCommands": [
                "process handle -p false -s false -n false SIGUSR1",
                "process handle -p false -s false -n false SIGUSR2"
            ],
            "preLaunchTask": "ninja clickhouse",
            // "stopAtEntry": false,
            // "osx": {
            // "MIMode": "lldb"
            // },
            "cwd": "${workspaceFolder}"
        }
    ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launch clickhouse-server here, and you can mark some breakpoints to debug.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flepgo3v0gobdwvfw42g6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flepgo3v0gobdwvfw42g6.png" alt="debug entrance" width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8iz752ductjlkzrrkuz1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8iz752ductjlkzrrkuz1.png" alt="debug panel" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>olap</category>
      <category>clickhouse</category>
      <category>vscode</category>
    </item>
  </channel>
</rss>
