DEV Community

Erika for Tinybird

Posted on • Originally published at tinybird.co

ClickHouse Fundamentals: Part 1

We've launched an Open Source ClickHouse® Knowledge Base!

We'll be sharing some of the tips here on Dev.to but if you don't want to wait, jump in and find them all in one place.

Oh, and if you're a ClickHouse guru, please feel free to contribute your own ClickHouse magic.

Here's part 1:

View the intermediate states of aggregations
Using aggregation functions with the -State modifier (e.g. sumState) result in intermediate states being stored in ClickHouse. These intermediate states generally cannot be read, as they are stored in a binary representation. Thus, to read the result, we must use the corresponding -Merge modifer when selecting the result (e.g. sumMerge).

For example:

SELECT
    number % 4 AS pk,
    avgState(number) AS avg_state
FROM numbers(2000)
GROUP BY pk

Query id: af1c69e7-b5d2-4063-9b8d-1ac08598fc79

┌─pk─┬─avg_state─┐
│  0 │ 8��         │
│  1 │ ,��         │
│  2 │  ��         │
│  3 │ ��          │
└────┴───────────┘
Enter fullscreen mode Exit fullscreen mode

If you want to explore the intermediate states, perhaps without knowing what the original aggregation method was, you can instead use the finalizeAggregation function.

SELECT
    pk,
    finalizeAggregation(avg_state)
FROM
(
    SELECT
        number % 4 AS pk,
        avgState(number) AS avg_state
    FROM numbers(2000)
    GROUP BY pk
)

Query id: 7cf3a07f-f5d1-4ddd-891f-a89bb304b227

┌─pk─┬─finalizeAggregation(avg_state)─┐
│  0 │                            998 │
│  1 │                            999 │
│  2 │                           1000 │
│  3 │                           1001 │
└────┴────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Apply column default to existing rows
ClickHouse includes a special wrapper type called Nullable which allows a column to contain null values. It's common to use this early on in schema design, when a default value has not yet been decided.

CREATE TABLE deleteme
(
    `number` UInt64,
    `date` Nullable(DateTime)
)
ENGINE = MergeTree
PARTITION BY number % 10
ORDER BY number AS
SELECT
    number,
    NULL
FROM numbers(10)
Enter fullscreen mode Exit fullscreen mode

However, you will often find that you eventually want to modify this column to remove Nullable and insert a default value instead of nulls.

ALTER TABLE deleteme MODIFY COLUMN `date` DEFAULT now()
Enter fullscreen mode Exit fullscreen mode

Adding a default value will affect new rows, but will not replace the nulls in existing rows.

SELECT *
FROM deleteme
LIMIT 1;

┌─number─┬─date─┐
│      0 │ ᴺᵁᴸᴸ │
└────────┴──────┘
Enter fullscreen mode Exit fullscreen mode

To apply the new default value to existing rows, you can use MATERIALIZE.

ALTER TABLE deleteme
MATERIALIZE COLUMN `date`;

SELECT *
FROM deleteme
LIMIT 1;

┌─number─┬────────────────date─┐
│      0 │ 2022-09-23 12:31:14 │
└────────┴─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Avoid TOO_MANY_PARTS with async_insert
ClickHouse was originally designed to insert data in batches.

For engineers accustomed to other databases, it's a common mistake to send hundreds of individual inserts per second to ClickHouse and get a TOO_MANY_PARTS error. This error is ClickHouse telling us to throttle ingestion, as it can't keep up.

Until recently, you were required to solve this issue yourself, by buffering inserts and sending larger batches.

However, ClickHouse v21.11 introduced async_insert which enables ClickHouse to handle batching small inserts for you.

NOTE
async_insert is disabled by default, so you must enable it to take advantage of this feature.

If you decide to use it you should also have a look at async_insert_threads, async_insert_max_data_size, async_insert_busy_timeout_ms and wait_for_async_insert.

Top comments (0)