PREWHERE optimization with ReplacingMergeTree and FINAL in Clickhouse

#clickhouse #database #performance #analytics

In this blog, we will learn about how to reduce I/O and improve query performance using PREWHERE in queries with FINAL keyword on ReplacingMergeTree table in Clickhouse.

Prerequisite

Clickhouse
ReplacingMergeTree
FINAL
prewhere
1. Above is the best place to understand prewhere optimiation and how it can improve the query performance. It is recommended to go through it before proceeding.

PREWHERE with FINAL in ReplacingMergeTree

ReplacingMergeTree table engine is widely used to design the update/delete use case in Clickhouse. And it is very common to use FINAL keyword to deduplicate and query the latest copy of data.

PREWHERE is one of the powerful technique to reduces I/O and improve query performance by avoiding unnecessary data reads, and filtering out irrelevant data before reading non-filter columns from disk.

By default PREWHERE optimization is enabled on all query except the one with FINAL keyword.

Why prewhere is disabled on query with FINAL ?

FINAL applies the table engine’s merge/deduplication logic at read time (e.g. ReplacingMergeTree picks the "winning" row).

PREWHERE runs before this merge. If PREWHERE filters on columns that differ between duplicate versions of a row (and are not in ORDER BY), it can drop the version that should have "won" under FINAL, or keep a version that should have been removed. This can change which row survives the FINAL merge and therefore change query results. To avoid incorrect results, ClickHouse does not automatically move conditions to PREWHERE when FINAL is present unless you explicitly allow it.

Exmaple

CREATE TABLE test_prewhere_final
(
    id       UInt64,
    status   String,
    ver      UInt64
)
ENGINE = ReplacingMergeTree(ver)
ORDER BY id;

INSERT INTO test_prewhere_final VALUES
(1, 'active', 1);  -- old version, should lose

INSERT INTO test_prewhere_final VALUES
(1, 'inactive', 2);  -- new version, should win

SELECT id, status, ver
FROM test_prewhere_final
FINAL
WHERE id=1 and status = 'active';
-- Returns 0 rows (correct: the winning row has status = 'inactive')

SELECT id, status, ver
FROM test_prewhere_final
FINAL
PREWHERE id=1 and status = 'active';
-- Can return 1 rows (incorrect: if merge is not performed then prewhere applied before deduplication and can return row with active status. Which can be incorrect result if you want to query the latest state of data.)

id	status	ver	comment
1	active	1	removed during merge
1	inactive	2	latest data

Which columns can be moved to prewhere clause

Rule of thumb: It is safe to move columns mentioned in order by clause to PREWHERE clause as deduplication is happaned on them.

SELECT id, status, ver
FROM test_prewhere_final
FINAL
PREWHERE id=1 WHERE status = 'active';
-- Gives correct result as it is safe to move order by column to prewhere clause.

In above example, if user wants to query latest version of data using FINAL keyword then moving status column in PREWHERE clause can give you wrong results. But it is safe to move id column to PREWHERE.

Impact

PREWHERE can reduce the I/O operations and improve the query performance. There is no rule of thumb to measure the improvement but it can be as high as >90% reduction in data read.

For e.g. if you are building saas product then it is very likely that you always have filter on your client/tenant ids. As you only allow client to query only their data. And to efficiently filter the data, it is very common that client/tenant id is present in order by clause of table. In this scenario, we can safely move filters on client/tenant id to PREWHERE.

What is next ?

It is very useful if prewhere optimization is automatically applied on order by columns in queries with FINAL keyword same as it is applied on queries without FINAL keyword. If you like the idea and what this to be picked up, upvote the below open idea.