DEV Community: gaurang101197

PREWHERE optimization with ReplacingMergeTree and FINAL in Clickhouse

gaurang101197 — Sun, 26 Apr 2026 14:24:40 +0000

In this blog, we will learn about how to reduce I/O and improve query performance using PREWHERE in queries with FINAL keyword on ReplacingMergeTree table in Clickhouse.

Prerequisite

Clickhouse
ReplacingMergeTree
FINAL
prewhere
1. Above is the best place to understand prewhere optimiation and how it can improve the query performance. It is recommended to go through it before proceeding.

PREWHERE with FINAL in ReplacingMergeTree

ReplacingMergeTree table engine is widely used to design the update/delete use case in Clickhouse. And it is very common to use FINAL keyword to deduplicate and query the latest copy of data.

PREWHERE is one of the powerful technique to reduces I/O and improve query performance by avoiding unnecessary data reads, and filtering out irrelevant data before reading non-filter columns from disk.

By default PREWHERE optimization is enabled on all query except the one with FINAL keyword.

Why prewhere is disabled on query with FINAL ?

FINAL applies the table engine’s merge/deduplication logic at read time (e.g. ReplacingMergeTree picks the "winning" row).

PREWHERE runs before this merge. If PREWHERE filters on columns that differ between duplicate versions of a row (and are not in ORDER BY), it can drop the version that should have "won" under FINAL, or keep a version that should have been removed. This can change which row survives the FINAL merge and therefore change query results. To avoid incorrect results, ClickHouse does not automatically move conditions to PREWHERE when FINAL is present unless you explicitly allow it.

Exmaple

CREATE TABLE test_prewhere_final
(
    id       UInt64,
    status   String,
    ver      UInt64
)
ENGINE = ReplacingMergeTree(ver)
ORDER BY id;

INSERT INTO test_prewhere_final VALUES
(1, 'active', 1);  -- old version, should lose

INSERT INTO test_prewhere_final VALUES
(1, 'inactive', 2);  -- new version, should win

SELECT id, status, ver
FROM test_prewhere_final
FINAL
WHERE id=1 and status = 'active';
-- Returns 0 rows (correct: the winning row has status = 'inactive')

SELECT id, status, ver
FROM test_prewhere_final
FINAL
PREWHERE id=1 and status = 'active';
-- Can return 1 rows (incorrect: if merge is not performed then prewhere applied before deduplication and can return row with active status. Which can be incorrect result if you want to query the latest state of data.)

id	status	ver	comment
1	active	1	removed during merge
1	inactive	2	latest data

Which columns can be moved to prewhere clause

Rule of thumb: It is safe to move columns mentioned in order by clause to PREWHERE clause as deduplication is happaned on them.

SELECT id, status, ver
FROM test_prewhere_final
FINAL
PREWHERE id=1 WHERE status = 'active';
-- Gives correct result as it is safe to move order by column to prewhere clause.

In above example, if user wants to query latest version of data using FINAL keyword then moving status column in PREWHERE clause can give you wrong results. But it is safe to move id column to PREWHERE.

Impact

PREWHERE can reduce the I/O operations and improve the query performance. There is no rule of thumb to measure the improvement but it can be as high as >90% reduction in data read.

For e.g. if you are building saas product then it is very likely that you always have filter on your client/tenant ids. As you only allow client to query only their data. And to efficiently filter the data, it is very common that client/tenant id is present in order by clause of table. In this scenario, we can safely move filters on client/tenant id to PREWHERE.

What is next ?

It is very useful if prewhere optimization is automatically applied on order by columns in queries with FINAL keyword same as it is applied on queries without FINAL keyword. If you like the idea and what this to be picked up, upvote the below open idea.

https://github.com/ClickHouse/ClickHouse/discussions/95595

Resource cap on the non user facing workload in Clickhouse

gaurang101197 — Sat, 31 Jan 2026 09:49:29 +0000

If you are looking for solution to safeguard your critical workload from failure in Clickhouse because of some adhoc, unwanted and undesired queries then you are at right place. This blog is for you.

Why is it important to cap resource usage for non user facing workload ?

Clickhouse is built to make use of all available resources to execute query. One bad query can utilize all available resources and can impact the other critical business queries. And we don't want someone to run expensive queries by mistake (while debugging issue or performing adhoc analysis) which can impact the user facing queries. So it is very critical that we apply the resource usage limit on non user facing queries to safeguard our critical workload.

How to safeguard critical workload

It is a best practice to create separate role and user for different use cases in clickhouse. It will help you mange the access and configure different settings for different role/user.

So In this blog, we assume that we have isolated role/user for user facing and non user facing workload.

In practice, we should limit the number of concurrent queries and amount of memory one user can run and utilize. Clickhouse has limit on maximum number of concurrent queries it can run at any given time. And it terminates the new queries if this limit is breached (even though memory usage is less). So we want to cap the maximum number of concurrent query that our non user facing workload can run.

What is settings profile ?

Settings profile is a way to define group of settings and attach them to given user or role.

Creating a settings profile to limit the memory usage and concurrent number of queries

Below query creates a settings profile to limit the max memory usage to 1GB and max concurrent queries to 100 for given role/user.

-- Limit max memory usage to 1GB and concurrent query to 100
CREATE SETTINGS PROFILE restrict_resource_on_non_user_facing_workload 
SETTINGS max_memory_usage_for_user=1000000000, max_concurrent_queries_for_user=100 
to non_user_facing_role;

-- You can alter the settings using below query.
ALTER SETTINGS PROFILE restrict_resource_on_non_user_facing_workload 
SETTINGS max_memory_usage_for_user=1000000000, max_concurrent_queries_for_user=100 
to non_user_facing_role;

-- Below query deleted the provided settings profile
DROP SETTINGS PROFILE restrict_resource_on_non_user_facing_workload;

Settings which can be useful

max_memory_usage_for_user - The maximum amount of RAM in bytes to use for running a user's queries on a single server.
max_concurrent_queries_for_user - The maximum number of simultaneously processed queries per user.
- Even though limit on total memory usage will give us a starting point but we should also cap the number of concurrent queries as well, because clickhouse node can run 1000 queries at a time. So we should add limit on number of concurrent queries run by non user facing workload.
max_rows_to_read - The maximum number of rows that can be read from a table when running a query. The restriction is checked for each processed chunk of data, applied only to the deepest table expression and when reading from a remote server, checked only on the remote server.
- This can be ignored if we can set limit on memory usage per user. That will restrict the number of rows internally. Our idea is to restrict resource usage and if we are able to do it using max_memory_usage_for_user then we should be good to leave this setting.
max_bytes_to_read - The maximum number of bytes (of uncompressed data) that can be read from a table when running a query. The restriction is checked for each processed chunk of data, applied only to the deepest table expression and when reading from a remote server, checked only on the remote server.
- This can be ignored if we can set limit on memory usage per user.

Testing

Memory limit

SELECT count() FROM system.numbers_mt LIMIT 1000000
SETTINGS max_memory_usage_for_user = 10000;

Expected Error

User memory limit exceeded: would use 88.41 KiB (attempt to allocate chunk of 0.00 B bytes), maximum: 9.77 KiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker: While executing AggregatingTransform.

Concurrent queries

-- Run below query from multiple session
SELECT count() FROM system.numbers_mt LIMIT 1000000
SETTINGS max_concurrent_queries_for_user=1;

-- To get list of running queries by current user
SELECT count() AS running
FROM system.processes
WHERE user = currentUser();

Expected error

Error: Too many simultaneous queries for user XYZ. Current: 1, maximum: 1.

Improve node app responsiveness using partitioning

gaurang101197 — Mon, 19 Jan 2026 04:18:10 +0000

Problem Statement

We have a one GET endpoint which fetches documents from database, transforms and returns. There were few clients for which we have to fetch large number of documents which used to block the event loop thread of node. When event loop thread is blocked, it creases below problems:

App becomes unresponsive to new requests.
Liveliness probe fails and k8s restarts the pod.
Add unpredictable delay to small and fast requests.

Restarts became very frequent and we can not directly add pagination and deprecate this legacy endpoint. So need to figure out a short term solution which requires less code changes to prevent the restart and improve the responsiveness of application.

Solution

There is a concept called partitioning. It is very simple, break your large synchronous processing into smaller tasks and add event loop yield in between of each task to let event loop work upon other requests in between. Below is the simple pseudocode which unblocks the event loop in between of each task (batch) and let it serve other requests.

// Return a Promise which resolves immediately. But event loop continue processing in next cycle which unblock the event loop and let it work upon other tasks as well.
function yieldToEventLoop(): Promise<void> {
  return new Promise<void>((resolve) => setImmediate(resolve));
}

for (let i = 0; i < docs.length; i += smallBatchSize) {
  const batch = docs.slice(i, i + smallBatchSize);

  // Do your processing of batch here
  output.push(...processBatch(batch));

  await yieldToEventLoop();
}

Key Take Away

It does not make processing faster and remove CPU cost.
It improves responsiveness of the application.
One heavy request can starve the entire Node process. Chunk + yield can keep the event loop alive and provide temporary relief.

If your server relies heavily on complex calculations, you should think about whether Node.js is really a good fit. Node.js excels for I/O-bound work, but for expensive computation it might not be the best option. Reference

References

Alert on counter discontinue in Grafana

gaurang101197 — Wed, 22 Jan 2025 04:32:47 +0000

Requirement

We have a counter named heartbeat_count which indicates whether application is up or not. It has label called application which is application name.
Each application send this heartbeat metrics at 15 seconds.

Now, we want to get an alert whenever any application stop pushing heartbeat metric for 5 minutes.

Solution

count_over_time() - this function counts the number of time metrics has value in given time. So if application is sending metrics at every 15 seconds, count_over_time(heartbeat_count{application="ABC"}[1m]) give 4 (4 times metrics has value in last 1 minutes as metric is pushed every 15 seconds).

Now, in 10 minutes, count_over_time should be 40 for application working fine. We can use this function to send an alerts if counter is missing 20 times in last 10 minute. Below query print the heartbeat count in last 10 minutes by application. So if we see metric for any application going below 20 which means counter's value is missing for 20 times (5 minutes might not be continuos but that is the limitation of this solution) in last 10 minutes.

sum by(application) (count_over_time(heartbeat_count{application!=""}[10m]))

Below query will give us a applications for which heartbeat counter is missing 20 times in last 10 minutes. And we can setup alert on this easily.

sum by(application) (count_over_time(heartbeat_count{application!=""}[10m])) < 20

Limitations

Whenever new application starts, alerts is sent as new application don't have counter value in past 10 minutes. Which can be ignored.

Resource

https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time

Plotting Histogram Distribution Over Time in Grafana

gaurang101197 — Sat, 10 Aug 2024 08:02:01 +0000

If you are looking for plotting histogram distribution over time as shown in above image then this blog is for you. This blog does not cover internals of histogram and Grafana.

Why Histogram Distribution Over Time

It helps to understand how distribution looks like over time.
It is very useful to find the time period when distribution skewed.
While histogram distribution summarize distribution and useful to check system performance at glance, distribution over time help to detect time period when performance degrades.

Pre-requisite

Internals of histogram: https://prometheus.io/docs/practices/histograms/
Better to have hands on experience on how Prometheus histogram works and prior experience with Grafana.

Use-case

Plot latency distribution over time of any operation, for e.g. API latency, db latency.

Setup

Measure latency metric using Prometheus Histogram.
Metric name is my_latency_metric.
Histogram buckets used are [0, 80, 160, 320, 640, 1280, 2560, 5120].

Step 1: Panel visualization

Select Heatmap in Panel section shown as below image.

Step 2: Query



round(sum by (le) (increase(my_latency_metric_bucket{label_name=~"label_value"}[$__interval])))

label_name=~"label_value" - [Optional] filters the metric data.
increase - Calculate the difference between two data points. We have used $__interval to make use of appropriate interval automatically calculated by Grafana.

Quote from prometheus documentation.

increase(v range-vector) calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if a counter increases only by integer increments.

increase acts on native histograms by calculating a new histogram where each component (sum and count of observations, buckets) is the increase between the respective component in the first and last native histogram in v.
sum by (le): Sums metric values by le (where le refers histogram bucket label name). Suppose you measure latencies of your API which is deployed on k8s with multiple pods and you have pod id as label name. In this case, each pod emits latency data and we want to get picture of overall deployment. So we need to aggregates data of all pods and sum by (le) perform this. It aggregates increase happens in each pod by le.
round: As you might know, increase can return non integer value and if we see non-integer number for counter then it looks bad. To avoid this, we use round function to convert all values to integer.

Step 3: Query Options

Select heatmap in Format and type {{le}} in Legend in query option as shown in below image.

Step 4: Panel Query Options

Select Min Interval as twice of Scrape Interval. In given example, I have used 1m. This handles variation in Scrape Interval If any.

Reference

https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/

Plotting Histogram Distribution In Grafana

gaurang101197 — Sat, 10 Aug 2024 07:29:47 +0000

If you are looking for plotting histogram distribution as shown in above image then this blog is for you. This blog does not cover internals of histogram and Grafana.

Why Histogram Distribution

Histogram distribution gives overview of how data distribution looks like for selected period.
API latency histogram is incredibly useful for understanding the performance and behavior of API.
Range of Latency: Histogram distribution shows how latency is spread out across different buckets. This helps us understand the typical range of response times.

Pre-requisite

Internals of histogram: https://prometheus.io/docs/practices/histograms/
Better to have hands on experience on how Prometheus histogram works and prior experience with Grafana.

Use-case

Plot latency distribution for selected time period, for e.g. API latency, db latency.

Setup

Measure latency metric using Prometheus Histogram.
Metric name is my_latency_metric.
Histogram buckets used are [0, 80, 160, 320, 640, 1280, 2560, 5120].

Step 1: Panel visualization

Select Bar Gauge Panel as panel.

Step 2: Query

round(sum by (le) (increase(my_latency_metric_bucket{label_name=~"label_value"}[$__interval])))

label_name=~"label_value" - [Optional] filters the metric.
increase - Calculate the difference between two data points. We have used $__interval to make use of appropriate interval automatically calculated by Grafana.

Quote from prometheus documentation.

increase(v range-vector) calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if a counter increases only by integer increments.

increase acts on native histograms by calculating a new histogram where each component (sum and count of observations, buckets) is the increase between the respective component in the first and last native histogram in v.
sum by (le): Sums metric values by le (where le refers histogram bucket label name). Suppose you measure latencies of your API which is deployed on k8s with multiple pods and you have pod id as label name. In this case, each pod emits latency data and we want to get picture of overall deployment. So we need to aggregates data of all pods and sum by (le) perform this. It aggregates increase happens in each pod by le.
round: As you might know, increase can return non integer value and if we see non-integer number for counter then it looks bad. To avoid this, we use round function to convert all values to integer.

Step 3: Query Options

Select heatmap in Format and type {{le}} in Legend as shown in below image.

Step 4: Panel Query Options

Select Min Interval as twice of Scrape Interval. In given example, I have used 1m. This handles variation in Scrape Interval If any.

Step 5: Value options

Want to know more ?: https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/bar-gauge/#value-options

Select Total as calculation as shown in below image.

Reference

https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/