Improve Query Performance with Clickhouse Data Skipping Index

Operating a Large Clickhouse Table

At Instana, we process and store every single call collected by Instana tracers with no sampling over the last 7 days. Instana’s Unbounded Analytics feature allows filtering and grouping calls by arbitrary tags to gain insights into the unsampled, high-cardinality tracing data. We are able to provide 100% accurate metrics such as call count, latency percentiles or error rate, and display the detail of every single call.

For many of our large customers, over 1 billion calls are stored every day. This number reaches 18 billion for our largest customer now and it keeps growing. Calls are stored in a single table in Clickhouse and each call tag is stored in a column. Filtering this large number of calls, aggregating the metrics and returning the result within a reasonable time has always been a challenge.

Previously we have created materialized views to pre-aggregate calls by some frequently used tags such as application/service/endpoint names or HTTP status code. However, we cannot include all tags into the view, especially those with high cardinalities because it would significantly increase the number of rows in the materialized view and therefore slow down the queries.

Filtering on high cardinality tags not included in the materialized view still requires a full scan of the calls table within the selected time frame which could take over a minute.

Optimize filtering on http url

Filtering on HTTP URL is a very frequent use case. The cardinality of HTTP URLs can be very high since we could have randomly generated URL path segments such as /api/product/{id}.

Choose the data skipping index

Clickhouse MergeTree table engine provides a few data skipping indexes which makes queries faster by skipping granules of data (A granule is the smallest indivisible data set that ClickHouse reads when selecting data) and therefore reducing the amount of data to read from disk. ngrambf_v1 and tokenbf_v1 are two interesting indexes using bloom filters for optimizing filtering of Strings. A bloom filter is a space-efficient probabilistic data structure allowing to test whether an element is a member of a set.

ngrambf_v1

A string is split into substrings of n characters. For example, n=3 ngram (trigram) of 'hello world' is ['hel', 'ell', 'llo', lo ', 'o w' ...]. The ngrams of each column value will be stored in the bloom filter.

When searching with a filter column LIKE 'hello' the string in the filter will also be split into ngrams ['hel', 'ell', 'llo'] and a lookup is done for each value in the bloom filter. If all the ngram values are present in the bloom filter we can consider that the searched string is present in the bloom filter.

Functions with a constant argument that is less than ngram size can’t be used by ngrambf_v1 for query optimization. For example, searching for ‘hi’ will not trigger a ngrambf_v1 index with n=3. Small n allows to support more searched strings. But small n leads to more ngram values which means more hashing and eventually more false positives. False positive means reading data which do not contain any rows that match the searched string.

Since false positive matches are possible in bloom filters, the index cannot be used when filtering with negative operators such as column_name != 'value’ or column_name NOT LIKE ‘%hello%’.

tokenbf_v1

tokenbf_v1 splits the string into tokens separated by non-alphanumeric characters and stores tokens in the bloom filter. ‘Hello world’ is splitted into 2 tokens [‘hello’, ‘world’].

In addition to the limitation of not supporting negative operators, the searched string must contain at least a complete token. In the above example, searching for hel will not trigger the index.

Once we understand how each index behaves, tokenbf_v1 turns out to be a better fit for indexing HTTP URLs, because HTTP URLs are typically path segments separated by /. Each path segment will be stored as a token. Splitting the URls into ngrams would lead to much more sub-strings to store. The index size needs to be larger and lookup will be less efficient.

Configure the index

Tokenbf_v1 index needs to be configured with a few parameters. First the index granularity specifies how many granules of data will be indexed together in a single block using a bloom filter. The entire block will be skipped or not depending on whether the searched value appears in the block. The number of rows in each granule is defined by the index_granularity setting of the table. Increasing the granularity would make the index lookup faster, but more data might need to be read because fewer blocks will be skipped.

We also need to estimate the number of tokens in each granule of data. In our case, the number of tokens corresponds to the number of distinct path segments. Then we can use a bloom filter calculator. After fixing the N which is the number of token values, p which is the false positive rate and k which is the number of hash functions, it would give us the size of the bloom filter.

The index can be created on a column or on an expression if we apply some functions to the column in the query. In our case searching for HTTP URLs is not case sensitive so we have created the index on lowerUTF8(http_url).

The final index creation statement looks something like this:

**ADD INDEX** IF **NOT EXISTS** tokenbf_http_url_index lowerUTF8(http_url) TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4

Result

Index size

The size of the tokenbf_v1 index before compression can be calculated as following:

Bloom_filter_size x number_of_blocks

Number_of_blocks = number_of_rows / (table_index_granularity * tokenbf_index_granularity)

You can check the size of the index file in the directory of the partition in the file system. The file is named as skp_idx_{index_name}.idx.

In our case, the size of the index on the HTTP URL column is only 0.1% of the disk size of all data in that partition.

Query speed

The query speed depends on two factors: the index lookup and how many blocks can be skipped thanks to the index.

According to our testing, the index lookup time is not negligible. It can take up to a few seconds on our dataset if the index granularity is set to 1 for example. We decided to set the index granularity to 4 to get the index lookup time down to within a second on our dataset.

The number of blocks that can be skipped depends on how frequently the searched data occurs and how it’s distributed in the table. Our calls table is sorted by timestamp, so if the searched call occurs very regularly in almost every block, then we will barely see any performance improvement because no data is skipped. On the contrary, if the call matching the query only appears in a few blocks, a very small amount of data needs to be read which makes the query much faster.

Optimize filtering on HTTP header

Now that we’ve looked at how to use Clickhouse data skipping index to optimize query filtering on a simple String tag with high cardinality, let’s examine how to optimize filtering on HTTP header, which is a more advanced tag consisting of both a key and a value.

In Clickhouse, key value pair tags are stored in 2 Array(LowCardinality(String)) columns. For example, given a call with Accept=application/json and User-Agent=Chrome headers, we store [Accept, User-Agent] in http_headers.key column and [application/json, Chrome] in http_headers.value column.

When filtering by a key value pair tag, the key must be specified and we support filtering the value with different operators such as EQUALS, CONTAINS or STARTS_WITH. e.g. call.http.headers.Accept EQUALS application/json. This filter is translated into Clickhouse expression

arrayExists((k, v) -> lowerUTF8(k) = ‘accept’ AND lowerUTF8(v) = ‘application’, http_headers.key, http_headers.value)

Choose and configure the index

We can add indexes to both the key and the value column. tokenbf_v1 and ngrambf_v1 indexes do not support Array columns. bloom_filter index looks to be the best candidate since it supports array functions such as IN or has. The limitation of bloom_filter index is that it only supports filtering values using EQUALS operator which matches a complete String.

bloom_filter index requires less configurations. The only parameter false_positive is optional which defaults to 0.025. Reducing the false positive rate will increase the bloom filter size.

Since the filtering on key value pair tag is also case insensitive, index is created on the lower cased value expressions:

```ADD INDEX bloom_filter_http_headers_key_index arrayMap(v -> lowerUTF8(v), http_headers.key) TYPE bloom_filter GRANULARITY 4,

ADD INDEX bloom_filter_http_headers_value_index arrayMap(v -> lowerUTF8(v), http_headers.value) TYPE bloom_filter GRANULARITY 4,```

So that the indexes will be triggered when filtering using expression has(arrayMap((v) -> lowerUTF8(v),http_headers.key),'accept')

When filtering on both key and value such as call.http.header.accept=application/json, it would be more efficient to trigger the index on the value column because it has higher cardinality. The index on the key column can be used when filtering only on the key (e.g. call.http.header.accept is present).

Our Pragmatic Clickhouse Rollout

Adding an index can be easily done with the ALTER TABLE ADD INDEX statement. After the index is added, only new incoming data will get indexed. Clickhouse provides ALTER TABLE [db.]table MATERIALIZE INDEX name IN PARTITION partition_name statement to rebuild the index in an existing partition. But this would generate additional load on the cluster which may degrade the performance of writing and querying data. We decided not to do it and just wait 7 days until all our calls data gets indexed.

Conclusion – Try (TRY) Index Skipping Yourself

We have spent quite some time testing the best configuration for the data skipping indexes. But once we understand how they work and which one is more adapted to our data and use case, we can easily apply it to many other columns.

The bloom_filter index and its 2 variants ngrambf_v1 and tokenbf_v1 all have some limitations. They do not support filtering with all operators. The performance improvement depends on how frequently the searched data occurred and how it is spread across the whole dataset so it’s not guaranteed for all queries.

Ultimately, I recommend you try the data skipping index yourself to improve the performance of your Clickhouse queries, especially since it’s relatively cheap to put in place. It only takes a bit more disk space depending on the configuration and it could speed up the query by 4-5 times depending on the amount of data that can be skipped. BUT TEST IT to make sure that it works well for your own data. If it works for you – great! If not, pull it back or adjust the configuration. We also hope Clickhouse continuously improves these indexes and provides means to get more insights into their efficiency, for example by adding index lookup time and the number granules dropped in the query log.