DEV Community

Joe Martin
Joe Martin

Posted on • Updated on • Originally published at posthog.com

How to speed up ClickHouse queries using materialized columns

ClickHouse supports speeding up queries using materialized columns to create new columns on the fly from existing data. In this post by Karl-Aksel Puulmann - a version of which originally appeared on the PostHog blog - I’ll walk through a query optimization example that's well-suited to this rarely-used feature.

Consider the following schema:

CREATE TABLE events (
    uuid UUID,
    event VARCHAR,
    timestamp DateTime64(6, 'UTC'),
    properties_json VARCHAR,
)
ENGINE = MergeTree()
ORDER BY (toDate(timestamp), event, uuid)
PARTITION BY toYYYYMM(timestamp)
Enter fullscreen mode Exit fullscreen mode

Each event has an ID, event type, timestamp, and a JSON representation of event properties. The properties can include the current URL and any other user-defined properties that describe the event (e.g. NPS survey results, person properties, timing data, etc.).

This table can be used to store a lot of analytics data and is similar to what we use at PostHog.

If we wanted to query login page pageviews in August, the query would look like this:

SELECT count(*)
FROM events
WHERE event = '$pageview'
  AND JSONExtractString(properties_json, '$current_url') = 'https://app.posthog.com/login'
  AND timestamp >= '2021-08-01'
  AND timestamp < '2021-09-01'
Enter fullscreen mode Exit fullscreen mode

This query takes a while complete on a large test dataset, but without the URL filter the query is almost instant. Adding even more filters just slows down the query. Let's dig in to understand why.

Looking at flamegraphs

ClickHouse has great tools for introspecting queries, which we discuss in a version of this article on the PostHog blog. For now, let's focus on the soluion...

Enter materialized columns

ALTER TABLE events
ADD COLUMN mat_$current_url
VARCHAR MATERIALIZED JSONExtractString(properties_json, '$current_url')
Enter fullscreen mode Exit fullscreen mode

The above query creates a new column that is automatically filled for incoming data, creating a new file on disk. The data is automatically filled during INSERT statements, so data ingestion doesn't need to change.

The trade-off is more data being stored on disk. In practice, ClickHouse compresses data well, making this a worthwhile trade-off. On our test dataset, mat_$current_url is only 1.5% the size of properties_json on disk with a 10x compression ratio. Other properties which have lower cardinality can achieve even better compression (we’ve seen up to 100x)!

Just creating the column is not enough though, since old data queries would still resort to using a JSONExtract. For this reason, you want to backfill data. The easiest way currently is to run the OPTIMIZE command:

OPTIMIZE TABLE events FINAL
Enter fullscreen mode Exit fullscreen mode

After backfilling, running the updated query speeds things up significantly:

SELECT count(*)
FROM events
WHERE event = '$pageview'
  AND mat_$current_url = 'https://app.posthog.com/login'
  AND timestamp >= '2021-08-01'
  AND timestamp < '2021-09-01'
Enter fullscreen mode Exit fullscreen mode

Looking at system.query_log, the new query:

  • Took 980ms (71%/3.4x improvement)
  • Read 14.36 GiB from disk (81%/5x improvement improvement)

The wins are even more magnified if more than one property filter is used at a time.

Backfilling efficiently

Using OPTIMIZE TABLE after adding columns is often not a good idea, since it will involve a lot of I/O as the whole table gets rewritten.

As of writing, there's a feature request on Github for adding specific commands for materializing specific columns on ClickHouse data parts.

Here's how you can use DEFAULT type columns to backfill more efficiently:

ALTER TABLE events
ALTER COLUMN mat_$current_url
VARCHAR DEFAULT JSONExtractString(properties_json, '$current_url');

ALTER TABLE events UPDATE mat_$current_url = mat_$current_url WHERE timestamp >= '2021-08-01';

-- Wait for mutations to finish before running this
ALTER TABLE events
ALTER COLUMN mat_$current_url
VARCHAR MATERIALIZED JSONExtractString(properties_json, '$current_url');
Enter fullscreen mode Exit fullscreen mode

This will compute and store only the mat_$current_url in our time range and is much more efficient than OPTIMIZE TABLE.

Be aware though that this will:

  1. Break your INSERT statements if you don't specify column names explicitly
  2. Alter the behavior of SELECT * queries

Usage at PostHog

PostHog as an analytics tool allows users to slice and dice their data in many ways across huge time ranges and datasets. This also means that performance is key when investigating things - but also that we currently do nearly no preaggregation.

Rather than materialize all columns, we built a solution that looks at recent slow queries using system.query_log, determines which properties need materializing from there, and backfills the data on a weekend. This works well because not every query needs optimizing and a relatively small subset of properties make up most of what’s being filtered on by our users.

You can find the code for this here and here.

After materializing our top 100 properties and updating our queries, we analyzed slow queries (>3 seconds long). The average improvement in our query times was 55%, with 99th percentile improvement being 25x.

As a product, we're only scratching the surface of what ClickHouse can do to power product analytics. If you're interested in helping us with these kinds of problems, we're hiring!

Top comments (0)