ClickHouse supports speeding up queries using materialized columns to create new columns on the fly from existing data. In this post by Karl-Aksel Puulmann - a version of which originally appeared on the PostHog blog - I’ll walk through a query optimization example that's well-suited to this rarely-used feature.
Consider the following schema:
CREATE TABLE events ( uuid UUID, event VARCHAR, timestamp DateTime64(6, 'UTC'), properties_json VARCHAR, ) ENGINE = MergeTree() ORDER BY (toDate(timestamp), event, uuid) PARTITION BY toYYYYMM(timestamp)
Each event has an ID, event type, timestamp, and a JSON representation of event properties. The properties can include the current URL and any other user-defined properties that describe the event (e.g. NPS survey results, person properties, timing data, etc.).
This table can be used to store a lot of analytics data and is similar to what we use at PostHog.
If we wanted to query login page pageviews in August, the query would look like this:
SELECT count(*) FROM events WHERE event = '$pageview' AND JSONExtractString(properties_json, '$current_url') = 'https://app.posthog.com/login' AND timestamp >= '2021-08-01' AND timestamp < '2021-09-01'
This query takes a while complete on a large test dataset, but without the URL filter the query is almost instant. Adding even more filters just slows down the query. Let's dig in to understand why.
ClickHouse has great tools for introspecting queries, which we discuss in a version of this article on the PostHog blog. For now, let's focus on the soluion...
ALTER TABLE events ADD COLUMN mat_$current_url VARCHAR MATERIALIZED JSONExtractString(properties_json, '$current_url')
The above query creates a new column that is automatically filled for incoming data, creating a new file on disk. The data is automatically filled during
INSERT statements, so data ingestion doesn't need to change.
The trade-off is more data being stored on disk. In practice, ClickHouse compresses data well, making this a worthwhile trade-off. On our test dataset,
mat_$current_url is only 1.5% the size of
properties_json on disk with a 10x compression ratio. Other properties which have lower cardinality can achieve even better compression (we’ve seen up to 100x)!
Just creating the column is not enough though, since old data queries would still resort to using a
JSONExtract. For this reason, you want to backfill data. The easiest way currently is to run the OPTIMIZE command:
OPTIMIZE TABLE events FINAL
After backfilling, running the updated query speeds things up significantly:
SELECT count(*) FROM events WHERE event = '$pageview' AND mat_$current_url = 'https://app.posthog.com/login' AND timestamp >= '2021-08-01' AND timestamp < '2021-09-01'
system.query_log, the new query:
- Took 980ms (71%/3.4x improvement)
- Read 14.36 GiB from disk (81%/5x improvement improvement)
The wins are even more magnified if more than one property filter is used at a time.
OPTIMIZE TABLE after adding columns is often not a good idea, since it will involve a lot of I/O as the whole table gets rewritten.
As of writing, there's a feature request on Github for adding specific commands for materializing specific columns on ClickHouse data parts.
Here's how you can use
DEFAULT type columns to backfill more efficiently:
ALTER TABLE events ALTER COLUMN mat_$current_url VARCHAR DEFAULT JSONExtractString(properties_json, '$current_url'); ALTER TABLE events UPDATE mat_$current_url = mat_$current_url WHERE timestamp >= '2021-08-01'; -- Wait for mutations to finish before running this ALTER TABLE events ALTER COLUMN mat_$current_url VARCHAR MATERIALIZED JSONExtractString(properties_json, '$current_url');
This will compute and store only the
mat_$current_url in our time range and is much more efficient than
Be aware though that this will:
- Break your
INSERTstatements if you don't specify column names explicitly
- Alter the behavior of
PostHog as an analytics tool allows users to slice and dice their data in many ways across huge time ranges and datasets. This also means that performance is key when investigating things - but also that we currently do nearly no preaggregation.
Rather than materialize all columns, we built a solution that looks at recent slow queries using
system.query_log, determines which properties need materializing from there, and backfills the data on a weekend. This works well because not every query needs optimizing and a relatively small subset of properties make up most of what’s being filtered on by our users.
After materializing our top 100 properties and updating our queries, we analyzed slow queries (>3 seconds long). The average improvement in our query times was 55%, with 99th percentile improvement being 25x.
As a product, we're only scratching the surface of what ClickHouse can do to power product analytics. If you're interested in helping us with these kinds of problems, we're hiring!