Optimizing Continuous Aggregate Performance for Large Datasets

#database #dataengineering #performance #postgres

Continuous aggregates are supposed to make your queries fast. They pre-compute aggregations incrementally so your dashboards read from a compact summary instead of scanning millions of raw rows. But there is a common pattern where the continuous aggregate itself becomes the performance bottleneck: its refresh takes 30 minutes, its materialization table consumes 40 GB, and it monopolizes a background worker slot for the entire refresh duration.

The problem is almost always the bucket size.

Bucket Width: The Lever Most People Get Wrong

When you create a continuous aggregate, the time_bucket() width determines how many rows end up in the materialization table. This single parameter controls refresh duration, disk usage, and query speed against the aggregate.

Consider a source hypertable with 10 million rows collected over 30 days from 200 devices:

Bucket Size	Materialized Rows	Materialization Size	Refresh Duration
1 minute	~8.6 million	~500 MB	30-60 seconds
1 hour	~144,000	~15 MB	5-10 seconds
1 day	~6,000	~1 MB	< 1 second

A 1-minute bucket materializes nearly as many rows as the source table. At that point, you have traded raw data for pre-computed averages, but the materialization itself is a large hypertable requiring its own compression, retention, and index maintenance. You have not simplified your infrastructure -- you have doubled it.

The right bucket width matches the resolution your queries actually need. If your dashboard plots hourly data points, a 1-minute CAGG is doing 60x more work than necessary during every refresh cycle.

-- 1-hour buckets: the sweet spot for most dashboard use cases
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', event_timestamp_utc) AS bucket_utc,
    device_id,
    AVG(metric_a) AS avg_metric_a,
    MAX(metric_b) AS max_metric_b,
    COUNT(*) AS event_count
FROM raw_device_metrics
GROUP BY bucket_utc, device_id;

Tightening the Refresh Window

A surprising number of continuous aggregate deployments use the default refresh policy, which recomputes the entire aggregate on every run. For a dataset spanning 30 days, this means rescanning all 10 million rows during each refresh -- even though only the last hour has changed.

The add_continuous_aggregate_policy() function accepts three parameters that bound the refresh window:

SELECT add_continuous_aggregate_policy('metrics_hourly',
    start_offset  => INTERVAL '3 hours',
    end_offset    => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour'
);

start_offset controls how far back the refresh looks. Set this to cover the maximum latency of late-arriving rows plus one extra bucket width as a safety margin. If data arrives within 30 minutes of its timestamp, a 2-hour start_offset with 1-hour buckets gives comfortable headroom.

end_offset defines how close to "now" the refresh will materialize. Setting it to at least one bucket width avoids materializing partial buckets that will be invalidated immediately on the next refresh.

schedule_interval should be less than or equal to start_offset. If the refresh runs every hour but only looks back 30 minutes, data that arrived in the gap will be missed.

A properly tuned refresh window processes a few hours of data instead of the full dataset. Refresh duration drops from minutes to milliseconds, and the background worker slot is freed almost instantly.

Compressing the Materialization Hypertable

This is the optimization that most teams overlook entirely. The materialization table behind a continuous aggregate is a regular hypertable, which means it supports compression. The CAGG abstraction hides this detail, but the storage savings are real -- typically 10-20x on the materialization data.

-- Find the underlying materialization hypertable
SELECT materialization_hypertable_schema || '.' || materialization_hypertable_name
FROM timescaledb_information.continuous_aggregates
WHERE view_name = 'metrics_hourly';

-- Enable compression (replace with actual table name from above)
ALTER TABLE _timescaledb_internal._materialized_hypertable_42 SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'bucket_utc DESC'
);

-- Add a compression policy
SELECT add_compression_policy(
    '_timescaledb_internal._materialized_hypertable_42',
    compress_after => INTERVAL '1 day'
);

The segmentby column should match your most common filter dimension. If dashboards filter by device_id, segment by device_id. The orderby should be the bucket column descending, since most queries fetch the most recent data first.

Adding Targeted Indexes

TimescaleDB creates a default index on the bucket column, but dashboard queries usually filter by additional dimensions. If queries always filter by device_id, add a composite index:

CREATE INDEX ON metrics_hourly (device_id, bucket_utc DESC);

Without it, every query performs a sequential scan filtered only by the bucket index, then applies a recheck on device_id. The composite index serves range queries like "device 42's last 7 days" directly without scanning the full materialization.

Hierarchical CAGGs for Multi-Resolution Access

Rather than creating one CAGG and hoping the bucket size works for every use case, build a hierarchy:

-- Level 1: hourly from raw data
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', event_timestamp_utc) AS bucket_utc,
       device_id, AVG(metric_a) AS avg_metric_a, COUNT(*) AS event_count
FROM raw_device_metrics
GROUP BY bucket_utc, device_id;

-- Level 2: daily from the hourly CAGG
CREATE MATERIALIZED VIEW metrics_daily
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', bucket_utc) AS bucket_utc,
       device_id, AVG(avg_metric_a) AS avg_metric_a, SUM(event_count) AS event_count
FROM metrics_hourly
GROUP BY bucket_utc, device_id;

Each level has its own refresh policy, retention policy, and compression settings. The daily CAGG refreshes from the compact hourly materialization instead of rescanning millions of raw rows. Real-time dashboards query the hourly CAGG. Monthly reports query the daily one.

Quick Sizing Guide

Match bucket width to query resolution. Hourly dashboards need hourly buckets. Minute-level buckets should be reserved for real-time alerting where sub-minute latency matters.
Compress the materialization hypertable. Segment by your primary filter dimension, order by bucket descending. Add a compression policy.
Use hierarchical CAGGs for multi-resolution. Hourly for recent data, daily for historical queries.
Tighten the refresh window. A start_offset of 2-3x your bucket width covers late-arriving data without reprocessing everything.

The materialization table follows the same performance rules as any other hypertable. Size the buckets deliberately, compress aggressively, index for your access pattern, and the CAGG stays fast regardless of how much data accumulates.