Philip McClarence

Posted on Mar 14

Debugging Stale Continuous Aggregates in TimescaleDB

#database #monitoring #performance #postgres

When Continuous Aggregates Lie: Detecting and Fixing Stale CAGGs in TimescaleDB

Continuous aggregates are one of the most useful features in TimescaleDB. They pre-compute expensive aggregations and serve them instantly. But they have a failure mode that makes them uniquely dangerous: when they break, they keep working. No errors. No warnings. Just silently stale data.

I have seen this play out in production: a report showed average readings at 42 while the raw source data said 78. Half a day of debugging phantom data issues later, someone discovered the continuous aggregate refresh job had stopped three weeks earlier. Every downstream dashboard and report had been returning stale results the entire time.

Understanding Staleness

A continuous aggregate (CAGG) materializes pre-computed results from a source hypertable. A background refresh job periodically re-reads recent source data and updates the materialized view. The latest materialized time bucket is the high-water mark.

When the refresh job stops or falls behind, the high-water mark freezes. New data continues flowing into the source hypertable, but the CAGG does not see it. The gap between the latest materialized bucket and the current time is the staleness duration.

A few hours of staleness may be acceptable depending on your configuration. Days or weeks of staleness means something is broken.

Quantifying the Gap

The first step is measuring how far behind each CAGG has fallen:

SELECT
    view_name AS cagg_name,
    max(bucket_column)::timestamp(0) AS latest_materialized,
    now()::timestamp(0) AS current_time,
    now() - max(bucket_column) AS staleness_duration,
    CASE
        WHEN now() - max(bucket_column) > INTERVAL '2 hours'
            THEN 'CRITICALLY STALE'
        WHEN now() - max(bucket_column) > INTERVAL '1 hour'
            THEN 'STALE'
        ELSE 'ACCEPTABLE'
    END AS staleness_level
FROM your_continuous_aggregate
CROSS JOIN timescaledb_information.continuous_aggregates
WHERE view_name = 'your_continuous_aggregate'
GROUP BY view_name;

Compare against the source to see the data gap:

SELECT 'Source (raw)' AS data_source,
       count(*) AS row_count,
       max(event_timestamp_utc)::timestamp(0) AS latest_timestamp
FROM source_hypertable
UNION ALL
SELECT 'CAGG (materialized)',
       sum(event_count),
       max(bucket_hour_utc)::timestamp(0)
FROM your_continuous_aggregate;

If the source has recent data but the CAGG is hours or days behind, the refresh pipeline is broken.

Five Root Causes

1. Refresh job paused

The most frequent cause. Someone ran alter_job(job_id, scheduled => false) during debugging or a migration and never re-enabled it. The CAGG silently stops updating.

2. Background worker pool exhausted

TimescaleDB runs all scheduled jobs through a shared worker pool controlled by timescaledb.max_background_workers. When the number of policies exceeds available workers, jobs queue up. Refresh jobs that wait too long are skipped entirely.

3. Lock contention

The refresh job needs to read from the source hypertable. Heavy concurrent writes can block the refresh with lock waits. If contention persists across multiple schedule intervals, the CAGG falls progressively further behind.

4. Overly conservative end_offset

The end_offset parameter excludes recent data from materialization to handle late-arriving rows. An end_offset of 6 hours means the CAGG is always at least 6 hours behind by design. If set too conservatively, the CAGG appears stale even when the job is running correctly.

5. Persistent job failures

Out-of-memory conditions, disk pressure, or data issues cause the refresh job to fail. TimescaleDB retries with backoff, but if the root cause is not resolved, failures accumulate.

Systematic Diagnosis

Work through these checks in order:

SELECT
    j.job_id,
    j.scheduled AS is_scheduled,
    js.last_run_status,
    js.last_successful_finish::timestamp(0) AS last_success,
    js.total_failures,
    CASE
        WHEN NOT j.scheduled
            THEN 'JOB PAUSED'
        WHEN js.total_failures > 0
            THEN 'JOB FAILING -- check job_errors'
        WHEN (now() - js.last_successful_finish) > j.schedule_interval * 3
            THEN 'OVERDUE -- workers may be exhausted'
        ELSE 'Normal'
    END AS diagnosis
FROM timescaledb_information.jobs j
JOIN timescaledb_information.job_stats js USING (job_id)
WHERE j.proc_name = 'policy_refresh_continuous_aggregate';

This single query tells you whether each CAGG refresh job is paused, failing, overdue, or healthy.

Remediation

Resume a paused job

SELECT alter_job(job_id, scheduled => true)
FROM timescaledb_information.jobs
WHERE proc_name = 'policy_refresh_continuous_aggregate'
  AND hypertable_name = 'your_continuous_aggregate'
  AND NOT scheduled;

Force an immediate catch-up refresh

Do not wait for the next scheduled run. Refresh the entire stale window manually:

CALL refresh_continuous_aggregate(
    'your_continuous_aggregate',
    now() - INTERVAL '7 days',
    now() - INTERVAL '30 minutes'
);

Set the start bound to cover the full period of staleness. The end bound should match your end_offset to avoid refreshing buckets that may still have late-arriving data.

Increase the worker pool

If diagnosis points to worker exhaustion:

ALTER SYSTEM SET timescaledb.max_background_workers = 32;
-- Requires PostgreSQL restart

Count all scheduled policies (compression + retention + CAGG refresh) and set workers to at least 1.5x that count.

Prevention Strategies

Monitor the refresh job, not just the CAGG data. A CAGG that returns results without errors can be weeks out of date. Alert on last_successful_finish relative to schedule_interval.
Size your worker pool properly. Count every scheduled policy across all hypertables. Add buffer for internal TimescaleDB jobs.
Audit after schema changes. DDL changes to a hypertable or its CAGG can disable refresh policies. Verify all CAGG refresh jobs show scheduled = true after every migration.
Set staleness alerts. A monitoring check that flags now() - last_successful_finish > schedule_interval * 3 catches problems within hours instead of weeks.

Stale continuous aggregates are dangerous precisely because they fail silently. The queries work, the numbers look reasonable, and nothing in your logs says "error." Proactive monitoring of the refresh pipeline is the only reliable defense.

DEV Community