Philip McClarence

Posted on Mar 11

Monitoring TimescaleDB in Production: A Complete Checklist

#postgres #database #sql #monitoring

The Essential TimescaleDB Production Monitoring Checklist

Running TimescaleDB in production is straightforward -- until something silently breaks. Compression policies stall. Chunk counts balloon. Continuous aggregates serve stale data without a single error in your logs. Background workers exhaust their pool and jobs start queueing.

The good news is that every major TimescaleDB failure mode has observable symptoms long before it becomes an outage. This article covers six monitoring areas with the exact SQL queries and thresholds you need.

1. Chunk Health

Chunks are the physical tables backing every hypertable. Too many chunks cause planning overhead that can add hundreds of milliseconds to queries. Too few chunks make retention and compression coarse-grained.

The key metric is total chunk count per hypertable:

SELECT
    hypertable_name,
    count(*) AS total_chunk_count,
    pg_size_pretty(avg(total_bytes)) AS avg_chunk_size
FROM timescaledb_information.chunks
GROUP BY hypertable_name
ORDER BY total_chunk_count DESC;

Warning threshold: Any hypertable exceeding 500 chunks with a stable ingest rate indicates a chunk interval that is too small. Adjust with set_chunk_time_interval() -- it only affects future chunks, so old ones age out through retention.

2. Compression Health

Compression delivers the bulk of your storage savings. When it stops working, disk usage climbs fast. Three signals matter:

Compression ratio. A healthy ratio is 8x-20x for typical time-series workloads. Below 3x suggests your segmentby column has too high a cardinality or you have included a high-entropy column.

Uncompressed chunk backlog. Any chunk older than your compress_after threshold that remains uncompressed points to a stalled compression policy:

SELECT
    hypertable_name,
    chunk_name,
    range_start,
    range_end,
    pg_size_pretty(total_bytes) AS chunk_size
FROM timescaledb_information.chunks
WHERE NOT is_compressed
  AND range_end < now() - INTERVAL '2 hours'
ORDER BY range_end ASC;

Policy existence. A surprisingly common mistake: enabling compression settings with ALTER TABLE ... SET without ever calling add_compression_policy(). The settings declare how to compress; the policy triggers when. Every hypertable should have both.

3. Continuous Aggregate Freshness

Continuous aggregates (CAGGs) trade compute at query time for materialized results. When they go stale, dashboards show old data without any warning.

Check refresh job status and staleness:

SELECT
    ca.view_name,
    js.last_run_status,
    js.last_successful_finish,
    now() - js.last_successful_finish AS time_since_last_refresh,
    js.total_failures
FROM timescaledb_information.continuous_aggregates ca
JOIN timescaledb_information.jobs j
    ON j.hypertable_name = ca.materialization_hypertable_name
JOIN timescaledb_information.job_stats js USING (job_id)
ORDER BY time_since_last_refresh DESC;

If time_since_last_refresh is significantly larger than the configured schedule_interval, the CAGG refresh is falling behind -- likely due to worker exhaustion or lock contention on the source hypertable.

4. Background Worker Health

Every compression policy, retention policy, and CAGG refresh runs as a background job. The worker pool is finite, controlled by timescaledb.max_background_workers_per_db.

SELECT
    current_setting('timescaledb.max_background_workers_per_db') AS max_workers,
    (SELECT count(*) FROM timescaledb_information.jobs WHERE scheduled) AS active_jobs,
    (SELECT count(*) FROM timescaledb_information.job_stats
     WHERE last_run_status = 'Failed') AS failed_jobs;

When active jobs exceed max workers, jobs compete for slots. Failed jobs compound the problem because each retry consumes a worker slot that another job needs. Address failures first, then consider increasing max_background_workers_per_db if the math still does not work.

5. Retention and Storage

Retention policies drop entire chunks instantly -- no row-by-row DELETE, no dead tuples, no vacuum overhead. But if the retention job silently fails, old data piles up.

SELECT
    j.hypertable_name,
    j.schedule_interval,
    js.last_run_status,
    js.last_successful_finish,
    (SELECT count(*) FROM timescaledb_information.chunks c
     WHERE c.hypertable_name = j.hypertable_name) AS current_chunk_count
FROM timescaledb_information.jobs j
JOIN timescaledb_information.job_stats js USING (job_id)
WHERE j.proc_name = 'policy_retention'
ORDER BY js.last_successful_finish ASC;

Also watch disk usage trends. If total hypertable size grows week over week despite retention running, either the ingest rate has increased or the retention window is too generous.

6. XID Age and Memory

Each chunk is a separate PostgreSQL table with its own relfrozenxid. With hundreds of chunks, autovacuum must freeze each one individually. A single chunk stuck behind a long-running transaction can push the entire cluster toward XID wraparound.

SELECT
    pg_size_pretty(sum(total_bytes)) AS uncompressed_working_set,
    current_setting('shared_buffers') AS shared_buffers_setting
FROM timescaledb_information.chunks
WHERE NOT is_compressed;

The uncompressed working set should fit comfortably within shared_buffers. If it does not, you will see buffer cache thrashing as active chunks compete for memory.

Building a Monitoring Pipeline

The pattern for every failure mode above is the same: collect, detect, alert, fix.

Collect the metrics at regular intervals -- chunk counts, compression ratios, job statuses, XID ages, disk usage.
Detect anomalies by comparing against known thresholds: chunk count > 500, compression ratio < 3x, CAGG refresh lag > 2x schedule interval, failed jobs > 0.
Alert before the symptom becomes user-facing. Every issue above has a window of hours to days between first signal and actual impact.
Fix with targeted remediation -- adjust chunk intervals, restart failed jobs, increase worker pools, enable compression.

You can implement this pipeline with a simple cron job running these queries and pushing results to your alerting system, or use a dedicated monitoring tool that understands TimescaleDB internals.

Conclusion

TimescaleDB is reliable infrastructure when monitored properly. The failure modes are well-understood, the symptoms are observable, and the fixes are documented. The only thing that turns a manageable signal into a production incident is not watching for it. Run these queries weekly at minimum, and automate them if you can.

DEV Community