Understanding TimescaleDB Background Workers and Jobs

#architecture #automation #database #postgres

TimescaleDB's automation is one of its best features. Compression, retention, continuous aggregate refreshes -- all handled by scheduled jobs running in the background. But there is a hidden scaling cliff that most teams hit without warning: the background worker pool has a hard ceiling, and when your jobs outgrow it, everything breaks silently.

No errors. No alerts. Just stale aggregates, uncompressed chunks piling up, and retention policies that stopped firing days ago.

The Mechanics of Background Workers

Every time you call add_compression_policy(), add_retention_policy(), or add_continuous_aggregate_policy(), TimescaleDB registers a scheduled job. Each job runs in a PostgreSQL background worker -- an independent process that executes outside of any client connection. The number of available background workers is controlled by two settings:

timescaledb.max_background_workers -- the ceiling for TimescaleDB's own scheduler (default: 8-16)
max_worker_processes -- PostgreSQL's global limit shared by all extensions, parallel queries, and logical replication

You can inspect your current configuration with:

SELECT name, setting, unit
FROM pg_settings
WHERE name IN (
    'timescaledb.max_background_workers',
    'max_worker_processes',
    'max_parallel_workers'
)
ORDER BY name;

And list all active jobs with:

SELECT
    job_id,
    proc_name,
    hypertable_name,
    schedule_interval,
    scheduled AS is_active
FROM timescaledb_information.jobs
WHERE scheduled = true
ORDER BY proc_name, hypertable_name;

How Exhaustion Happens

The math is straightforward and unforgiving. Each hypertable with compression, retention, and a continuous aggregate creates three jobs. Eight hypertables produces 24 jobs. Add TimescaleDB's internal maintenance tasks, and you can reach well over 100 scheduled jobs.

A default max_background_workers of 16 was never designed for that load.

When a job cannot acquire a worker, it does not raise an error. It simply waits in the scheduler's queue. If a free worker never opens within the schedule interval, the next invocation stacks behind the current one. Over hours, a backlog forms and compounds with every cycle.

The worst case occurs when a job's execution time exceeds its schedule interval. A compression job that takes 15 minutes but is scheduled every 10 minutes permanently occupies a worker slot and can never catch up. Each cycle adds another queued invocation, starving other job types.

Diagnosing the Problem

Start by comparing your total job count to your worker limit:

WITH worker_config AS (
    SELECT current_setting('timescaledb.max_background_workers')::int AS max_workers
),
active_jobs AS (
    SELECT count(*) AS total_scheduled_jobs
    FROM timescaledb_information.jobs
    WHERE scheduled = true
)
SELECT
    wc.max_workers,
    aj.total_scheduled_jobs,
    CASE
        WHEN aj.total_scheduled_jobs <= wc.max_workers THEN 'OK'
        WHEN aj.total_scheduled_jobs <= wc.max_workers * 1.5 THEN 'WARNING'
        ELSE 'CRITICAL -- worker exhaustion likely'
    END AS worker_status
FROM worker_config wc, active_jobs aj;

Then look for jobs that have never run or are actively failing:

SELECT
    j.job_id,
    j.proc_name,
    j.hypertable_name,
    js.total_failures,
    js.consecutive_failures,
    js.last_run_status,
    js.last_run_duration,
    CASE
        WHEN js.total_failures > 0 THEN 'FAILING'
        WHEN js.total_runs = 0 THEN 'NEVER RUN -- likely queued'
        ELSE 'OK'
    END AS health_status
FROM timescaledb_information.jobs j
JOIN timescaledb_information.job_stats js ON j.job_id = js.job_id
ORDER BY js.total_failures DESC, j.proc_name;

Key warning signs:

total_runs = 0: The job was registered but never acquired a worker. Pure queue starvation.
Rising consecutive_failures: The worker was acquired but the job failed -- often from lock contention or OOM during compression.
last_run_duration exceeding schedule_interval: A job that can never finish before its next invocation permanently blocks a worker slot.

Right-Sizing Your Worker Pool

The formula is simple:

total_policies + 2 (internal jobs) = minimum max_background_workers

You can compute the exact recommendation with:

WITH policy_count AS (
    SELECT count(*) AS total_jobs
    FROM timescaledb_information.jobs
    WHERE scheduled = true
)
SELECT
    total_jobs,
    total_jobs + 2 AS recommended_workers,
    'ALTER SYSTEM SET timescaledb.max_background_workers = '
        || (total_jobs + 2) AS sql_to_run
FROM policy_count;

Apply the change:

ALTER SYSTEM SET timescaledb.max_background_workers = 28;
ALTER SYSTEM SET max_worker_processes = 32;
-- Requires a full PostgreSQL restart -- pg_reload_conf() is NOT sufficient

Over-provisioning is cheap. Each idle background worker consumes approximately 5-10 MB of memory and zero CPU. Setting max_background_workers to 32 or 64 on a server running 20 jobs carries no measurable performance penalty. Under-provisioning, on the other hand, silently breaks your entire automation pipeline.

A Note on PostgreSQL 18 + TimescaleDB 2.24

If you are running PostgreSQL 18 with TimescaleDB 2.24, be aware that custom functions registered via add_job() fail with "cache lookup failed for function" errors. Background workers on this version combination cannot resolve public schema functions. The workaround is to use system cron for any custom scheduled tasks, while letting TimescaleDB handle its built-in policies (compression, retention, aggregate refresh) normally.

Prevention Checklist

Count jobs after every new hypertable. Each hypertable with full policies adds 3 jobs. Update worker settings proactively.
Monitor total_failures and consecutive_failures regularly. Query timescaledb_information.job_stats weekly or set up automated monitoring.
Verify job duration stays below schedule interval. If last_run_duration approaches schedule_interval, increase the interval, reduce chunk size, or add workers.
Set max_worker_processes higher than max_background_workers. Leave room for parallel queries and logical replication.
Remember: both settings require a full PostgreSQL restart. Plan changes during maintenance windows.
Treat worker sizing as part of hypertable setup. Add a policy, add a worker. Never treat it as an afterthought.

Background worker exhaustion is entirely preventable. The fix takes one SQL statement and a restart. The hard part is knowing to look for it before your automation silently stops working.