Dean Hamstead

Posted on Jun 9

Using TimescaleDB and Collectd for observability

#postgres #timescaledb #monitoring #devops

I have used many timescale databases over the years and have found most to be wanting. Often over complicated and under performant.

PostgreSQL with TimescaleDB brings SQL-powered analytics, flexible retention, and automatic compression to time-series data. Collectd is a mature metrics daemon shipped with every distribution, making it the easiest way to gather system metrics. Grafana is the de-facto standard for dashboards and charting, with native PostgreSQL support out of the box. Together they form a minimal, composable observability stack.

This post walks through the schema, configuration, and rationale for a production setup I run on my home network. What I love about this setup is how minimal it is — collectd and PostgreSQL take mere slivers of system resources. Comments on how it could be improved or scaled are very welcomed!

Why TimescaleDB?

SQL queries -- Standard PostgreSQL. Join, aggregate, window functions, CTEs -- everything works.
Dynamic retention -- A single add_retention_policy() call. Change it any time without rebuilding anything.
Automatic compression -- Columnar compression routinely achieves 10-20x ratios on metrics data.
Single database -- All metrics from all hosts in one place. Back it up with pg_dump or stream replicate it.
Grafana native -- Grafana's PostgreSQL/TimescaleDB data source works out of the box. No plugins needed.

Prerequisites

PostgreSQL 14+ with TimescaleDB and the TimescaleDB Toolkit extensions installed
collectd 5.x with the postgresql plugin (included in most distribution packages)
A dedicated collectd PostgreSQL user and database

Create the database and user before applying the schema:

CREATE USER collectd WITH PASSWORD 'a_secure_password';
CREATE DATABASE collectd OWNER collectd;
\c collectd

The Schema

Identifiers table

collectd identifies every metric with a five-part tuple: host / plugin-plugin_instance / type-type_instance. Rather than storing these strings on every row, we normalise them into an identifiers table:

CREATE TABLE identifiers (
    id integer NOT NULL PRIMARY KEY,
    host character varying(126) NOT NULL,
    plugin character varying(126) NOT NULL,
    plugin_inst character varying(126) DEFAULT NULL,
    type character varying(64) NOT NULL,
    type_inst character varying(64) DEFAULT NULL,

    UNIQUE NULLS NOT DISTINCT (host, plugin, plugin_inst, type, type_inst)
);

Only one index is needed on this small lookup table. The unique constraint index handles collectd_insert upserts, and plugin_type_host covers the most common Grafana query patterns:

CREATE INDEX identifiers_plugin_type_host ON identifiers USING btree (plugin, type, host);

Values hypertable

The actual time-series data lands in a values table, which is converted into a TimescaleDB hypertable:

CREATE TABLE "values" (
    id integer NOT NULL
        REFERENCES identifiers ON DELETE CASCADE,
    tstamp timestamp with time zone NOT NULL,
    name character varying(126) NOT NULL,
    value_type character varying(64) NOT NULL,
    value double precision NOT NULL,

    UNIQUE(tstamp, id, name)
);

SELECT create_hypertable('values', 'tstamp',
    chunk_time_interval => interval '6 hours');

A few things to note:

chunk_time_interval => '6 hours' -- This creates a new chunk every 6 hours. The default is 7 days, but with high-cardinality metrics from many hosts, smaller chunks keep individual chunk sizes manageable and improve compression ratios.
No CHECK constraint on value_type -- TimescaleDB cannot reliably propagate CHECK constraints that reference function calls across chunk tables. Instead, a trigger on the hypertable validates it against the value_types table.
btree indexes on values -- The unique constraint provides an index on (tstamp, id, name) for insert efficiency, with additional indexes on (id, tstamp) and (tstamp) for common query patterns.

The value_type column is validated with a separate lookup table and trigger:

CREATE TABLE value_types (
    name character varying(64) NOT NULL PRIMARY KEY
);

INSERT INTO value_types (name) VALUES
    ('absolute'), ('counter'), ('derive'), ('distribution'), ('gauge');

CREATE OR REPLACE FUNCTION check_value_type()
RETURNS trigger LANGUAGE plpgsql SECURITY INVOKER AS $$
BEGIN
    IF NOT EXISTS (SELECT 1 FROM public.value_types WHERE name = NEW.value_type) THEN
        RAISE EXCEPTION 'invalid value_type: %. Must be one of: %',
            NEW.value_type,
            (SELECT string_agg(name, ', ' ORDER BY name) FROM public.value_types);
    END IF;
    RETURN NEW;
END;
$$;

CREATE TRIGGER values_value_type_check
    BEFORE INSERT OR UPDATE OF value_type ON "values"
    FOR EACH ROW EXECUTE FUNCTION check_value_type();

Compression and Retention Policies

TimescaleDB's killer features for metrics storage are automatic compression and retention:

SELECT add_retention_policy('values', INTERVAL '31 days');

ALTER TABLE "values" SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'id,name',
    timescaledb.compress_orderby = 'tstamp DESC'
);

SELECT add_compression_policy('values', INTERVAL '1 day');

How this works

Retention: Chunks older than 31 days are automatically dropped. No cron jobs, no manual cleanup. Adjust the interval to taste -- INTERVAL '90 days' or INTERVAL '1 year' are common choices.
Compression: Chunks older than 1 day are compressed. TimescaleDB uses columnar compression with delta-of-delta encoding for timestamps and gorilla encoding for floating point values -- the same algorithms used by Facebook's Gorilla time-series database. On typical collectd data, expect 10-20x compression.
Segment by id, name: Compression groups rows by metric identifier and value name before compressing. This means queries filtered by a specific metric decompress only the relevant segments.
Order by tstamp DESC: Within each segment, rows are stored newest-first. This optimises the common case of querying recent data.

You can check compression status at any time:

SELECT
    pg_size_pretty(before_compression_total_bytes) AS before,
    pg_size_pretty(after_compression_total_bytes) AS after,
    pg_size_pretty(before_compression_total_bytes - after_compression_total_bytes) AS saved
FROM hypertable_compression_stats('values');

Continuous Aggregates

Three continuous aggregates pre-compute bucketed rollups for fast dashboards. Showing metrics_5min (hourly and daily differ only by bucket size and divisor):

CREATE MATERIALIZED VIEW metrics_5min
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('5 minutes', v.tstamp) AS bucket,
    v.id,
    i.host, i.plugin, i.plugin_inst, i.type, i.type_inst,
    v.name AS metric_name,
    v.value_type,
    avg(v.value)  FILTER (WHERE v.value_type = 'gauge') AS avg_value,
    min(v.value)  FILTER (WHERE v.value_type = 'gauge') AS min_value,
    max(v.value)  FILTER (WHERE v.value_type = 'gauge') AS max_value,
    (max(v.value) FILTER (WHERE v.value_type = 'derive')
   - min(v.value) FILTER (WHERE v.value_type = 'derive')) / 300.0 AS rate_per_second,
    count(*) AS sample_count
FROM "values" v
JOIN identifiers i ON v.id = i.id
GROUP BY bucket, v.id, i.host, i.plugin, i.plugin_inst, i.type, i.type_inst, v.name, v.value_type;

SELECT add_continuous_aggregate_policy('metrics_5min',
    start_offset      => INTERVAL '15 minutes',
    end_offset        => INTERVAL '0 seconds',
    schedule_interval => INTERVAL '2 minutes');
ALTER MATERIALIZED VIEW metrics_5min SET (timescaledb.materialized_only = false);

With materialized_only = false, end_offset is irrelevant — queries see data within seconds.

Policies for all three aggregates:

SELECT add_continuous_aggregate_policy('metrics_hourly',
    start_offset      => INTERVAL '30 minutes',
    end_offset        => INTERVAL '0 seconds',
    schedule_interval => INTERVAL '10 minutes');
ALTER MATERIALIZED VIEW metrics_hourly SET (timescaledb.materialized_only = false);

SELECT add_continuous_aggregate_policy('metrics_daily',
    start_offset      => INTERVAL '7 days',
    end_offset        => INTERVAL '0 seconds',
    schedule_interval => INTERVAL '1 day');
ALTER MATERIALIZED VIEW metrics_daily SET (timescaledb.materialized_only = false);

SELECT add_retention_policy('metrics_5min', INTERVAL '30 days');
SELECT add_retention_policy('metrics_hourly', INTERVAL '90 days');
SELECT add_retention_policy('metrics_daily', INTERVAL '2 years');

With identifier columns materialized into each view, queries don't need a separate join:

SELECT host, plugin, type, bucket, avg_value, sample_count
FROM metrics_5min
WHERE plugin = 'cpu'
  AND type = 'cpu'
  AND bucket > now() - interval '6 hours'
ORDER BY bucket;

The collectd View and Insert Function

Convenience view

The collectd view joins identifiers back onto values and reconstructs the familiar collectd identifier string (host/plugin-instance/type-instance):

CREATE OR REPLACE VIEW collectd AS
SELECT
    host, plugin, plugin_inst, type, type_inst,
    host
        || '/' || plugin
        || CASE WHEN plugin_inst IS NOT NULL THEN '-' ELSE '' END
        || coalesce(plugin_inst, '')
        || '/' || type
        || CASE WHEN type_inst IS NOT NULL THEN '-' ELSE '' END
        || coalesce(type_inst, '') AS identifier,
    tstamp, name, value_type, value
FROM identifiers
JOIN values ON values.id = identifiers.id;

This makes ad-hoc queries more natural -- you can query by the same identifier strings you see in collectd's logs.

Insert function

collectd's postgresql plugin calls a stored function to insert data. The collectd_insert function handles the upsert into identifiers and fans out the array of values:

CREATE OR REPLACE FUNCTION collectd_insert(
        timestamp with time zone, character varying,
        character varying, character varying,
        character varying, character varying,
        character varying[], character varying[], double precision[]
    ) RETURNS void
    LANGUAGE plpgsql
    AS $_$
DECLARE
    p_time alias for $1;
    p_host alias for $2;
    p_plugin alias for $3;
    p_plugin_instance alias for $4;
    p_type alias for $5;
    p_type_instance alias for $6;
    p_value_names alias for $7;
    p_type_names alias for $8;
    p_values alias for $9;
    ds_id integer;
    i integer;
BEGIN
    SELECT id INTO ds_id
        FROM identifiers
        WHERE host = p_host
            AND plugin = p_plugin
            AND plugin_inst IS NOT DISTINCT FROM p_plugin_instance
            AND type = p_type
            AND type_inst IS NOT DISTINCT FROM p_type_instance;
    IF NOT FOUND THEN
        INSERT INTO identifiers (host, plugin, plugin_inst, type, type_inst)
            VALUES (p_host, p_plugin, p_plugin_instance, p_type, p_type_instance)
            RETURNING id INTO ds_id;
    END IF;
    i := 1;
    LOOP
        EXIT WHEN i > COALESCE(array_upper(p_value_names, 1), 0);
        INSERT INTO values (id, tstamp, name, value_type, value)
            VALUES (ds_id, p_time,
                    trim(both '''' from p_value_names[i]),
                    trim(both '''' from p_type_names[i]),
                    p_values[i])
            ON CONFLICT (tstamp, id, name) DO NOTHING;
        i := i + 1;
    END LOOP;
END;
$_$;

The function:

Looks up (or creates) the identifier row for the incoming metric tuple using IS NOT DISTINCT FROM to match NULLS NOT DISTINCT semantics on the unique constraint
Iterates over the parallel arrays of value names, type names, and values
Inserts each value into the values hypertable with ON CONFLICT DO NOTHING to handle duplicate rows on collectd reconnect

The COALESCE(array_upper(...), 0) guard prevents errors if collectd sends an empty or NULL values array, and trim(both '''' from ...) handles collectd's habit of wrapping values in single quotes.

This is called directly by collectd's postgresql plugin -- no middleware needed.

collectd Server Configuration

The collectd instance acting as the central server needs two plugins: network (to receive metrics from clients) and postgresql (to write them to TimescaleDB).

FQDNLookup true
CollectInternalStats true
LoadPlugin "network"

TypesDB "/usr/share/collectd/types.db" "/etc/collectd/bf_types.db"

# Receive metrics from all clients
<Plugin "network">
  Listen "0.0.0.0"
  ReportStats true
  MaxPacketSize 65535
</Plugin>

# Write to TimescaleDB
LoadPlugin postgresql
<Plugin postgresql>
    <Writer sqlstore>
        Statement "SELECT collectd_insert($1, $2, $3, $4, $5, $6, $7, $8, $9);"
        StoreRates false
    </Writer>
    <Database collectd>
        Host "/run/postgresql"
        Port 5432
        User collectd
        Password "your_password_here"
        Writer sqlstore
        CommitInterval 5
    </Database>
</Plugin>

Key configuration points:

Host "/run/postgresql" -- Connects via Unix domain socket for lower latency. Use a hostname/IP for remote PostgreSQL instances.
CommitInterval 5 -- Batches inserts into 5-second transaction windows. This dramatically reduces write overhead compared to committing every individual metric.
StoreRates false -- Keeps the original counter/derive types rather than converting to gauges. This preserves the original collectd type annotation for downstream consumers. Set to true if you want pre-computed rates.
MaxPacketSize 65535 -- Allows jumbo collectd network packets. Make sure your network MTU supports this if sending across network boundaries.
Custom types.db -- The bf_types.db file adds any custom data types your environment needs. Here it defines a stations gauge type for tracking wireless access point client counts.

Client configuration

On each monitored host, collectd just needs to send to the central server:

<Plugin "network">
  Server "monitor01.example.com"
</Plugin>

No database plugin needed on clients -- they send metrics over the collectd binary protocol and the server handles storage.

Example Queries

Once data is flowing, you have the full power of SQL. Here are some practical examples.

Current CPU usage by host (last 5 minutes)

SELECT host, type_inst AS cpu_state,
       avg(avg_value) AS avg_pct
FROM metrics_5min
WHERE plugin = 'cpu'
  AND bucket > now() - interval '5 minutes'
GROUP BY host, type_inst
ORDER BY host, avg_pct DESC;

Memory usage over the last 24 hours (hourly buckets)

SELECT host,
       bucket,
       metric_name,
       avg_value AS avg_bytes
FROM metrics_hourly
WHERE plugin = 'memory'
  AND bucket > now() - interval '24 hours'
ORDER BY host, bucket;

Disk I/O rate for a specific host

SELECT time_bucket('10 minutes', bucket) AS bucket,
       type_inst AS direction,
       avg(avg_value) AS avg_bytes_per_sec
FROM metrics_5min
WHERE host = 'pve2.example.com'
  AND plugin = 'disk'
  AND type = 'disk_octets'
  AND bucket > now() - interval '6 hours'
GROUP BY bucket, direction
ORDER BY bucket;

Top 10 hosts by network traffic in the last hour

SELECT host,
       sum(avg_value) AS total_octets
FROM metrics_5min
WHERE plugin = 'interface'
  AND type = 'if_octets'
  AND bucket > now() - interval '1 hour'
GROUP BY host
ORDER BY total_octets DESC
LIMIT 10;

Using TimescaleDB Toolkit functions

TimescaleDB Toolkit adds statistical aggregates. For example, computing the 95th percentile of CPU usage:

SELECT host,
       approx_percentile(0.95, percentile_agg(avg_value)) AS p95_cpu
FROM metrics_5min
WHERE plugin = 'cpu'
  AND type_inst = 'idle'
  AND bucket > now() - interval '1 hour'
GROUP BY host
ORDER BY p95_cpu ASC;

Grafana Integration

Grafana has native support for PostgreSQL and TimescaleDB as data sources. I use a separate read-only database user with tight timeouts to prevent any single query from consuming server resources:

CREATE ROLE grafana_ro WITH LOGIN PASSWORD 'a_readonly_password';
ALTER ROLE grafana_ro SET statement_timeout = '10s';
ALTER ROLE grafana_ro SET lock_timeout = '5s';
ALTER ROLE grafana_ro SET idle_in_transaction_session_timeout = '30s';
GRANT CONNECT ON DATABASE collectd TO grafana_ro;
GRANT USAGE ON SCHEMA public TO grafana_ro;
GRANT SELECT ON metrics_5min, metrics_hourly, metrics_daily TO grafana_ro;
GRANT SELECT ON collectd, identifiers, "values", metadata, value_types TO grafana_ro;

To connect:

In Grafana, go to Configuration > Data Sources > Add data source
Select PostgreSQL
Enter your connection details (host, port, database: collectd, user: grafana_ro)
Enable TimescaleDB under the PostgreSQL settings -- this enables time_bucket() in the query builder

Example Grafana query

In a Grafana panel using the query editor, a typical CPU usage graph query uses the metrics_5min continuous aggregate for performance:

SELECT
  time_bucket('$__interval', bucket) AS time,
  host,
  avg(avg_value) AS usage
FROM metrics_5min
WHERE plugin = 'cpu'
  AND type_inst != 'idle'
  AND value_type = 'gauge'
  AND $__timeFilter(bucket)
GROUP BY time, host
ORDER BY time

Grafana's $__interval macro automatically adjusts the bucket size based on the dashboard's time range, and $__timeFilter() handles the time range filter. Querying the continuous aggregate rather than the raw collectd view makes queries significantly faster -- pre-aggregated data vs millions of raw rows.

Template variables

You can create Grafana template variables to build dynamic dashboards:

-- Hosts dropdown
SELECT DISTINCT host FROM identifiers ORDER BY host;

-- Plugins for a selected host
SELECT DISTINCT plugin FROM identifiers WHERE host = '$host' ORDER BY plugin;

This lets you build a single dashboard that works across all your monitored hosts.

Derive/counter queries

For counters, use the pre-computed rate_per_second column:

SELECT
  time_bucket('$__interval', bucket) AS time,
  metric_name AS metric,
  avg(rate_per_second * 300) AS "value"
FROM metrics_5min
WHERE
  plugin = 'disk' AND type = 'disk_octets'
  AND host = '$hostname' AND plugin_inst = '$disk'
  AND value_type = 'derive'
  AND $__timeFilter(bucket)
GROUP BY time, metric_name
ORDER BY time

The rate_per_second is computed as (max-min)/300 within each 5-minute bucket. Multiply by 300 to get the per-bucket total.

Metadata Table

The schema also includes a metadata table for annotating hosts with their system type:

CREATE TYPE system_types AS ENUM (
    'linux', 'switch', 'freebsd', 'ghostbsd', 'lxc', 'openwrt', 'sensor', 'coreelec'
);

CREATE TABLE metadata (
    host character varying(126) NOT NULL,
    system_type system_types NOT NULL,
    UNIQUE(host, system_type)
);

This is useful for building Grafana dashboards that filter by system type -- show all Linux hosts, or all switches, or only LXC containers. Join it into your queries:

SELECT m.system_type, i.host, avg(m5.avg_value) AS avg_load
FROM metadata m
JOIN identifiers i ON i.host = m.host
JOIN metrics_5min m5 ON m5.id = i.id
WHERE i.plugin = 'load' AND i.type = 'load'
  AND m5.metric_name = '1min'
  AND m5.bucket > now() - interval '1 hour'
GROUP BY m.system_type, i.host
ORDER BY m.system_type, avg_load DESC;

Wrapping Up

This setup keeps all metrics in a single PostgreSQL database that compresses well, retains data as long as you want, and speaks SQL. collectd's built-in postgresql plugin makes the integration seamless -- no additional daemons or message queues required.

The combination of collectd for collection, TimescaleDB for storage, and Grafana for visualisation gives you a lightweight monitoring stack that punches well above its weight. If you are already running PostgreSQL in your environment, you likely already have most of the infrastructure in place.

Dean Hamstead is a Production Engineer / SRE and CPAN contributor. Find him on LinkedIn. He writes about Perl, infrastructure, and the occasional database.

DEV Community