I recently counted how many tools I use for PostgreSQL monitoring on any given week: Grafana for system metrics, psql for pg_stat_statements queries, pgAdmin for table inspection, a Python script for Slack alerts, and occasionally htop during incidents. Five tools, five context switches, and the real cost is not the tooling -- it is the 20 minutes at the start of every incident spent just gathering context across all of them.
The Problem
The typical PostgreSQL monitoring setup looks like this: Grafana for system metrics (CPU, memory, disk I/O), manual pg_stat_statements queries for slow query investigation, pgAdmin for ad-hoc table inspection, and a custom Python script posting alerts to Slack. Four tools, four logins, four mental models -- and none of them talk to each other.
Grafana shows CPU spiked at 2 PM. pg_stat_statements shows a query that started consuming 10x more time around 2 PM. pgAdmin shows the table that query reads is 40% bloated. The Slack alert fired because response times exceeded the threshold. Four pieces of the same puzzle, scattered across four tools. Correlating them requires you to mentally hold timestamps, table names, and query fingerprints while switching between browser tabs and terminal sessions.
This fragmentation has a cost beyond inconvenience. When an incident happens, the first 20 minutes are spent gathering context: which queries are slow, what changed, which tables are affected, is replication lagging, are connections exhausted. In a setup with 4+ tools, those 20 minutes are spent logging into different systems and cross-referencing data manually.
The deeper problem is that fragmented monitoring discourages proactive work. Nobody opens four tools to do a routine health check. The tools only come out during incidents -- which means problems are discovered reactively instead of caught early.
What a Complete Health Check Actually Looks Like
Each individual tool gives a partial view. Here is what a complete health check requires across separate tools:
System metrics (Grafana or htop):
# CPU, memory, disk I/O, disk space
top -bn1 | head -5
df -h /var/lib/postgresql
iostat -x 1 3
Query performance (pg_stat_statements):
SELECT queryid, calls, mean_exec_time, total_exec_time,
rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
Table health (pg_stat_user_tables):
SELECT relname, n_dead_tup, n_live_tup, last_autovacuum,
seq_scan, idx_scan,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;
Replication status (pg_stat_replication):
SELECT application_name, state, sent_lsn, write_lsn,
flush_lsn, replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
Four queries, four different result sets, four mental models. You can do this. You have probably done it hundreds of times. The question is whether this is a good use of your time during an incident when every minute matters.
Why Unified Monitoring Matters
The value of consolidating PostgreSQL monitoring is not just convenience -- it is the correlations that become possible when everything shares the same timeline:
- Query + table correlation: when a query slows down, see whether the table it reads has become bloated, whether vacuum is behind, or whether an index was dropped
- Time series, not snapshots: transaction rate is not just "450 TPS right now" -- it is a chart showing the last 24 hours so you can see the spike at 2 PM and determine whether it was a one-time event or a sustained change
- Fleet-wide scanning: with multiple PostgreSQL instances, a health score per instance lets you scan the entire fleet in seconds instead of checking each one individually
- Automated health checks: 60+ checks across query performance, table health, replication, vacuum, indexes, and extensions -- running continuously instead of when you remember to check
The goal is not to replace your existing tools. It is to make proactive health checks so low-friction that they happen daily instead of only during incidents. Problems caught early are configuration changes. Problems caught late are emergency recoveries.
Getting Started
A lightweight collector (~15 MB Go binary, systemd service) gathers metrics in three tiers:
- Fast (every 15 seconds): sessions, locks, replication lag
- Medium (every 60 seconds): table stats, index usage, query performance
- Slow (every 5 minutes): bloat estimates, extension health, configuration changes
# Download and install
curl -L https://mydba.dev/download/linux-amd64 -o /usr/local/bin/mydba-collector
chmod +x /usr/local/bin/mydba-collector
The collector uses under 1% CPU and approximately 50 MB RAM. Metrics flow within 15 seconds of starting.
Alternatively, managed collectors handle everything for you -- provide connection details and metrics start flowing immediately. Supports SSH tunnels, VPN peering, or direct access with IP allowlisting.
Pricing
The free tier includes every feature: 1 primary + 1 replica, 1 user, 7-day metric retention, all 60+ health checks, index advisor, EXPLAIN plan analysis, and all 12 developer tools. No feature gating -- the free tier is the full product with a shorter retention window.
Pro tier starts at 50 GBP/primary/month, 25 GBP/replica/month, with unlimited connections, users, and 30-day retention. Volume discounts for fleets of 10+ instances.
Start at mydba.dev. Five minutes to first metrics.
Top comments (0)