Every DBA with a decade of experience carries a mental checklist of things to verify when something goes wrong with PostgreSQL. Check pg_stat_activity. Look at pg_stat_user_tables. Query pg_stat_bgwriter. Examine replication slots. Verify autovacuum settings. Inspect connection pool behavior. The rest of us are searching Stack Overflow while the incident clock ticks.
The real challenge is not running individual diagnostic queries -- it is knowing which queries to run, interpreting the results correctly, and understanding how problems in one domain cascade into others.
The Interconnected Nature of PostgreSQL Problems
Many PostgreSQL performance issues form feedback loops. Bloated tables cause slow queries. Slow queries hold locks longer. Held locks block autovacuum. Blocked autovacuum causes more bloat. You need visibility across all domains simultaneously to see these causal chains.
Building Your Own Health Check Suite
You can start with diagnostic queries across key domains:
-- Configuration: check shared_buffers sizing
SELECT name, setting, unit,
pg_size_pretty(setting::bigint * 8192) AS effective_size
FROM pg_settings
WHERE name = 'shared_buffers';
-- Vacuum health: tables with high dead tuple ratios
SELECT schemaname, relname,
n_dead_tup, n_live_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 1) AS dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC LIMIT 10;
-- Index usage: find unused indexes wasting space
SELECT schemaname, relname, indexrelname,
idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND schemaname NOT IN ('pg_catalog', 'pg_toast')
ORDER BY pg_relation_size(indexrelid) DESC LIMIT 10;
-- Replication: check lag on replicas
SELECT client_addr, state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag
FROM pg_stat_replication;
This covers 4 of the 12 domains you should be checking. For comprehensive coverage, you also need queries for connection saturation, XID age, checkpoint frequency, lock contention, wait events, security configuration, WAL generation rate, storage growth, and extension-specific checks.
Making Health Checks Actionable
The difference between a useful health check and a noisy one is actionability. Each finding should include three things:
- What is wrong -- "shared_buffers is set to 128MB on a server with 32GB RAM"
- Why it matters -- "Only 0.4% of RAM is allocated to shared buffers, causing excessive disk I/O"
- How to fix it -- A copy-pasteable SQL command
Example: Configuration Fix
-- Finding: shared_buffers = 128MB on a 32GB server (0.4% of RAM, should be ~25%)
ALTER SYSTEM SET shared_buffers = '8GB';
SELECT pg_reload_conf();
-- Note: shared_buffers change requires a restart to take effect
The target value should be calculated from your actual server memory, not a generic recommendation.
Example: Vacuum Fix
-- Finding: autovacuum_enabled = false on orders table (dead tuples: 2.4M)
ALTER TABLE orders RESET (autovacuum_enabled);
-- This re-enables autovacuum with default or per-table settings
Someone disabled autovacuum during a bulk load and forgot to re-enable it.
Example: Index Cleanup
-- Finding: idx_users_email and idx_users_email_unique both index (email)
-- The unique constraint index already provides the same lookup capability
DROP INDEX CONCURRENTLY idx_users_email;
-- Estimated space savings: 847 MB
Prioritizing Findings
Not every finding has the same urgency. Organize them into priority levels:
- P1 (Critical) -- Fix now. Disabled autovacuum on high-churn tables, XID wraparound approaching, replication slots consuming disk.
- P2 (High) -- Fix this week. Misconfigured shared_buffers, missing indexes on large tables, connection pool at capacity.
- P3 (Medium) -- Fix when convenient. Suboptimal checkpoint settings, unused extensions loaded, minor security gaps.
- P4 (Informational) -- Awareness only. Default settings that could be tuned, minor redundancies.
Making It a Habit
Run health checks after every major deployment. Application changes introduce new query patterns that can invalidate existing index strategies, change autovacuum pressure, or shift connection usage. A deployment that adds a new background job running heavy aggregations might degrade your database without triggering any alerts until users complain.
Track domain scores over time and set thresholds for investigation. If your Vacuum score drops below 80, something changed. Build health check reviews into your team's operational cadence -- a weekly glance at the findings keeps small problems from compounding into incidents.
Full article with screenshots and more examples: 75 Automated Health Checks That Tell You Exactly What to Fix
Top comments (0)