丁久

Posted on May 16 • Originally published at dingjiu1989-hue.github.io

Database Monitoring and Performance Alerting

#sql #database

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Database Monitoring and Performance Alerting

Why Monitor Databases?

Database monitoring catches problems before they become incidents. Track key metrics and alert on anomalies.

Key Metrics

Query Performance

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- PostgreSQL slow queries

SELECT query, mean_exec_time, calls

FROM pg_stat_statements

ORDER BY mean_exec_time DESC LIMIT 10;

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- Active queries

SELECT pid, state, query_start, query

FROM pg_stat_activity

WHERE state = 'active';

Connection Pools

Monitor active vs idle connections. Alert when connection count exceeds 80% of max_connections.

Disk and Memory

Track cache hit ratio (aim for 99%+), disk usage, and IOPS. Low cache hit ratio indicates the working set does not fit in memory.

Replication Lag

SELECT application_name,

pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,

now() - pg_last_xact_replay_timestamp() AS lag_time

FROM pg_stat_replication;

Prometheus Setup

prometheus.yml

scrape_configs:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- job_name: 'postgres'

static_configs:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- targets: ['postgres_exporter:9187']

Alert Thresholds

| Metric | Warning | Critical | |--------|---------|----------| | Cache hit ratio | < 97% | < 95% | | Connections | > 80% | > 90% | | Replication lag | > 30s | > 300s | | Disk usage | > 80% | > 90% |

Conclusion

Track QPS, latency, connections, cache hit ratio, and replication lag. Use Prometheus and Grafana for collection and visualization. Set meaningful alert thresholds and avoid alert fatigue.

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

DEV Community

Database Monitoring and Performance Alerting

Database Monitoring and Performance Alerting