Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Database Monitoring Setup

#database #sql #postgres #performance

Database Monitoring Setup

Production-ready monitoring configurations for PostgreSQL, MySQL, and MongoDB — complete with Grafana dashboard JSON, Prometheus exporter configs, and alert rules that actually tell you why something is wrong, not just that it is. Deploy in under 30 minutes and stop getting woken up by false alarms.

Key Features

PostgreSQL monitoring via pg_stat_statements, pg_stat_user_tables, and custom diagnostic queries with pre-built Grafana panels
MySQL Performance Schema dashboards tracking query latency histograms, InnoDB buffer pool hit rates, and replication lag
MongoDB Atlas-compatible metrics collection plus self-hosted mongod monitoring via mongostat and mongotop wrappers
18 Grafana dashboard panels organized into overview, query performance, replication health, and storage dashboards
Alert rules with runbooks — every alert includes a severity, description, and step-by-step resolution procedure
Prometheus exporters pre-configured: postgres_exporter, mysqld_exporter, and mongodb_exporter with secure credential handling
Baseline capture scripts to establish normal operating ranges before setting thresholds
Slack and PagerDuty notification templates for critical, warning, and informational alerts

Quick Start

unzip database-monitoring-setup.zip
cd database-monitoring-setup/

# 1. Configure your database connection
cp config.example.yaml config.yaml
# Edit config.yaml with your database host, port, credentials

# 2. Deploy the PostgreSQL exporter
docker run -d --name pg_exporter \
  -p 9187:9187 \
  -e DATA_SOURCE_NAME="postgresql://monitor:YOUR_PASSWORD_HERE@db-host:5432/postgres?sslmode=require" \
  prometheuscommunity/postgres-exporter

# 3. Import the Grafana dashboard
# In Grafana UI: Dashboards > Import > Upload JSON
# Use: dashboards/main.json

# 4. Load alert rules into Prometheus
cp alerts/rules.yml /etc/prometheus/rules.d/
systemctl reload prometheus

Verify the exporter is scraping:

curl -s http://localhost:9187/metrics | grep pg_stat_user_tables_n_tup_ins
# pg_stat_user_tables_n_tup_ins{datname="myapp",relname="orders"} 284710

Architecture / How It Works

database-monitoring-setup/
├── configs/
│   ├── development.yaml          # Dev environment settings
│   ├── production.yaml           # Production tuned settings
│   ├── exporters/
│   │   ├── postgres_exporter.yml # Custom query definitions
│   │   ├── mysqld_exporter.cnf   # MySQL exporter config
│   │   └── mongodb_exporter.env  # MongoDB exporter environment
│   └── collectors/
│       ├── pg_diagnostic.sql     # Queries for pg_stat_*
│       ├── mysql_perf_schema.sql # Performance Schema queries
│       └── mongo_serverStatus.js # mongosh diagnostic script
├── dashboards/
│   └── main.json                 # Grafana dashboard panels
├── alerts/
│   ├── rules.yml                 # Prometheus alert rules
│   └── notification_templates/
│       ├── slack.tmpl
│       └── pagerduty.tmpl
└── config.example.yaml

Data flows: Database -> Exporter -> Prometheus -> Grafana dashboards + Alert Manager -> Slack/PagerDuty.

Usage Examples

PostgreSQL: custom exporter query for cache hit ratio:

# postgres_exporter custom queries
pg_cache_hit_ratio:
  query: |
    SELECT
      datname,
      ROUND(
        blks_hit::numeric / NULLIF(blks_hit + blks_read, 0) * 100, 2
      ) AS cache_hit_ratio
    FROM pg_stat_database
    WHERE datname NOT IN ('template0', 'template1')
  metrics:
    - datname:
        usage: "LABEL"
    - cache_hit_ratio:
        usage: "GAUGE"
        description: "Buffer cache hit ratio percentage"

Prometheus alert rule for replication lag:

# alerts/rules.yml
groups:
  - name: postgresql_replication
    rules:
      - alert: PostgreSQLReplicationLagHigh
        expr: pg_replication_lag_seconds > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag exceeds 30s on {{ $labels.instance }}"
          runbook: |
            1. Check replica status: SELECT * FROM pg_stat_replication;
            2. Verify network between primary and replica
            3. Check long-running txns: SELECT * FROM pg_stat_activity
               WHERE state = 'active'
               AND age(now(), xact_start) > interval '5 min';
            4. If WAL replay stuck, check replica logs for errors

      - alert: PostgreSQLCacheHitRatioLow
        expr: pg_cache_hit_ratio < 95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit ratio {{ $value }}% on {{ $labels.datname }}"
          runbook: |
            1. Check shared_buffers: SHOW shared_buffers;
            2. Compare to DB size: SELECT pg_size_pretty(pg_database_size(current_database()));
            3. Consider increasing shared_buffers (target: 25% of RAM)

MySQL Performance Schema diagnostic query:

-- Top 10 queries by total execution time
SELECT
    DIGEST_TEXT AS query_pattern,
    COUNT_STAR AS executions,
    ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_seconds,
    ROUND(AVG_TIMER_WAIT / 1e9, 2) AS avg_ms,
    SUM_ROWS_EXAMINED AS rows_examined,
    SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

Configuration

# config.example.yaml
monitoring:
  engine: postgresql            # postgresql | mysql | mongodb
  host: db-host.example.com
  port: 5432
  monitor_user: monitor_ro      # dedicated read-only monitoring user
  monitor_password: YOUR_PASSWORD_HERE

prometheus:
  scrape_interval: 15s
  evaluation_interval: 15s
  retention: 30d

alerting:
  slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK/HERE
  pagerduty_key: YOUR_PAGERDUTY_KEY_HERE
  alert_cooldown: 10m           # suppress repeat alerts within window

grafana:
  datasource_name: Prometheus
  refresh_interval: 30s
  default_time_range: 6h

Best Practices

Create a dedicated monitoring user with SELECT-only permissions. Never monitor with the application user or superuser.
Set alert thresholds from baselines, not guesses. Run baseline capture for 7 days, then set thresholds at p95 + 20%.
Alert on symptoms, not causes. "Query latency p99 > 500ms" is actionable. "CPU > 80%" often is not.
Include runbooks in every alert annotation. At 3 AM, the on-call engineer needs steps, not a dashboard link.
Monitor the monitors. Set a dead-man's switch alert that fires if Prometheus stops receiving metrics for > 5 minutes.
Rotate exporter credentials on the same schedule as application credentials.

Troubleshooting

Problem	Cause	Fix
Exporter shows 0 for all metrics	Monitor user lacks `pg_monitor` role	`GRANT pg_monitor TO monitor_ro;`
Grafana dashboard shows "No Data"	Prometheus datasource misconfigured	Verify datasource URL in Grafana > Configuration > Data Sources
Alerts firing immediately after deploy	Thresholds too tight for your workload	Run baseline capture for 7 days, then adjust
`pg_stat_statements` empty	Extension not loaded	`CREATE EXTENSION pg_stat_statements;` and add to `shared_preload_libraries`

This is 1 of 9 resources in the Database Admin Pro toolkit. Get the complete [Database Monitoring Setup] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Database Admin Pro bundle (9 products) for $109 — save 30%.

Get the Complete Bundle →

DEV Community

Database Monitoring Setup

Database Monitoring Setup

Key Features

Quick Start

Architecture / How It Works

Usage Examples

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)