DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Database Monitoring Setup

Database Monitoring Setup

Production-ready monitoring configurations for PostgreSQL, MySQL, and MongoDB — complete with Grafana dashboard JSON, Prometheus exporter configs, and alert rules that actually tell you why something is wrong, not just that it is. Deploy in under 30 minutes and stop getting woken up by false alarms.

Key Features

  • PostgreSQL monitoring via pg_stat_statements, pg_stat_user_tables, and custom diagnostic queries with pre-built Grafana panels
  • MySQL Performance Schema dashboards tracking query latency histograms, InnoDB buffer pool hit rates, and replication lag
  • MongoDB Atlas-compatible metrics collection plus self-hosted mongod monitoring via mongostat and mongotop wrappers
  • 18 Grafana dashboard panels organized into overview, query performance, replication health, and storage dashboards
  • Alert rules with runbooks — every alert includes a severity, description, and step-by-step resolution procedure
  • Prometheus exporters pre-configured: postgres_exporter, mysqld_exporter, and mongodb_exporter with secure credential handling
  • Baseline capture scripts to establish normal operating ranges before setting thresholds
  • Slack and PagerDuty notification templates for critical, warning, and informational alerts

Quick Start

unzip database-monitoring-setup.zip
cd database-monitoring-setup/

# 1. Configure your database connection
cp config.example.yaml config.yaml
# Edit config.yaml with your database host, port, credentials

# 2. Deploy the PostgreSQL exporter
docker run -d --name pg_exporter \
  -p 9187:9187 \
  -e DATA_SOURCE_NAME="postgresql://monitor:YOUR_PASSWORD_HERE@db-host:5432/postgres?sslmode=require" \
  prometheuscommunity/postgres-exporter

# 3. Import the Grafana dashboard
# In Grafana UI: Dashboards > Import > Upload JSON
# Use: dashboards/main.json

# 4. Load alert rules into Prometheus
cp alerts/rules.yml /etc/prometheus/rules.d/
systemctl reload prometheus
Enter fullscreen mode Exit fullscreen mode

Verify the exporter is scraping:

curl -s http://localhost:9187/metrics | grep pg_stat_user_tables_n_tup_ins
# pg_stat_user_tables_n_tup_ins{datname="myapp",relname="orders"} 284710
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

database-monitoring-setup/
├── configs/
│   ├── development.yaml          # Dev environment settings
│   ├── production.yaml           # Production tuned settings
│   ├── exporters/
│   │   ├── postgres_exporter.yml # Custom query definitions
│   │   ├── mysqld_exporter.cnf   # MySQL exporter config
│   │   └── mongodb_exporter.env  # MongoDB exporter environment
│   └── collectors/
│       ├── pg_diagnostic.sql     # Queries for pg_stat_*
│       ├── mysql_perf_schema.sql # Performance Schema queries
│       └── mongo_serverStatus.js # mongosh diagnostic script
├── dashboards/
│   └── main.json                 # Grafana dashboard panels
├── alerts/
│   ├── rules.yml                 # Prometheus alert rules
│   └── notification_templates/
│       ├── slack.tmpl
│       └── pagerduty.tmpl
└── config.example.yaml
Enter fullscreen mode Exit fullscreen mode

Data flows: Database -> Exporter -> Prometheus -> Grafana dashboards + Alert Manager -> Slack/PagerDuty.

Usage Examples

PostgreSQL: custom exporter query for cache hit ratio:

# postgres_exporter custom queries
pg_cache_hit_ratio:
  query: |
    SELECT
      datname,
      ROUND(
        blks_hit::numeric / NULLIF(blks_hit + blks_read, 0) * 100, 2
      ) AS cache_hit_ratio
    FROM pg_stat_database
    WHERE datname NOT IN ('template0', 'template1')
  metrics:
    - datname:
        usage: "LABEL"
    - cache_hit_ratio:
        usage: "GAUGE"
        description: "Buffer cache hit ratio percentage"
Enter fullscreen mode Exit fullscreen mode

Prometheus alert rule for replication lag:

# alerts/rules.yml
groups:
  - name: postgresql_replication
    rules:
      - alert: PostgreSQLReplicationLagHigh
        expr: pg_replication_lag_seconds > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag exceeds 30s on {{ $labels.instance }}"
          runbook: |
            1. Check replica status: SELECT * FROM pg_stat_replication;
            2. Verify network between primary and replica
            3. Check long-running txns: SELECT * FROM pg_stat_activity
               WHERE state = 'active'
               AND age(now(), xact_start) > interval '5 min';
            4. If WAL replay stuck, check replica logs for errors

      - alert: PostgreSQLCacheHitRatioLow
        expr: pg_cache_hit_ratio < 95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit ratio {{ $value }}% on {{ $labels.datname }}"
          runbook: |
            1. Check shared_buffers: SHOW shared_buffers;
            2. Compare to DB size: SELECT pg_size_pretty(pg_database_size(current_database()));
            3. Consider increasing shared_buffers (target: 25% of RAM)
Enter fullscreen mode Exit fullscreen mode

MySQL Performance Schema diagnostic query:

-- Top 10 queries by total execution time
SELECT
    DIGEST_TEXT AS query_pattern,
    COUNT_STAR AS executions,
    ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_seconds,
    ROUND(AVG_TIMER_WAIT / 1e9, 2) AS avg_ms,
    SUM_ROWS_EXAMINED AS rows_examined,
    SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
monitoring:
  engine: postgresql            # postgresql | mysql | mongodb
  host: db-host.example.com
  port: 5432
  monitor_user: monitor_ro      # dedicated read-only monitoring user
  monitor_password: YOUR_PASSWORD_HERE

prometheus:
  scrape_interval: 15s
  evaluation_interval: 15s
  retention: 30d

alerting:
  slack_webhook: https://hooks.slack.com/services/YOUR/WEBHOOK/HERE
  pagerduty_key: YOUR_PAGERDUTY_KEY_HERE
  alert_cooldown: 10m           # suppress repeat alerts within window

grafana:
  datasource_name: Prometheus
  refresh_interval: 30s
  default_time_range: 6h
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Create a dedicated monitoring user with SELECT-only permissions. Never monitor with the application user or superuser.
  2. Set alert thresholds from baselines, not guesses. Run baseline capture for 7 days, then set thresholds at p95 + 20%.
  3. Alert on symptoms, not causes. "Query latency p99 > 500ms" is actionable. "CPU > 80%" often is not.
  4. Include runbooks in every alert annotation. At 3 AM, the on-call engineer needs steps, not a dashboard link.
  5. Monitor the monitors. Set a dead-man's switch alert that fires if Prometheus stops receiving metrics for > 5 minutes.
  6. Rotate exporter credentials on the same schedule as application credentials.

Troubleshooting

Problem Cause Fix
Exporter shows 0 for all metrics Monitor user lacks pg_monitor role GRANT pg_monitor TO monitor_ro;
Grafana dashboard shows "No Data" Prometheus datasource misconfigured Verify datasource URL in Grafana > Configuration > Data Sources
Alerts firing immediately after deploy Thresholds too tight for your workload Run baseline capture for 7 days, then adjust
pg_stat_statements empty Extension not loaded CREATE EXTENSION pg_stat_statements; and add to shared_preload_libraries

This is 1 of 9 resources in the Database Admin Pro toolkit. Get the complete [Database Monitoring Setup] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Database Admin Pro bundle (9 products) for $109 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)