DEV Community

InstaDevOps
InstaDevOps

Posted on • Originally published at instadevops.com

Production Monitoring with Prometheus and Grafana: From Setup to Alerting

Introduction

You can't fix what you can't see. Yet a surprising number of startups run production systems with nothing more than CloudWatch basics or an uptime checker pinging their homepage every 5 minutes. When something breaks, the debugging process starts with "check the logs" and devolves into SSH-ing into random servers hoping to find a clue.

Prometheus and Grafana have become the standard open-source monitoring stack for good reason: Prometheus is a battle-tested time-series database with a powerful query language, and Grafana turns that data into dashboards and alerts that actually help you find problems. Together with proper alerting rules, they give you the observability foundation every production system needs.

This guide covers a production-ready setup: installation, scrape configuration, essential PromQL queries, alerting rules that won't wake you up for nothing, Grafana dashboards, and long-term storage with Thanos.

Installing the Stack

The cleanest way to deploy the monitoring stack is with Docker Compose for small/medium setups, or the kube-prometheus-stack Helm chart for Kubernetes. We'll start with Docker Compose since it's easier to understand, then cover the K8s path.

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules/:/etc/prometheus/rules/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=50GB'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: https://monitoring.example.com
    ports:
      - "3000:3000"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - '/:/host:ro,rslave'
    pid: host
    ports:
      - "9100:9100"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
Enter fullscreen mode Exit fullscreen mode

Prometheus Configuration and Scrape Targets

The Prometheus config defines what to scrape and how often:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # Monitor Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Host metrics via Node Exporter
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node-exporter:9100'
          - '10.0.1.10:9100'
          - '10.0.1.11:9100'
          - '10.0.1.12:9100'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):\d+'
        target_label: instance
        replacement: '${1}'

  # Container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Application metrics
  - job_name: 'api'
    metrics_path: /metrics
    static_configs:
      - targets: ['api-server:3000']
    relabel_configs:
      - source_labels: [__address__]
        target_label: service
        replacement: 'api'

  # Postgres via postgres_exporter
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Nginx via nginx_exporter
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

  # Blackbox probing (HTTP endpoints)
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
          - https://app.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
Enter fullscreen mode Exit fullscreen mode

Pro tip: For Kubernetes, skip static configs entirely. Use the kubernetes_sd_configs provider with pod annotations:

# In your K8s deployment
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "3000"
    prometheus.io/path: "/metrics"
Enter fullscreen mode Exit fullscreen mode

Essential PromQL Queries

PromQL is powerful but has a learning curve. Here are the queries you'll use 80% of the time:

CPU usage per host:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Enter fullscreen mode Exit fullscreen mode

Memory usage percentage:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Enter fullscreen mode Exit fullscreen mode

Disk usage percentage:

(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
Enter fullscreen mode Exit fullscreen mode

HTTP request rate by status code:

sum(rate(http_requests_total[5m])) by (status_code)
Enter fullscreen mode Exit fullscreen mode

95th percentile request latency:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Enter fullscreen mode Exit fullscreen mode

Error rate as a percentage of total requests:

sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Enter fullscreen mode Exit fullscreen mode

Container memory usage by pod (Kubernetes):

sum(container_memory_working_set_bytes{container!=""}) by (pod) / 1024 / 1024
Enter fullscreen mode Exit fullscreen mode

PromQL gotcha: Always use rate() or irate() on counters, never raw values. Counters only go up (and reset on restart), so the raw value is meaningless. rate() gives you per-second change averaged over a window. Use 5m windows for dashboards and alerting - shorter windows are noisy.

Alerting Rules That Don't Cause Alert Fatigue

The biggest mistake in monitoring is alerting on everything. Every alert should be actionable - if you can't do anything about it at 3 AM, it shouldn't page you.

# prometheus/rules/alerts.yml
groups:
  - name: infrastructure
    rules:
      # Only alert on sustained high CPU - brief spikes are normal
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85% for 15 minutes on {{ $labels.instance }}"
          runbook: "https://wiki.example.com/runbooks/high-cpu"

      # Disk filling up - predict when it will be full
      - alert: DiskWillFillIn24Hours
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 24 hours"

      # Memory - alert before OOM killer strikes
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

  - name: application
    rules:
      # Error rate spike
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% - {{ $value | humanizePercentage }} of requests failing"

      # Latency degradation
      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency above 2 seconds"

      # Endpoint down
      - alert: EndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is not responding to HTTP probes"

  - name: kubernetes
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: PodNotReady
        expr: kube_pod_status_ready{condition="true"} == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10 minutes"
Enter fullscreen mode Exit fullscreen mode

Notice the for clauses - they prevent one-off spikes from triggering alerts. A brief CPU spike during a deployment is fine; sustained high CPU for 15 minutes means something is wrong.

Alertmanager Configuration

Route alerts to the right channels based on severity:

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-warnings'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 4h

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: YOUR_PAGERDUTY_SERVICE_KEY
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
Enter fullscreen mode Exit fullscreen mode

The inhibit rule prevents warning alerts from firing when a critical alert is already active for the same issue. No point getting a "high CPU" warning when you're already paged about the machine being down.

Grafana Dashboard Essentials

Provision dashboards as code instead of clicking through the UI:

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true
Enter fullscreen mode Exit fullscreen mode

For a quick start, import these community dashboards by ID in Grafana:

  • 1860 - Node Exporter Full (host metrics)
  • 893 - Docker and Host Monitoring
  • 315 - Kubernetes Cluster Monitoring
  • 9628 - PostgreSQL Database
  • 12708 - Nginx

Build custom dashboards for your application-specific metrics. A good application dashboard has four rows:

  1. Traffic - Request rate, active connections, requests by endpoint
  2. Errors - Error rate, error breakdown by type, recent error log entries
  3. Latency - p50, p95, p99 response times, latency by endpoint
  4. Saturation - CPU, memory, disk, connection pool utilization

This is the RED method (Rate, Errors, Duration) plus saturation - and it's enough to diagnose most production issues.

Long-Term Storage with Thanos

Prometheus keeps data locally and it's not designed for long-term retention or global querying across multiple Prometheus instances. Thanos solves both problems.

The simplest Thanos setup uses the sidecar pattern:

# Add to your Prometheus service in docker-compose.yml
  thanos-sidecar:
    image: thanosio/thanos:v0.35.1
    command:
      - sidecar
      - '--tsdb.path=/prometheus'
      - '--prometheus.url=http://prometheus:9090'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
    volumes:
      - prometheus_data:/prometheus
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml

  thanos-store:
    image: thanosio/thanos:v0.35.1
    command:
      - store
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--data-dir=/tmp/thanos-store'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml

  thanos-query:
    image: thanosio/thanos:v0.35.1
    command:
      - query
      - '--store=thanos-sidecar:10901'
      - '--store=thanos-store:10901'
    ports:
      - "10902:10902"

  thanos-compactor:
    image: thanosio/thanos:v0.35.1
    command:
      - compact
      - '--data-dir=/tmp/thanos-compact'
      - '--objstore.config-file=/etc/thanos/bucket.yml'
      - '--retention.resolution-raw=30d'
      - '--retention.resolution-5m=180d'
      - '--retention.resolution-1h=365d'
      - '--wait'
    volumes:
      - ./thanos/bucket.yml:/etc/thanos/bucket.yml
Enter fullscreen mode Exit fullscreen mode
# thanos/bucket.yml
type: S3
config:
  bucket: your-thanos-metrics-bucket
  endpoint: s3.us-east-1.amazonaws.com
  region: us-east-1
  access_key: YOUR_ACCESS_KEY
  secret_key: YOUR_SECRET_KEY
Enter fullscreen mode Exit fullscreen mode

Thanos compactor downsamples old data: raw resolution for 30 days, 5-minute resolution for 6 months, 1-hour resolution for a year. This keeps storage costs manageable while giving you a full year of metrics history.

Point your Grafana data source at Thanos Query (port 10902) instead of Prometheus directly. You'll get the same PromQL interface but with access to historical data from object storage.

Need Help with Your DevOps?

Building a monitoring stack is step one - tuning it to actually catch problems without drowning you in noise is the hard part. At InstaDevOps, we set up and maintain your complete observability pipeline - Prometheus, Grafana, alerting, log aggregation, and tracing - starting at $2,999/month.

Book a free 15-minute consultation to discuss your monitoring needs: https://calendly.com/instadevops/15min

Top comments (0)