Introduction
You can't fix what you can't see. Yet a surprising number of startups run production systems with nothing more than CloudWatch basics or an uptime checker pinging their homepage every 5 minutes. When something breaks, the debugging process starts with "check the logs" and devolves into SSH-ing into random servers hoping to find a clue.
Prometheus and Grafana have become the standard open-source monitoring stack for good reason: Prometheus is a battle-tested time-series database with a powerful query language, and Grafana turns that data into dashboards and alerts that actually help you find problems. Together with proper alerting rules, they give you the observability foundation every production system needs.
This guide covers a production-ready setup: installation, scrape configuration, essential PromQL queries, alerting rules that won't wake you up for nothing, Grafana dashboards, and long-term storage with Thanos.
Installing the Stack
The cleanest way to deploy the monitoring stack is with Docker Compose for small/medium setups, or the kube-prometheus-stack Helm chart for Kubernetes. We'll start with Docker Compose since it's easier to understand, then cover the K8s path.
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:v2.53.0
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules/:/etc/prometheus/rules/
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.1.0
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/:/etc/grafana/provisioning/
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: https://monitoring.example.com
ports:
- "3000:3000"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.8.1
command:
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- '/:/host:ro,rslave'
pid: host
ports:
- "9100:9100"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration and Scrape Targets
The Prometheus config defines what to scrape and how often:
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# Monitor Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Host metrics via Node Exporter
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter:9100'
- '10.0.1.10:9100'
- '10.0.1.11:9100'
- '10.0.1.12:9100'
relabel_configs:
- source_labels: [__address__]
regex: '(.+):\d+'
target_label: instance
replacement: '${1}'
# Container metrics
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Application metrics
- job_name: 'api'
metrics_path: /metrics
static_configs:
- targets: ['api-server:3000']
relabel_configs:
- source_labels: [__address__]
target_label: service
replacement: 'api'
# Postgres via postgres_exporter
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
# Nginx via nginx_exporter
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
# Blackbox probing (HTTP endpoints)
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
- https://app.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Pro tip: For Kubernetes, skip static configs entirely. Use the kubernetes_sd_configs provider with pod annotations:
# In your K8s deployment
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
Essential PromQL Queries
PromQL is powerful but has a learning curve. Here are the queries you'll use 80% of the time:
CPU usage per host:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage percentage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk usage percentage:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
HTTP request rate by status code:
sum(rate(http_requests_total[5m])) by (status_code)
95th percentile request latency:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Error rate as a percentage of total requests:
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Container memory usage by pod (Kubernetes):
sum(container_memory_working_set_bytes{container!=""}) by (pod) / 1024 / 1024
PromQL gotcha: Always use rate() or irate() on counters, never raw values. Counters only go up (and reset on restart), so the raw value is meaningless. rate() gives you per-second change averaged over a window. Use 5m windows for dashboards and alerting - shorter windows are noisy.
Alerting Rules That Don't Cause Alert Fatigue
The biggest mistake in monitoring is alerting on everything. Every alert should be actionable - if you can't do anything about it at 3 AM, it shouldn't page you.
# prometheus/rules/alerts.yml
groups:
- name: infrastructure
rules:
# Only alert on sustained high CPU - brief spikes are normal
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 15m
labels:
severity: warning
annotations:
summary: "CPU usage above 85% for 15 minutes on {{ $labels.instance }}"
runbook: "https://wiki.example.com/runbooks/high-cpu"
# Disk filling up - predict when it will be full
- alert: DiskWillFillIn24Hours
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: critical
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 24 hours"
# Memory - alert before OOM killer strikes
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "Memory usage above 90% on {{ $labels.instance }}"
- name: application
rules:
# Error rate spike
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% - {{ $value | humanizePercentage }} of requests failing"
# Latency degradation
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "95th percentile latency above 2 seconds"
# Endpoint down
- alert: EndpointDown
expr: probe_success == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is not responding to HTTP probes"
- name: kubernetes
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: PodNotReady
expr: kube_pod_status_ready{condition="true"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready for 10 minutes"
Notice the for clauses - they prevent one-off spikes from triggering alerts. A brief CPU spike during a deployment is fine; sustained high CPU for 15 minutes means something is wrong.
Alertmanager Configuration
Route alerts to the right channels based on severity:
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-warnings'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: YOUR_PAGERDUTY_SERVICE_KEY
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: '{{ .CommonLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
The inhibit rule prevents warning alerts from firing when a critical alert is already active for the same issue. No point getting a "high CPU" warning when you're already paged about the machine being down.
Grafana Dashboard Essentials
Provision dashboards as code instead of clicking through the UI:
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: true
For a quick start, import these community dashboards by ID in Grafana:
- 1860 - Node Exporter Full (host metrics)
- 893 - Docker and Host Monitoring
- 315 - Kubernetes Cluster Monitoring
- 9628 - PostgreSQL Database
- 12708 - Nginx
Build custom dashboards for your application-specific metrics. A good application dashboard has four rows:
- Traffic - Request rate, active connections, requests by endpoint
- Errors - Error rate, error breakdown by type, recent error log entries
- Latency - p50, p95, p99 response times, latency by endpoint
- Saturation - CPU, memory, disk, connection pool utilization
This is the RED method (Rate, Errors, Duration) plus saturation - and it's enough to diagnose most production issues.
Long-Term Storage with Thanos
Prometheus keeps data locally and it's not designed for long-term retention or global querying across multiple Prometheus instances. Thanos solves both problems.
The simplest Thanos setup uses the sidecar pattern:
# Add to your Prometheus service in docker-compose.yml
thanos-sidecar:
image: thanosio/thanos:v0.35.1
command:
- sidecar
- '--tsdb.path=/prometheus'
- '--prometheus.url=http://prometheus:9090'
- '--objstore.config-file=/etc/thanos/bucket.yml'
volumes:
- prometheus_data:/prometheus
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
thanos-store:
image: thanosio/thanos:v0.35.1
command:
- store
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--data-dir=/tmp/thanos-store'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
thanos-query:
image: thanosio/thanos:v0.35.1
command:
- query
- '--store=thanos-sidecar:10901'
- '--store=thanos-store:10901'
ports:
- "10902:10902"
thanos-compactor:
image: thanosio/thanos:v0.35.1
command:
- compact
- '--data-dir=/tmp/thanos-compact'
- '--objstore.config-file=/etc/thanos/bucket.yml'
- '--retention.resolution-raw=30d'
- '--retention.resolution-5m=180d'
- '--retention.resolution-1h=365d'
- '--wait'
volumes:
- ./thanos/bucket.yml:/etc/thanos/bucket.yml
# thanos/bucket.yml
type: S3
config:
bucket: your-thanos-metrics-bucket
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
Thanos compactor downsamples old data: raw resolution for 30 days, 5-minute resolution for 6 months, 1-hour resolution for a year. This keeps storage costs manageable while giving you a full year of metrics history.
Point your Grafana data source at Thanos Query (port 10902) instead of Prometheus directly. You'll get the same PromQL interface but with access to historical data from object storage.
Need Help with Your DevOps?
Building a monitoring stack is step one - tuning it to actually catch problems without drowning you in noise is the hard part. At InstaDevOps, we set up and maintain your complete observability pipeline - Prometheus, Grafana, alerting, log aggregation, and tracing - starting at $2,999/month.
Book a free 15-minute consultation to discuss your monitoring needs: https://calendly.com/instadevops/15min
Top comments (0)