TL;DR
- Alert on symptoms, not causes – users feel latency and errors, not high CPU. Alert on p95 latency and error rates, not internal metrics.
- Use SLOs and error budgets – alert when error budget burns too fast (e.g., 1% errors over 1 hour = 24x normal rate).
- Reduce alert fatigue – group related alerts, inhibit child alerts (don't alert API errors when database is down). Target <10% false positive rate.
- Route by severity – critical = page (PagerDuty), warning = Slack, info = channel. Escalate unacknowledged alerts.
-
Test alerts in staging with
promtool test rules. Review and remove stale alerts quarterly.
Alerts transform monitoring data into action. Without alerts, dashboards require constant watching. With proper alerting, teams learn about problems immediately. But poor alerting creates noise that gets ignored. Effective alerts are actionable, relevant, and timely. They notify the right people about real problems with enough context to respond quickly.
Alerting Philosophy
Alerts should be actionable. Every alert should require human intervention. If no action is needed, it's not an alert—it's noise.
Alert on symptoms first. Users experience errors, latency, and unavailability. Alert on these before alerting on causes.
Causes inform investigation, not alerting. High CPU is a cause. Slow responses is a symptom. Alert on slow responses.
Context enables fast response. Alert messages should include what's wrong, where, and how to investigate. Links to dashboards and runbooks save time.
# Good alert with context
- alert: HighAPILatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
for: 5m
labels:
severity: warning
team: api
annotations:
summary: "API p95 latency exceeds 500ms"
description: "{{ $labels.endpoint }} has p95 latency of {{ $value }}s"
dashboard: "https://grafana.internal/d/api-latency"
runbook: "https://wiki.internal/runbooks/api-latency"
Every alert needs an owner. Someone must be responsible for responding. Orphan alerts get ignored.
Choosing What to Alert On
Service Level Objectives (SLOs) define what matters. If 99.9% availability is the target, alert when availability drops.
Error budgets quantify acceptable failure. Consuming error budget too fast triggers alerts. Slow burn toward SLO violation gets attention.
# Error budget alert
- alert: ErrorBudgetBurn
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h])) > 0.001 * 24
for: 30m
annotations:
summary: "Burning error budget at 24x normal rate"
The four golden signals guide alerting. Latency, traffic, errors, and saturation cover most user-impacting issues.
Latency alerts catch slowdowns. Response time percentiles exceeding targets indicate problems.
Error rate alerts catch failures. Elevated error rates mean users aren't succeeding.
Traffic alerts catch unusual patterns. Too little traffic might indicate upstream problems. Too much might indicate attacks.
Saturation alerts predict problems. High resource utilization precedes failure. Alert before exhaustion.
Setting Effective Thresholds
Baseline from historical data. Normal operation defines what's unusual. Analyze weeks of data before setting thresholds.
Percentile-based thresholds handle variation. Alerting when p95 exceeds 500ms catches real problems. Average-based alerts miss tail latency issues.
# Percentile-based threshold
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 5m
Relative thresholds catch anomalies. Traffic 3x normal is unusual regardless of absolute value. Percentage increase from baseline.
# Relative threshold (2x baseline)
- alert: TrafficAnomaly
expr: |
sum(rate(http_requests_total[5m])) >
2 * avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h])
for: 10m
Duration requirements prevent flapping. Require conditions to persist before alerting. Brief spikes don't trigger pages.
Multi-window alerts reduce noise. Alert only when both short-term and long-term views are bad. Catches sustained problems.
# Multi-window alert
- alert: SustainedErrors
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.01
) and (
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h])) > 0.005
)
for: 5m
Test thresholds in staging. Simulate load and failures. Verify alerts fire appropriately.
Alert Routing and Escalation
Route alerts to responsible teams. API alerts go to API team. Database alerts go to DBA team.
Severity levels determine urgency. Critical alerts page immediately. Warnings create tickets. Info sends to Slack.
# PagerDuty routing rules
routes:
- match:
severity: critical
receiver: pagerduty-oncall
- match:
severity: warning
receiver: slack-warnings
- match:
severity: info
receiver: slack-general
Escalation ensures response. Unacknowledged alerts escalate to secondary. Eventually reach management if needed.
# Escalation policy
escalation_policy:
escalation_rules:
- targets: [primary-oncall]
escalation_delay_in_minutes: 5
- targets: [secondary-oncall]
escalation_delay_in_minutes: 10
- targets: [team-lead]
escalation_delay_in_minutes: 20
Time-based routing handles shifts. Route to on-call schedules, not individuals. Schedules rotate automatically.
Business hours awareness adjusts severity. Warning during work hours. Critical after hours for the same condition.
Reducing Alert Fatigue
Alert fatigue kills response quality. Too many alerts means alerts get ignored. Each unnecessary alert degrades the system.
Group related alerts. Multiple symptoms of one problem create one notification. Reduce noise without losing information.
# Alertmanager grouping
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
Inhibit redundant alerts. If a database is down, don't also alert about API errors caused by database.
# Inhibition rules
inhibit_rules:
- source_match:
alertname: 'DatabaseDown'
target_match:
alertname: 'APIErrors'
equal: ['environment']
Maintenance windows silence expected alerts. Deployments and migrations trigger alerts. Silence during planned work.
Regular alert review removes obsolete alerts. Delete alerts that never fire. Delete alerts that don't require action. Audit quarterly.
Track alert metrics. Alert frequency, time to acknowledge, and false positive rate. Use data to improve.
Alert fatigue kills response quality. We implement grouping, inhibition, and quarterly audits.
Group related alerts (group_by: ['alertname', 'service']). Inhibit redundant notifications (database down → no API error alerts). Audit stale alerts quarterly.
We help you:
-
Configure Alertmanager grouping –
group_wait,group_interval,repeat_interval - Set up inhibition rules – Suppress downstream alerts when root cause already alerting
- Schedule maintenance windows – Silence expected alerts during deployments
- Track alert metrics – False positive rate (<10%), actionable rate (>90%), time to acknowledge (<5min)
Notification Channels
Multiple channels ensure delivery. PagerDuty for critical. Slack for warnings. Email for reports.
Critical alerts need push notification. Phone calls and push notifications for pages. Interruptive by design.
# Multi-channel notification
receivers:
- name: 'critical'
pagerduty_configs:
- service_key: xxx
slack_configs:
- channel: '#critical-alerts'
- name: 'warning'
slack_configs:
- channel: '#warnings'
Slack integration enables collaboration. Alerts in channels where teams work. Discuss and resolve together.
Rich notifications include context. Links to dashboards. Current metric values. Affected systems.
# Slack message template
slack_configs:
- channel: '#alerts'
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: |
{{ .CommonAnnotations.summary }}
*Severity:* {{ .CommonLabels.severity }}
*Service:* {{ .CommonLabels.service }}
<{{ .CommonAnnotations.dashboard }}|View Dashboard>
<{{ .CommonAnnotations.runbook }}|Runbook>
Status pages inform users. Integrate alerts with status page updates. Users know when you know about problems.
Alert Management and Maintenance
Version control alert configurations. Store in Git alongside code. Review changes before deployment.

Test alerts in staging. Verify alerts fire correctly. Catch configuration errors before production.
# Prometheus rule testing
promtool check rules alert_rules.yml
promtool test rules alert_rules_test.yml
Document each alert. What it means. Why it matters. How to investigate. Keep documentation current.
Review alerts after incidents. Did alerts fire? Were they helpful? What was missed? Improve based on experience.
Track alert effectiveness metrics. Mean time to acknowledge. False positive rate. Alert-to-incident ratio.
| Metric | Target | Action if Poor |
|---|---|---|
| Acknowledge time | < 5 min | Review routing |
| False positive rate | < 10% | Adjust thresholds |
| Alerts per incident | 1-3 | Improve grouping |
| Actionable rate | > 90% | Remove noise |
Scheduled silence for known issues. While investigating known problems, silence related alerts. Focus on new issues.
Runbook automation reduces response time. Link alerts to automated diagnostics. Pre-gather information for responders.
Conclusion
Effective alerting transforms raw monitoring data into timely, actionable notifications that drive incident response. The principles are proven: alert on symptoms (latency, errors, saturation), use SLOs and error budgets as your framework, set percentile-based thresholds with duration requirements, route by severity with clear escalation paths, and relentlessly eliminate noise.
Without proper alerting, your dashboards are just screensavers. With proper alerting, you detect problems before users, respond with context, and resolve faster. Start with the four golden signals (latency, traffic, errors, saturation), add SLO-based error budget alerts, and implement grouping and inhibition to reduce fatigue. Review and clean up alerts quarterly, stale alerts are dangerous alerts.
FAQs
1. How do I distinguish between critical and warning alerts?
Critical alert if:
- Error rate >1%
- p95 latency >1s
- Service down
- User impact imminent or occurring
Pages on-call. Requires immediate action
Warning alert if:
- CPU >80%
- Error rate rising but still <0.5%
- Potential future issue
Creates ticket, sends to Slack. Can wait until business hours.
Info alert (no action) for:
- Deployment completed
- Observability data
Use as context, not an alert
2. What's a good false positive rate for alerts?
Target <10% false positives. If >20%, engineers ignore alerts ("cry wolf" effect). Common causes:
| Cause | Fix |
|---|---|
| Thresholds too sensitive | Adjust thresholds based on baseline data |
| Duration too short | Add for: 5m duration requirement |
| Missing maintenance windows | Silence during known maintenance (deployments, batch jobs) |
3. How do I test alerts without affecting production?
Three methods:
- Prometheus rule testing – promtool test rules with mock time series.
- Staging environment – replicate production metrics, trigger conditions, verify notifications.
- Synthetic monitoring – run test transactions that deliberately trigger alert conditions (e.g., force 5xx errors on test endpoint).
For PagerDuty API, use pd-send-test-event to verify routing without actual incident. Never test with real production pages – use resolve flag or dry-run mode.

Top comments (0)