Safdar Wahid

Posted on Jun 2 • Originally published at blog.easecloud.io

Setting Up Alerts and Notifications for Performance Bottlenecks

#devops #monitoring #performance #sre

TL;DR

Alert on symptoms, not causes – users feel latency and errors, not high CPU. Alert on p95 latency and error rates, not internal metrics.
Use SLOs and error budgets – alert when error budget burns too fast (e.g., 1% errors over 1 hour = 24x normal rate).
Reduce alert fatigue – group related alerts, inhibit child alerts (don't alert API errors when database is down). Target <10% false positive rate.
Route by severity – critical = page (PagerDuty), warning = Slack, info = channel. Escalate unacknowledged alerts.
Test alerts in staging with promtool test rules. Review and remove stale alerts quarterly.

Alerts transform monitoring data into action. Without alerts, dashboards require constant watching. With proper alerting, teams learn about problems immediately. But poor alerting creates noise that gets ignored. Effective alerts are actionable, relevant, and timely. They notify the right people about real problems with enough context to respond quickly.

Alerting Philosophy

Alerts should be actionable. Every alert should require human intervention. If no action is needed, it's not an alert—it's noise.

Alert on symptoms first. Users experience errors, latency, and unavailability. Alert on these before alerting on causes.

Causes inform investigation, not alerting. High CPU is a cause. Slow responses is a symptom. Alert on slow responses.

Context enables fast response. Alert messages should include what's wrong, where, and how to investigate. Links to dashboards and runbooks save time.

# Good alert with context
- alert: HighAPILatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
  for: 5m
  labels:
    severity: warning
    team: api
  annotations:
    summary: "API p95 latency exceeds 500ms"
    description: "{{ $labels.endpoint }} has p95 latency of {{ $value }}s"
    dashboard: "https://grafana.internal/d/api-latency"
    runbook: "https://wiki.internal/runbooks/api-latency"

Every alert needs an owner. Someone must be responsible for responding. Orphan alerts get ignored.

Choosing What to Alert On

Service Level Objectives (SLOs) define what matters. If 99.9% availability is the target, alert when availability drops.

Error budgets quantify acceptable failure. Consuming error budget too fast triggers alerts. Slow burn toward SLO violation gets attention.

# Error budget alert
- alert: ErrorBudgetBurn
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[1h])) /
    sum(rate(http_requests_total[1h])) > 0.001 * 24
  for: 30m
  annotations:
    summary: "Burning error budget at 24x normal rate"

The four golden signals guide alerting. Latency, traffic, errors, and saturation cover most user-impacting issues.

Latency alerts catch slowdowns. Response time percentiles exceeding targets indicate problems.

Error rate alerts catch failures. Elevated error rates mean users aren't succeeding.

Traffic alerts catch unusual patterns. Too little traffic might indicate upstream problems. Too much might indicate attacks.

Saturation alerts predict problems. High resource utilization precedes failure. Alert before exhaustion.

Setting Effective Thresholds

Baseline from historical data. Normal operation defines what's unusual. Analyze weeks of data before setting thresholds.

Percentile-based thresholds handle variation. Alerting when p95 exceeds 500ms catches real problems. Average-based alerts miss tail latency issues.

# Percentile-based threshold
- alert: HighLatency
  expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
  for: 5m

Relative thresholds catch anomalies. Traffic 3x normal is unusual regardless of absolute value. Percentage increase from baseline.

# Relative threshold (2x baseline)
- alert: TrafficAnomaly
  expr: |
    sum(rate(http_requests_total[5m])) >
    2 * avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h])
  for: 10m

Duration requirements prevent flapping. Require conditions to persist before alerting. Brief spikes don't trigger pages.

Multi-window alerts reduce noise. Alert only when both short-term and long-term views are bad. Catches sustained problems.

# Multi-window alert
- alert: SustainedErrors
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) > 0.01
    ) and (
      sum(rate(http_requests_total{status=~"5.."}[1h])) /
      sum(rate(http_requests_total[1h])) > 0.005
    )
  for: 5m

Test thresholds in staging. Simulate load and failures. Verify alerts fire appropriately.

Alert Routing and Escalation

Route alerts to responsible teams. API alerts go to API team. Database alerts go to DBA team.

Severity levels determine urgency. Critical alerts page immediately. Warnings create tickets. Info sends to Slack.

# PagerDuty routing rules
routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
  - match:
      severity: warning
    receiver: slack-warnings
  - match:
      severity: info
    receiver: slack-general

Escalation ensures response. Unacknowledged alerts escalate to secondary. Eventually reach management if needed.

# Escalation policy
escalation_policy:
  escalation_rules:
    - targets: [primary-oncall]
      escalation_delay_in_minutes: 5
    - targets: [secondary-oncall]
      escalation_delay_in_minutes: 10
    - targets: [team-lead]
      escalation_delay_in_minutes: 20

Time-based routing handles shifts. Route to on-call schedules, not individuals. Schedules rotate automatically.

Business hours awareness adjusts severity. Warning during work hours. Critical after hours for the same condition.

Reducing Alert Fatigue

Alert fatigue kills response quality. Too many alerts means alerts get ignored. Each unnecessary alert degrades the system.

Group related alerts. Multiple symptoms of one problem create one notification. Reduce noise without losing information.

# Alertmanager grouping
route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

Inhibit redundant alerts. If a database is down, don't also alert about API errors caused by database.

# Inhibition rules
inhibit_rules:
  - source_match:
      alertname: 'DatabaseDown'
    target_match:
      alertname: 'APIErrors'
    equal: ['environment']

Maintenance windows silence expected alerts. Deployments and migrations trigger alerts. Silence during planned work.

Regular alert review removes obsolete alerts. Delete alerts that never fire. Delete alerts that don't require action. Audit quarterly.

Track alert metrics. Alert frequency, time to acknowledge, and false positive rate. Use data to improve.

Alert fatigue kills response quality. We implement grouping, inhibition, and quarterly audits.

Group related alerts (group_by: ['alertname', 'service']). Inhibit redundant notifications (database down → no API error alerts). Audit stale alerts quarterly.

We help you:

Configure Alertmanager grouping – group_wait, group_interval, repeat_interval
Set up inhibition rules – Suppress downstream alerts when root cause already alerting
Schedule maintenance windows – Silence expected alerts during deployments
Track alert metrics – False positive rate (<10%), actionable rate (>90%), time to acknowledge (<5min)

Get Alert Fatigue Reduction →

Notification Channels

Multiple channels ensure delivery. PagerDuty for critical. Slack for warnings. Email for reports.

Critical alerts need push notification. Phone calls and push notifications for pages. Interruptive by design.

# Multi-channel notification
receivers:
  - name: 'critical'
    pagerduty_configs:
      - service_key: xxx
    slack_configs:
      - channel: '#critical-alerts'
  - name: 'warning'
    slack_configs:
      - channel: '#warnings'

Slack integration enables collaboration. Alerts in channels where teams work. Discuss and resolve together.

Rich notifications include context. Links to dashboards. Current metric values. Affected systems.

# Slack message template
slack_configs:
  - channel: '#alerts'
    title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
    text: |
      {{ .CommonAnnotations.summary }}

      *Severity:* {{ .CommonLabels.severity }}
      *Service:* {{ .CommonLabels.service }}

      <{{ .CommonAnnotations.dashboard }}|View Dashboard>
      <{{ .CommonAnnotations.runbook }}|Runbook>

Status pages inform users. Integrate alerts with status page updates. Users know when you know about problems.

Alert Management and Maintenance

Version control alert configurations. Store in Git alongside code. Review changes before deployment.

Test alerts in staging. Verify alerts fire correctly. Catch configuration errors before production.

# Prometheus rule testing
promtool check rules alert_rules.yml
promtool test rules alert_rules_test.yml

Document each alert. What it means. Why it matters. How to investigate. Keep documentation current.

Review alerts after incidents. Did alerts fire? Were they helpful? What was missed? Improve based on experience.

Track alert effectiveness metrics. Mean time to acknowledge. False positive rate. Alert-to-incident ratio.

Metric	Target	Action if Poor
Acknowledge time	< 5 min	Review routing
False positive rate	< 10%	Adjust thresholds
Alerts per incident	1-3	Improve grouping
Actionable rate	> 90%	Remove noise

Scheduled silence for known issues. While investigating known problems, silence related alerts. Focus on new issues.

Runbook automation reduces response time. Link alerts to automated diagnostics. Pre-gather information for responders.

Conclusion

Effective alerting transforms raw monitoring data into timely, actionable notifications that drive incident response. The principles are proven: alert on symptoms (latency, errors, saturation), use SLOs and error budgets as your framework, set percentile-based thresholds with duration requirements, route by severity with clear escalation paths, and relentlessly eliminate noise.

Without proper alerting, your dashboards are just screensavers. With proper alerting, you detect problems before users, respond with context, and resolve faster. Start with the four golden signals (latency, traffic, errors, saturation), add SLO-based error budget alerts, and implement grouping and inhibition to reduce fatigue. Review and clean up alerts quarterly, stale alerts are dangerous alerts.

FAQs

1. How do I distinguish between critical and warning alerts?

Critical alert if:

Error rate >1%
p95 latency >1s
Service down
User impact imminent or occurring

Pages on-call. Requires immediate action

Warning alert if:

CPU >80%
Error rate rising but still <0.5%
Potential future issue

Creates ticket, sends to Slack. Can wait until business hours.

Info alert (no action) for:

Deployment completed
Observability data

Use as context, not an alert

2. What's a good false positive rate for alerts?

Target <10% false positives. If >20%, engineers ignore alerts ("cry wolf" effect). Common causes:

Cause	Fix
Thresholds too sensitive	Adjust thresholds based on baseline data
Duration too short	Add `for: 5m` duration requirement
Missing maintenance windows	Silence during known maintenance (deployments, batch jobs)

3. How do I test alerts without affecting production?

Three methods:

Prometheus rule testing – promtool test rules with mock time series.
Staging environment – replicate production metrics, trigger conditions, verify notifications.
Synthetic monitoring – run test transactions that deliberately trigger alert conditions (e.g., force 5xx errors on test endpoint).

For PagerDuty API, use pd-send-test-event to verify routing without actual incident. Never test with real production pages – use resolve flag or dry-run mode.

Top comments (1)

Ron • Jun 3

Nice article and an important topic. Technology can help, and alert fatigue is real.

Version v0.32 or later of Alertmanager supports dynamic parameters for webhook payloads. This makes it easy for alerting tools like SIGNL4 to provide the information an on-call engineer needs to resolve an issue.