Michael Mekuleyi

Posted on Apr 15

Designing Alerts That Matters using Amazon CloudWatch

#ai #devops #sre #aws

The Alert Fatigue Problem

Cloud systems today generate a huge amount of data. Every time a Lambda function runs, an RDS query happens, or an API Gateway is called, it creates information. When teams are just starting out with the cloud, it’s easy to want to set up alerts for everything. If a metric is available, someone usually wants to be notified about it. But this often leads to so many alerts that it ends up overwhelming the team, making it hard to focus on what really matters. Instead of helping, it steals the engineers’ attention.

Alert fatigue is not a soft problem. It is a direct cause of production incidents being missed or escalated too slowly. When an on-call engineer receives 200 notifications on a quiet night, and 50 of them fire routinely without action, the signal-to-noise ratio collapses. The 201st notification — the one that actually matters — gets lost.

The goal of this guide is to reframe how you think about alerting. Rather than asking "what should we alarm on?", start with "what conditions require immediate human intervention?" Everything else can wait for a dashboard review, a weekly metric review, or be surfaced as a log insight.

CloudWatch Alarms — Fundamentals

Every CloudWatch Alarm is made up of three parts: a Metric, which is the data you’re keeping an eye on; a Condition, which is the rule or threshold that triggers the alarm (this can be a fixed number or based on unusual behavior); and an Action, which is what happens when the alarm changes state—like sending a notification or starting an auto-scaling event. Think of alarms like state machines—you can set them to respond whenever their status changes, not just when something crosses a set limit.

Alarm States Explained

It’s really important to understand the three alarm states because if engineers don’t, the alarms can act in unexpected ways—especially during things like deployments or when there’s missing data. Getting how the alarm’s “state machine” works helps avoid surprises and keeps everything running smoothly.

Choosing the Right Metrics

The four Golden signals;

Avoid Pure Resource Utilization Alarms: Having your CPU at 90% isn’t a problem on its own. It only becomes an issue if it’s happening alongside something users notice, like slow response times. To handle this better, you can use Composite Alarms to make sure alerts only go off when multiple signals show there’s a real problem.

Custom Metrics via EMF: You can use the Embedded Metrics Format to send detailed application data from your Lambda functions in a neat, structured JSON format. CloudWatch then automatically understands and processes this data without any extra cost for API calls. Now let us start to review the alert strategies that ensure that alerting is not a nightmare.

Thresholds & Evaluation Periods

The M-of-N Pattern: An alarm should go off only when several of the recent data points (M out of N) cross the limit. Don’t trigger an alarm just because 1 out of 1 data point did, unless it’s a clear failure, like having zero healthy hosts.

Set Thresholds from Baselines; Observe 2–4 weeks of normal operation, then set thresholds at a meaningful distance from your p95/p99. Avoid round-number intuition guesses like "80% feels high.”

Composite Alarms

Composite Alarms combine several alarms using AND, OR, or NOT logic, so the main alarm only triggers when a specific combination of conditions happens. This way, you avoid unnecessary alerts and focus only on real issues. It’s a powerful way to reduce false alarms and make monitoring more accurate.

Anomaly Detection Alarms

For metrics that naturally change over time like higher traffic on weekdays and lower at 3 a.m, fixed thresholds can either trigger alarms too often during normal busy times or miss real issues.

CloudWatch Anomaly Detection utilizes machine learning to recognize these patterns, including time of day and day of week, and alerts you only when the metric exceeds the expected range. It requires at least 14 days of data to create an effective model, so start with static thresholds on new services and then switch to anomaly detection once sufficient data is collected.

Best Practices Checklist

Use this checklist when reviewing any CloudWatch Alarm — whether newly created or inherited. Every alarm that cannot satisfy these criteria is a candidate for deletion or rework.

[x] Alarm is actionable — the engineer knows what to do immediately
[x] Metric correlates directly with user-facing impact
[x]Threshold set from observed baseline data, not intuition
[x] M-of-N evaluation configured (minimum 3 of 5 for most metrics)
[x] TreatMissingData is explicitly configured
[x] OKAction defined — team gets an automated all-clear
[x] Correct priority tier / SNS topic assigned
[x] Runbook URL in alarm description
[x] Defined in Terraform or CloudFormation — not the console
[x] Reviewed and tested in the last 90 days

Summary

Designing alerts that matters takes focus and discipline. CloudWatch has lots of great features like anomaly detection, composite alarms, and metric math, but the goal isn’t to use everything everywhere. Instead, pick just the few alarms that give your team clear, useful info without causing too much noise.

Start with the Four Golden Signals. Use composite alarms to reduce false alarms. Use anomaly detection for metrics that have patterns or seasonal changes. Make sure alerts are prioritized right and include runbooks in the alarm descriptions. Define alerts as code so they’re easy to manage. And regularly review your alarms to remove anything outdated—old alarms can cause more harm than good.

A well-designed alert means your on-call engineer gets a clear alert at 2 AM, knows exactly what’s wrong, where to check, and who to call. That’s what a good alerting system looks like.

DEV Community