Cold truth: problems always show up in logs first. The trick is turning those “uh-oh” lines into a nudge in your inbox before users feel it.
Here’s the dead-simple pattern I use in AWS:
CloudWatch Logs → Metric Filter → Alarm → SNS (Email/Slack)
No new services to run. No extra agents. Just wiring.
Why this works
Think of CloudWatch Logs as a river. Metric filters are little nets you drop in: “catch anything that looks like ERROR” or “grab JSON where level=ERROR and service=payments.” Each catch bumps a metric. Alarms watch that metric and boom; email, Slack, PagerDuty, whatever you like.
Cheap. Fast. No app changes.
App → CloudWatch Logs ──(metric filter)──▶ Metric
│
└──▶ Alarm ──▶ SNS ──▶ Email/Slack
Step 1: create an SNS topic (so you get alerted)
aws sns create-topic --name app-alarms
# copy the "TopicArn" from the output
TOPIC_ARN="arn:aws:sns:REGION:ACCOUNT_ID:app-alarms"
# subscribe your email (confirm the email to activate)
aws sns subscribe \
--topic-arn "$TOPIC_ARN" \
--protocol email \
--notification-endpoint you@example.com
Step 2: add a metric filter to your log group
Option A — simple keyword (“ERROR” but not health checks):
LOG_GROUP="/aws/lambda/my-fn"
aws logs put-metric-filter \
--log-group-name "$LOG_GROUP" \
--filter-name "ErrorCount" \
--filter-pattern '"ERROR" -HealthCheck' \
--metric-transformations \
metricName=ErrorCount,metricNamespace="App/Alerts",metricValue=1,defaultValue=0
Option B — structured JSON logs (recommended):
aws logs put-metric-filter \
--log-group-name "$LOG_GROUP" \
--filter-name "PaymentsErrors" \
--filter-pattern '{ $.level = "ERROR" && $.service = "payments" }' \
--metric-transformations \
metricName=PaymentsErrorCount,metricNamespace="App/Alerts",metricValue=1
Step 3: create an alarm on that metric
Alert if we see ≥ 1 error per minute for 3 minutes:
aws cloudwatch put-metric-alarm \
--alarm-name "LambdaErrorBurst" \
--metric-name ErrorCount \
--namespace "App/Alerts" \
--statistic Sum \
--period 60 \
--evaluation-periods 3 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreaching \
--alarm-actions "$TOPIC_ARN" \
--ok-actions "$TOPIC_ARN"
That treat-missing-data=notBreaching bit keeps you from getting “we’re fine!” alerts when traffic is quiet
Step 4: test it (don’t skip this)
1. Log an ERROR that matches your filter.
2. In CloudWatch Metrics → App/Alerts, make sure the metric ticks up.
3. Watch the alarm flip to ALARM and check your email.
If nothing happens, go to your Log Group → Metric filters → Test pattern and paste a real log line. It’ll tell you if your pattern matches.
Prefer Terraform? here’s the whole thing
resource "aws_sns_topic" "app_alarms" {
name = "app-alarms"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.app_alarms.arn
protocol = "email"
endpoint = "you@example.com"
}
resource "aws_cloudwatch_log_metric_filter" "errors" {
name = "ErrorCount"
log_group_name = "/aws/lambda/my-fn"
pattern = "\"ERROR\" -HealthCheck"
metric_transformation {
name = "ErrorCount"
namespace = "App/Alerts"
value = "1"
default_value = "0"
}
}
resource "aws_cloudwatch_metric_alarm" "error_alarm" {
alarm_name = "LambdaErrorBurst"
namespace = "App/Alerts"
metric_name = "ErrorCount"
statistic = "Sum"
period = 60
evaluation_periods = 3
threshold = 1
comparison_operator = "GreaterThanOrEqualToThreshold"
treat_missing_data = "notBreaching"
alarm_actions = [aws_sns_topic.app_alarms.arn]
ok_actions = [aws_sns_topic.app_alarms.arn]
}
Common gotchas (learned the hard way)
• Case matters. ERROR ≠ error. Match what you actually log.
• Per line matching. Filters look at one line at a time. If your stack trace spans lines, rely on a JSON level field instead.
• Right account/region. Metric filters must live with the log group.
• Don’t explode cardinality. Keep one metric per signal; don’t bake IDs into metric names.
• No alerts during quiet times. That treat missing data setting is your chill pill.
Variations you’ll probably want
• Slack/Teams: SNS → lambda → Slack (point & click).
• PagerDuty/Opsgenie: SNS → EventBridge → your incident tool.
• Smarter thresholds: Try Anomaly Detection alarms once you have baseline traffic.
• Composite alarms: “Only alert if errors spike and p50 latency is ugly.”
You don’t need a massive observability rebuild to get useful alerts. Start with one or two high signal patterns timeouts, 5xx, “payment failed” wire them to email, and iterate.
Tiny effort. Big safety net.
Top comments (0)