AI-Powered AWS CloudWatch Alarm Triage Terraform Module

#terraform #aws #cloudwatch #devops

This module was built entirely on my own time using my personal laptop, personal AWS accounts, and personal funds (RIP my credit card). All development and testing was done independently using my own resources.

Just released v0.2.1 of my Terraform module that does autonomous CloudWatch Alarm triage using AWS Bedrock, and I've got to say, it's been a real eye-opener seeing it in action. The module essentially gives your CloudWatch alarms the ability to investigate themselves when they go off.

Here's the thing about production incidents, when an alarm fires at 2am, whoever gets paged is coming in cold. They're scrambling to understand what's happening, digging through logs, checking metrics, trying to piece together the story while the clock ticks and stakeholders start asking questions. By the time a war room forms, precious minutes have already slipped by. Oh, and let's not forget about those times when multiple things are failing at once and you're trying to figure out what the actual root cause is versus what's just cascade failures from something else entirely.

Example Triage Report

Here's an actual triage report from the module's demo (with sensitive info redacted):

CloudWatch Alarm Investigation Results
======================================

Alarm: triage-demo-test-lambda-errors
State: ALARM
Model: us.amazon.nova-premier-v1:0
Model Calls: 8
Tool Calls: 7
Region: us-east-2
Account: <redacted>
Console: https://console.aws.amazon.com/cloudwatch/home?region=us-east-2#alarmsV2:alarm/triage-demo-test-lambda-errors

Investigation Files:
- Report: s3://<redacted>/reports/2025/09/01/20250901_170608_UTC_triage-demo-test-lambda-errors_report.txt
- Full Context: s3://<redacted>/reports/2025/09/01/20250901_170608_UTC_triage-demo-test-lambda-errors_full_context.txt

--------------------------------------

### 🚨 EXECUTIVE SUMMARY

The CloudWatch alarm triggered due to intentional test failures in a Lambda function designed to demonstrate error handling. The function lacks EC2 permissions by design. No production impact detected - this is a controlled test environment.

### 🔍 INVESTIGATION DETAILS

#### Commands Executed:
1. Retrieved Lambda function configuration details
2. Analyzed CloudWatch Logs for error messages (last 30 mins)
3. Verified IAM role permissions for the Lambda execution role
4. Examined CloudWatch metrics trends (past 6 hours)

#### Key Findings:
- Lambda function modified 1.5 hours before alarm (15:42 UTC)
- Logs show explicit "EXPECTED FAILURE" messages about missing EC2 permissions
- IAM role has only basic logging permissions (no EC2 access)
- Metrics show consistent 15-error spikes every 5 minutes since 15:45 UTC
- Environment variable confirms this is a "Demo - Intentional failure"

### 📊 ROOT CAUSE ANALYSIS

The alarm triggered because the Lambda function "triage-demo-test-ec2-lister" is intentionally failing due to missing ec2:DescribeInstances permissions. This matches the environment variable stating "Demo - Intentional failure for triage testing". The IAM role is properly restricted for testing purposes. The error pattern began immediately after the last deployment (15:42 UTC) and shows consistent test-induced failures.

### 💥 IMPACT ASSESSMENT

- **Affected Resources**: Lambda function triage-demo-test-ec2-lister
- **Business Impact**: None - isolated test environment
- **Severity Level**: Low (test failure by design)
- **Users Affected**: 0 (no production impact)

### 🔧 IMMEDIATE ACTIONS

1. Acknowledge the alarm as valid test behavior
2. No remediation needed - test is functioning as intended

### 🛡️ PREVENTION MEASURES

- Ensure test alarms are properly labeled to avoid confusion
- Maintain separation between test and production monitoring configurations
- Document test scenarios in runbooks

### 📈 MONITORING RECOMMENDATIONS

- Add alarm annotations for known test failures
- Consider namespace segregation for test resources
- Review alarm thresholds if test patterns change

### 📝 ADDITIONAL NOTES

This was a controlled test validation. Verify all test resources are properly tagged and isolated from production. Review test failure injection mechanisms to prevent accidental production impact. The alarm worked correctly for its intended test purpose.

--------------------------------------

This investigation was performed automatically by AWS Bedrock using AWS API calls. For questions or improvements, please contact your CloudWatch Alarm Triage administrator.

What This Module Actually Does

When a CloudWatch alarm triggers, this module automatically spins up an AI investigation using AWS Bedrock's Claude Opus 4.1 model running in agent mode. The AI doesn't just look at the alarm itself; it actively investigates using Python code with boto3 to query CloudWatch logs, check IAM permissions, review recent CloudTrail events, analyze metrics trends, basically everything an engineer would do when they first get paged, except it happens automatically and lands in your inbox before you've even opened your laptop.

The investigation report isn't some generic "the alarm triggered" notification either. You get actual root cause analysis, like "Lambda function prod-api-handler experiencing 100% error rate due to missing DynamoDB table permissions, immediate action required: Add dynamodb:GetItem permission to Lambda role". The AI shows its work too, all the Python code it ran, what it found, and why it thinks that's the problem.

The Technical Bits

The module uses a two-Lambda architecture. The orchestrator Lambda recieves the CloudWatch alarm event and calls Bedrock, while a separate tool Lambda executes the Python code with read-only AWS permissions. This separation keeps things secure, the tool Lambda can investigate but can't modify anything or access secrets.

I've been really careful about costs and noise reduction too. There's a DynamoDB deduplication mechanism that prevents the same alarm from being investigated more than once per configurable time window (default is 24 hours), because nobody wants their inbox flooded during an incident or surprise AWS bills from runaway AI invocations, leadership definitely appreciates when systems are designed with cost controls built right in from the start rather than bolted on after someone gets a shocking invoice.

The module uses Claude Opus 4.1 specifically because when production is down, you want the best possible triage, not the cheapest. Alarms shouldn't be going off constantly anyway, and when they do, having comprehensive, accurate analysis is worth the extra few cents. That said, I did include support for Amazon Nova Premier if you want to optimize costs.

Real World Impact

Looking at that triage report above, you can see exactly what I mean. The AI identified the Lambda was failing due to missing EC2 permissions, showed the exact error messages from CloudWatch Logs, verified the IAM role configuration, and provided specific remediation steps. All of this analysis was completed and emailed out in a few minutes from when the alarm triggered.

What really strikes me is how this changes the whole incident response dynamic, engineers aren't coming in blind anymore, they have context before they even join the call, which means faster resolution times and honestly just less stress all around when things go sideways.

Getting Started

The module is open source and available on GitHub. Installation is pretty straightforward if you're already using Terraform:

module "alarm_triage" {
  source = "github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage"

  sns_topic_arn = aws_sns_topic.alarm_notifications.arn

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

Then just add the module's Lambda ARN to your CloudWatch alarm actions and you're good to go.

Some Thoughts

Building this has really driven home how much time we waste on initial triage during incidents. The cognitive load of context switching, especially during off-hours, is brutal. Having an AI do that initial investigation and serve it up on a silver platter, it's not replacing engineers, it's giving them a head start so they can focus on actually fixing the problem rather than figuring out what the problme is.

The module saves all investigation reports to S3 too, which creates this interesting side effect where you build up a searchable history of every production issue and how it was investigated. Pretty handy for postmortems or spotting patterns over time.

Anyway, if you're dealing with CloudWatch alarms and want to reduce your MTTR, give it a shot. The GitHub repo has full documentation, examples, and even a demo you can deploy to see it in action. Would love to hear if others find it as useful as I have.

Check out the project at: github.com/wayneworkman/terraform-aws-module-cloudwatch-alarm-triage