The Problem
I run about 200 Lambda functions and a few Glue jobs in production. Every couple of weeks, something breaks. I get the CloudWatch alarm email instantly -- "Error count: 12." Great, I know something is wrong. But that is where the easy part ends.
Now I have to figure out WHAT went wrong. That is the part that takes 30-45 minutes:
- Open CloudWatch, find the right log group out of 200+
- Scroll through log streams looking for the actual error message
- Read the stack trace, try to understand what happened
- Check if someone deployed recently (open Lambda console, compare timestamps)
- Check if other services are also broken
- Google the error, figure out the fix
- Tell the team what happened
The alert was instant. But the investigation was not. I still spent 30-45 minutes every time just reading logs and connecting dots.
I got tired of that part. The error is sitting right there in the logs. The deploy timestamp is in the Lambda config. The traffic metrics are in CloudWatch. All the information exists. I just need something to read the logs, connect the dots, and tell me the answer.
So I built a system that does exactly that. It catches the error, gathers all the context automatically, sends it to Amazon Bedrock, and emails me the root cause with the fix. The investigation that took me 30-45 minutes now takes 5 seconds.
What This Project Does
AI Ops Sentinel watches all your Lambda functions, Glue jobs, ECS tasks -- everything that writes to CloudWatch Logs. When something errors:
- It catches the error instantly (push model, not polling)
- It strips any credentials or PII from the logs (using Bedrock Guardrails)
- It checks if this function was recently deployed
- It checks how much traffic is affected right now
- It checks if other services broke at the same time
- It checks if this same error happened before
- It sends all of that context to Bedrock for root cause analysis
- It emails you the diagnosis with the fix
One deploy covers your entire account. New functions you create next week are automatically included. No per-function setup.
Why Not Just Use CloudWatch Alarm + SNS
Everyone already has that. A CloudWatch alarm fires when error count goes above zero and sends an SNS email. Here is what that email gives you:
Subject: ALARM: payment-processor-errors
StateChangeTime: 2025-06-19T09:05:03Z
NewStateValue: ALARM
Threshold: 1
That is it. You still have to do all the manual investigation yourself.
Here is what AI Ops Sentinel sends for the same error -- the root cause, the fix, whether it was a recent deploy, how many users are affected, and CLI commands to investigate. All in one email, 5 seconds later.
The alarm tells you there is a fire. This tells you which room, what started it, and hands you the extinguisher.
The Architecture
Here is how the whole system fits together:
The flow is simple:
All your services (Lambda, Glue, ECS, anything) write logs to CloudWatch. This happens automatically, you do not configure anything.
One account-level subscription filter watches ALL log groups. When any log event matches the error pattern (ERROR, Task timed out, OOM, missing modules), CloudWatch pushes it to the Analyzer Lambda instantly. This is a push, not a poll. No delay.
The Analyzer Lambda is the brain. It does everything: decodes the log event, identifies which service it came from, runs it through Bedrock Guardrails to strip credentials, gathers context (deploy history, blast radius, past incidents, correlated failures), sends everything to Bedrock Nova Lite for diagnosis, saves the incident to DynamoDB, and sends the alert via SNS.
You get an email about 5 seconds after the error happened.
Additionally, a separate Weekly Digest Lambda runs every Monday at 9 AM. It scans all incidents from the past week, asks Bedrock to write a summary, and sends you a formatted HTML report via SES.
There is also a CloudWatch Dashboard that shows the Analyzer's invocations, errors, and duration so you can monitor the monitor.
Services used: CloudWatch Logs, CloudWatch Subscription Filter, CloudWatch Metrics, CloudWatch Dashboard, Lambda (3 functions), Amazon Bedrock Guardrails, Amazon Bedrock Nova Lite, SNS, SES, DynamoDB, SSM Parameter Store, EventBridge, S3, and CloudFormation. 15 services total, deployed as 11 modular stacks.
Why the AI Works Well Here
If you paste an error message into ChatGPT and ask "what is wrong", you get a generic answer. "Check your input." "Validate the payload." Not helpful.
The reason AI Ops Sentinel gives useful answers is that it does not just send the error. It sends the full story:
- The error happened at 9:05 AM
- This function was deployed at 8:53 AM (12 minutes before)
- The same error has occurred 4 times in the past week
- Right now, 1200 requests per hour are hitting this function and 15 percent are failing
- Two other services (a Glue job and another Lambda) also started erroring within the last 5 minutes
With all that context, even the cheapest Bedrock model (Nova Lite, which costs about $0.001 per call) gives genuinely useful root cause analysis. It is not about using an expensive model. It is about giving the model the right information.
Credentials Safety
This is the part I want to be clear about. Production error logs contain secrets. I have seen database connection strings with passwords, JWT tokens, API keys, email addresses, and even credit card numbers show up in error messages because someone logged the raw request.
If you send those logs to any AI model, you are leaking credentials.
Before Bedrock sees any error text, it goes through Bedrock Guardrails. The Guardrail replaces sensitive data with placeholders:
- Database URLs become {DatabaseConnectionString}
- Bearer tokens become {BearerToken}
- Emails become {EMAIL}
- API keys become {GenericApiKey}
The AI still understands "database connection failed" -- it just never sees the actual password. And the email you receive never contains raw credentials either.
I use ANONYMIZE for everything, not BLOCK. Blocking would mean the AI gets no context and cannot diagnose anything. Anonymizing keeps the story intact while removing the sensitive parts.
Smart Deduplication
If a function errors 50 times in 10 minutes, you do not want 50 emails. You want one.
But if the same function throws a different error type 2 minutes later, you do want that alert. It could be a cascading failure.
The rule: same function plus same error type equals suppressed for 30 minutes. Different error type always alerts immediately.
Weekly Digest (Monday 9 AM)
Separate from the real-time alerts, a Weekly Digest Lambda runs every Monday morning. It scans all incidents from the past 7 days in DynamoDB, sends them to Bedrock for a summary, and emails you a formatted HTML report via SES.
The report includes:
- Total incidents that week
- Top 5 failing functions/jobs
- Severity breakdown (how many critical, high, medium)
- AI-written summary: patterns, biggest risk, one recommendation
You do not have to check a dashboard or remember to review incidents. The summary comes to you. If it was a quiet week, the email just says "no incidents, all systems healthy."
This uses SES (not SNS) because SES supports HTML formatting -- tables, colors, stats. The real-time alerts stay on SNS because speed matters there, not formatting.
How to Deploy
The project is open source. Everything is in one repo with deploy scripts.
Prerequisites:
- AWS CLI configured
- Bedrock Nova Lite model access enabled in your region
- An email address for alerts
Steps:
# Clone
git clone https://github.com/utkarsh-rastogi-aws/ai-ops-sentinel.git
cd ai-ops-sentinel
# Configure (add your email)
cp .env.example .env
# Edit .env: ALERT_EMAIL=your-email@example.com
# Deploy all 11 stacks (takes about 8 minutes)
./deploy.sh all
# Confirm the SNS email subscription (check inbox, click the link)
# Check everything is up
./deploy.sh status
Testing It
The project includes a test Lambda and a test Glue job that generate errors on purpose:
# Lambda tests
./test-errors.sh lambda key_error # Missing dictionary key
./test-errors.sh lambda connection # Logs DB password (guardrail strips it)
./test-errors.sh lambda pii_in_logs # Logs emails and credit cards
./test-errors.sh lambda timeout # Exceeds timeout limit
./test-errors.sh lambda import_error # Missing module
# Glue tests
./test-errors.sh glue key_error # ETL processing failure
./test-errors.sh glue connection # DB credentials in Glue logs
./test-errors.sh glue oom # Out of memory
# Run everything
./test-errors.sh all
After running any test, check your email in about 10 seconds. The AI diagnosis will be there.
The connection test is the interesting one for seeing Guardrails in action. It deliberately logs a full database URL with a password and a Bearer token. In the email you receive, those will be replaced with placeholders. The AI still correctly diagnoses "database connection failed" without ever seeing the actual credentials.
What Gets Deployed
11 CloudFormation stacks:
- DynamoDB table (incidents, dedup tracking, fingerprints)
- Bedrock Guardrail (PII and credentials anonymization)
- SNS topic (real-time email alerts)
- SES identity (weekly HTML digest emails)
- IAM roles (centralized, least-privilege)
- Analyzer Lambda (the AI brain)
- Subscription filter (account-level, covers all services)
- Test Lambda (5 error scenarios)
- Test Glue job (4 error scenarios)
- Weekly Digest Lambda with EventBridge schedule
- CloudWatch Dashboard
To remove everything: ./destroy.sh all
What It Costs
| Situation | Monthly cost |
|---|---|
| No errors happen | $3 (just the CloudWatch dashboard) |
| About 100 errors per day | Around $4 |
| About 1000 errors per day | Around $7 |
Most of that is the dashboard ($3/month) and Bedrock calls. If you remove the dashboard, the cost with no errors is $0.
For comparison: Datadog charges $15 per host per month. PagerDuty is $20 per user per month. This covers your entire account for less than a coffee.
Cleanup
./destroy.sh all
Deletes all 11 stacks in reverse dependency order. Empties S3 buckets automatically. Asks for confirmation before proceeding.
What I Would Build Next
- Bedrock Knowledge Bases so the AI can check our internal runbooks before diagnosing
- Slack integration for teams that prefer Slack over email
- Severity-based routing where CRITICAL goes to PagerDuty and LOW goes to weekly digest only
- A simple React dashboard to browse incident history
The Repo
GitHub: https://github.com/Utkarshlearner/ai-ops-sentinel
The README covers everything: architecture, prerequisites, customization, limitations, cost breakdown. All the code is inline Python in CloudFormation templates. No Docker, no layers, no build step. Deploy with one command, destroy with one command.
If you try it, let me know what it catches. And if you find a bug in the error-catching system, that would be genuinely funny.
Utkarsh Rastogi
LinkedIn | dev.to/awslearnerdaily







Top comments (0)