The Problem We're Solving
It's 2 AM on a Saturday. You got your teams call.
Your order service is failing silently. Customers can't check out. Revenue is bleeding out while you scramble.
You grab your laptop, login to AWS console, and start digging through CloudWatch logs. What went wrong? Was it validation? Payment? Database? Something else? You don't know yet. Thirty minutes later, you might have a guess.
This article is about never being in that situation again.
This guide walks you through building a production-grade serverless observability pipeline that triggers automated RCA.
We use SST v3 (TypeScript) for IaC and AWS managed services (CloudWatch EMF, Composite Alarms, EventBridge, Step Functions, SNS, and a small Logs Insights Lambda).
Part 1 focuses on building the demo app (Order Service), structured logging, alarms, composite alarm, EventBridge rule on the default bus, Step Functions orchestration, a FetchLogs Lambda, and email notifications via SNS.
Instead of manual log hunting, we're building a system that automatically detects failures, investigates them, and emails you the root cause—all within 90 seconds. No guesswork. Just facts.
What We're Building
An event-driven Root Cause Analysis (RCA) pipeline that works like this:
Order service fails
↓ (error is logged with classification)
CloudWatch detects the spike
↓
Alarm triggers
↓
EventBridge automatically invokes Step Functions
↓
Lambda queries historical logs for that error type
↓
Results emailed to your team
↓
Total time: 90 seconds
Before this system: MTTR = 40 minutes (manual investigation)
After this system: MTTR = 2 minutes (automated investigation)
Think of it like this. Your application is a patient. When it gets sick (fails), we don't wait for a doctor to manually examine it. We automatically run diagnostics, collect test results, and email the doctor everything they need to know.
The Scenario
Let's check any typical e-commerce platform. When a customer submits an order:
Request arrives
↓
Validation check (could fail → VALIDATION error)
↓
Payment processing (could fail → PAYMENT error)
↓
Save to DynamoDB (could fail → DATABASE error)
↓
Publish to event bus (could fail → DEPENDENCY error)
↓
Response sent or error returned
Each step can fail in different ways. Different failures need different fixes. That's why we classify every error—not just "it failed," but "it failed at validation" or "it failed at payment."
We'll implement a small, realistic 'Order Service' that receives API requests, writes orders to DynamoDB, emits structured logs (EMF), and publishes business events. Errors increment EMF metrics with an ErrorType dimension (VALIDATION, BUSINESS, DEPENDENCY, INFRA, UNKNOWN). We intentionally add a controlled chaos injector to produce failures for demo purposes.
Architecture Overview
High level architecture:
- API Gateway -> OrderService Lambda -> DynamoDB and EventBridge
- Structured logs and EMF metrics emitted by Lambda (ErrorType dimension).
- CloudWatch per-ErrorType MetricAlarms -> CompositeAlarm (OR logic).
- Composite alarm emits to EventBridge default bus.
- EventBridge rule matches composite alarm -> Step Functions state machine (RCA pipeline).
- Step Functions calls a focused FetchLogs Lambda that runs CloudWatch Logs Insights and returns recent error logs.
- Step Functions publishes a curated summary to an SNS topic for email delivery.
- (Part 2) The same pipeline feeds an AI analysis step for automated RCA.
Is This Framework Actually Good?
I'll be honest. Let me rate this against what you'd expect from a production system.
What Works Really Well ✅
- Automatic detection: No polling, no cron jobs. When an error metric crosses the threshold, alarms fire immediately.
- Structured investigation: We don't just send you a generic alert. We fetch logs from the exact time window when the error occurred, with full context (request IDs, error codes, everything).
- Event-driven: The entire pipeline is asynchronous. Nothing blocks. Your API returns fast even when alarms are firing.
- Cost-effective: You pay for Lambda when it runs, SNS when it sends emails, Logs Insights when you query. Nothing idle.
- Observable: All of this is infrastructure-as-code (SST), so you can see exactly what's running, version control it, audit it.
What Could Be Better ⚠️
- Deduplication: If 1000 orders fail, you might get 1000 RCA invocations. (Fixable with SQS dead-letter queues, but not in this initial version.)
- Remediation: This system detects and reports failures. It doesn't auto-fix them. That's intentional—you probably want human judgment before auto-rolling back payments.
- Multi-region: Single region only. Global platforms would need replication.
- Slack/PagerDuty integration: We're using SNS email by default. Adding Slack or PagerDuty is 30 minutes of work if you need it.
The Real Value Proposition
Most teams have random log statements scattered everywhere. This framework enforces a standard: every error gets classified (VALIDATION, PAYMENT, etc.), tagged with context (requestId, userId, service), and published as a metric. Then when things break, you don't manually hunt through logs. You get a structured investigation automatically.
This framework setup focuses on this principle. Using of centralized accepted logging framework. Sending Embedded metric as part of logs and based on which we have alarms configured. Workflows gets invoked based on alarm statuses.
Not perfect, but it solves the critical problem: MTTR just dropped from 40 minutes to 2 minutes.
How It Works in simple english.
Layer 1: Detection
Your Lambda function processes an order. Something goes wrong. It logs the error with structure:
if (!userId) {
throw new AppError(
"User ID missing",
"VALIDATION", // ← This classification is key
"MISSING_USER_ID"
);
}
logError(error); // Logs + emits metric automatically
Two things happen:
-
A structured log entry:
{timestamp, level, message, errorType, errorCode, requestId, userId} -
A metric to CloudWatch:
OrderFailureCount with ErrorType=VALIDATIONThis format of adding metric makes this solution scalable where the logger utility can be reused and based on error a custom metric will be published in cloudwatch.
Layer 2: Alarming
CloudWatch watches that metric continuously. When it exceeds the threshold (default: 1 error), the alarm transitions to ALARM state.
We create one alarm per error type:
order-failure-validation-devorder-failure-payment-devorder-failure-database-dev- ... etc
Plus a composite alarm: "ANY of these alarms is in ALARM state"
Layer 3: Triggering
An EventBridge rule listens for this specific pattern:
source = aws.cloudwatch
AND detail-type = CloudWatch Alarm State Change
AND alarm state = ALARM
AND alarm name matches order-failure-any-*
When it matches, EventBridge automatically triggers a Step Functions execution. No manual intervention. No waiting.
Layer 4: Investigation
Step Functions is your orchestrator. It calls a Lambda function that queries CloudWatch Logs Insights:
fields @timestamp, @message, requestId, errorType
| filter @message like /ERROR/
| filter errorType == "VALIDATION"
| sort @timestamp desc
| limit 20
This fetches the last 20 errors from the past 5 minutes (configurable), exactly from the time window when the alarm triggered.
Layer 5: Notification
Step Functions publishes results to SNS, which emails your team with:
- Alarm name and trigger time
- Error logs (timestamps, messages, request IDs)
- Time window investigated
Total latency: ~90 seconds
Voila! you have received the email. I know its not the prettiest email. We are using email-json format so that if we want to ingest this somewhere else it would be easy. Also since we went with functionless approach i.e there is no lambda in between to send email, directly via SNS this minimal setup should work. On Production we can have a lambda function hooked to prettify the email, use custom email service to send email. The possibility are endless.
As you can see we are getting the logs and relevant details in JSON format that could be reused. More details on message content is added in later part of the article.
Building It: Step by Step
I would recommend to refer the attached GitHub link for more information.
Step 1: The Order Handler (Error Classification)
This is the foundation. When your Lambda processes orders, classify failures:
Below is sample code stating what one could adopt to capture essential metrics.
// services/order/handler.ts
export const handler: APIGatewayProxyHandlerV2 = async (event) => {
const requestId =
event?.requestContext?.requestId ??
event?.headers?.["x-amzn-requestid"] ??
event?.headers?.["x-request-id"] ??
crypto.randomUUID();
const traceId = process.env._X_AMZN_TRACE_ID;
appendContext({
service: "order-service",
operation: "CreateOrder",
requestId,
traceId
});
try {
logInfo("Create order request received", { event });
if (!event.body) {
throw new AppError(
"Request body missing",
"VALIDATION",
"BODY_MISSING"
);
}
const { userId, amount, currency } = JSON.parse(event.body);
if (!userId || !amount || !currency) {
throw new AppError(
"Invalid payload",
"VALIDATION",
"INVALID_PAYLOAD"
);
}
const orderId = `ord_${uuid()}`;
appendContext({ orderId });
await ddb.send(
new PutItemCommand({
TableName: process.env.ORDERS_TABLE!,
Item: {
pk: { S: `ORDER#${orderId}` },
sk: { S: "METADATA" },
orderId: { S: orderId },
userId: { S: userId },
amount: { N: amount.toString() },
currency: { S: currency },
status: { S: "CREATED" },
requestId: { S: requestId },
traceId: { S: traceId ?? "NA" },
createdAt: { S: new Date().toISOString() }
}
})
);
await eb.send(
new PutEventsCommand({
Entries: [
{
Source: "demo.orders",
DetailType: "OrderCreated",
EventBusName: process.env.EVENT_BUS_NAME,
Detail: JSON.stringify({
orderId,
userId,
amount,
currency,
requestId,
traceId
})
}
]
})
);
logInfo("Order created", { eventType: "OrderCreated" });
return {
statusCode: 202,
body: JSON.stringify({ orderId, status: "CREATED" })
};
} catch (err) {
logError(err, { eventType: "OrderCreateFailed" });
return {
statusCode: 400,
body: JSON.stringify({ message: "Failed to create order" })
};
}
};
When logError() is called, CloudWatch gets:
{
"level": "ERROR",
"message": "Request body missing",
"timestamp": "2026-02-01T19:58:29.655Z",
"service": "order-service",
"sampling_rate": 0,
"xray_trace_id": "1-697fb065-3f7483d74258544027bd336e",
"operation": "CreateOrder",
"requestId": "2bc6aa1e-cf56-4e12-a2fe-9f283466fd89",
"traceId": "Root=1-697fb065-3f7483d74258544027bd336e;Parent=5b855ffe9a203f8f;Sampled=0;Lineage=1:141a50ec:0",
"errorType": "VALIDATION",
"errorCode": "BODY_MISSING",
"retryable": false,
"meta": {
"eventType": "OrderCreateFailed"
}
}
Note: Using lambda-powertools we do get important traceability out of the box. Things like requestId, traceId do help to get end-to-end picture of exactly what happened that let to this error. I recommend to go through official powertools documentation.
And simultaneously, EMF metric gets emitted:
{
"_aws": {
"Timestamp": 1769975909655,
"CloudWatchMetrics": [
{
"Namespace": "OrderService",
"Dimensions": [
[
"Service",
"Stage",
"ErrorType"
]
],
"Metrics": [
{
"Name": "OrderFailureCount",
"Unit": "Count"
}
]
}
]
},
"Service": "order-service",
"Stage": "sourabhTest",
"ErrorType": "VALIDATION",
"ErrorCode": "BODY_MISSING",
"OrderFailureCount": 1
}
Why both? Logs give context. Metrics trigger alarms. Together, you get detectability + context.
Step 2: Creating Alarms (One Per Error Type)
Using SSTv3, define alarms in code:
Below thresholds are very tight. I intentionally kept it that way for testing. For Productions we can adjust based on business SLO.
// sst.config.ts
const errorTypes = ["VALIDATION", "PAYMENT", "DATABASE", "DEPENDENCY"];
const alarms = errorTypes.map((errorType) =>
new aws.cloudwatch.MetricAlarm(`OrderFailure-${errorType}`, {
name: `order-failure-${errorType.toLowerCase()}-${stage}`,
namespace: "OrderService",
metricName: "OrderFailureCount",
statistic: "Sum",
period: 60, // Check every minute
evaluationPeriods: 1, // Trigger immediately
threshold: 1, // Alarm on 1+ errors (use higher in prod)
comparisonOperator: "GreaterThanOrEqualToThreshold",
dimensions: {
Service: "order-service",
Stage: stage,
ErrorType: errorType, // ← Must match metric dimension
},
}),
);
// Composite alarm: "OR" all of them
const compositeRule = pulumi
.all(alarms.map(a => a.name))
.apply((names) => names.map(n => `ALARM("${n}")`).join(" OR "));
const composite = new aws.cloudwatch.CompositeAlarm(
"OrderFailureComposite",
{
alarmName: `order-failure-any-${stage}`,
alarmDescription: "Any order failure detected",
alarmRule: compositeRule,
},
);
Why per-error-type alarms? Because you want to know what broke. Different errors need different fixes. The composite lets you trigger RCA on any error, but you still have granularity.
Note: In Part 2, we'll add Amazon Bedrock (Nova Premier) for AI-powered analysis. The IAM permissions are already in place:
// Step Functions gets Bedrock permissions
const rcaPolicy = new aws.iam.RolePolicy("OrderRcaPolicy", {
role: rcaRole.name,
policy: rcaStateMachine.arn.apply((arn) =>
aws.iam.getPolicyDocument({
statements: [
{
actions: ["bedrock:InvokeModel"],
resources: [
`arn:aws:bedrock:${region}::foundation-model/amazon.nova-premier-v1:0`,
],
},
// ... other permissions
],
}).then((doc) => doc.json)
),
});
The Metric added creates a custom namespaces with the dimensions defined by metric. Below images shows metric data being updated based on errors observed in the system.
Step 3: Step Functions Orchestration
When the composite alarm triggers, Step Functions is invoked automatically. Here's the complete workflow for the Orchestration:
{
"StartAt": "FetchLogs",
"States": {
"FetchLogs": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${FETCH_LOGS_LAMBDA_ARN}",
"Payload.$": "$"
},
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "SendFailureEmail"}],
"Next": "SendSuccessEmail"
},
"SendSuccessEmail": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "${ALARM_EMAIL_TOPIC_ARN}",
"Subject.$": "States.Format('Order Failure: {}', $.detail.alarmName)",
"Message.$": "States.JsonToString($)"
},
"End": true
},
"SendFailureEmail": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "${ALARM_EMAIL_TOPIC_ARN}",
"Subject": "RCA Failed",
"Message.$": "States.JsonToString($)"
},
"End": true
}
}
}
Flow:
- Fetch logs from CloudWatch Logs Insights
- If successful → Send email with results
- If failed → Send email saying RCA failed (at least you know something broke)
In Part 2, we enhance this with a parallel state that sends immediate alerts while Amazon Bedrock Nova Premier analyzes the logs with AI. You get two emails: raw data (10 sec) + AI analysis (30 sec).
Step 4: The Investigation Lambda
This Lambda is where the magic happens. It queries logs for the specific error.
It extracts error type from the alarm name, then queries CloudWatch Logs Insights for all errors matching that type within the past 10 minutes (the exact window when the alarm triggered). Since Logs Insights queries are asynchronous, it polls for results with exponential backoff, then returns structured data including timestamps, error messages, request IDs, and trace IDs—everything your team needs to start investigating.
What happens:
- Extracts error type from alarm name
- Queries logs for that specific error in the past 5 minutes
- Polls Logs Insights until results are ready
- Returns structured data: what failed, when, how many times, with full context
Testing It
Let's verify the whole pipeline works.
Test 1: Trigger an Error
We will trigger error by hitting the deployed endpoint with invalid payload.
# Send invalid request
curl -X POST "$API_URL" \
-H "Content-Type: application/json" \
-d '' # Empty body triggers VALIDATION error
Based on this we do receive email with complete logs. This could then be used to do end to end check. Since we are using lambda powertools, the requestId can be used to check complete trace of request that caused the error.
Test 2: Watch the Logs
In CloudWatch Logs, we'll see the structured error. In CloudWatch Metrics, you'll see OrderFailureCount increment.
Test 3: Check Alarm State
Wait 60 seconds, then:
aws cloudwatch describe-alarms --query 'MetricAlarms[?Namespace==`OrderService`].[AlarmName,StateValue]' --output table --region us-east-1
---------------------------------------------------
| DescribeAlarms |
+----------------------------------------+--------+
| order-failure-business-sourabhTest | OK |
| order-failure-dependency-sourabhTest | OK |
| order-failure-infra-sourabhTest | OK |
| order-failure-unknown-sourabhTest | OK |
| order-failure-validation-sourabhTest | ALARM |
+----------------------------------------+--------+
Test 4: Verify Step Functions Executed
Post invoking endpoint with invalid details we will check the step function execution.
Test 5: Received Email
Within 90 seconds, you should receive an email with the full context. Here's what the actual email looks like:
From: AWS Notifications
Subject: Order Service Alarm: order-failure-any-dev
The email body contains a complete JSON payload with:
1. Alarm Context - What triggered:
- Alarm name and transition time
- Previous state (OK → ALARM)
- Triggering alarm details (which specific error type caused it)
2. RCA Object - The investigation results:
"rca": {
"Payload": {
"alarmName": "order-failure-any-dev",
"errorType": "VALIDATION",
"timeWindow": {
"from": 1769927917,
"to": 1769928217,
"fromReadable": "2026-02-01T06:38:37.000Z",
"toReadable": "2026-02-01T06:43:37.000Z"
},
"count": 1,
"logs": [
{
"@timestamp": "2026-02-01 06:42:36.945",
"@message": "{...full structured log...}",
"requestId": "1f00615f-7154-4964-9c14-db512efea5f8",
"traceId": "Root=1-697ef5dc-08ed5ced0bda9d290e9aea6c",
"errorType": "VALIDATION"
}
]
}
}
What we Get:
- Time window: Exact 5-minute window that was investigated.
- Error count: How many errors were found
-
Full logs: Each log entry with:
- Timestamp (when it happened)
- Complete error message with all context
- Request ID (for distributed tracing)
- X-Ray trace ID (for deep diving)
- Error type and code
- Order ID (for customer impact assessment)
The @message field contains the full structured log in JSON format, including:
- Error classification (VALIDATION, PAYMENT, etc.)
- Error code (INVALID_PAYLOAD, MISSING_USER_ID, etc.)
- Whether it's retryable
- Service metadata
- Custom fields you've added
That's the win. Error detected → Investigated → Full context emailed. All automatic. 90 seconds.
Real-world example (redacted for privacy):
See the complete sample in the repo at examples/sample-rca-email.json showing an actual VALIDATION error with full CloudWatch Logs Insights results.
What to Think About for Production
1. Alarm Thresholds
One error = alarm. Good for testing. In production, maybe 10 errors in a minute? Depends on your traffic. Know your baseline.
2. Notifications
Email works. For production, you probably want Slack or PagerDuty. SNS integrates with both (30-minute addition).
3. Rate Limiting
If thousands of orders fail at once, you'll get thousands of RCA invocations. Add deduplication or batching to prevent Lambda/Logs Insights thrashing.
4. Cost
Estimated monthly cost at scale: $5-15
- Logs Insights queries: ~$0.005 per GB scanned
- Step Functions: ~$0.000025 per state transition
- SNS: ~$0.50 per million emails
- Lambda: Pay-per-execution
Cheap. Way cheaper than on-call engineers hunting logs.
5. Dashboard
Add a CloudWatch Dashboard to visualize:
- Failure rate trends
- Which error types are most common
- RCA success rate
One command, huge value.
What's Coming in Part 2
Right now you get logs. In Part 2, you'll get analysis.
We'll send those logs to Amazon Bedrock (Claude 3), which will:
- Summarize the incident in plain English
- Identify likely root causes
- Recommend fixes
- Compare to past similar incidents
Same architecture—just add one more step that calls Bedrock and processes the response.
The Key Insight
This whole system rests on one principle: automation needs structured data.
If your logs are "oops something broke," you can't automate anything. But if your logs are structured—error type, error code, request ID, user context—suddenly you can:
- Create meaningful alarms
- Investigate automatically
- Share context instantly
- Let AI analyze patterns
The hard part isn't the AWS services (they're all standard). The hard part is getting your application to emit good logs in the first place.
Start there. Classify your failures. Emit structured data. Everything else flows from that.
References
- AWS Well-Architected - Operational Excellence
- CloudWatch Embedded Metric Format
- Step Functions Best Practices
- CloudWatch Logs Insights Syntax
- Powertools for AWS Lambda
Code
GitHub Repository:
https://github.com/Wizard-Z/aws-serverless-rca
Key Files:
-
sst.config.ts- Infrastructure as Code -
services/order/handler.ts- Order service -
services/rca/fetchlogs.ts- Investigation Lambda -
shared/logging/- Logging utilities -
infra/stepfunctions/orders-rca.asl.json- RCA workflow
Part 2 coming soon. Follow me on LinkedIn to get notified.
AWS Community
Lets build!








Top comments (0)