Daniele Frasca for AWS Community Builders

Posted on May 13

What Happens During an Incident (Part 4)

#aws #serverless #observability #incidentresponse

In Part 3, we separated signals on purpose:

metrics tell you where to look
logs and traces tell you what happened
audit tells you what can be proven later.

This article is an incident story. The examples are fictional, but the dynamics are real.

An incident is the exact moment where panic starts:

Just ship everything to one place.
Just add the tenantId to every metric.
Just give everyone access so we can debug faster.

And for the sake of the article and AI, I bring in AWS DevOps Agent.

as an incident capability that can reduce cognitive load by correlating telemetry, code and deployment context, and showing its investigation steps transparently.

References:

About AWS DevOps Agent: https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent.html
DevOps Agent Incident Response (investigation timeline, root cause, mitigation plans, gaps): https://docs.aws.amazon.com/devopsagent/latest/userguide/devops-agent-incident-response.html

Assumptions

The application is setup in this way:

3 continents (EU / US / APAC)
2 regions per continent
EventBridge bus per region
A lot of Lambda functions per region
Data residency constraints

Other assumptions already solved:

an AWS Organization with SCPs and guardrails
CloudTrail trails delivering to S3 with retention policies
Structured application logs
Metrics and alarms for key services
Traces enabled for critical flows

Let's have fun

09:12

Someone writes:

CorrelationId: c-9f3a... blah blah. It worked yesterday.

When operating globally, this message is missing clues:

We need to identify which region
We need to stay inside the right boundary
We need to answer fast without granting dangerous access
We need to produce an explanation later (post-mortem, evidence)

This is where trust is the architecture that becomes operational.

09:14

I need a fast answer to 2 questions:

1) Is this global or regional?

2) Is this a customer-specific failure or a systemic failure?

Based on the message, I start looking for the metric that represents the health of that service.

I would look into

CloudFront 5xx error rate (if requests fail at the edge)
API error rate or latency (if APIGW/ALB)
Downstream service errors (if the API is okay but something else fails later)

So I look at:

Regional behaviour (EU vs US vs APAC backends)
Last 15 minutes vs last 24 hours
Error rate and latency

This is a great moment for AWS DevOps Agent

DevOps Agent is designed to:

Learn your resources and relationships,
Build a topology graph,
Introspect CloudWatch telemetry through the configured AWS account access,
Produce an investigation timeline and root-cause summaries when an investigation runs.

09:18

Let's say metrics show:

EU error rate rising
US/APAC normal
Service X is impacted
Time window: since ~09:05
A known correlationId from customer: c-9f3a...

At this point, I have your coordinates:

Region boundary: EU
Capability boundary: invoice approval

09:22

Even with the correlationId, I still need a place to start.

I usually start from where the request starts

The API handler Lambda
The background process
The first event or Step Functions

Once the entry point is established, I look for the log:

{
  "level": "info",
  "msg": "Invoice approval requested",
  "correlationId": "c-9f3a...",
  "tenantId": "tenant-42",
  "region": "eu-west-1",
  "service": "invoice-api",
  "operation": "invoice.approve" -> command
}

Now I can answer a critical question fast:

Did we actually receive the request?
Did we enqueue/publish the next step?
Did we reject it immediately?

Suppose I see:

{
  "level": "info",
  "msg": "Published event InvoiceApprovalRequested",
  "correlationId": "c-9f3a...",
  "eventBus": "eu-invoices",
  "service": "invoice-api"
}

This is important: as the request entered the system and made it into the workflow.

09:26

At this point in the investigation, the instinct is to look for a trace.

I have the correlationId. The system is instrumented. Tracing is enabled. So the obvious question becomes:

Do we have a trace for this request?

The answer is: maybe.

Tracing in production is almost always sampled. The exact request that failed may not have been captured, especially if the system is high-volume or if sampling rates are intentionally low to control cost and data volume.

When a trace does exist for the correlationId, I can see how the request moved across services, where time was spent, and which dependency failed or slowed down.

But I'm usually not that lucky, so I tried in a different way:

Look for traces of similar requests in the same time window.
Move on and rely on logs

09:31

Now I have an idea:

In the EU, serviceX is timing out

The next question is:

Why did it start at 09:05?

And immediately after:

Did we deploy new code?
Was there a configuration rollout?
Did a feature flag change?

If nothing was deployed, that does not end the investigation. I start looking into dependency behaviour, load patterns, or environmental drift.
This is where CloudTrail becomes relevant because it confirms that the infrastructure has actually changed, when it changed, and under which identity.

This is why the audit workflow often looks like:

CloudTrail delivered to S3
queried later (often Athena)
correlated by time, principal, and resource

09:36

In this moment, AWS DevOps Agent can be genuinely useful:

It maintains an investigation timeline
It can report investigation gaps
It can propose mitigation plans with stages

Note:
Using DevOps Agent does not change what needs to be done it changes how easy it is to do it.

09:44

Let's say I confirm a configuration change, and not a random problem where AWS decided to have some problems.
I apply the fix with some deployment, as this is used as an audit, and I start the verification that should result in:
1) Metrics improve (orientation)
2) Traces show the path is healthy (flow)
3) Logs show successful processing (detail)
4) Evidence captures the change and the actor (proof)

10:30

Post-mortem time

I should be prepared to answer:

What happened?
Why did it happen?
Why did it take us that long to detect?
What was confusing?
What data did we wish we had?
What guardrail would have prevented it?
What should we automate or change?

Again, this is where AWS DevOps Agent fits, as it includes a prevention feature that analyses multiple incidents and produces recommendations.

By the way AWS best practise for AWS DevOps Agent:

Application-specific Agent Spaces,
Read-only access to shared dependency accounts,
Tagging shared resources to identify which applications use them,
Turning cross-team escalation procedures into runbooks.

Conclusion

If Part 3 was about why signals are different, this article is about what happens when you actually need them.

Metrics help you get an idea of where things are happening
Logs and traces let you understand what really happened
Audit and identity give you something you can stand behind later.

Incidents are the moments when the missing process shows up, while Post-mortems are where that pain becomes, hopefully, better systems.

Top comments (1)

HARD IN SOFT OUT • May 13

Building a blameless post‑mortem culture is so hard, and you’ve broken it down honestly. I’ve been in teams where the post‑mortem was just a blame hunt—your points hit home.

One gap I’ve experienced firsthand: near‑misses. Teams often log only major incidents, but those tiny “blips” are the early warnings. Do you have a way to automatically pull low‑severity events into the review loop? In my experience, that’s where the real root causes hide.

Practical idea from the field: a team I knew ran a random “chaos snack” every week—injecting a minor failure during business hours—and updated the runbooks live. It turned their ops knowledge from shelfware into reflex. Worth a mention in your review section.