In Part 3, we separated signals on purpose:
- metrics tell you where to look
- logs and traces tell you what happened
- audit tells you what can be proven later.
This article is an incident story. The examples are fictional, but the dynamics are real.
An incident is the exact moment where panic starts:
- Just ship everything to one place.
- Just add the tenantId to every metric.
- Just give everyone access so we can debug faster.
And for the sake of the article and AI, I bring in AWS DevOps Agent.
as an incident capability that can reduce cognitive load by correlating telemetry, code and deployment context, and showing its investigation steps transparently.
References:
- About AWS DevOps Agent: https://docs.aws.amazon.com/devopsagent/latest/userguide/about-aws-devops-agent.html
- DevOps Agent Incident Response (investigation timeline, root cause, mitigation plans, gaps): https://docs.aws.amazon.com/devopsagent/latest/userguide/devops-agent-incident-response.html
Assumptions
The application is setup in this way:
- 3 continents (EU / US / APAC)
- 2 regions per continent
- EventBridge bus per region
- A lot of Lambda functions per region
- Data residency constraints
Other assumptions already solved:
- an AWS Organization with SCPs and guardrails
- CloudTrail trails delivering to S3 with retention policies
- Structured application logs
- Metrics and alarms for key services
- Traces enabled for critical flows
Let's have fun
09:12
Someone writes:
CorrelationId:
c-9f3a...blah blah. It worked yesterday.
When operating globally, this message is missing clues:
- We need to identify which region
- We need to stay inside the right boundary
- We need to answer fast without granting dangerous access
- We need to produce an explanation later (post-mortem, evidence)
This is where trust is the architecture that becomes operational.
09:14
I need a fast answer to 2 questions:
1) Is this global or regional?
2) Is this a customer-specific failure or a systemic failure?
Based on the message, I start looking for the metric that represents the health of that service.
I would look into
- CloudFront 5xx error rate (if requests fail at the edge)
- API error rate or latency (if APIGW/ALB)
- Downstream service errors (if the API is okay but something else fails later)
So I look at:
- Regional behaviour (EU vs US vs APAC backends)
- Last 15 minutes vs last 24 hours
- Error rate and latency
This is a great moment for AWS DevOps Agent
DevOps Agent is designed to:
- Learn your resources and relationships,
- Build a topology graph,
- Introspect CloudWatch telemetry through the configured AWS account access,
- Produce an investigation timeline and root-cause summaries when an investigation runs.
09:18
Let's say metrics show:
- EU error rate rising
- US/APAC normal
- Service X is impacted
- Time window: since ~09:05
- A known
correlationIdfrom customer:c-9f3a...
At this point, I have your coordinates:
- Region boundary: EU
- Capability boundary: invoice approval
09:22
Even with the correlationId, I still need a place to start.
I usually start from where the request starts
- The API handler Lambda
- The background process
- The first event or Step Functions
Once the entry point is established, I look for the log:
{
"level": "info",
"msg": "Invoice approval requested",
"correlationId": "c-9f3a...",
"tenantId": "tenant-42",
"region": "eu-west-1",
"service": "invoice-api",
"operation": "invoice.approve" -> command
}
Now I can answer a critical question fast:
- Did we actually receive the request?
- Did we enqueue/publish the next step?
- Did we reject it immediately?
Suppose I see:
{
"level": "info",
"msg": "Published event InvoiceApprovalRequested",
"correlationId": "c-9f3a...",
"eventBus": "eu-invoices",
"service": "invoice-api"
}
This is important: as the request entered the system and made it into the workflow.
09:26
At this point in the investigation, the instinct is to look for a trace.
I have the correlationId. The system is instrumented. Tracing is enabled. So the obvious question becomes:
Do we have a trace for this request?
The answer is: maybe.
Tracing in production is almost always sampled. The exact request that failed may not have been captured, especially if the system is high-volume or if sampling rates are intentionally low to control cost and data volume.
When a trace does exist for the correlationId, I can see how the request moved across services, where time was spent, and which dependency failed or slowed down.
But I'm usually not that lucky, so I tried in a different way:
- Look for traces of similar requests in the same time window.
- Move on and rely on logs
09:31
Now I have an idea:
In the EU, serviceX is timing out
The next question is:
Why did it start at 09:05?
And immediately after:
- Did we deploy new code?
- Was there a configuration rollout?
- Did a feature flag change?
If nothing was deployed, that does not end the investigation. I start looking into dependency behaviour, load patterns, or environmental drift.
This is where CloudTrail becomes relevant because it confirms that the infrastructure has actually changed, when it changed, and under which identity.
This is why the audit workflow often looks like:
- CloudTrail delivered to S3
- queried later (often Athena)
- correlated by time, principal, and resource
09:36
In this moment, AWS DevOps Agent can be genuinely useful:
- It maintains an investigation timeline
- It can report investigation gaps
- It can propose mitigation plans with stages
Note:
Using DevOps Agent does not change what needs to be done it changes how easy it is to do it.
09:44
Let's say I confirm a configuration change, and not a random problem where AWS decided to have some problems.
I apply the fix with some deployment, as this is used as an audit, and I start the verification that should result in:
1) Metrics improve (orientation)
2) Traces show the path is healthy (flow)
3) Logs show successful processing (detail)
4) Evidence captures the change and the actor (proof)
10:30
Post-mortem time
I should be prepared to answer:
- What happened?
- Why did it happen?
- Why did it take us that long to detect?
- What was confusing?
- What data did we wish we had?
- What guardrail would have prevented it?
- What should we automate or change?
Again, this is where AWS DevOps Agent fits, as it includes a prevention feature that analyses multiple incidents and produces recommendations.
By the way AWS best practise for AWS DevOps Agent:
- Application-specific Agent Spaces,
- Read-only access to shared dependency accounts,
- Tagging shared resources to identify which applications use them,
- Turning cross-team escalation procedures into runbooks.
Conclusion
If Part 3 was about why signals are different, this article is about what happens when you actually need them.
- Metrics help you get an idea of where things are happening
- Logs and traces let you understand what really happened
- Audit and identity give you something you can stand behind later.
Incidents are the moments when the missing process shows up, while Post-mortems are where that pain becomes, hopefully, better systems.
Top comments (1)
Building a blameless post‑mortem culture is so hard, and you’ve broken it down honestly. I’ve been in teams where the post‑mortem was just a blame hunt—your points hit home.
One gap I’ve experienced firsthand: near‑misses. Teams often log only major incidents, but those tiny “blips” are the early warnings. Do you have a way to automatically pull low‑severity events into the review loop? In my experience, that’s where the real root causes hide.
Practical idea from the field: a team I knew ran a random “chaos snack” every week—injecting a minor failure during business hours—and updated the runbooks live. It turned their ops knowledge from shelfware into reflex. Worth a mention in your review section.