CloudWatch Investigations: Your AI-Powered Troubleshooting Sidekick

#aws #genai #cloud #sre

Remember those 3 AM incidents when you’re frantically switching between dashboards, digging through logs, and wondering if you should just restart everything? We all have been through the situation where we worked in non-business hours, weekends, midnights to troubleshoot production issues and its quite energy draining task. What if in this GenAI world we get AI assistant that works 24*7 and guide us through the chaos.

Enter CloudWatch Investigations – a generative AI-powered feature that’s changing how we handle incidents in AWS environments.

When something breaks, instead of you jumping between CloudWatch metrics, logs, deployment history, CloudTrail, X-Ray, and health dashboards, CloudWatch Investigations does the first round of detective work for you.

It uses generative AI to scan your system’s telemetry and quickly surface:

the metrics that look suspicious
the logs that matter
recent deployments or config changes
and even possible root-cause hypotheses, especially when multiple resources are involved

All of this is presented visually, so you can see how things are connected instead of guessing.

It's like having an extra team member who's been staring at your system architecture 24/7 .

Let's get started to check how to configure CloudWatch investigation.

Getting Started

In AWS Console, Go to CloudWatch → AI Operation (left pane), if you are configuring account for first time, you need to do setup
- Creating Investigation Group
  - Configure Retention days: For how much time you would like keep investigation. Please note that The retention cannot be changed once it is configured.
  - Customise Encryption: You can have customer managed key for encryption. But make sure you give permission to access key.
  - Next, it will create new role with required permission for investigation. These will be read-only permission which needed for investigation. You can also create new role here. By default, it uses : AIOpsAssistantPolicy, AmazonRDSPerformanceInsightsFullAccess and AIOpsAssistantIncidentReportPolicy**

Once the investigation group is created, you will be able to see Optional Enhanced configuration

Under Enhanced integration, you will be able to include tags related to application. This will help Cloudwatch to narrow down investigation. This is quite useful setting as its efficient to narrow down investigation.
Access to CloudTrail event, to help CloudWatch investigations better discover relevant change events.
Optionally, X-ray, Application Signals and EKS access entries.

DEMO

Now, we have configuration ready, let's start with demo. Part of this blog, I have simple Event booking app as follows.

User books appointment by providing details and selecting available slots, admin approves/rejects requests.

Disruption in Application

Part of this demo I have modified Lambda role permission where I have removed KMS permission.

Now imagine scenario, suddenly users started reporting errors they are not able see the slots, it throws an errors. Also Admin not able to see any appointments at all.

As we know application design, entry point for application is CloudFront, we start checking Cloudfront and see there is increase in 5xx errors.

Starting Investigation

Under CloudWatch metrics 5xx, you can start already start investigation why its throwing 5xx errors.

It will pick the timestamp automatically or you can adjust from what time you would like to start investigation.

Once investigation is started, it will take 10-15 minutes to finish investigation, we can also view progress of investigation. But instead, we can always start communicating user/business or start other parallel activities.
On completing investigation, it correctly pointed out what went wrong and why we are getting 5xx errors 🥳🥳🥳

As we can use under Root Cause Summary, it was IAM configuration issue which was causing issue.

ANALYSIS: This failure pattern represents an IAM configuration issue rather than a service degradation, as evidenced by the specific KMS permission errors and the NEW occurrence pattern indicating a recent permission change affecting the eventap staging service components.

We have root cause in 15 minutes. This is huge advantage for anyone who works on production system and need to keep system running.

Going one step further, instead of checking metrics every time and start investigating. We can have CloudWatch Alarm in place, as soon as resources metrics gets ALARM start, CloudWatch Investigation will get automatically started and I did same.

I hope this blog gives you a good idea of how you can get started with CloudWatch Investigations. There may be moments where you don’t fully agree with the AI’s suggestions — and that’s perfectly fine. You’re always in control. You can accept what makes sense, discard what doesn’t, and guide the investigation in the right direction.

The beauty is that you can start small with zero setup, and then gradually level up by adding richer telemetry, cross-account visibility, and automation runbooks. Over time, this leads to fewer guesswork-driven fixes, faster MTTR, and much calmer incident calls — even at 3 AM.

Instead of panic-driven troubleshooting and endless tab-hopping across metrics, logs, and dashboards, you get context first: what changed, what’s related, and what’s most likely broken.

Thanks for reading, and happy troubleshooting 🚀