AWS-native incident investigation PoC

#aws #sre #aiops #agents

I open-sourced aws-incident-investigator - an AWS-native incident investigation PoC built around one question:

How can we use AI in incident analysis without turning the system into a black box?

aws-incident-investigator demonstrates how to build a credible, cost-aware, deterministic-first approach to AI-assisted root-cause analysis.

The flow is intentionally deterministic-first:

An operator triggers an investigation with a saved incident context and time window.
Step Functions orchestrates scoped evidence collection (metrics, logs, traces) in parallel.
A deterministic hypothesis builder ranks candidate root causes from the combined evidence.
Amazon Bedrock evaluates the shortlisted hypotheses as a bounded AI advisory layer - evaluate competing hypotheses, add plausibility, identify missing evidence, and suggest next investigative actions.
A final report is assembled and rendered in the UI.

Using AI for everything is expensive, can be slow, and is often hard to explain and audit. This project keeps evidence collection and hypothesis ranking fully deterministic - AI is introduced only at the evaluation layer, where it adds genuine value: comparing competing causes, explaining ambiguity, and surfacing missing evidence. The result is a system that is explainable, cost-efficient, and reliable even when AI is unavailable.

This repo is already public, and I’ve been testing it on several incident flows, including:

• error spike incidents
• latency degradation incidents

You can see examples of investigation results in this post’s images or in the repo itself.

Right now the project is triggered from the web app via API Gateway, but it can be extended to start from alerts through EventBridge, and it can also be extended with more deterministic evidence such as CloudTrail recent changes and more ideas mentioned in the repo:
https://lnkd.in/d8u4-gJe

It would be great to hear your thoughts about it :)