Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From Alert to Resolution: Supercharge AWS Ops with the Agentic AI SRE (AIM225)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From Alert to Resolution: Supercharge AWS Ops with the Agentic AI SRE (AIM225)

In this video, NeuBird's AI SRE agent called Hawkeye is demonstrated, showing how it automates root cause analysis of production incidents by analyzing telemetry data from multiple observability tools like Datadog, Splunk, and AWS CloudWatch. The presentation showcases Hawkeye's integration with coding assistants through MCP (Model Context Protocol), enabling SREs to investigate alerts, run deep analyses, and implement fixes directly in tools like Claude Code and Cursor without switching interfaces. Real customer examples include an insurance company reducing incident costs, a logistics firm managing Databricks pipelines, and an Australian bank handling backup failures. The newly launched self-service onboarding through AWS Marketplace offers a two-week free trial with pay-as-you-go pricing, distinguishing Hawkeye from siloed solutions by providing cross-platform analysis across diverse tech stacks.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introducing AI SRE: Automating Root Cause Analysis to Reduce Incident Resolution Time

Thank you. Welcome, everybody. So you know I'm wearing my agentic shirt today. Are you all wearing your agentic outfits? Everybody's agentic, but I'm going to actually present today an actual agent and actually two agents working together, which I'm super excited about. So the AI SRE is all about solving problems in production, right? Everybody has observability. You have all this data, but you need to solve problems under pressure and you have to go through a lot of data to do that. So that's the topic today. We're going to see how agents can help analyze the root cause of issues and really reduce your mean time to resolution, right, by automating the process of investigating the root cause. And then I'll show you how you can work with your coding agent, Cloud Code, Cursor, GitHub Copilot, and so on, to actually implement the resolution recommended by our agent that does the analysis. This is delivering massive productivity gains to our customers, but also reduction of outages, better system availability and uptime, right? And so the impact of incidents can be very costly, and that's what an agent can do. That's really driving massive savings.

One note is that our customers are deploying this without changing any of their existing tools. So you keep your observability, your incident management. People work with the tools they're used to, and here we're having an agent that's going to help them. So that's just a few examples. We have an insurance company that for them it's all about reducing the cost of incidents. They have a complex on-premises and AWS footprint, and issues that span those can all be analyzed by Hawkeye. Hawkeye is the name of our agent. The company is called NeuBird. And Hawkeye can deploy as a SaaS product or in your VPC. So if you're a security-conscious company and you want control over your AI, we can actually deploy Hawkeye in your VPC. That's what this insurance company is doing, and it's connected to their on-premises telemetry.

We have a shipping and logistics company where for them it's all about managing data pipelines, massive numbers of nightly jobs that are running on Databricks with Flank jobs running on EKS, and they have to make sure all these run on time, right? And so for them it's all about handling all these alerts that come off of these data pipelines and nightly jobs. And the last example is a bank in Australia that uses us. The first use case now they use us across a lot of use cases. The first use case was as simple as backup management. So if you're a bank and you have a lot of backups, you need to explain why they're failing and how you're going to avoid the backups from failing again because there's regulations around that.

So those are some quick examples, and really, again, it's about helping this person, right? And we call it an AI SRE, but it's serving the whole company. And usually there's like an on-call team, right, that are not maybe the SREs, but they pull in the SREs when they don't know how to solve the problem. And you end up with all these people sifting through all the information because you have this dynamic cloud environment with lots of layers, lots of telemetry, lots of alerts. And then inevitably you have lots of people jumping on a call. Have you been on one of these calls where there's thirty people there and it's interrupting your day? It's stressful. Nobody knows why they're there. There's finger-pointing, potentially, hopefully not. So what if you could just have a very clean explanation of what's going on? That's really what we're doing. We're taking you from a manual workflow where you have people logging into Datadog, Splunk, Grafana, CloudWatch, and so on, and doing kind of the mental work of looking at dashboards and trying to explain what's going on to then be able to resolve it.

So we're taking this manual workflow and automating it with an agent that can trigger automatically, connect to your telemetry, explain exactly what's going on, and provide a very easy-to-read summary for the user. So anybody want to see that?

Hawkeye in Action: Deep Telemetry Analysis and Integration with Coding Agents via MCP

So this is Hawkeye. We're looking at a dashboard where on the left we have incidents and alerts that have not been investigated, and on the right we're seeing incidents and alerts that have been investigated. I'm going to show you the information that we're providing in an alert context. You can see there are some user-initiated and automated sessions. Let me look at this one. I'm going to pick one at random. This one happens to be a Datadog monitor.

I have a tremendous amount of context here for the agent to work with because I'm triggering automatically off of an alert. Alerts have rich context for the agent to do the work. The work happens in the background, and what you're seeing here can be delivered wherever the team is working. It could be on Slack, it could be on Teams, it could be in your coding agent, or it could be in ServiceNow. Basically, I think the future is headless. We have a UI here where you can consult the information, but you should be able to consume this information no matter where you work without necessarily coming to our user interface.

In here I can see the summary, the incident description, the application artifacts, timeline of events, the evidence, the root cause, and the corrective actions. Contrast that with just the alert. You know, in ServiceNow you might see it represented. There are some AI SRE tools that are basically regurgitating and summarizing the alert and presenting it to you, but what we're doing here is we're digging deep into the actual data in your observability tools: the traces, the telemetry, the metrics, the logs, and also your config. We can log into your AWS console and see all the same data that you see there, how your security groups are configured, how your EKS cluster is configured. We could use kubectl to go and interrogate the Kubernetes cluster. So we're doing what an SRE would do to find a root cause, not just summarizing what we're seeing in the alert.

All this work is available here. Typically the workflow is you just get the conclusion and the corrective actions, you take those actions, and you close the ticket. But sometimes you have to dig deeper. Of course, you have a budget for investigations. By default, we just go and do what we call one prompt cycle. Here is a chain of thought. We have about five or six questions that we're going to go through, and to answer each of them we're going to consult a series of data sources. You can see here the full list of data sources available for this particular investigation. There were a lot of them: config, traces, logs, metrics.

All consulted in order to provide this very deep and rich analysis. At every step I have all this analysis, and that is what gets summarized in the RCA itself that gets shipped off to your Slack channel and so on. Something that I'm very excited about is this new capability that I'm going to show you that we just released, which allows you to work in your coding assistant. So here I'm showing you Claude Code. I think real SREs don't like UIs, that's my theory. They work in a shell and they're really good with scripting and working in shells. If you're that kind of person and you're looking to leverage AI to do your work, I would say go to Claude Code. That should be your first step. I'm sure most of you have already taken that step. It's extremely powerful because it can use all your available CLI tools that you have on your system.

It has this concept of an MCP server that you can add to it. MCP stands for Model Context Protocol. Think of it as an API that the LLM can automatically understand and interrogate, and it's basically self-documenting. So you can plug in these tools and you can start plugging in a bunch of MCP servers,

but at some point, the more you plug in, you get diminishing returns. What we're doing is taking your data dock connection, your AWS information, your CI/CD pipeline, your GitHub connection—all of that is managed by Hawkeye. Hawkeye is providing one MCP server to the coding assistant here that can basically orchestrate everything I showed you earlier. The whole investigation can learn over time and do all these things, but now I can talk to that one MCP server as an SRE and say, okay, I want to do things with this MCP server. It's self-documenting.

You can actually ask it what it does and how you can use this MCP server. It can help you with the onboarding process. You have all these typical workflows: listing alerts, starting an investigation, monitoring the progress of an investigation, and getting results of an investigation. I use this extensively. We do evaluations and PLC, so I can download all root cause analysis and then run queries in my coding assistant across all the RCAs for the last week. Maybe I'm looking for patterns or things that keep happening. I want to analyze all my alerts and find the thing that I should actually be fixing as a site reliability engineer.

Our job is not just to fight fires; it's to prevent them. With my coding agent and this deep analytic capability of Hawkeye that can look at all my telemetry, I can prompt Hawkeye to do all this deep research, extract all this information, and then process this information in my coding agent. I can also automate the remediation steps recommended by Hawkeye. This is what you're seeing here: a set of things that I go through normally. How do I create an Azure DevOps connection? It's helping me onboard Hawkeye and showing me exactly what I need to do. Because it's connected to my AWS CLI, I can ask it to create this role that Hawkeye needs, so I can do the whole workflow right here.

Onboarding and setting up my project basically takes the connections and combines them together. You can also add instructions because our systems are very complex. Hawkeye can do a tremendous job of understanding the dependencies and discovering the configuration, but sometimes, just like when you onboard a very smart SRE, that person needs a bit of help with instructions. The same thing can be done with Hawkeye. I even use Cloud Code to create those instructions sometimes because I can use it to create really great context for Hawkeye to use in its investigations.

I do all types of work here. I can summarize all my past investigations and look at the RCAs. Usually at some point I need to get back into the user interface, and I can ask for a link right here. I can just jump back into the UI if I prefer a UI. As I mentioned earlier, you can always ask follow-up questions on an investigation, and Hawkeye actually recommends some questions. You can set it up so that it automatically answers those questions. Think of it as having a budget for investigations. For some alerts, you're going to say just keep going until you think you're done. By default, we go one round—we go five questions, one chain of thought, then we stop and summarize.

But we can automate the follow-up question, or you can have your team ask that follow-up question in Cloud Code, in Cursor, in Slack, on Teams, wherever they want to work. That's the vision for working with an agent: you have agents working with other agents through MCP. You have people working with agents without necessarily using a tool. I'm on Slack just talking to this agent. It's providing me useful information. I'm asking for follow-up questions. I'm not really using a new tool.

Getting Started with NeuBird: Self-Service Onboarding and Cross-Platform Capabilities

There's something in the background that's helping me, and that's kind of the vision that we have for agentic SREs. There are a lot of things that SREs need to do, and we're connected to so many other teams—DBAs and others. We're going to have a lot of different agents doing all these pieces of the work. I think a great model is to use a coding agent as your main interface where you're basically interacting with all these different agents that help you do your job. We have a lot of traction in this space and a lot of excitement around this.

We also just launched yesterday. I don't know if you saw the news, but we have now launched self-service onboarding so you can go to your AWS account in the marketplace and say I want to start using Hawkeye, and we will automatically provision a SaaS instance of Hawkeye for you. You will get a two-week trial, and you're up and running within five minutes. You can start using Hawkeye and you pay as you go. There's no commitment—just go to the AWS console, go to your marketplace, find NeuBird there, and start using it right away. Within fifteen to twenty minutes, you'll be actually running deep analysis on your alerts.

You can configure your AWS CloudWatch, you can configure Datadog, Dynatrace, and we have Grafana as well available there for enterprise customers. We go and do private offers and deploy in your VPC. We use your LLMs and we can connect to any observability tool of your choice and any incident management tools of your choice. But for those that just want to get started with self-service, they can simply connect and get started right away with our SaaS offering.

We're basically connecting to everything under the sun. You'll see a lot of different solutions out there with people claiming they have an AI SRE. Amazon just launched theirs yesterday, and we're excited to have Amazon also join the efforts to help SREs. What you'll find is that a lot of them are kind of siloed. So you've got one for AWS CloudWatch, one for Datadog—they have their Bits AI. You have one, Dynatrace has their agent. We believe that most organizations have more than one source of data, and our agent goes across all that.

A lot of people have, maybe they use CloudWatch for some of the Amazon services, they use Prometheus to monitor the metrics for Kubernetes, they use Datadog for this other service. It's a patchwork usually, unless you're super consolidated. Even if you're consolidated, most of these agents are limited. We can actually look at configs, we can use kubectl, we can bring in MCP servers if you have your own data that you want to provide. So we believe that you need a kind of Swiss Army knife that's orchestrating your incident response across all these tools to help you along and make sure that you look like this guy, not the guy we saw earlier.

Super happy now, you're just having a great time working with your coding agent and Hawkeye. So let me skip through some of this. I just want to leave you here. I have one minute and eighteen seconds. I think I'm going to finish early, but I really encourage you to go and onboard, start here, and sign up and just try it and tell us what you think. We're the first to be bold enough to open this up publicly for everybody to use, which shows you just the maturity of our offering where we're confident this is going to work for you. So please just take advantage of it. It's free to get started. You're going to have two weeks of free investigations, and then you're just going to pay as you go. It's very no commitment. Just go consume it through the Amazon marketplace, and we would love to just hear your feedback.

We're at booth 1219 if you want to learn more. We've got a café behind the Datadog slide where a lot of us are hanging out, and you'll see the NeuBird branding there. Come and see us and we'd love to speak to you about what your vision is and how we can help you get there. Thank you very much, and enjoy the show.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community