Kazuya

Posted on Dec 6, 2025

AWS re:Invent 2025 - Accelerating incident response through AIOps (COP334)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Accelerating incident response through AIOps (COP334)

In this video, Pratul Chakre and Andres Silva present AWS CloudWatch's AI-powered operations capabilities using Formula One racing as an analogy. They demonstrate unified data management for CloudWatch, which centralizes telemetry across accounts and regions, supporting OpenTelemetry and OCSF standards. Key features include natural language query generation, MCP servers for CloudWatch that enable bidirectional AI communication, and CloudWatch Investigations—an AI tool that automatically surfaces metrics, logs, and traces to expedite incident resolution. The Kindle team achieved 80% time savings using Investigations. They also introduce the AWS DevOps Agent, a frontier agent for proactive incident prevention. The session includes live demos showing how these tools reduce mean time to resolution from hours to minutes, with practical examples of auto-generating alarms and incident reports.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Accelerating Incident Response Through AIOps with a Formula One Analogy

Good afternoon, everyone. Can everybody hear me okay? You need to have your headsets on if you want to hear me. Good, all good. Thank you. Fantastic. Yes, you are in the right session. The session is accelerating incident response through AIOps. Let's start with some introductions. My name is Pratul Chakre. I lead the global worldwide team for cloud operations specialists based here in the US. Andres is my sidekick. My name is Andres Silva. I'm a principal specialist solutions architect with the cloud operations team, and I focus on observability, helping customers adopt observability strategies. Throughout the last year, I've been focusing on AIOps, so we have a lot of exciting things to share with you today.

What we've put together is probably the most high octane session that you will attend at re:Invent. If that's not the case, please come back and see me and let me know so we can do better next year. When Andres and I sat down to talk about what to present in this session, we actually wanted to take a step differently and not just present all the cool stuff, which we will of course, but we wanted to weave it into a story. When we were thinking about the story, we thought, what's better than the recently concluded Formula One at the Las Vegas Grand Prix? Any Formula One fans here? Show of hands? Okay, fine, so we'll have some work to do along this. What we really wanted to showcase was how Formula One has evolved using technology to get to where they are today, and they're continuously doing so. The biggest difference between organizations that are successful and organizations that are not so successful is how they evolve and embrace technology to drive innovation within their enterprise, especially in operations.

What we will be talking about today is the race strategy. A race strategy is what you put together to understand during the offseason what you're going to do, at the start of the season what you're going to do, mid-season, and then on the day of the race what are the kinds of things that you'll be focusing on. So we'll talk about all of the AI innovations in CloudWatch, CloudWatch being our flagship AWS service for observability. We'll talk about MCP servers and agentic frameworks. We'll talk about CloudWatch investigations, and last but not least, we will also talk about agentic AI and agentic frameworks. There is no session that's complete without talking about this. We are going to talk about the AWS DevOps Agent, which was announced by Matt Garman earlier today.

CloudWatch Evolution: From Basic Logs to Unified Telemetry at Exabyte Scale

But first, and this is interesting because I've been with Amazon for ten years and I've spoken to hundreds, if close to about one thousand customers over my last ten years, and every single time, or most of the time, I hear, "I didn't know CloudWatch could do that." So what I wanted to do was quickly talk about how the evolution has begun in terms of innovation with CloudWatch. It started in 2014 with just logs to where we are today. We launched LiveTail for logs in 2023, one of the most sought after features for developers and infrastructure engineers to be able to look at logs in near real time.

In 2024, we made further improvements to LiveTail, but we also launched Database Insights, getting into specific curated views into the databases. We also launched Transaction Search and Analytics. Today, we've launched a whole bunch of things, but I'm just going to pick on a few things. Number one, we've centralized all of our logs across accounts and regions in one account or region. This was one of the most sought after and asked features from our customers. We now have Application Map. This is extremely important for customers to understand their application level dependencies across whether or not the application is instrumented, so you can actually see an entire application map across your application.

We also have Generative AI Observatory, something that Matt Garman announced today, talking about when you build your generative AI applications whether it's on tech problems, whether it's on Bedrock, or whether it's on EKS self-hosted, how do you monitor the applications or the agents that you're building across the enterprise? Some of the cool stuff that we've launched over the past year is all the best. CloudWatch has evolved a lot, and this evolution has helped us to monitor at scale. I'm sure if you didn't know this, one of our biggest customers for CloudWatch or AWS observability is Amazon. All of Amazon, all of AWS uses CloudWatch for their observability. We support seventeen exabytes of CloudWatch logs per month, thirty-two quadrillion metric observations per month, and nine hundred thirty-two million canary runs per month.

We're processing 932 million canary lands per month. I don't know about you, but when it goes beyond gigabytes and petabytes, I kind of lose context on how many zeros we're talking about. So I put that for my own reference: a exabyte is 18 zeros. That's the scale at which we are operating, and that's what we are supporting.

How are we able to do that today? One of the things we were extremely clear about, as we heard from our customers when we were talking internally, is that AWS needs to be the home of all telemetry. If you are not able to get all of the data into a single place and you have disparate data sources spread across your entire environment, it is extremely difficult to build context and get insights based on that context from data that is very disparate.

CloudWatch has now become the home of telemetry. Building curated experiences is also critical. What an application developer might need is very different from what an infrastructure engineer might need or what a database engineer might want to see as part of their metrics and observability. Building curated experiences to help the persona for the use case that they are responsible for is extremely important. And then last but not least, implementing AIops: how do you offload mundane, manual, repetitive tasks to AI so you can focus on what's important for your business?

The Formula One Pit Stop: How 96% Improvement in Speed Mirrors Cloud Operations Transformation

Let me take you back to a Formula One analogy. Do you know how long an average pit stop takes? A pit stop is a pre-planned stop that the driver takes in order to either refuel gas or change tires, which are the two most common use cases, in order to keep the race running for the number of laps. So watch this video. There will be a question at the end of it.

Hamilton is going faster. K1, K1 to gain more. Racing two cars, push now. This is the undercut: Hamilton going faster on that new set of tires than Vettel is able to because he's on an older set of tires. Hamilton is just coming around the final turn as the Ferrari pit executes a very nice pit stop indeed. Medium compound tires going on to Sebastian Vettel's Ferrari. Lewis Hamilton goes past the commentary box at a rate of knots. Vettel at 80 kilometers an hour. Hamilton has got the jump on Sebastian Vettel by performing the undercut, and it's Hamilton now ahead of the Ferrari.

Excellent. How many of you saw that Lewis Hamilton took 2.1 seconds as a pit stop? This time is so important that it can either make you win a race or not. It depends upon the podium position that a race driver could achieve. How many of you saw that there was a predicted gap analysis on how long Hamilton stops before Sebastian Vettel, who is another driver behind him, and how far ahead Hamilton will come in front of Vettel? All of this is collecting telemetry, doing the analysis at runtime, not even near real time, at runtime, in order to make the right decision of when to stop and when to make those changes.

On average, it takes about 2.3 seconds, even though Hamilton did better. On average, it takes 2.3 seconds for a race driver to pit stop. But they didn't just get here. Over the years, Formula One has made progressive improvements, going from when they would use hammers in order to change their tires all the way down to AI prediction today. That has gotten them from 67 seconds back in the 1950s to 2.3 seconds today. That is a 96 percent improvement in how they conduct a pit stop. That's the strategy. That's what makes or breaks a race.

How do they do that? What did they change? It really revolves around three things: adopting the latest and greatest technology that's available. They went from hammers to pneumatic wheel guns, which immediately gave them five times faster ability to change the tires.

They went even more data driven. They collect 300 sensors per car. They're collecting weather information, racetrack information, all the telemetry that they need to make the right decisions in order to win the championship race. And the process, the personas—every persona is choreographed to do the exact same thing every single time the car stops during a pit stop. They practice this like a symphony over and over again so that they don't make a mistake when the time is really needed.

So why shouldn't it be any different for all of you to also drive your cloud AI ops implementation? Because the technology is now available, and we will show you how this technology can help you get there. It is data driven, so once you're able to collect all your telemetry across all your environments into one single place, you can make those data-driven decisions. And enabling the process of getting AI to assist you in making the decisions—what we essentially call human-led AI assisted decision making, driven by data. And then continuously learning from past experiences in order to improve and further reduce your MTTR, or mean time to resolution, lower.

Think of that pit stop as a planned downtime that we have, we all have in some shape or form. We practice it, we choreograph it, but what happens when an incident occurs in our world or in the case of Formula One, there's an unscheduled pit stop. An unscheduled pit stop happens because something broke off the car and needs to be quickly replaced, the tire came off or the tires were not working properly and they need to change again. Even then, an unscheduled pit stop is only about 40 plus seconds.

Now imagine your pager goes off at 3 a.m. You now have an incident. I don't know why, but everybody loves a 3 a.m. analogy for pagers going off. But here we are. Your pager goes off at 3 a.m., which is your unscheduled pit stop in your world, in our world. That incident takes 4 hours to resolve. And now you've not lost, if not thousands, at least a few million dollars in realized revenue because your customers couldn't log in, they couldn't process their data the way they wanted to do any of that impact.

So what if your operational pit stop, whether scheduled or unscheduled, could get into minutes or even seconds instead of hours? That's where AI ops plays a big role. And with that, I'm going to hand it over to Andres to take us to the next set of slides.

Defining AIOps: The Journey from Monitoring to Autonomous Operations

We've talked a lot about the Formula One analogy, which if you didn't know how to recognize a Formula One fan, now you know. But I think it's very good. But before we start diving deep into this, I think it's fair that we define what AI ops means, because it could mean different things to different people.

When we define AI ops, we're talking about using artificial intelligence and machine learning to enhance, accelerate, and automate the cloud operations process so it's to give you superpowers. This is not something that's going to do all the work for you—hopefully we'll get there, but we're not there yet—but we're discussing AI ops in that context that you see there.

In the Formula One analogy, the evolution from hammers and all that has taken them to where they are today, where they can do a pit stop in 2.3 seconds, which is amazing. You can see a lot of parallel things in their evolution and the evolution that our customers need to do in order to take advantage of this technology. Formula One started with telemetry. They then started doing simulations and strategy, then they went to making real-time adjustments, and I'll give you an interesting example later on about when that was implemented. And now we're at the point where they're making AI decisions. When something happens, they can go in and tweak things. Super powerful.

So in order to manage infrastructure in the cloud and adopt AI ops, you have to go through a similar journey. We all started with monitoring, and then maybe we started doing predictive analysis that has been available for quite some time, anomaly detection. Remember setting up that base pattern and then detecting anomalies. Now we're at the time where we can do auto remediation and we can do a lot of other cool things with generative AI.

Remediation is just the beginning. Of course, autonomous ops is the holy grail—that's where we want to be. That has to be the journey we take to adopt and benefit from this technology. I'll take a quick moment to highlight the partnership we have with Formula One and their platform F1 Insights. It's a very good example of how you can continue to adopt technology and improve not only your operations, but in this case, their business, which is the key thing.

To help you understand this and create a mental framework, we decided to continue our Formula One analogy. We're going to talk about four main things: the pit crew, which consists of all those people who help you win a race. We're going to talk about tire gunners, jack operators, strategists, and race engineers. We're going to map those to different services and features that can help you on this journey.

Tire Gunners: Foundational AI Innovations with Unified Data Management for CloudWatch

Let's get started with tire gunners. A tire gunner is not somebody who shoots tires. It's actually the person who has that big pneumatic pistol and races to take out the big nut in the middle of the wheel very quickly. If you go to the Caesars Forum now, they have the sports forum where you can actually play with a pneumatic gun, which is pretty cool. But it may seem simple, and it is foundational. When we talk about foundational innovations, we have to talk about some of the innovations we've done with CloudWatch. Interestingly, in F1 Insights, the Formula One team uses CloudWatch telemetry to monitor their LLM inference infrastructure and the infrastructure components. That goes back to the point that it is foundational.

Let's talk briefly about some of the AI innovations and base layer innovations we've done in AI. We're going to talk about something that was announced today—25 launches, which is very exciting. This includes the unified data story and unified telemetry for CloudWatch, as well as natural language query generation and a couple of other things. Let me dive into it and then I'll do a demo because demos are fun.

First, to set the stage, an F1 car has an immense amount of data coming in that the teams have to process. Just to give you an idea, in a typical race, there's more than 5 billion data points that come in across all cars. That is a lot of data. Something similar happens with our infrastructure. We have a lot of data coming in—infrastructure logs, application data, application logs, databases, streaming services, you name it. Right now, a lot of that data coming from different places ends up in different places to be analyzed, which becomes a hassle. It doesn't empower you to leverage artificial intelligence for operations because the data cannot be in all these places. You want to centralize your data.

We're very excited that this morning we launched and announced something we've been working on for quite some time now, easily a year or so. This is unified data management for CloudWatch. What does this mean? CloudWatch is a very powerful telemetry platform that we use at AWS. But now we're creating an additional experience that allows you to unify three main use cases: operational, security, and compliance. This means we're empowering our customers to bring all that data into CloudWatch. We're providing an enriched interface on how to query and search that data, and we're also providing features like the analytics tiers and the telemetry pipeline. The goal of those is to allow our customers to put all their data in one place, and that's going to be the foundation. Then on top of that, we can run effective AI operations.

Here's why it makes a difference. Just like we saw in the Formula One example with about 5 billion data points coming from all different cars, that's the part we're trying to understand. It's the same set of data but coming from different cars, and the telemetry being collected is the same. Similarly, for our applications, whether it's security data, compliance data, or observable telemetry, it's coming from different instances, servers, Kubernetes, serverless, or whatever technology you use underneath the hood to power your applications. It's coming from multiple instances of those same applications, and to be able to run the level of analytics and decision-making that Formula One does, we need to centralize that data.

Some of the key features I want to highlight include the ability to enable all data sources from a single place. AWS vended data sources are now accessible in a unified way. For example, imagine you have an organization with 400 accounts and you want to enable VPC flow logs across your entire organization in a consistent manner. You can create a rule that enables VPC flow logs everywhere for you consistently.

We are providing connectors for third-party sources, allowing you to bring in your CrowdStrike data. We launched with about 12 connectors and the goal is to continue increasing that number. Now you have a full telemetry pipeline that allows you to ingest data and transform it into the format you want so that you can consume it. Along with all the analytics features we would take the whole session telling you about, this is incredibly powerful, which is why we wanted to include it in this session. The best way to demonstrate this is through a demo.

Before you get into that test drive, the last but not least feature is support for open standards. What we have learned from customers over the years is that many customers want to use open standards to store their telemetry. We are supporting OpenTelemetry for observability data, and we also support the Open Cybersecurity Schema Framework, or OCSF, for security data. Now you can send either of those formats to us in CloudWatch, whether it is security or observability data.

Demonstrating Data Sources, S3 Table Integration, and Telemetry Pipeline Creation

If you have used CloudWatch in the logs console in the logs section, you will see there is a new feature called log management. Before it used to say log groups, and there is a subtle difference. Now when you go in there, you will have three tabs that provide additional context into what is going on and the enhancements we are providing. The first one I want to call out is data sources. This shows you all the different data sources that CloudWatch is tracking. You can see here that I have VPC flow logs, Route 53 logs, and CloudTrail events. As you add more data sources, either through enabling them via centralized enablement or by ingesting data from a third party, you will see your full list of data sources here.

Another thing you can do very quickly here, and this is something I forgot to call out, is that we are providing S3 table integration with this new set of features. What does that mean? You can take a specific data source and make it available via S3 table. Now you can read that data through Apache Iceberg and Athena. Imagine the possibilities of taking all the logs you have there and incorporating them into any large-scale analytics pipeline you want or just using artificial intelligence for it. This is incredibly powerful and something we think will make a big difference.

You can also add new data sources here. You can see all the data sources that we currently support. We have all the vended telemetry there, including VPC flow logs and EC2 managed Prometheus. If you click on third party, you will see that we are starting with a large set of partners and third-party integrations that you can bring in.

The other thing worth calling out is that creating pipelines is very easy. You can specify what kind of source you are talking about from the list you saw before. Let us say we select CrowdStrike. You give it a name, and then you go next. The integration is going to be done through an SQS queue. Different sources work in different ways. You specify an SQS queue, and the telemetry pipeline will pull from that queue and ingest anything that comes in new. You specify the data format.

You specify the service role that will be used, and that's it. It's as simple as that—super simple to use. You can also perform transformations as you're creating the pipeline. For example, if you have CloudTrail data, you can say you want to transform it to OCSF format, and it will do that for you. You can also perform enrichment on the telemetry pipeline, which is also very powerful.

Natural Language Query Generation: Simplifying Data Insights Without Complex Syntax

Let me get back to the core point. This is foundational, but there's another thing I wanted to show you. I can go quickly to the PowerPoint for just one slide because I'm going to go back to the demo. Communication is very important in a Formula One race. The fact that the team can communicate with the driver in a clear way is super important. When the driver is instructed to come back to the pit lane for whatever reason, they don't use complicated language or intricate words. They simply say "box, box, box," and the driver knows to return to the pit lane.

Why do I mention this? It's a good analogy for what happens when you're trying to extract insights from your data. You don't want to overcomplicate things. Traditionally, to get insights from logs and metrics, we've had to learn query languages like SQL or whatever is supported by the platform. One of the great things generative AI has enabled is the ability to express in natural language what you want, and the system automatically translates it to what you need. That's something we've been investing in heavily over the past couple of years.

Let me show you how that works. We're back in the demo, and I'm going to show you a couple of examples—one with logs and one with metric insights. I'm going to work with a log group name and use CloudTrail because everyone understands CloudTrail. Typically, you have to come in and write your query. Now we support PPL and SQL as other languages where you can craft your queries. What you can also do is use the query generator, which is right here. You describe in simple plain English what you want—and it actually works in other languages too, which is pretty cool.

You can just say it in plain language, and it will convert it into the query you want. Let me do this one. I'll say "show me the top 20 API calls made." A couple of things happen here. It understands the schema of the data source and incorporates it into the request that is made. Using that information, it generates the query you want. I already did it over there, so I can go ahead and run my query. There you go. I have all the API calls from CloudTrail that have been made, organized. I can go in and refine the query. I can say, "Can you sort them by this or add that detail?" It's very simple to use.

The cool thing is that you can become an expert on these querying languages very easily. I feel so powerful when I'm using it because I don't know anything about complex syntax. It's so cool—a super powerful use of AI. What it really does is take away the need to learn a specific query language in order to get the insights you want. You want to know what's in your data and what your data is telling you. Instead of spending time learning the query language to get to that insight, with natural language querying, you now just focus on the insight you need and don't worry about the query language underneath.

MCP Servers: Two-Way Telemetry for Real-Time Adjustments and Auto-Remediation

So I just showed you the demo. Now let's talk about the second role. Let's talk about MCP servers.

These are super important, and the way we fit into this analogy is when you are using or trying to leverage AI to do AI operations, you can ask a lot of questions. You can get a lot of data, and we are familiar with a lot of systems that enable you to store the data in one place and then use some sort of vector database to organize it, and then you can go ask questions about that. But there's a number of problems with that—it tends to be very expensive and other things. However, an MCP server kind of changes the whole thing, and it's very similar to what happened with Formula One in 1993.

Before 1993, teams were able to get telemetry—the speed of the car, the temperature of the engine, a whole bunch of stuff. But in 1993, there was a team that introduced what they call two-way telemetry, which meant that not only were they getting data, but they could also go and make adjustments to certain things in the car. That gave them a 0.3 second advantage per lap, and they were able to win a bunch of championships until the other teams caught on and started doing the same thing. That is kind of what MCP allows us to do because it allows us to communicate both ways with some of these things.

It's like the way that has been illustrated—like a USB-C for standardizing the connectivity between your AI models and any API. So why do you need MCP servers and why are they important in AI ops? Well, as I said before, you need to get data from your telemetry, your observability platform, and you need to take action in an efficient and safe way. MCP servers allow you to do that, and when we saw the potential of this and how important it was for agentic workloads, we decided to start building these MCP servers and providing them to our customers.

Today we have three MCP servers for CloudWatch and related telemetry. We have the CloudWatch MCP server, we have the Application Signals MCP server, and we also have the CloudTrail MCP server. When you use these—let's say with Quiro, but you can use them not only with Quiro or VS Code Q Developer—you can also build your own agentic applications that connect with the MCP server and allow you to read data. Some of them allow you to make adjustments too. Think about auto remediation of issues—some customers are actually doing that. It's pretty cool.

So let me show you how it works. Let me go back here. I'm going to go to this. Yeah, this thing doesn't want to give up. All right, let's go to my Quiro. I have an application here, and let me show you first how you can figure this out. This is very small. All right, so you configure your MCP servers, and you can see how the CloudWatch one is broken because the demo had to break. But we'll find another way of doing it.

So basically, the way you configure MCP servers is by downloading the set of files from the QR code that I showed you before. I'll show you again. There is a GitHub repo where we have all of our MCP servers—not only the CloudWatch ones, all of them are there. You can go in and there will be instructions on how to do it. The instructions there are for running it locally, which is what most customers are doing today with development environments like this, but you can also run them hosted.

The idea and the use case for an IDE environment like this is you can think of it as coding. You're just pushing a change, you have your development environment, you want to see how it behaves, you want to tweak some settings. So you can go right now here and say, because I have mine configured and now finally it's working, you can go here in the chat and say I just deployed a new version of the Nova image generator. Now you see why I was using that app, right? Generator. Can you check latency or if there are any issues,

the good thing is that I don't understand my typos. What this does is based on the MCP server. The MCP server has a set of API calls defined, and it can reason which ones it needs to call. It will say, "Hey, I need to run this so I can approve it." You can actually auto-approve things, but I have it here because I want to see what it's doing specifically. It will ask me for permission to go ahead and run the tool. So it's going to call the MCP server, use a set of APIs, and return back some information that's going to help me understand the state of my application.

It's doing audit service. This is the application signals. So it found some connection errors to Bedrock, which is interesting, right? There's a problem with the application, so maybe when I pushed the code, something happened. You get the point, right? You can ask in plain English very intricate and complicated things. I remember doing a demo one time where I said, "Look at my CloudTrail data. Show me any unusual activity that happened yesterday, right?" And it goes in and does the query, and then I said, "OK, now compare that to the previous day and create a comprehensive report for me." It gave me this amazing report with everything that happened in one day compared to the other one, and it called out a few things that were very interesting that never occurred to me were happening in that account.

I encourage you to test it out. It's super powerful and can empower you as a developer to get on top of things real quick, see how things are behaving, and you have a direct connection to the telemetry. Before you move on, a couple of things to notice here. Of course, you were in the middle of deciding which MCP server to use. Of course, you can automate it, but there's also an element of trust to be built with the agent that you're working with before you can get to that point. But it allows you to get there. That's number one.

Number two, it also comes with recommendations on where you could possibly need to make those fixes in order to remove those errors. That ability to get to the root cause, identify those issues up front early on, and make corrections is extremely important and extremely powerful in your journey to implementing AIops. I just issued another command. The Application Signals MCP server has a special API call called audit services that combines a number of API calls internally. What I just did was I said, "Are there any alarms you would recommend for this service?"

What it does is analyze all the metrics and check metric density and other parameters to figure out if this is a good candidate for variations of an alarm. It would actually recommend me alarms, which is super cool. It's running that, and at the end it will come and say, "You know what, you should probably enable these 34 alarms here. They're kind of critical." That's super powerful. There you go. I'll go back to it and show it to you on my next demo. Hopefully it'll be done.

CloudWatch Investigations: AI-Powered Incident Resolution with 80% Time Savings

All right, so let's move on. OK, we're back there? Yep. All right, so let's go to the next one, strategies, and this is my favorite application. I've been working with it for a year now. It's CloudWatch Investigations. Let me tell you a little bit about this tool. If you think this slide is busy and confusing, we did our job, right? But this is how the process of responding to an incident looks today, right? You get a page, you go check an alarm, you review the metric, you go to a dashboard, you call a friend, and all that stuff happens. It's very confusing, right?

Through things that we've learned internally at Amazon and all the feedback your customers give us, we said, "Can we solve this problem? Can we help solve this problem with generative AI?" So we created Amazon CloudWatch Investigations. Now, how many of you knew about CloudWatch Investigations before today? Raise your hand. OK, all of you do. The reason I have a new tag there is because there's a new feature that is incredibly powerful. I'll tell you a little bit about it. Here's the thing: it's an AI-powered tool that helps you expedite the process of resolving incidents in your infrastructure. So it automatically surfaces, when you open an investigation, metrics that could be related to the incident, log queries that could be related or results that could be related to the incident, as well as traces, AWS Health data, and CloudTrail data changes.

One customer told me this is awesome because they know whether an issue is AWS-related or theirs. The system checks AWS Health automatically and tells you if there's an issue with a specific service, so you can move on without spending an hour trying to figure out why your application is failing.

It's multi-account ready, so you can investigate across accounts and start an investigation from any CloudWatch widget or trigger it with an alarm. This is super cool because by the time you get to the console, the system will already have some information for you to act on. The new feature we're introducing is the ability to generate incident reports based on what we use internally at Amazon called cause of error or COEs, which ask the five whys. This gives you very detailed information about why the incident happened, how it was mitigated, and how many users were impacted.

Here's how it works. You start an investigation from an alarm or from any widget. The most difficult thing when doing incident investigation using telemetry in a central location is understanding the topology of the application. You have all this telemetry in a big data lake, but how do you know that one piece of telemetry is related to another piece? That's super difficult. The team at CloudWatch did an awesome job by creating a topology of the application as soon as you open the investigation. We use internal resources and traverse the infrastructure using different APIs and special features that the CloudWatch agent has, called NTT. We create a map of that, and with that map we can start focusing on a subset of the telemetry to figure out what the problem is, which expedites the process and makes it much more efficient.

Immediately, the systems start surfacing observations. The console allows you and your team to collaborate, as multiple people can work on the investigation together. You can add more data points, make notes to improve the investigation, or discard information that's not related. All these steps continue refining the process, and then the system comes to a hypothesis. Based on what it sees, it tells you what it thinks is happening and how it came to that conclusion with full documentation. It also tells you what you need to mitigate and solve the problem, which could be documentation, knowledge base articles, best practices, or an SSM automation runbook to fix the problem.

Here are the benefits. Interestingly, the Kindle team internally at Amazon started using CloudWatch Investigations for some of their incidents, and the result was 80 percent time saved in doing investigations. That's what we want to do. We want to reduce the complexity and all the issues that come up when you start an investigation. Another pretty cool thing is that through this process you learn a lot about your infrastructure that you didn't know before. There's this weird metric that surfaces and it had never occurred to you that maybe you should keep an eye on it, but all of a sudden it just pops up. It's a very powerful tool, and because it has the human in the loop, it's not only expediting the process of solving issues but also helping you learn more about your environment at the same time. We think that's very important.

Live Demo: From Alarm Trigger to Root Cause Analysis and Incident Report Generation

Let me show you a quick demo of investigations. I wanted to show you how the system came back with some recommendations for alarms that you should set up. I'm going to go to investigations and show you something that happened. I have this application that I built called the Nova image generator. It uses Nova through Bedrock, and you just type something like create me this picture. It's the same thing you can do when you go to Nova on Amazon.com, but what I have is a process running requesting images every so often, and I have an SLO defined using Application Signals.

If you don't know what that is, I encourage you to learn more about it—it's really cool. Basically, it's keeping an eye on that metric and letting me know if there's an issue. I have an alarm tied to that SLO, and you can see here that the alarm fired off because I broke the application. I broke it intentionally, and what happened was the alarm triggered immediately.

I have the alarm configured to automatically start an investigation. So if I go to AI Operations and navigate to Investigations, I can see that it already opened an investigation for me. If I go into that investigation, I can see what piece of telemetry the system used to start the investigation. It's the metric that was backed by that alarm, which in this case is the SLO. The SLO average is based on errors, so when I broke the application, the errors spiked up, which triggered the alarm.

As this is happening, you can see the agent queue here . I can go in and see what the system is doing—it's constantly moving. It's not a black box. It shows me what metrics the system is analyzing. I can go in and read the details of what it's analyzing, and I can dive into it to see exactly what it's doing and have visibility into it. I can filter the results. The backend is fully agentic, so you can see that playing out in how it's handling the investigation.

Now, if I go back here, what you will see are the observations that I mentioned before . These observations are surfacing, and I can read them and validate them. I can say that I don't like a particular observation or that it's not related. I can also add a note and say that I think this is related to something specific and that the system should check it. The system will incorporate that feedback into the investigation as well.

This is where it becomes extremely powerful. Imagine that the alarm went off and kickstarted the investigation, and it came up with a hypothesis about what the agent system believes could be the possible root causes for why the alarm was fired off . You're not trying to go in and figure out what went wrong from scratch. Instead, you already start at a position of saying here are the possible two or three reasons why the alarm went off. If it makes sense to you in terms of why it could have gone off, you can accept the hypothesis.

The system keeps a record of every hypothesis that you've accepted and starts investigating further until you get to the root cause faster. You're not starting off in the dark or in the blind, not starting off saying I don't know where the problem could be. Instead, you're starting off saying here are exactly one, two, or three places where the problem could be, and now I need to figure out which one is the issue in order of priority. I'm sure you can imagine how much time that would have saved you already on the front end of trying to figure out where the issue is.

The other cool thing I like about these observations is that you can dive deep into them and see what the system is doing. This is also an excellent tool for learning the platform. You see that message there that says "pattern message and pipe to anomaly"—that's a feature we released just for investigations, and we call it always-on anomaly detection. You can actually use this on your own with any log. It detects anomalies between two ranges. If you select one hour, it compares two ranges, does a pattern analysis, and tells you if there are any anomalies in those two sets. So it's an excellent tool for learning as well.

Eventually, the system will come up with a hypothesis, and here's the hypothesis you can see . It basically says that an S3 bucket policy deployment that denied access to the Nova image analyzer Lambda function role is causing immediate service failure when the function attempted S3 operations. That's exactly what I did. I put a deny-all policy on that S3 bucket for the role that the Lambda is using. So now it found the issue very quickly.

It gives me a summary of everything that was found. It gives me a map—remember I told you about the topology? This is the map of the topology of the application that it used to figure out what needed to be checked . It also gives me troubleshooting next steps and runbook actions . I could just run this if I needed to. There are knowledge-based articles as well, so there's a lot of good information here.

Now what I can do is accept that hypothesis . Once I accept the hypothesis, the system is going to start collecting facts, and those facts can later be used to generate an incident report.

Let me show you an archive report. I'm going to this archive one to see if I remember. Let me see if it generates quickly. If not, I can go to another one. Anyway, we'll wait, but basically what's happening on the back end is all the facts that were gathered about the incident—like what pieces of telemetry were used, what the connection is between all of them, and how this impacted the system—are all gathered and collected into this report. Now you have a very comprehensive report that you can use as a mechanism to document the incident and prevent it from happening again, because it has all the details that you need for that. You will see that once it completes here.

There's a lot of information that may not be included because there's no way for the system to know, for example, how many customers were impacted. That's information that is sometimes a little difficult to figure out. However, you can edit those in the system. You can go and add these facts or update these facts so that the report is more complete. Anyway, you're going to have to trust me on the report. Maybe I'll show it to you later, but this is a feature that our customers using investigations absolutely love.

AWS DevOps Agent and Race Strategy: The Path to Autonomous Operations

But you saw how quickly we were able to get to the resolution. This, I think, is what is a game changer for our customers. The fact that now you can get to the root cause—not in hours but within minutes, very quickly—is remarkable. Alright, let's talk quickly about the last one. This one was announced today and it's called AWS DevOps Agent.

What is this? AWS DevOps Agent is what's called a frontier agent that resolves and proactively prevents incidents continuously and improves reliability and performance of applications in AWS, multi-cloud, and hybrid environments. Now you're going to ask me what the difference is between this and the one I showed previously. The difference is how they work. This is a frontier agent, a new type of agent that when you let loose out there, it's going to start looking for issues that could be brewing somewhere—anomalies and things like that.

The Investigations feature is a CloudWatch feature that we offer at no cost to our customers. That's the big difference. This is a full-blown frontier agent that's going to go out and work with third-party telemetry. It will do a lot of preventive work on its own, whereas the other one you saw is designed for incident investigation and finding the resolution with the human in the loop, which we feel is important. But I think this is a great first step so that customers can prevent incidents and find out what's going on in their infrastructure. I'm very excited about what the team is going to deliver on this.

Yes, this AWS DevOps Agent is just a start in the journey of how we get to a nirvana stage of autonomous ops where an alert goes off, but the agent has already figured out what the root cause is and has taken the corrective remediation action. You just make sure that everything is validated, or you don't even need to do that. You wake up in the morning and it just gives you a report of everything that happened last night and all the changes it made and all the fixes it made. That's where we all want to go. This is a first step towards that.

Everything we showed before you all the way up to here is your base implementation, your foundation, and making incremental progress and incremental changes in your operations to be able to get to that stage of nirvana. That's the journey that we are on. I promised you I was going to show you the incident report. There it is. This is a very detailed incident report with metrics, all the definitions, everything that happened, what went well, the analysis, the detection, and the mitigation.

It's very well documented and structured with all the details you need, action items that were recommended, and you will notice that at the beginning here there are these fields that need input. You can update those in the system, and then they will update in the report. You can export the report into a PDF, copy it as markdown, and incorporate it into whatever tool you want. Let me switch back. There you go, you have your pit crew.

So learn more about the foundational things that we discussed—the unified telemetry with CloudWatch. Learn more about how we're using MCP servers. Investigations is an awesome tool that if I were you I would just enable it. It doesn't cost you anything and it can save you a lot of time.

We're very excited about the future and what it holds with the AWS DevOps Agent. So please take a look at it.

What we showed you here today was your race strategy for your season. I started this session by saying there is a preseason of setting, so you start using CloudWatch natural language query and get your teams to start talking to their telemetry in natural language. Then, in order to be better prepared, as Andres mentioned, please enable CloudWatch Investigations. It's available at no cost and can actually get you faster to where you need to go in the case of an incident.

Next, be even more structured and deploy MCP automations. I showed an example of how it's integrated with GitHub, and you can now essentially just at the time of development, at the time of coding, ask it questions to tell you what alarms and what metrics need to be captured, what alarms need to be enabled, what level of logging, what kind of logging, and how much logging is needed. All of those questions can be answered up front during the development process.

But last but not least, if you're ready today, go ahead and deploy the AWS DevOps Agent for autonomous operations. You will be able to get to that stage of nirvana on this journey, and you will be able to get to that championship finish as part of your operational transformation race. Formula One teams don't just think about winning, but think about strategy and actually implement it. What we did here today is we tried to give you all the tools and showed you how to use them in order to implement that race-winning strategy.

Here's the QR code with all the information that you will need, that you will learn from here today, and everything else that you need to know to get started. But last but not least, please fill out that survey and give us five stars. Please fill out the surveys and give us five stars if you want to see us back on stage next year. That's the only way to do it. Thank you everyone for taking the time. Amazing race, thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community