Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Accelerating incident response through AIOps (COP334)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Accelerating incident response through AIOps (COP334)

In this video, Pratul Chakre and Andres Silva present AWS's AIOps innovations for accelerating incident response, using Formula One racing as an analogy. They demonstrate CloudWatch's unified telemetry management that centralizes operational, security, and compliance data across accounts, natural language query generation for logs and metrics, and MCP servers for standardized AI-model connectivity. The session showcases CloudWatch Investigations, an AI-powered tool that automatically surfaces relevant metrics, logs, and traces to expedite incident resolution—achieving 80% time savings for Amazon's Kindle team. They also introduce the newly announced AWS DevOps Agent, a frontier agent for proactive incident prevention, and explain how these tools form a progressive journey from monitoring to autonomous operations, reducing mean time to resolution from hours to minutes.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Accelerating Incident Response Through AIOps at AWS re:Invent

Good afternoon, everyone. Can everybody hear me okay? You've got to have your headsets on if you want to hear me. Good, all good. Thank you. Fantastic. Yes, you are in the right session. The session is Accelerating Incident Response Through AIOps. Let's start with some introductions. My name is Pratul Chakre. I lead the global worldwide team for Cloud Operations Specialists based here in the US.

Andres, I'm a sidekick. My name is Andres Silva. I'm a Principal Specialist Solutions Architect with the Cloud Operations team, and I focus on observability, so helping customers adopt observability strategies. Throughout the last year, I've been focusing on AIOps, so we have a lot of exciting things to share with you today.

So today what we've put together is probably the most high-octane session that you will attend at re:Invent, and if that's not the case, please come back and see me and let me know so we can do better next year. When Andres and I sat down to talk about what do we do about this session, what do we present, we actually wanted to take a step differently and not just present all the cool stuff, which we will of course, all the cool stuff that we have to show you, but we wanted to weave it into a story.

When we were thinking about the story, we're like, what's better than the recently concluded Formula One at the Las Vegas Grand Prix? Any Formula One fans here? Show of hands? Okay, there. Okay, fine, so we'll have some work to do along this. But what we really want to showcase was how Formula One has evolved using technology to get to where they are today, and they're continuously doing so. The biggest difference between organizations that are successful and organizations that are not so successful is how they evolve and embrace technology to drive innovation within their enterprise, especially in operations.

So what we will be talking about today is the race strategy, right? A race strategy is what you put together to understand during the off-season what are we going to do, during the start of the season what are you going to do, mid-season, and then on the day of the race, what are the kind of things that you'll be focusing on, right? So we'll talk about all of the AI innovations in CloudWatch, CloudWatch being our flagship AWS service for observability. We'll talk about MCP servers and agentic frameworks. We'll talk about CloudWatch Investigations, and last but not the least, we will also talk about, there is no session that's complete without talking about agentic AI and agentic frameworks, so we are going to talk about the AWS DevOps Agent, which was announced by Matt Garman earlier today.

CloudWatch Evolution: From Basic Logs to Centralized Telemetry at Scale

But first, and this is interesting because I've been with Amazon ten years and I've spoken to hundreds, if close to about a thousand customers over my last ten years, and every single time I hear, or most of the time I hear, I didn't know CloudWatch could do that. So what I wanted to do was just quickly talk about how the evolution has begun in terms of the innovation with CloudWatch. It started in 2014 with just the logs to where we are today. We launched Live Tail for logs in 2023, one of the most sought-after features for developers who are infrastructure engineers to be able to look at the logs at near real time.

In 2024, we of course made further improvements to Live Tail, but we also launched Database Insights, so getting into specific curated views into the databases, and then we also launched Transaction Search and Analytics to today. Where we've launched a whole bunch of things, but I'm just going to pick on a few things. Number one, we've centralized all of our logs across accounts, across regions in one account or a region, right? This was one of the most sought-after and asked features from our customers.

We now have Application Map. Now this is extremely important for customers to understand their application-level dependencies across whether or not the application is instrumented, right? So you can actually see an entire application map across your application. Generative AI Observability, again something that Matt Garman announced today, talking about when you build your generative AI applications, whether it's on SageMaker, whether it's on Bedrock, or whether it's on EKS self-hosted, how do you monitor the application or the agents that you're building across the enterprise, right? Some of the cool stuff that we've launched over the past year, which is all the best. So CloudWatch has evolved a lot.

And this evolution has helped us to monitor at scale, and I'm sure if you didn't know this, one of our biggest customers for CloudWatch or AWS Observability is Amazon. All of Amazon, all of AWS uses CloudWatch for their observability. We support seventeen exabytes of CloudWatch Logs per month, thirty-two quadrillion metric observations per month, and nine hundred thirty-two million canary runs per month.

Now, I don't know about you, but when it goes beyond gigabytes and petabytes, I kind of lose context on how many zeros we're talking about. So this is why I put that for my own reference. An exabyte is 18 zeros, right? That's the scale at which we are operating, and that's what we are supporting.

So how are we doing that? How are we able to do that today? One of the things that we were extremely clear about as we heard from our customers when we were talking internally, one of the things we were really clear about is number one, CloudWatch, well, AWS needs to be the home of all telemetry. If you are not able to get all of the data into a single place, you have disparate data sources all spread across your entire environment. It is extremely difficult to build a context. It is extremely difficult to get the insights based on that context from the data that is very disparate. So CloudWatch now has become the home of telemetry.

Then building curated experiences. What an application developer might need is very different from what an infrastructure engineer might need or what a database engineer might want to see as part of their metrics, as part of their observability, right? So building curated experiences in order to be able to help the persona for the use case that they are responsible for is extremely important. And then last but not the least, implementing AIOps. How do you offload mundane, manual, repetitive tasks to AI so you can focus on what's important for your business?

The Formula One Pit Stop Analogy: Technology-Driven Operational Excellence

Now let me take you back a little bit on the Formula One analogy, and I know a lot of people didn't raise their hands when we talked about Formula One. But this is a question we wanted to ask about. Do you know how long an average pit stop takes? A pit stop is a pre-planned stop that the driver takes in order to either refuel gas or change tires, which are the two most common use cases, in order to be able to keep the race running for the number of laps. So watch this video. There'll be a question at the end of it.

Hamilton on Vettel, 81% now because Hamilton's going faster. Push now to gain more. Push now. So this is the undercut. This is Hamilton going faster on that new set of tires than Vettel's able to because he's on an older set of tires. Hamilton's just coming around the final turn as the Ferrari pits.  They execute a very nice pit stop indeed. Medium compound tires going on to Sebastian Vettel's Ferrari. Lewis Hamilton goes past our commentary box  at a rate of knots. Vettel at 80 kilometers an hour. Hamilton has got the jump on Sebastian Vettel by performing the undercut, and it's Hamilton now ahead of the Ferrari.

Excellent. How many of you saw that Lewis Hamilton took 2.1 seconds as a pit stop? Yes, this time is so important that it actually can either make you win a race or not. It depends upon the podium position that a race driver could make. How many of you saw that there was a predicted gap analysis on how long does Hamilton stop before Sebastian Vettel, who's another driver behind him, how far ahead Hamilton will come in front of Vettel? All of this is collecting telemetry, doing the analysis at runtime, not even near real time, at runtime, in order to be able to make the right decision of when to stop and when to make those changes.

On average, it takes about 2.3 seconds, even though Hamilton did better. On average, it takes 2.3 seconds for a race driver to pit stop, right? But they didn't just get here. Over the years, Formula One has made progressive improvements, going from when they would use hammers in order to change their tires all the way down to AI prediction today. That has actually gotten them from 67 seconds back in the 1950s to 2.3 seconds today. That is 96% improvement in how they conduct a pit stop, right? That's the strategy. That's what makes or breaks a race.

How do they do that? What did they do? What did they change? And it really revolves around three things. Being adopting the latest and greatest technology, what's available, right? So they went from hammers to pneumatic wheel guns that immediately gave them five times faster in their ability to change the tires.

They went even more data driven. There are 300 sensors per car. They're collecting weather information, racetrack information, all the telemetry that they need to make the right decisions in order to win the championship race. And the process, the personas, every persona is accustomed or choreographed to do the exact same thing every single time the car stops during a pit stop. They practice this like a symphony over and over again so that they don't make a mistake at the time when the time is really needed.

So why shouldn't it be any different for all of us, all of you, to also drive your cloud AI ops implementation? Because the technology is now available, and we will show you how this technology can help you get there. It is data driven, so once you're able to collect all your telemetry across all your environments into one single place, you can make those data-driven decisions and enable the process of getting AI to assist you in making the decisions, what we essentially call human-led AI assisted decision making, driven by data. And then continuously learning from past experiences in order to improve and further reduce your ability to drive your MTTR or mean time to resolution lower.

From Unscheduled Incidents to Autonomous Operations: The AIOps Journey

Think of that pit stop as a planned downtime that we all have in some shape or form. We practice it, we choreograph it, but what happens when an incident occurs in our world or in the case of Formula One, there's an unscheduled pit stop? An unscheduled pit stop happens because something broke off the car, needs to be quickly replaced, the tire came off or the tires were not working properly, they need to change again. Even then, an unscheduled pit stop is only about 40 plus seconds.

Now imagine your pager goes off at 3 a.m. Right, you now have an incident. I don't know, for some reason everybody loves a 3 a.m. analogy for pagers going off. I don't know why, but here we are. Pager goes off at 3 a.m., which is your unscheduled pit stop in your world, in our world. That incident takes four hours to resolve. And now we've lost, if not thousands, at least a few million dollars in unrealized revenue because your customers couldn't log in, they couldn't process their data the way they wanted to do any of that impact.

So what if your operational pit stop, whether scheduled or unscheduled, could get into minutes or even seconds instead of hours? That's where AI ops plays a big role and with that I'm going to hand it over to Andres to take us to the next set of slides. Excellent. So we've talked a lot about the Formula One analogy which if you didn't know how to recognize a Formula One fan now you know. But I think it's very good. But before we start diving deep into this, I think it's fair that we define what AI Ops means, right? Because it could mean different things to different people.

So when we define AI ops, we're talking about using artificial intelligence and machine learning to enhance, you can see there on the slide, accelerate and automate the cloud operations process. So it's to give you superpowers. This is not something that's going to go and do all the work for you. Hopefully we'll get there, but we're not there yet, but we're discussing AI ops in that context that you see there.

So that's the problem you were talking about that the slide doesn't come. Okay, very good. So in the Formula One analogy, Pratul did an excellent job in explaining how their evolution from the hammers and all that has taken them to where they are today, where they can do a pit stop in 2.3 seconds, right? Amazing. So you can see a lot of parallel things in their evolution and the evolution that our customers need to do in order to take advantage of this technology.

The F1 started with telemetry. They then started doing simulations and strategy. Then they went to making real-time adjustments, and I'll give you an interesting example later on about when that was implemented. And now we're at the point where they're making AI decisions. When something happens, you know, they can go in and tweak things. Super powerful.

So in order to manage infrastructure in the cloud and adopt AI ops, you have to do or have to go through a similar journey. We all started with monitoring, right? And then maybe we started doing predictive analysis that has been available for quite some time, anomaly detection. Remember setting up that base pattern and then detecting anomalies. Now we're at the time where we can do auto remediation and we can do a lot of other cool things with generative AI.

And of course, autonomous ops, that's the holy grail, right? That's where we want to be, and that has to be the journey that we have to take to adopt and benefit from this technology.

Tire Gunners: Unified Data Management and Foundational AI Innovations in CloudWatch

So I'll just take a quick moment to highlight the partnership we have with Formula One and their platform F1 Insights. It's a very good example of how you can continue to adopt technology and improve not only your operations, but in this case, their business, right, which is the key thing. So in order to help you understand this and kind of create a mental framework, we decided to continue our Formula One analogy. And we're going to talk about four main things, right? We're going to define the pit crew. The pit crew is all those people that help you in a race to win.

And we're going to talk about tire gunners. We're going to explain what they are. We're going to talk about jack operators, strategists, and then race engineer. We're going to map those to different services and features that can help you in this journey, all right? Let's get started with tire gunners, right? So a tire gunner is not somebody that shoots tires. No, that's not it. It's actually the person that has that big pneumatic pistol and races to take out the big nut that's in the middle of the wheel and takes it out very, very quickly.

In fact, if you go to the Caesars Forum now, they have the sports forum, I think there's actually you can go in and actually play with a pneumatic gun, which is actually I think pretty cool. But it may seem simple, but it's foundational, right? And when we talk about foundational, we have to talk about some of the innovations that we've done with CloudWatch. Interesting note here that in F1 Insights, the Formula One team uses CloudWatch telemetry to monitor their ML inference infrastructure, the infrastructure components, right? And again, that goes back to the point that it is foundational, right?

So let's talk briefly about some of the AI innovations, the base layer innovations that we've done in AI. We're going to talk about something that Matt Garman announced today in the last 10 minutes, 25 launches, right? That was very exciting, which is the unified data story, unified telemetry for CloudWatch. And we're going to talk about natural language query generation and a couple of other things. So let's dive into it, and then I'll do a demo because demos are fun, right?

So first of all, to set the stage, an F1 car has an immense amount of data, just one car, immense amount of data coming in that the teams have to process. So just to give you an idea, in a typical race, there's more than 5 billion data points that come in across all cars. And that is a lot of data, and that's something similar that happens with our infrastructure, right? We have a lot of data coming in. We have infrastructure logs, we have application data, we have application logs, databases, streaming services, you name it.

And right now what happens is a lot of that data that's coming from different places ends up in different places to be analyzed, which kind of becomes a hassle, right? And it doesn't empower you to leverage artificial intelligence for operations because the data cannot be in all these places. You want to centralize your data. So we're very excited that this morning we launched and announced, this is something we've been working with for quite some time now. What is a year or something? Yeah, easily, yep. So unified data management for CloudWatch.

What does this mean, right? CloudWatch, you all know, it's a very powerful telemetry platform. It's the one that we use at AWS like Pratul said. But now we're creating an additional experience that allows you to unify these three main use cases: operational, security, and compliance. So that means we're empowering our customers to bring all that data into CloudWatch. We're providing an enriched interface on how to query and search that data, and we're also providing features like the analytics tiers, the telemetry pipeline.

And the goal of those is to allow our customers to put all their data in one place. And that's going to be the foundation. And then on top of that, we can run effective AI operations. Does that make sense? Yes, absolutely. And here is why it makes a difference, right? I mean, just like we saw in the Formula One example, about 5 billion, you said 5 billion per race, 5 billion data points coming from all different cars.

And that's the part that we're trying to understand here. It's the same set of data but coming from different cars, but the telemetry being collected is the same. Similarly for our applications, whether it's security data, compliance data, or observability telemetry, it's coming from different instances, servers, Kubernetes, serverless, whatever technology that you use underneath the hood to power your applications. It's coming from multiple instances of those same applications,

and to be able to run the level of analytics and decision making that Formula One does, it is extremely important to be able to collect all this data into a single location. Exactly. So let me highlight some of the features. Don't worry, we'll make some noise later.

Let me highlight some of the features. It's the fact that now you can from a single place enable all the AWS vended data sources. I'm going to give you a very simple example. You have an organization with 400 accounts. You want to enable VPC flow logs in your accounts. You want to do it in a consistent way across your entire organization. That's just one small part of what this new set of features does, where you can go in, create a rule that goes and enables all VPC flow logs everywhere for you in a consistent way.

We are providing connectors for third party sources. So you can bring in your CrowdStrike data. We launched with about 12 connectors and the goal is to continue to increase that. Now you have a full blown telemetry pipeline that allows you to ingest the data, transform it in the format that you want, so that then you can consume it. And along with all the analytics features that would take the whole session telling you about, this is incredibly powerful and that's why we wanted to include it in this session, but the best thing to do is to show you a demo, right?

Yeah. And before you get into that test drive, the last but not the least is the support for open standards, right? What we've learned from customers over the years is that there is an advent of how many customers want to use open standards to store their telemetry. And that's exactly what we're supporting. So we are supporting OpenTelemetry. We do support OpenTelemetry for observability data, and then we also support Open Cybersecurity Schema Framework, OCSF, for security data. So now you can pretty much send either of those formats to us in CloudWatch, whether it's security or observability. Go ahead, Andres.

Natural Language Query Generation: Simplifying Data Insights Without Complex Syntax

Awesome. So if you have used CloudWatch in the logs console in the logs section here, you will see there's a new thing called log management. Before it used to say log groups. Maybe you don't notice there's a subtle difference, right? But now when you go in there, you're going to have three tabs that provide additional context into what's going on and the enhancements we're providing.

The first one that I want to call out is data sources, and what this does is that it will show you all the different data sources that CloudWatch is tracking, right? So you can see here that I have VPC flow logs, Route 53 logs, I have CloudTrail events, right? So as you add more data sources, either through enabling them through the centralized enablement or by ingesting the data from a third party, you will see your full list of data sources here.

The other thing you can do very quickly here, and this is something I forgot to call out, we are providing S3 Tables integration with this new set of features. What does that mean? That means you can take a specific data source and you can say I want to make this available via S3 Tables. Now you can read that data through Apache Iceberg, Athena. So imagine the possibilities of now taking all the logs that you have there and incorporating them into any big scale analytics pipeline that you want or just using artificial intelligence for it. This is super powerful and something that we think is going to make a big difference.

The other thing you could do here if we could scroll up, you can add new data sources, and I'll just show you this quickly. I don't want to spend too much time because I just want to whet your appetite so you can go play with it, right? So we can see here all the data sources that we currently support. As you can see, we have all the vended telemetry there, you know, VPC flow logs, EC2, Amazon Managed Service for Prometheus, but if you go ahead and click on third party, you'll see that we're also starting with a large set of partners, third party integrations that you can bring in, right.

Okay, the other thing that is worth calling out is the fact that pipelines here, it's very easy to create a pipeline. You can just say, okay, what kind of source are we talking about from the list that you saw before? Let's say we select CrowdStrike. You give it a name. Right, and then you go next. And the integration is going to be done through an SQS queue. In this case, the different sources work in different ways. You specify an SQS queue and our telemetry pipeline is just going to be pulling that queue and pulling anything that comes in new so it can ingest it. You specify the data format.

The data format is only one data format at source. And the service role that is going to be used, and that's it. It's as simple as that. It's super simple to use. You can also do transformations as you're creating the pipeline. You can say, you know what, to give you a very simple example, my CloudTrail data, I want to transform it to OCSF, right? And it'll do that OCSF, thank you, OCSF. And it'll do it for you. You can also do enrichment on the telemetry pipeline, which is also super powerful.

Anyway, let's get back to this is core for AI. This is foundational, but another thing that I wanted to show you, and I can go real quick to the PowerPoint, but it's just for one slide because I'm going to go back to the demo. Communication is very important in a Formula One race, so the fact that the team can communicate with the driver is super important in a clear way. So when the driver is instructed to come back to the pit lane for whatever reason, they don't use complicated language. They don't use very intricate words or anything like that. They just say box, box, box, and he knows that he needs to go back to the pit lane, right?

So why do I say that? It's a good analogy for what happens when you're trying to extract insights from your data, right? You don't want to overcomplicate things, and traditionally to get insights from our logs and our metrics, we've had to learn these query languages, right? Log Insights language could be SQL, right, whatever is supported by the platform. One of the great things generative AI has enabled is the fact that now you can express in natural language what you want, and the system can automatically translate it to what you need. And that's one thing that we're investing a lot over the last couple of years.

So let me just show you real quick how that works. All right, we're back to our demo here, and I'm going to just show you a couple of examples, one with logs and one with metric insights. So I'm going to pick a log group name, and I'm going to do CloudTrail because CloudTrail is, everybody understands what's going on with CloudTrail. I don't have to explain a lot of things. So what we're going to do is let's just run, let me see, let's run this query. Oh, I'm sorry, no, I don't want to do that.

So typically you have to come in and write your query, right? Now we support also PPL and SQL as other languages. You can craft your queries in those languages. And what you can do also is use the query generator, which is right here, right? And now you describe in simple plain English. It actually works in other languages too, which is kind of cool. I was about to ask you that. Yes, some of the main languages it works. You can just say it in plain language, and it'll convert it into the query that you want.

So let's do this one. Show me the top 20 API calls made, right? So I'm going to say go ahead, and just a couple of things that happen here. It understands the schema of the data source and incorporates it into the request that is made. And using that information, it will go ahead and generate the query that you want. You saw I already did it over there, so I can go ahead and run my query. Right? And there you go. I have all the API calls from CloudTrail that have been made, organized. I can go in and refine the query. I can say, you know what, can you sort them by blah blah blah or add this detail? Very simple, very simple to use.

The cool thing is that you can become an expert on these querying languages very easily, right? I feel so powerful when I'm using it because I don't know anything about this complex syntax where you go in and you say, oh, that's how you do it. So cool, super powerful use of AI. What it really does, it takes away the need to have to learn a specific query language in order to actually get the insights, which is what you want. You want to know what's in your data, what is your data telling you. Instead, the time is spent on trying to learn the query language in order to be able to get to that insight. And so with the natural language querying, you will now just focus on the insight that you need and not worry about the query language underneath the hood. Yep, all right, very good.

Jack Operators: MCP Servers Enable Two-Way Communication for AI Operations

So I just showed you the demo. Now let's go to the second one, jack operators. Let's talk about MCP servers. So MCP servers are super important.

The way we fit into this analogy is when you are using or trying to leverage AI to do AI operations, you can ask a lot of questions. You can get a lot of data, and we are familiar with a lot of systems that enable you to store the data in one place and then use some sort of vector database to organize it, and then you can go ask questions about that. But there's a number of problems with that. It tends to be very expensive and other things.

But an MCP server kind of changes the whole thing, and it's very similar to what happened with Formula One in 1993. So before 1993, they were always able to get, not always, but for a period they were able to get telemetry, the speed of the car, the temperature of the engine, a whole bunch of stuff. But in 1993, there was a team that introduced what they call two-way telemetry, which is basically now that we're getting data, but they can go and make adjustments to certain things in the car. And that gave them a 0.3 second advantage per lap, and they were able to win a bunch of championships until the other teams just caught on and started doing the same thing. So that that's kind of what MCP allows us to do because it allows us to communicate both ways with some of these things. It's like the way that has been illustrated is like a USB-C for standardizing the connectivity between your AI models and any API.

So why do you need them and why they are important in AIOps? Well, the thing, like I said before, is that you need to get data from your telemetry, your observability platform, and you need to take action, and you need to do that in an efficient and safe way. So MCP servers allow you to do that, and when we saw the potential of this and how important it was for agentic workloads, we decided to start building these MCP servers and providing them to our customers.

So today we have three MCP servers for CloudWatch or CloudWatch and related telemetry. We have the CloudWatch MCP server, we have the Application Signals MCP server, and we also have the CloudTrail MCP server. And when you use these, let's say with, and I'm going to show you an example with Claude, but you can use them not only with Claude or VS Code Q Developer, you can also build your own agentic applications that connect with the MCP server and allow you to read data, and some of them allow you to make adjustments too. Think about auto remediation of issues. Some customers are actually doing that. It's pretty cool.

So let me show you how it works. Let's go back here. So I'm going to go to, yeah, this thing doesn't want to give up, huh? All right. Let's go to my, this is Claude, right? And I have an application here. And let me show you first how, you can figure this is very small and I'm so blind, goodness gracious. All right, so there we go.

So you configure your MCP servers, and you can see how the CloudWatch one is broken because the demo had to break, right? But we'll find another way of doing it. So basically the way you configure MCP servers is by, you download the set of files that QR code that I showed you before, I'll show you again. You can go, it's a GitHub repo where we have all of our MCP servers, not only the CloudWatch ones, all of them are there, and you can go in and there will be instructions on how to do it. The instructions there are for running it locally, which is what most customers are doing today with development environments like this, but you can also run them hosted and all that.

So the idea and the use case for an IDE environment like this is you can think of it, you're coding, right? You're just pushing a change, you have your development environment, you want to see how it behaves, you want to tweak some settings. So you can go right now here and say, you know, because I have mine configured and now finally it's working, you can go here in the chat and say I just deployed a new version of the Nova image generator. Now you see why I was using that app, right? Can you check latency or if there are any issues.

The good thing is that it understands my typos. So what this does is based on the MCP server, the MCP server has a set of API calls defined and it can reason which ones it needs to call. It will say, hey, I need to run this, so I can approve it. You can actually auto-approve things, but I have it here because I want to see what it's doing specifically. It will ask me for permission to go ahead and run the tool, so it's going to go call the MCP server. There's a set of APIs that it's going to use, and it's going to return back some information that's going to help me understand the state of my application.

So it's doing audit service. This is the Application Signals. So it found some connection errors to Bedrock, which is interesting, right? There's a problem with the application, so maybe when I pushed the code, something happened. You get the point, right? You can ask in plain English very intricate and complicated things. I remember doing a demo one time where I said, look at my CloudTrail data, show me any unusual activity that happened yesterday, right? And it goes in and does the query, and then I said, okay, now compare that to the previous day and create a comprehensive report for me. It gave me this amazing report with everything that happened in one day, compared it to the other one, and it called out a few things that were very interesting that it never occurred to me were happening in that account. So I encourage you to test it out. It's super powerful, and it can empower you as a developer to get on top of things real quick, see how things are behaving, and you have a direct connection to the telemetry.

Before you move on, a couple of things to notice here. Of course, this was you were in the middle deciding which MCP server to be used. Of course you can automate it, but there's also an element of trust to be built with the agent that you're working with before you can get to that point, but it allows you to get there. So that's number one. Number two, it also comes with recommendations on where you could possibly need to make those fixes in order to remove those errors. Right, so that ability to be able to get to that root cause, those issues, or identify them up front early on and make corrections is extremely important and extremely powerful in your journey to implementing AIOps.

Yeah, so I just issued another command. The Application Signals MCP server has a special API call called audit services that combines a number of API calls internally, and it's actually pretty cool because what I just did was I said, are there any, do you recommend any alarms for this for this service? So what it does is it analyzes all the metrics and it checks metric density and other parameters to figure out if this is a good candidate and the variations for an alarm, and it would actually recommend me alarms, which is actually super cool. So it's running that, and at the end it will come and say, you know what, you should probably enable these three or four alarms. Here, they're kind of critical, so that's super powerful, right? And that thing, there you go. I'll go back to it and show it to you on my next demo. Hopefully it'll be done. All right, so let's move on.

Strategists: CloudWatch Investigations Transforms Incident Response with AI-Powered Analysis

Okay, we're back there? Yep. All right, so let's go to the next one, strategies, and this is my favorite application. I've been working with it for a year now. So it's CloudWatch Investigations. Let me tell you a little bit about this tool. If you think this slide is busy and confusing, we did our job, right? But this is how the process of responding to an incident looks today, right? You get a page, you go check an alarm, you review the metric, you go to a dashboard, you call a friend, all that stuff happens, right? And then you don't know, it's very confusing, right? So through things that we've learned internally at Amazon and all the feedback your customers give us, we said, can we solve this problem? Can we help solve this problem with generative AI? So we created Amazon CloudWatch Investigations.

Now, how many of you knew about CloudWatch Investigations before today? Raise your hand, raise your hand. Okay, all of you do now, all of you do, all right. The reason I have a new tag there is because there's a new feature that is incredibly powerful. I'll tell you a little bit about it, but here's the thing. What it is is an AI-powered tool that helps you expedite the process of resolving incidents in your infrastructure. So it automatically surfaces when you open an investigation metrics that could be related to the incident, log queries that could be related or results that could be related to the incident, traces, AWS Health data, and CloudTrail data changes.

One customer told me this is awesome just so that they know if the issue that's happening is AWS-related or theirs, because it automatically checks AWS Health and will tell you if there's an issue with a particular service, so you can move on. You don't have to be stuck there spending an hour trying to figure out why your application is failing.

It's multi-account ready, so you can investigate across accounts, and you can start an investigation from any CloudWatch widget or you can start it triggered by an alarm, which is super cool because when the alarm goes off, by the time you get to the console, the system will already have some information for you to act on. That's super powerful. The new feature that we're introducing is that now you can generate incident reports. These incident reports are based on what we use internally at Amazon that we call Correction of Errors or COEs, and they ask the five whys. This gives you very detailed information about why the incident happened, how it was mitigated, how many users were impacted, everything, and I'll show you one. It's better than me talking about it.

All right, so here's how it works. You start an investigation. You can do it, like I said, from an alarm, or you can do it from any widget. The most difficult thing when you're doing incident investigation using telemetry in a central location is the topology of the application, because think about it, you have all this telemetry in a big data lake. How do I know that this piece of telemetry is related to this other piece of telemetry? That's super difficult.

The team at CloudWatch did an awesome job, and the first thing that happens once you open the investigation is that we start creating a topology of that application. How do we do it? We use internal resources. We traverse the infrastructure using different APIs. We use special features that the CloudWatch agent has called NTT, and we create a map of that. Now with that map, we can start focusing on a subset of the telemetry to figure out what the problem is, and that expedites the process and makes it a lot more efficient.

What's going to happen immediately is the systems are going to start surfacing observations. Hey, I noticed this, I noticed that, I noticed that. Through the console, you and your team, because this is a collaborative tool where multiple people can be collaborating, can add more data points to the investigation, can make notes to the investigation to make it better, add notes, or you can say, you know what, this is not related, discard it, and it removes it from the investigation. All those steps continue refining the process, and then the system is going to come to a hypothesis.

Based on what I see, this is what I think is happening. Not only that, it tells you how we came to that conclusion with full documentation and what you need to mitigate the problem, what you need to solve it. This could be either it points you to documentation, it points you to knowledge base, best practices, or in some cases it will give you an SSM automation runbook to fix the problem.

Here are the benefits. It's interesting that the Kindle team internally at Amazon started using CloudWatch Investigations for some of the incidents, and the result was 80% time saved in doing investigations. That's what we want to do. We want to reduce the complexity and all the issues that come up when you start an investigation. The other thing that is actually pretty cool is that through this process, you learn a lot about your infrastructure that you didn't know, because there's this weird metric that surfaces and it had never occurred to you, maybe I should keep an eye on that. All of a sudden it just pops up. It is a very powerful tool, and because it has the human in the loop, it's not only expediting the process of you solving the issues, but it's helping you learn more about your environment at the same time. We think that's very important. You were going to add something, Per, too? No, okay, you got it covered for now. All right, very good.

Live Demo: CloudWatch Investigations Identifies Root Cause in Minutes

So let me show you a quick demo of investigations. You see, I wanted to show you how the system came back with some recommendations for alarms that I should do. It's super cool. All right, so I'm going to go to Investigations. I'm going to go to this one here, and I'm going to show you something that happened. I have this application that I built called the Nova Image Generator, so it uses Nova through Bedrock, and you just type, hey, create me this picture. It's the same thing that you can do when you go to Nova.Amazon.com.

But what I have is I have in my here or here, I have a process running requesting images every so often. And then I have an SLO defined using Application Signals. If you don't know what that is, learn more about it. It's super cool. But basically, it is keeping an eye on that metric and letting me know if there's an issue. So I have an alarm tied to that. And you can see here that the alarm fired off because I broke the application, and you'll see how I broke it in a minute. And what happened was the alarm triggered. I have the alarm configured to automatically start an investigation. So if I go here to AI Operations and I go to Investigations, I see that it already opened an investigation for me.

All right, so if I go into that investigation, I can see what was the piece of telemetry, the initial piece of telemetry that it used to start the investigation, which is the metric that was backed by that alarm, which is in this case the SLO. The SLO average is based on errors, so it spiked up. I broke the application when Paul was talking before. It spiked up, triggered the alarm. So that's the first piece. And then right here as this is happening, you see that I have the agent queue here. I can go in here and I can see what it's doing. It's constantly moving. I can see it's not a black box, right? It shows me what metrics it's doing. I can go in and read what it's analyzing, all the details. I can dive into it so I can see what it's doing and I have visibility into it. I can filter. This in the back end is fully agentic, so you can see that playing out there on how it's handling the investigation.

Now if I go back here, what you will see, and I'm going to go here, these are the observations that I mentioned to you before. These observations are surfacing. I can read them. I can validate them. I can say, you know, I just don't like it. This is not related. I can also add a note and say, you know what, I think this is related to blah. You should check this. And the system will incorporate that into the investigation also. And this is where it becomes extremely powerful, right? Imagine that the alarm went off, it kickstarted the investigation already, and it came up with the hypothesis that the agent believed could be the possible root causes for why the alarm was fired off. So it's not like you're trying to go in and just figure out, okay, let me see what went wrong. You already start at a position of saying here are the possible two or three reasons why the alarm went off. And then if it makes sense for you in terms of why it could have gone off, you can accept the change.

Now it keeps a record of every hypothesis that you've accepted and starts investigating further until you get to that root cause faster. So you're not starting off in the dark, you're not starting off in the blind, you're not starting off saying I don't know where the problem could be. You're saying here are exactly one or two or three places where the problem could be. Let me go figure out which is, in order of priority, which is more important. I'm sure you can imagine how much time that would have saved you already, right, on the front end of trying to figure out where the issue is. Yeah, the other cool thing that I like about these observations, you can dive deep into them and you can see what the system is doing. So this is also an excellent tool for learning the platform. You see that message there? It says pattern message and pipe to anomaly. That's a feature that we released just for investigations, and we call it always-on anomaly detection. You can actually use this on your own with any log. And it detects anomalies between two ranges. So if you select one hour, it compares two ranges. It does a pattern analysis, and it tells you if there's any anomalies in those two sets. So it's an excellent tool to learn also.

Anyway, eventually the system will come up with a hypothesis, and here's the hypothesis. You can see it, and it basically says that S3 bucket policy deployments that deny access to the Nova image analyzer Lambda function are causing immediate service failure when the function attempted S3 operations. That's exactly what I did. I put a deny all policy on that S3 bucket for the role that the Lambda is using. So now it found the issue very, very quickly. It gives me a summary of everything that it found. It gives me a map. Remember I told you about the topology? This is the map of the topology of the application that it used to figure out what I needed to check. And it gives me also troubleshooting next steps, runbook actions here. I could just run this if I needed to. Knowledge-based articles, so there's a lot of good information here.

Now what I can do is I can go ahead and accept that hypothesis. And once I accept the hypothesis, the system is going to start collecting facts, and those facts later on can be used to generate an incident report. So let me see if I have one that I can show you.

I'm going to go to this archive one and see if I remember. Let me see if it generates quickly. If not, I can go to another one. Yeah, anyway, we'll wait, but basically what's happening on the back end is all the facts that were gathered about the incident, like what pieces of telemetry were used, what is the connection between all of them, and how this impacted everything. All that is gathered and collected into this report, and now you have a very comprehensive report that you can use as a mechanism to document the incident and prevent it from happening again, right, because it has all the details that you need for that. You will see that once it completes here.

There's a lot of information that may not be included because there's no way for the system to know, like how many customers were impacted, right? That's information that is sometimes a little difficult to really figure out. So you can edit those in the system. You can go and say, you know, add these facts or update these facts so that the report is more complete. Anyway, you're going to have to just trust me on the report. Maybe I'll show it to you later, but this is a feature that our customers that are using CloudWatch Investigations love. But you saw how quickly we were able to get to the resolution.

Race Engineer: AWS DevOps Agent Delivers Proactive and Autonomous Operations

And this, I think, is what is a game changer for our customers, right? The fact that now you can get to the root cause, like Pratul said at the beginning, not in hours but within minutes, very, very quickly, right? Alright, let's talk quickly about the last one. This one was announced today and it's called AWS DevOps Agent. What is this? So AWS DevOps Agent is what's called a frontier agent that resolves and proactively prevents incidents continuously and improves reliability and performance of applications in AWS multi-cloud and hybrid environments.

So now what you're going to ask me, what is the difference between this and the one you showed previously, right? The difference is how they work. This is a frontier agent. It's a new type of agent that when you let loose out there, it's going to start looking for issues that could be brewing somewhere, anomalies and things like that. The CloudWatch Investigations is a CloudWatch feature that we offer at no cost to our customers. So that's the big difference, right? This is a full blown frontier agent that's going to go out. It works with third party telemetry. It will go and do a lot of preventive work on its own, right, where the other one you saw that it's designed for incident and finding the resolution with the human in the loop, which we feel is important. But I think this is a great first step so that customers can prevent incidents, find out what's going on in the infrastructure, and I'm very excited about what the team is going to deliver on this.

Yes, this AWS DevOps Agent is just a start in the journey of how do we get to a nirvana stage of autonomous ops where a pager goes off, but the agent's already figured out what the root cause is, has taken the corrective remediation action. You just make sure that everything is validated, or you don't even, and you wake up in the morning and it just gives you a report of everything that happened last night and all the changes it made and all the fixes it made, right? That's where we all want to go. That's where we go. This is a first step towards that. Everything we showed before, all the way up to here, is your base implementation, your foundations, and making that incremental progress, incremental changes in your operations to be able to get to that stage of nirvana. That's the journey that we are on.

And I promised you I was going to show you the incident report. There it is, right? This is a very detailed incident report with, as you can see, metrics, all the definitions, everything that happened, what went well, the analysis, the detection, the mitigation, very, very well documented, structured with all the details you need, action items that were recommended. And you will notice that at the beginning here there are these fields that need input. Those you can update here, and then they will update in the report so you can export the report into a PDF. You can copy it as markdown and you can incorporate it into whatever tool you want.

Alright, so let me switch back and just say there you go, you have your pit crew, right? So learn more about this foundational things that we discussed, the unified telemetry with CloudWatch. Learn more about how we're using MCP servers. CloudWatch Investigations is an awesome tool that if I were you I would just enable it. It doesn't cost you anything and it can save you a lot of time.

We're very excited about what the future holds with the AWS DevOps Agent, so please take a look at it.

Your Race Strategy: Implementing the Complete AIOps Pit Crew for Operational Success

Awesome, Andres, thank you so much. What we showed you here today was your race strategy for your season, right? I started this session by saying there is a preseason of setting, so you start using CloudWatch Natural Language Query and get your teams to start talking to their telemetry in natural language. Then, in order to be better prepared, as Andres mentioned, please enable CloudWatch Investigations. It's available at no cost, and it can actually get you faster to where you need to go in the case of an incident.

To be even more structured, deploy MCP automations. I showed an example of how it's integrated with Amazon Q, and you can now essentially just at the time of development, at the time of coding, ask it questions to be able to tell you what alarms and what metrics need to be captured, what alarms need to be enabled, what level of logging, what kind of logging, and how much logging is needed. All of those questions can be answered up front during the development process.

Last but not least, if you're ready today, go ahead and please deploy the AWS DevOps Agent for autonomous operations. You will be able to get to that stage of nirvana on this journey, and you will be able to get to that championship finish as part of your race, as part of your operational transformation race. Formula One teams don't just think about winning by thinking about strategy, but actually implementing it. What we did here today is we tried to give you all the tools and showed you how to use them in order to implement that race-winning strategy.

Here's the QR code. It's got all the information that you will need, everything that you learned here today and everything else that you need to know to get started. Last but not least, fill out that survey. Please give us five stars. Please fill out the surveys and give us five stars if you want to see us back on stage next year. That's the only way to do it. Thank you everyone. Thank you for taking the time. Amazing race, thank you.

; This article is entirely auto-generated using Amazon Bedrock.