Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Move beyond reactive: Transform cloud ops with AWS DevOps Agent (COP362)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Move beyond reactive: Transform cloud ops with AWS DevOps Agent (COP362)

In this video, Bill and David from AWS introduce the AWS DevOps Agent, launched in public preview. The agent autonomously investigates incidents by analyzing metrics, logs, traces, and deployments, achieving an 86% success rate in root cause analysis across 1,000+ internal AWS incidents. David demonstrates four key capabilities: automated incident resolution (identifying a cache serialization bug and generating rollback procedures), custom MCP server integration for bespoke observability tools, interactive chat for steering investigations and querying environment details, and proactive prevention recommendations by analyzing past incident patterns. The agent integrates with CloudWatch, Dynatrace, Datadog, New Relic, Splunk, GitHub, and GitLab, building application topologies automatically and working alongside human engineers and other AI agents.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Meet the Team Behind AWS DevOps Agent

Got awful quiet in here all of a sudden. I'm Bill, product manager for AWS DevOps Agent. I'm going to be joined here on stage later by David, who's a senior principal engineer here at AWS.

A little bit about David before I start, before he comes on a little bit later. David's been part of the teams that have built AWS services like DynamoDB, Lambda, IoT, and CloudWatch. He's also, as a senior principal engineer and having been at Amazon for nearly 20 years now, been one of the people that's really shaped the DevOps culture at Amazon and AWS. He publishes weekly inside of AWS to all of our AWS and Amazon builders a blog called This Week in Observability. He's also been a primary contributor to something called the Amazon or AWS Builder Library. Is anybody familiar with the Builder Library out there? A few of you. So that's where we publish in-depth pieces on how Amazon approaches building and operating software. I checked last night, and David has authored 11 of those articles. I think he's the most prolific author on that site, so I do encourage you to look at those. There's a lot of really interesting and helpful information there.

I say all of that because we've been really lucky to have David be part of the team that's helped us build the AWS DevOps Agent over the past year, so his fingerprints are all over it. A lot of the culture, the DevOps culture that he's tried to infuse into the AWS and Amazon development and operations teams, has been poured into the DevOps Agent itself.

A Day in the Life: How AWS DevOps Agent Responds to Production Incidents

The other thing I wanted to say is David's a senior principal engineer at Amazon. Some of you might be wondering what is that? What does a senior principal engineer at Amazon really do? It's actually kind of a big deal, but what's the day in the life of David? If you're one of those people, hopefully there's a few of you that had that question, we got some undercover footage of a day in the life of David from a couple of weeks ago. So let's take a look at this.

So there he is. It turns out that the first thing you need to know about a senior principal engineer is they still get paged, and they still have to scramble. They have to get back to their office. They have to drop everything. They have to get their laptop out, get their favorite playlist going to help them work through this troubleshooting scenario. But what's new about this case or what's new about this week is David has a new teammate. He has the AWS DevOps Agent to help him. So the first thing he does is he checks in and before he got his laptop out, metrics, analytics, logs, traces analyzed, checked for deployments that may have impacted this situation, already found root cause identified. It's all before David got his laptop open.

So let's see what else it can do. Can it help us get out of this alarm state and resolve this incident? Sure enough, it's been clearly reading David's blog about This Week in Observability, so it knows how to generate an effective what we call an MCM or a change document that's going to include pre-validation steps, action steps, post-action steps. And then you'll see here that it's actually even showing off a little bit. The agent is providing David with code to go ahead and actually fix what was in this case a bad deployment, the code itself.

So the last thing to do here is to actually perform the rollback as instructed and fix the issue. So fingers crossed, we'll check the validation steps. The errors are going to, wait for it, go away. All right, and so we should have a happy senior principal engineer. So that's just a little bit of insight into David, but also really what the AWS DevOps Agent does.

Core Capabilities: Resolving Incidents and Preventing Future Problems

We launched AWS DevOps Agent in public preview yesterday. Let me ask, who here in the audience has ever been on call, carried a pager, been on the other side of that page? Is that it really? Only about half of you? Okay. One other question just out of my curiosity, is anybody here actually on call right now? Okay, well, we got a couple of dozen of you, so if you get up and run out, I won't take it personally.

So what does the AWS DevOps Agent do? It performs two fundamental tasks. The first I just showed you helps you resolve incidents faster. It's going to analyze your metrics, analytics, logs, and traces that are related to a particular incident, look for related deployments. It's going to share with you its findings. It's going to generate a root cause, and it's going to provide you with mitigation steps to resolve the issue.

All with the goal of helping you reduce MTTR.

The second thing which you didn't see in that video is the agent is always there, it's always available, it's always sort of lurking in the background, and it's designed also to help you prevent incidents before they happen. So it does that in a couple of ways, but what it's able to do is it will periodically scan all of the incidents that have been investigated and managed by the agent. It's going to take its understanding of things like AWS best practices or Well-Architected best practices, its understanding of your application environment, and it's going to suggest changes, changes to fix what it detects as maybe underlying problems that created the context for a particular set of incidents or other opportunities it sees to improve the posture of your application.

Four Key Characteristics: Team Member, Telemetry Expert, Pipeline Pro, and Application Knowledge

So, how does it work? There are four things that I want to highlight about how the AWS DevOps Agent works. The first is that we've designed it to operate as a member of your team. If you saw the keynote yesterday, Matt Garman talked about frontier agents, and an aspect of a frontier agent is it can work autonomously as part of your team. And so in this case, what it can do is it can respond just like an on-call engineer would respond to support tickets, pages, alarms that you configured to trigger the agent to perform an investigation of a particular incident. It can then write its findings, communicate its findings back into, say, the comment in a ServiceNow ticket. It can also, if you're using something like Slack for your team to coordinate during an incident, provide, just as any other engineer would, its findings back into the Slack channel itself.

We've also designed the agent to not just work with the people on your team, and I'll talk about this in a little bit more depth in a moment. We've also designed it as an agent to work with other agents that you may be deploying in your environment as well. The second characteristic of the DevOps Agent that I want to share with you is we've designed it to be a telemetry expert, so we've built in integrations with not just AWS CloudWatch and telemetry that AWS provides, but with our launch partners Dynatrace, Datadog, New Relic, and Splunk. We've given it access to the telemetry and the telemetry signals that you all have told us that you use, and you all have told us that you use other things too.

A very common thing we've heard is open source stacks with Grafana, Prometheus, something like Loki, for example, and so we give you a way to tie into those self-hosted telemetry systems or really any other telemetry that you may have through something that we call BYO or bring your own MCP server. And so with that in place, what it gives the agent the ability to do is what you saw in that video, which is rapidly and quickly scan through the important or related logs, metrics, alarms, and traces that you have in your infrastructure. But because it has access and because it's also oriented to prevent future incidents, it will provide you with recommendations on how to improve your observability posture. So that might be adding what it perceives to be missing metrics or perhaps tuning what might be flapping alarms, those kinds of things.

So the third characteristic of the AWS DevOps Agent that I wanted to highlight is we've worked very hard to make it what we think of as a pipeline pro. So often it's the case that when an incident happens, it's a result of some kind of change. It might be a change to the application code, it could be a change to the infrastructure code. And so we've given it access through built-in integrations to things like GitHub and GitLab so it can understand your deployment pipelines and the repos that sit behind those pipelines so it can detect these changes. And again, it's got this dual personality of being able to react to incidents but then also provide guidance on how to prevent future incidents. So in this case, it may make suggestions to improve your pipelines themselves, perhaps to add tests or other things that will help you prevent incidents to begin with before they happen.

The fourth characteristic and the last one that I want to talk about is it was important for us to be able to construct this agent so that it really knows your apps and your organization. So, knowing your apps, what it will do is it'll automatically generate what we call an application topology. You can think of it as a knowledge graph or a map of your application and your environment. What are the important entities? What are their relationships with each other? That'll help guide the agent as it's performing its tasks. The other thing that we recognize is that there's some things that are going to be unique about your organization, your practice, so you can provide those to the agent.

This guidance comes in the form of two things: what we call steering files, or you might think of them as runbooks that you can directly provide to the agents that you create. You can also give the agents access to existing runbooks that you may have, so these might be coming out of your ticketing system or, for example, we have customers who told us they maintain their runbooks in something like Confluence. And if that's something you want to actually see in action, I understand that the Atlassian booth at 5 o'clock is going to show you how to connect AWS DevOps Agent through the Atlassian MCP server into a tool that will give the agent access to the Confluence runbooks.

Having said all of that, the one thing we've also acknowledged is that it's important to know that at some point there's a limit to what the agent can do, and so that's where we give you the ability to bring in an expert, a human expert. At any point, for example, during an investigation that the AWS DevOps Agent is in the process of completing as it's working through findings, generating hypotheses, performing root cause analysis, or creating mitigation plans, if you want to have an expert's human set of eyes, one click you can pull an AWS support engineer into an investigation. They'll be provided with all the context up to that date. You can be chatting with not just the agent but also an AWS support engineer to help you refine the root cause or maybe validate mitigation steps.

Application Topology: Building a Knowledge Graph of Your Infrastructure

So I talked about topology. I wanted to double click on that just for a second here. This is a picture of what an application topology looks like in the AWS DevOps Agent web application. So what is it? How this is created starts from the ground up. When you set up your AWS DevOps Agent, you create what's called an agent space, and within that agent space you give it access to one or multiple cloud accounts, and we'll start discovering the resources that you've given us access to through IAM roles and their relationships with each other.

The agent will then start to, and it will continue to, maintain that. It will look for additional relationships. If you've given it access to a telemetry source, it'll bring in their service maps. It'll map things like log groups, metrics, and alarms into entities in your topology. If you've given it access to your CI/CD pipelines, it'll map deployments into entities, so it builds up this structure or this map of your architecture. That helps it during an incident narrow in and refine, say, log queries or metric queries that it needs to do to determine what's going on in a particular incident.

We'll also use this application map and compare it to patterns, for example, AWS best practices patterns that it knows are well-architected principles, as well as any guidance that you've given through the steering files for your own guidelines and policies. It will look for deviations from that and use that to form part of its recommendations or what I called earlier the prevention recommendations.

Real-World Success: 86% Accuracy Rate and Customer Validation

So the last thing I want to talk about is, you know, the success that customers have had with the AWS DevOps Agent so far, and then I'm going to turn it over to David to demo the solution to you. But the first thing I want to say is you're probably wondering, we released it yesterday, how could I have customer examples of this? But we actually started back in August. We made AWS DevOps Agent available to internal Amazon and AWS teams to use in their on-call rotations to help them improve the posture of their applications and services.

And so, so far today, the AWS DevOps Agent internally has handled over 1,000 real live Amazon and AWS application and service team incidents. And we carefully track each of those. We work those manually with the teams that have the actual on-calls that have been responding to those incidents, and we've determined that the AWS DevOps Agent to date has been able to achieve an 86% success rate on its root cause analysis.

So two things about that. 86%, it's a B, right? And so, you know, we're certainly after an A here, and so we're continuing to work on improving this every day. But one thing that, you know, was pointed out to me by some of our customers is don't beat yourself up too much over that 14%, because even when the agent is getting it wrong or maybe has an incomplete root cause analysis, it's gone down pathways, right, and shown its work in terms of the log analysis that it may have done or the metric analysis that it may have done. And those are pathways that I would have gone down, dead ends that I as an on-call engineer would have gone down, and so you saved me a lot of time, even if in the end you didn't get it right.

Let me share a couple of other examples here. We opened it up in September to external beta customers. One of the first customers that we made it available to was the Commonwealth Bank of Australia. The first thing they did after getting it set up was to give it something tricky. They said they had a network incident a couple of days ago that took them and their team a little over five hours to root cause. What they did was give the AWS DevOps Agent the same context that their teams had a couple of days prior and asked it to figure it out. What came back in 15 minutes was an accurate root cause analysis. We're really happy with what we're seeing and what our customers are seeing in some real world scenarios.

The last thing I wanted to point out to you all is that as a result of the customer experience in the beta, as I mentioned earlier, we're making the AWS DevOps Agent something that works with not just other people, other on-call engineers, other SREs, and DevOps engineers on your team, but also with other agents. We worked with some of our launch partners on this, and here's an example of one of them, Dynatrace, where we targeted jointly a couple of customers. You see Clarivate, a chemical manufacturing company, Western Governors University, if you haven't heard of them they have about 200,000 students, and United Airlines, who many of you probably flew here to get here. These were joint Dynatrace and Amazon customers where we worked with them to set up AWS DevOps Agent to work in concert with Dynatrace and particularly Dynatrace's Davis AI capabilities. Western Governors is a great example that I'd like to highlight. In less than a day they were able to get both of these systems, these solutions, these agents up and running and working collaboratively together in their production environments to help them investigate in this particular case real world incidents.

Demo One: End-to-End Incident Investigation with Robots-as-a-Service

So that's the introduction. It's time now to turn it over to David who's going to walk you through in detail what the agent, the AWS DevOps Agent, can do. Alright, so yeah, a picture's worth a thousand words. So let's just look at a whole bunch of examples of the DevOps Agent going and doing things. I'll actually show you four different demos. We'll start with another kind of slow motion version of that fast motion video where I had started with the pager, which I still have. Hopefully it doesn't go off while we're having this presentation. I might have to investigate something. It's not going to, I'm just okay. But then we'll also look at customizing DevOps Agent with bringing in MCP servers. I'll show you how you can steer it and nudge it in a direction or ask questions of it with chat, and then I'll show that future issue prevention where it scans past incidents to look for patterns of things that you can improve long term to prevent future issues.

Okay, first demo time. Now throughout, we're going to look through a set of services that I built that work together called Robots-as-a-Service. You can imagine this is a set of web services that talk to each other that make it so you can control robots in the world, so you can send them commands, schedule them to do things, see where they are, and whatnot. I actually created this. It's a set of microservices like a gateway service, the bot service that talks to a schedule service that talks to the action service to go and use AWS IoT to send commands to the robots, and then a forge service that you would use for provisioning. All this is simple, but it's also each of these are internally complex. There are a lot of different things like security groups, load balancers, auto scaling groups, IAM roles, all kinds of stuff that actually make up these things. So lots of things that could potentially go wrong.

So let's talk more action. Let's get into it. Let's start by implementing a bug and pushing it to production. Here I'm in the code. I have this caching logic in the forge service, and I'm going to say let's add a timestamp field to the cache records, just because maybe I need to measure how effective my cache is, how long things are cached for. Harmless change, we're just adding data, it's not going to cause any problems. So let's just push that. Fantastic. And in the distance, alarms. So okay, so things break. In this case I'm using Dynatrace for my kind of application level monitoring and alerting, and sure enough, it goes and it detects the problem automatically. It's pretty nice, these Davis alarms, so you just, I didn't have to set anything up. It finds and sees the failure rate increase, and sure enough, yeah, problems. The bot service is seeing errors. I can see overall that there are problems kind of throughout the stack. All the services are kind of lighting up with bad failure rates, so not good.

In the meantime, Dynatrace has called the webhook that sent this notification to the AWS DevOps Agent. So before I didn't necessarily even need to show up here and look at it, it already kicked off an investigation. In this case, this is the operator app within the AWS DevOps Agent. So I go here and I see all my ongoing and past investigations. Let's go ahead and look at this one to see the details.

So here we can see, oh you don't need to use the chat for right now, we'll come back to that. So here we see the issue. This is what Dynatrace Davis sent the AWS DevOps Agent. You can see the details of the problem, where the problem is, and an extremely precise timestamp of when the problem occurred there, so that's really helpful. And so let's go from here. Let's just jump to the end from a story standpoint. Let's go right to the root cause.

So at the end, this investigation is already complete. By the time I made it back into the office and opened my laptop, the investigation was done. And so what is the root cause in this case? Well, it says a code deployment introduced a cache serialization format incompatibility in the bot-forge service. And this is exactly what you saw me implement. It says I added a new timestamp field with the code change, and the deserialization method understood the surrounding code, that the deserialized method didn't like to see this new field, so it was failing.

And so it also talked, said that, well, actually the requests that were cache misses, the ones that went right to DynamoDB, succeeded. So that's actually an interesting nuance that it noticed while doing the investigation, that some requests were succeeding. It wasn't obvious, but it's useful from figuring out useful evidence to prove that, yes, this code change was the problem. So great, let's go back to the top and see how the agent got there. We're going to scroll through slowly here.

So it started off, it said, okay, I'm going to investigate. Let's start by just figuring out where we are in the world. It's going to go read some runbooks. These are all MCP tool calls. You notice since I'm using Dynatrace, it's doing DQL queries. If I want to inspect and see what it's been up to, I can. Here I'm clicking on one of the details and it's showing me the inputs and outputs of each of those MCP calls. And then, so fantastic, it runs a bunch of queries just to get figuring out what's going on.

Now it has, between calling Dynatrace and querying our own internally figured out version of your topology, it says, okay, here's what we're dealing with, here are the services, here's what talks to what, let's keep going. So next it's going to continue to triage, understand the health of each of those services, and it says, okay, yep, the front end service is seeing errors, it's having errors talking to dependencies, so it's going to look for a needle in a different haystack. Okay, let's go investigate those services. Those services are also seeing errors, but talking to the bot forge service, which is seeing errors. So let's go move, search for a needle in a different haystack and says, okay, fantastic. Bot-forge seems like the most dependent service. Let's go, let's now zoom in on bot-forge and really interrogate that to see what's going on.

So it's bot-forge. It's calling AWS APIs through MCP. It's calling its own topology, and now it has figured out the AWS resources and everything that makes up, like what pipelines are deploying to this service, where are its logs. Let's just figure out how we zoom in and explore this thing. Okay, so from there, now we have a lot of investigations we need to do. So it comes up with a lot of ideas. Okay, let's search the bot-forge deployments and infrastructure changes to see if we just changed something. Let's look at the EC2 instance health metrics to see if maybe we ran out of something like file descriptors or CPU. Let's check other components like the bot config DynamoDB table that this uses to see if it's running out of capacity or something. Let's look at application logs and zoom in to see if there's any hints there. And actually let's look at traffic patterns to see maybe there's some incoming shift that's causing some problem here, some new robots being added or asked for, who knows?

These are, I find from my experience, the key beats to look for when you're trying to figure out what to mitigate, because each of these can map to a mitigation. If we're running out of CPU, let's add more servers. If we deployed something, let's roll back. So these map to things that can be mitigated. Okay, so, but that's a lot to look for. So the agent actually has listed all these things it wants to do, but it's going to do this in parallel. It has a lot more hands than I do, so instead of me doing one thing at a time, it's just going to investigate everything.

So it uses subagents. Now we're not going to look at the output of each individual subagent as it goes along. We're just going to look at the observations, the key things that it has figured out. Okay, so these are the types of things that get pushed to like a Slack channel. So it says aha, there's been a CloudFormation stack deployment with a new launch template version. This was created and it's interesting, the agent likes to do a little bit of extra credit sometimes and it says, okay, well,

it found during this CloudFormation stack update, here are the instances that were added and replaced. The way I do deployments in this fleet is I replace my instances kind of blue-green with deployments, and it noticed that and says, okay, here's an instance that started here, here when it had errors. It's pretty neat.

Okay, what else does it notice observation-wise? It's actually looking at a lot of things and giving me the evidence that it has ruled things out. This is what Bill was talking about. It's useful that it also tells you what it looked at that it thinks is not important, and it's actually kind of interestingly clever about that. For example, it noticed here below it just found that there's a CPU spike on one of the instances, which is suspicious and correlates in time, but irrelevant because this is an instance that just started up. So of course it's going to use a lot of CPU as it does whatever installation of software and everything. So it's nice, it's just okay, we saw this, we're going to ignore it. Okay, so this is just all the evidence that it found.

It also found, looking at log files, wherever the log files were, it found them and noticed some stack traces. This is where it starts to get some hints of the root cause of this. Here's the line of code according to the log that seems to be having some problems. Okay, it's more observations. It's pushing these Slack and stuff. What else happened here? It found the rolling instance replacement. See, so these are the instances that started up and were shut down. It also noticed this is where it observed that from the logs that the requests that were cache misses succeeded. So it's just really interesting details as a hot. There's definitely something wrong with the cache interaction here.

Okay, and then sure enough, here's the code deployment. It is connected to my GitHub Actions that I'm using to actually drive the code changes, not just the CloudFormation part of the infrastructure change. And this is where it really kind of did this fusion of, okay, based on what I know about the logs and based on the code that changed, here's what I think is starting to really happen. So great, all the sub-agents are done, and it's going to come up with, just kind of pull all that data together and think to come to the right conclusions from that, and then we're back at the root cause, which is what we talked about before. So great, I figured it out.

Now there's actually, if I were to really just jump to this from Slack, the first thing I would look at if the investigation is done is the root cause. This is where it's just formatted a little bit nicer. I don't have to scroll through like a ton of raw findings and things like that. So it just says that here's the impact, there's a failure rate increase. It then talks about, here's that same root cause text that I showed you before. Here's the root cause, we already saw this. It's that the cache serialization issue. And then it shows the kind of key observations below it. The observations, these are the key findings that kind of prove that this root cause. There could be multiple root causes, in this case it's just one. It says, okay, yep, there's that evidence about the failure rate of the cache hits versus cache misses. It talks about the code that was changed, and it talks about the CloudFormation stack update that was part of the code change that actually made it into production. Great, so there's that root cause.

Okay, but your root cause is great, but we need to fix the problem. So here's where we can generate a mitigation plan. We ask it to, okay, like what should we do? In this case, it's going to be pretty simple because it's a bad deployment. So what do you do when there's a bad deployment? Well, you roll it back. So it's just rollback. But I hadn't actually until I was using this, doing this demo, I didn't really know about, I hadn't used GitHub Actions before. So this is teaching me for the first time how to do a rollback.

Now it breaks things down into how I change. Whenever you're going to do anything in production, you want to do things very carefully. You want to prepare, make sure you have everything in order first, pre-validate, make sure the system hasn't changed in the meantime when you kind of have this first idea about what to do. You need to prepare and make sure, did somebody else on my team roll back already? Did we learn something new in the last five minutes? So you just do all those careful preparation, then apply. This is where I didn't really know how to do GitHub Actions rollbacks before. This taught me and it showed me what to expect. Then after you do that, it says here's how, here's some commands you can run if you want to make sure that the change is rolling back and that you're seeing the recovery you expect. And of course if that goes wrong, then you want to be prepared ahead of time to undo the thing that you just did, in this case, rolling back the rollback. So put the potentially bad code back in place and then figure out what to do from there, which will be a real exciting journey.

Now I would look at this later, but this is the long-term solution. I don't want to be distracted by this. I need to kick off the rollback, but the long-term solution, it prepares the spec of what I can do to hand to, say, a coding agent or just do myself. It says, here's how you should actually fix the code. You want to make sure you have good tests and that you update the deserializer to match the serializer. Okay, but let's focus on rollback.

So let's go over to GitHub Actions where I can do that, and I'm just following that runbook. It says, okay, this is actually the bad deployment though. Adding the timestamp was the problem, so let's go back. The problem was, or the thing we want to roll back to, is actually the last known good deployment. So click on that and rerun all jobs. There's nothing special in the agent that knows about GitHub Actions. It just kind of knows that because it's an LLM, frankly, and so it teaches me how to use GitHub Actions to do the rollback. This is just one example of a thing that it would figure out.

So to confirm, I can go to whatever observability tool I'm using. In this case, Dynatrace, hit that refresh, and then we are back to healthy. There it is. That's going through an investigation. And it shows that the Davis problems have been cleared, so that's great. Now let's do that follow-up. Let's fix the bug so that we don't come into this problem again. So I copy the spec, which is at the bottom of that mitigation actions page, and then I like to use Quiro for using its spec-driven development. It does production changes pretty well without the, and while keeping the fun vibe of coding. And so I paste that in.

Quiro is going to generate its own specification, its own, okay, here are the full requirements. I'm going to go spend some time understanding your code. Maybe it's code I didn't have to author with Quiro. It'll just learn about your codebase, come up with a plan, come with detailed requirements that say how I should make the change, come up with a design, and I can review the design. I'm not showing you all this. And then come up with a timeline, excuse me, a task list of things that it's going to do one thing at a time, and I can say, okay, implement it. I actually used Quiro to implement a lot of the things here, so it works really well for this.

And then it also generates really nice tests because you have that specification that says the requirements of things that you want to make sure that you have. Quiro generates property-based tests to make sure it's not just no-op unit tests. It does very thorough tests because it understands the behavior that you want to have through the specification, so it can generate really nice tests. That's the end-to-end fix on a typical incident for me. You react to it, you mitigate it, and then you solve the root cause.

The Technical Foundation: How LLMs and MCP Enable DevOps Automation

So the flow of this, why didn't we build this before? Why didn't we just build a DevOps agent like a decade ago? It's because DevOps is messy. It's a mix of so many different tools and systems together that were never really designed to interact or interface with each other. Sure, they all have APIs. When APIs and web services became a thing, you can make anything talk to anything by coding the integration. But there are just so many, it's end-to-end. There's so many combinations of tools that you might use together for deployment, observability, logs, metrics, traces, who knows.

And so the key unlock for why we can do this now is thanks to LLMs and MCP and that kind of thing. So you saw the bad code change in this case flow through GitHub. It's CloudFormation, EC2 service uses Dynatrace, uses CloudWatch, Slack to talk to me, ServiceNow to page me, all these things that all need to play together. And so I'm sure you can kind of map, you might use some of these tools, you might use a different set of tools for a different type of thing. So that's the flexibility is key.

Okay, so how did it actually do this? Like under the hood, what's going on? Well, the objective of incident management is to find some hidden gem, the root cause. That's it. You've got to find the thing, something that you can do something about. So the agent gets just kind of plopped into a desert and says, hey, go find that hidden gem with really no instruction. It doesn't need any instruction at the beginning. So this is where it starts by building that topology. It needs to understand what you have in your system, like what is the universe that I might need to look at. I don't even know where in the desert to look. So, okay, let's understand your system.

It learns how do I find metrics and how do I find logs about each of these things, how do I find out how you're deploying. So it surveys the landscape. And then it needs to understand the links between, and so this topology is the connectivity between different resources, different systems, and it uses everything available to build this topology map. It looks as, well, are you using traces? Then I can follow traces through the system. Do you have IAM permissions on this Lambda function to talk to that DynamoDB table? Okay, let's draw a line there. It doesn't necessarily talk to it, but it's worth looking at during an investigation.

And then it learns how to drill down. It says, okay, do I know how to zoom in on this service and break it down and see it, look at all the host metrics and everything, all the little things. So that's ultimately what it's doing behind the scenes, as pictured here.

Let's look at what we saw. We saw an investigation happen. It interfaced with a ton of different tools. It provided all of its evidence, maybe more than you needed, and it gave safe mitigation steps really thoughtfully because you're about to mitigate something by doing a production change kind of on the fly, so you need to be ready for that and review that and break it down. So that's that.

Demo Two: Customization with Bring Your Own MCP Server

Okay, demo two: MCP servers, bring your own MCP server and customization. Like I said, you probably use some combination of these types of tools or totally different tools. Let's do another investigation where instead of sending our traces to Dynatrace and logs and metrics to CloudWatch, let's say you have your own log store. So let's say you have a log store maybe in S3 or something, and then you send your traces to CloudWatch, whatever. It's just, let's mix it up, use different observability tools. Well, how would the agent know how to, it doesn't know about your bespoke logging system, so how would it do that?

Well, the answer is to just add in and pop up an MCP server and then configure the agent to use it. I did this in like 20 minutes. I actually implemented this, and you'll see it. I'm using Amazon Bedrock Agent Core MCP Gateway. I just set up a gateway, talks to a Lambda function, and queries my S3 logs. Used Quiro specs to generate it, and it was done. Maybe let's say 40 minutes just end to end.

Okay, great. So here's another failure. Let's cause another problem. So here I am, same bot-forge service setup. Let me just log into one of these instances. I'm going to SSH into one of them. Maybe I'm troubleshooting some latency issue. I don't know. Let's say I'm seeing something weird with the cache. Let's just go into the cache and clear the files. Like I think that might fix the problem. Just remove the cache. Yeah, maybe that's a good idea. I don't know. And again, maybe not a good idea. It depends on how it works.

So it wasn't a good idea, and so here we have the investigation that triggered off of a CloudWatch alarm. So in this case it was triggering off a CloudWatch alarm on a load balancer. So now the tools are different. The flow is the same, but the tools are different. It's looking at X-Ray. It's looking at CloudWatch metrics. It's using the topology again, that's similar, and it says, okay, yeah, we have a load balancer showing errors. Let's dig in from here and figure out where to do that investigation. Goes in, looks at stuff, looks at stuff.

Okay, now it found quickly that the SQLite cache table is missing, causing HTTP 500 errors. It just figured that out. How did it figure it out? Well, it saw it in the S3 application logs. This is that random MCP server that didn't exist until 40 minutes before this that I configured, and it just knew how to talk to it to query logs. Pretty great and use it contextually in the investigation for this service. It knows how to get to its logs, so it's great. And it figured it out. More investigation, it's coming to observations, the same idea. It just investigates and figures out what's going on. More metrics, more metrics, more metrics, more metrics. All sub-agents are done. So let's synthesize the findings.

Great, it actually noticed I was busted. Manual SSM session on bot-forge instance caused SQLite database corruption. Dyanaeck, who's that, started an SSM session and maybe messed with some stuff. And so sure enough, it found through CloudTrail that I had logged into the host before, and it's like, oh, that's suspicious because that's the same host that it saw from the logs that had the errors about the cache. So great, again I'm just showing the interesting thing that I'm showing here is how similar the investigation is to the past investigation. It just happens to be using different tools. It's still successful. It's just using this custom bespoke MCP server that I made that, you know, if you have your own system, you can also plug that in.

And so here it is with the root cause: manual SSM session maybe messed up the SQLite cache. So what should, let's see what it comes up with. The earlier mitigation plan was simple. It was just a rollback. This is a little bit more exotic of a scenario. What does it figure out? It says, well, what you should do is it's happening on one instance, you should just replace it, and that's absolutely what I would do in this incident. I would just first just get the bad thing out and then quickly follow up with more, figure out, make sure we understand why, and we actually do have the evidence as to why. It's because I logged into it.

Similar thing: prepare. Let's make sure that the Auto Scaling Group that we have, all of the right ARNs figured out. Make sure these are the right things we should be messing with. Prevalidate, make sure that, you know, more instances haven't come up bad. Make sure that the healthy instance is still there and healthy. You know, maybe the situation has changed, and then apply.

Conveniently, it gives the right approach. This is how I would terminate the instance with an Auto Scaling command instead of an EC2 command, just because this does the graceful shutdown. It's a little nuanced but still useful. Then post-validate to make sure that you have a healthy fleet, and if rollback steps are needed, I guess they're just to try to put the fleet back together in a good way. Great, so that's about that.

So what did we see? We saw bringing a random MCP server that can do some part of your ops, whether it's how you do deployments, if you have your own deployment system, if you have your own knowledge base that you want to wire in, if you have your own observability tool for something, if you have metrics, maybe you measure customer impact using your own tool. I don't know whatever it is, just bring it in an MCP server using Agent Core Gateway. I find it to be really easy to do that securely, like this agent is the only thing that has permission to talk to that MCP server, so it's not like it's just open. It's great, and this particular root cause was interesting. It was a great failure that we did, and so it's great.

Interactive Features: Chat, Steering, and Runbooks for Investigation Control

Now let's see how you can interact with investigations and with the DevOps Agent outside of passively watching what it's up to. You can ask questions about investigations. That's one thing you can do. Let's say I had some investigation and I can say, what is the start time? Well, it's a simple question, but maybe I didn't want to read through and scroll, so I can say what was the start time and it just answers questions about what this chat is about, what did the investigation already find. So it's not going to do big research projects, it's just going to answer questions based on what was already revealed in the investigation.

For example, I could say what alarms triggered during the event. It says, well, this is one alarm. And it says, by the way, the investigation didn't look super thoroughly, so if you want, you would have to ask me to go look for other alarms. Okay, great. So I might say, okay, let's do a little bit more of a research project. Let's say, can you summarize the customer impact? Maybe I want to just have a nice report that shows a little bit more about what the impact was. You can see it sends that request to the agent, and then the agent replies with this report. Maybe it had to do a little more research and just compile things. And you can see it's a nice report. It says here are the services, each of their failure rates that happened during the incident, and just a bunch more details about like the duration, the timeline, customer-facing impact. It describes things in terms of what it understands my functionality is.

Okay, and now I want to ask a deeper question to say, okay, but how many robots, like this is the bot service robot servers, how many robots were impacted during this? Maybe I could later on figure out which robots. Okay, how many robots were impacted? So it's going to have to do more queries. It didn't query that during the investigation, so it's going to take a little bit of time to go and do a bunch of queries. You can see it doing more DQL queries to Dynatrace, because spans are the part that are really rich with this information, the labels of like this is the robot that is working on and this is the status code and stuff. Spans are really rich with that information, so it queries the spans and logs. It's kind of again just figuring out what it needs to pull to answer my question of like which robots were impacted, and then it answers. Okay, we found 91 unique robots that had, and here's the way in which they were impacted. They saw 500s. They were impacted on certain operations. It's kind of breaking that down by API, all great because this was all in the span data. I don't know how I figured it out, but I just happen to know. And then, yeah, it just gives a nice overall summary.

This isn't something it was specially coded to do. I just thought it would be a nice thing to show off. Well, what should I ask it? Let's ask it customer impact because it's something that I often do during an incident. I want to see kind of what my customers saw when there was a problem. It's really an important question in the moment and then later.

Okay, you can also, when an investigation is going on, you can nudge it. You can steer it to say, hey, I think you should actually go look at this, because you might not just be sitting there. If you do open your laptop before the investigation is done, you might want to go look at things and be active in this. So when you're looking at things, you might share things you found, or if you see it get off track, you might want to put it back on track.

Here's a real example of it getting off track. During when preparing this demo for all of you, I noticed I was doing an investigation, but I was breaking things like every 30 minutes and letting it recover for a short time and breaking something again. And so it was confused. It was like, oh, there are all these different things happening at the same time. And so I said, can you show me the incident timeline? And it said, okay, well, 1800 and then 1900 this thing happened, and then at 2100 this thing happened, and then like at 2200, it's finding all these different things that I had broken. I don't remember what I had prompted. The next day this other thing happened. I'm like, okay, like agent, like can you relax?

I told it to just look at the most recent impact only. I said the deployments are way too spread out. Can you ask the investigation to focus on the most recent impact to see what might have triggered that? I think for this prompt initially for this investigation, I didn't actually ask it about a specific alarm timestamp or something. So that's why it kind of got a little creative and said great, I'm going to actually improve my prompts to the agent too. It said actually let's get this timestamp. I didn't paste that timestamp. It figured out the start point in the chat and then it pasted that to the agent to have it look at the most recent wave. So pretty nice way of steering investigations through chat.

I can also just kind of learn about my environment. I don't have to ask about a specific operational issue. We haven't really tuned it to ask about things like explain an alarm or explain why there's a problem right now, but you can actually ask it really whatever you feel like. For example, sometimes I might start an investigation. You can tell what alarm to look at here, but I can say maybe don't investigate anything. I just want to ask you some questions. Great, and let's start an investigation about TBD. So the agent responds with okay, I will just help you. Let me know what you want to do.

So I might ask, can you investigate to see what EC2 instances I had a few minutes ago? This is a real thing I was doing when preparing for this. I just thought, what would be a neat thing to ask and show all of you? I didn't really realize it was going to be kind of neat. So it said okay, I'm going to describe instances, look at the topology, figure it out, spent a little bit of time and it said, well, here's a summary. It said you have 54 terminated instances in the last 30 minutes, and I didn't know this was going on. It said you have instance churn happening in your Auto Scaling Group. Instances are coming up and getting shut down right away. I was like, oh, oops, okay. So I didn't actually realize it was kind of actually a kind of fun question, but it didn't just literally answer my question. It did, but then it also got a little bit of that extra credit and said, well this is funny, maybe you want to know why. So it has that nice helpful curiosity to help me. Oh yeah, okay, this is a problem I need to go fix that.

From customers who have been using this so far, they've also said just in different things, whether they're glancing at that noisy topology map or they're looking at anything during an investigation or asking these questions, we keep hearing that people are always learning a little something that they didn't realize they would learn that is interesting and relevant in some other way. So it's interesting that it does that little extra looking around corners for you. There's so much telemetry data out there, so much going on in logs, but you don't have time to go look at everything all the time, but the agent does.

Another way you can influence the agent around future investigations are with runbooks. Now runbooks are sort of like the things that the agent can use and reference when doing an investigation to speed things up or steer it in a certain way. They're kind of like, today we released in Codeium this thing called Codeium Powers. Codeium Powers are when you're using the Codeium IDE. They're a kind of combination of MCP servers and steering files and hooks and things that just like in a nice packaged format that help the agents specialize in a certain tool that it might not. Your agents are generally good but broadly good, but to help specialize in something, it's nice to package all of that up. So this is the same idea. It's runbooks that are just little pieces of guidance that are just loaded on demand whenever the agent wants to look at it. Let's look at something you can add one. Or let's look at a couple that I find are useful.

So I decided to write one for this that kind of just helps the agent speed up, maybe shave off a minute of an investigation. Actually, I think I disabled the wrong runbook, and I don't think it was helping, but it doesn't matter. It's for demonstration purposes. I kind of gave it an overview and said, okay, here's where I'm storing all of my stuff. I was using logs in an MCP server. I'm doing code deployments through maybe through GitHub Actions in this case, just to kind of give it a little bit of a head start on figuring it out. It'll wander around and figure out what I have, but this kind of, I just feel like it sometimes helps ground it. So that's one kind of runbook you could write, just give it a nudge in the right direction.

I gave it a little bit of a runbook for that bespoke MCP server for logs that I've been talking about. I found that the agent, it was using the MCP server okay, but it kept querying for the current minute of logs, like right now, like if it's 2 o'clock, it was looking for logs from 2 o'clock. I said oh actually I only push my logs with a 2 minute delay, so only look for logs 2 minutes ago. All that does is save a little bit of a back and forth. It figures it out, but I just give it a little bit more information about how to use the tool. Often that information is in the MCP tool documentation, but it doesn't matter where you put it.

I also mentioned that a good logging service like CloudWatch Logs will package up stack traces and everything into the same line. In my case, my MCP server will let you grep for errors, but it didn't return any data after that line that they grepped and found. So I said, hey, don't forget to ask for extra lines after so that you can actually see the stack trace. I just gave it some tips. The details aren't super important. This is just how you can give the agent a little bit of a hint if you see in the details why didn't they get to this thing fast enough. Oh, let me help it next time.

Prevention Mode and Getting Started: Proactive Recommendations and Public Preview Access

So great, that's how you can use Runbooks to influence future investigations. You can learn about your environment in chat, and you can actually just learn about your environment by talking with it. Okay, let's now talk about preventing future incidents, which sounds impossible, but it is. So one thing the DevOps agent does is prevention. If you go to the top of the DevOps Agent app, there's a prevention tab, and what this does is just in the background, there's no real hurry, but in the background it's looking at your past incidents to look for patterns of what seems to be going wrong that I can recommend you change that could prevent this class of issue from happening again.

So in this case, I've been injecting a bunch of failures into this environment and it looked at the last, in this case 19, and gave some recommendations. The recommendations are categorized. They're like governance things, infrastructure recommendations, observability. Notice it has no observability recommendations. I used to work on CloudWatch, so I'm pretty good at that. So I guess I passed the test for observability. Great. So what, let's look at some examples of what it did find. Just a bunch of stuff. I found some health checks stuff I should do. Let's just explore a few of these.

One is configure the min instances in service. They found some auto scaling setting that I had screwed up at one point in time. It said, hey, there was this change you made where you configured your auto scaling group in a dangerous way. It didn't immediately cause a problem, but you made it so in certain cases it could replace all of your instances at the same time. I don't want that, so this might have prevented me an outage from happening. It's like you should change it to be the safer configuration of your auto scaling group to avoid that if it were to happen that way it would keep instances in service all the time. Okay, that's great. I could avoid an outage with that. I'll take that, and it gives some details on what I should do and how to make that change and where and that kind of thing, where to look in your, it knows about the CDK stack, and here's where you should change it. I can paste this and give that to Cairo or some coding agent.

All right, another type of thing that I found, in this case was, it noticed, oh yeah, this, I didn't show you this failure mode, but I had kept messing with my DynamoDB table permissions and it was finding this just fine, so it was saying, okay, somebody 10 times keeps, that was me just testing this over and over again, keeps breaking your DynamoDB table policy, denying permissions and messing up everything, and so the recommendation for this that it gave was actually, hey, you should, well, it gives all the affected incidents, and it was just telling me, hey, like lock your permissions down. Don't let this Dyanacek person keep breaking your permissions on your DynamoDB table.

So then another one, it said, okay, implement post-deployment health validation with automatic rollback to prevent defective deployments from causing service failures. This was, it noticed that I had broken the ping health check, so I just had a syntax error in my ping health check logic, and that was making it so my instances were launching and getting terminated, launching terminated, and said, well, you should fix that, and you should also add checks to make sure if you break it in the future that it will get rolled back automatically. So it's not only is it saying fix this problem, but also prevent this class of problem by having automatic rollback. Great idea. This incident, now remember when I said my instances were churning like they were getting replaced all the time, that's what this, it was actually recommending the long term fix for that. So it's very, very nice. It kind of links all up.

Great. So let's look at, oh yeah, yeah, implement auto rollback, and this is, and it knows about my code. It says here is what you should change your launch script to do, and here's how to make it do the auto rollback. So pretty, pretty convenient. Last thing was it just found this whole class of problems. I kept pushing those cache breaking bugs.

It said somebody keeps pushing bad code changes, they keep making it to production. Can you write some tests and have good responsible CI/CD going on here? It's a good idea. Okay, so sure, but it doesn't make for as good of a demo, but yeah, that's the thing. So it says let's put some blockers in your pipelines, your deployment pipelines. Let's add some tests to make sure that you're testing your gamma environment, your pre-prod before prod, and if you do break things in prod, make it auto rollback. So it just gave me some nice recommendations there to just have better operational posture.

So the prevention part kind of is finding improvements to my operational posture based on what it notices about the things that seem to keep breaking. It's all that is when it looks at it, it looks at all the past incidents and investigates and looks for patterns of things that you should improve. Great. So Bill, you want to wrap us up and talk about how to get everybody started with everything?

Yeah, so thank you, David, for that, and for all of you to be here. So we, as I mentioned earlier, we launched some public preview yesterday. And so what that means is it's available as a service in the AWS console. You can go there, get it set up, search on Google, you'll find the docs, you'll find there's some links there that I've provided for you, and try it out. Public preview means it's free to use. You might have some charges like if the actions that the agent takes to query some of your logs and traces generate expense, you may have some expense, but we won't charge you for the agent itself.

There are some limits, so you get 20 per account that you set up per set of agents that you deploy. You get 20 hours of investigation time per month and 15 hours of prevention time per month, so plenty to give it a try. So it's super easy to set up. I hope you saw that it has a lot of power already. I'll anticipate the first question. Are we thinking about giving it the ability to actually take actions in your environment? The answer is yes, that's something that we're working on, but we want to step into that closely. You should be able to see that there's a lot of promise there, but give it a try.

You can get it set up running in a couple of AWS accounts literally in just a couple of clicks. Start asking a question, so you know what it can do. Throw alarms at it, throw questions at it. Love to get you trying it out and give us your feedback. So again, thank you for your time. We have a few minutes for questions and we can take some questions on the side after.

; This article is entirely auto-generated using Amazon Bedrock.