Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - AI-powered resilience testing and disaster recovery (COP420)

In this video, Nereida Woo and Hans Nesbitt demonstrate how combining generative AI with AWS Fault Injection Service accelerates resilience testing from weeks to days. They showcase a multi-agent system using AWS Agents framework and Amazon Bedrock that automatically discovers infrastructure components via Systems Manager Inventory, generates chaos engineering hypotheses, and creates Systems Manager automation documents for testing failure scenarios. The presentation covers the AWS Resilience Lifecycle Framework, focusing on the evaluate/test and learn/respond phases. Key demonstrations include an inventory analysis agent that identifies installed services on EC2 instances and a document generator agent that produces FIS experiment templates with proper safety guardrails, preconditions, and state restoration logic. They explain how AI can transform root cause analysis documents into automated resilience tests, achieving up to 90% reduction in experiment preparation time while maintaining controlled, purposeful chaos engineering practices.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The 3 A.M. Crisis: Why AI-Powered Resiliency Testing Matters

Let's start with a scenario. It's 2:37 a.m. You get woken up by your phone and you hear your CEO's voice in a panic, telling you that the company's application has been down for about three hours now, costing you millions in revenue. As you frantically try to restore these services, something is in the back of your mind: could this scenario have been prevented? What could you use AI to identify and test these failure points before they even began, before this 3 a.m. crisis? I'm Nereida Woo and this is Hans Nesbitt, and today in the next 60 minutes we are going to show you how combining generative AI and AWS Fault Injection Service will save you time and effort in testing today. Not just by helping you discover unknown risks, but also turning past incidents into automated tests to prevent history from repeating itself.

In the past year, 57% of organizations experienced 10 or more hours of critical cloud downtime. While systems fail, something far more valuable begins to crumble. Customers don't just get frustrated, they leave. Leadership doesn't just get concerned, they lose confidence. Investors don't just ask questions, they lose trust. Revenue doesn't just dip, it disappears. Here's the reality: everyone talks about resiliency until it's that 2 a.m. call when everything breaks. The question isn't whether you can afford to implement or invest in resiliency, it's whether your business can survive without it.

Everything fails all the time. We need to build systems that embrace failure as a natural occurrence. When Dr. Werner Vogels, VP and CTO of Amazon.com, said those words, he was defining a fundamental truth of modern-day cloud. We've heard this saying all the time, but this wisdom drives our shared responsibility model. AWS ensures the reliability of the cloud while you architect the reliability in the cloud. It's a partnership built on reality, not wishful thinking. The most resilient organizations and customers aren't the ones who believe they can prevent every failure. They are the ones who embrace this quote by Werner Vogels and weave resiliency into every layer of their architecture, transforming this shared responsibility into their competitive advantage.

So let's talk about how our journey is going to unfold today. First, we're going to talk about how generative AI can be used as your early detection system, discovering failure scenarios that you haven't even imagined yet. Think of this as architecting an AI that's constantly asking "what if," that little mind in the back of your head saying "what if you could do this, what if you could do that?" Next, we'll go behind the curtain and see exactly how this works. No magic, no buzzwords, just practical implementing technology that transforms how we're approaching resiliency testing.

Then we're going to tackle how equally crucial and challenging it is to use AI to validate failures that are already known. Because let's be honest, how many times do we think about when we test or when something happens, could this have been prevented? And then finally, this is where it gets a little more exciting and the actual thing we're waiting for: we'll show you how to put this into action, not next quarter, not next year, but by tomorrow. Bringing it back to your teams, you'll be able to implement this all together through real-world scenarios and practical coding that will be provided. We'll give you everything you need to get started. By the time you're done with this, you'll have a new approach to resiliency and testing and the power of AI in your resiliency testing altogether, turning months of work into weeks.

Integrating Generative AI into the Resilience Lifecycle Framework

So Hans, thank you. Alright, here is the resilience lifecycle framework. Has anyone seen this before?

This framework was released in October of 2023, so it has been around for a reinventor too, and there is a QR code at the bottom if you want to check it out as well. This is our North Star when we think about building resilient applications. Today we are going to spend some time talking about where GenAI or agentic capabilities come into play in this lifecycle.

You may see it is not pointed at the top or set objectives. Set objectives typically is when we talk about recovery time objective, recovery point objective, RPO, RTO, and our SLAs. These are typically business decisions that are made, and then those requirements are given to teams to execute on. So those are business decisions for set objectives. However, then we move to design and implement where we take the RPO, RTO, SLAs that we must provide to our end users, and this is probably the most common place where I have seen agentic capabilities so far.

We saw announcements this morning or more talk of Quiro and Quiro CLI. How can we use these agentic capabilities to one, create our infrastructure as code, right? Make sure it is across your availability zones, make sure your database is configured in a highly available fashion. But also when it comes to building our applications, when our application code is being written, can we make sure that we are adding backoffs, retries, and jitter, and having proper instrumentation for observability? We want to be able to answer the question: where is the challenge? What instance is it coming from? What service is it supporting? What availability zone is it living in? Where is the pain? Let me find out very quickly.

But once we have designed our system and implemented it, we can then begin to evaluate and test it. Here is what we will be spending a good amount of our time today on. Once I have that infrastructure, how can I use agentic capabilities to help me discover what fault modes it may have? And how can I build tests to help me validate my application is resilient to certain scenarios?

From there we operate our application. We saw the announcement of the DevOps agent as well. I think of the DevOps agent when I think of operate, or there is CloudWatch Investigator as well to help us determine what is going on in the environment and shorten our mean time to recovery to bring an application back into service. As Nereida was talking, like 50 percent of organizations had incidents and commonly after incidents we produce root cause analysis. Your teams probably do something similar. What was the challenge encountered? What was the timeline associated with that? What events came together to make this impairment happen? And then how can we use that data to recreate a scenario to validate all the checks and balances that we put into place after an impairment did what we want them to do.

Demo: Inventory Analysis Agent Discovers Failure Modes in a Three-Tier Application

We want to validate that we did not just add retries if that was what caused the error. We want to validate that that made the impairment not happen again. We are resilient to that impairment in the future. Before we jump into our demo, let us talk about what application we will be reviewing here. We have a three-tier application. Has everyone dealt with a three-tier application before in some form or fashion, whether they were on-premises and you migrated to the cloud or just have built a monolith? We have an application load balancer. In our case, we have a Windows EC2 instance or a set of EC2 instances backed by a relational database, and in this one, a MySQL database.

When we start to do our discovery, we have two agents that are built and backed by Bedrock in this case. We have our inventory analysis agent. Our inventory analysis agent will go help us discover what is on the box and help us determine what failure modes or scenarios are associated with that failure. From there we have our document generator. The document generator has an integration between Fault Injection Service and Systems Manager documents or automation. For the actions that are not native to Fault Injection Service, we can write automation documents to facilitate these. For example, impairing IIS is not a default action. However, we can write a document to go do that for us. This document will help us create those impairments.

One thing I will also call out here: you may be thinking, Hans, is not there a service called Resilience Hub, and what is the difference between what I am saying right now and what Resilience Hub provides? Resilience Hub is focused on your infrastructure, so it can determine you have auto scaling groups associated with your instance, you have multi-AZs and all that stuff from an infrastructure perspective. But it does not have insight into what is running on your EC2 instance. So pairing this capability with Resilience Hub gives us that powerful next step. How can we take this to the next level?

In our demo today we are going to show the agents running. It is going to review what is installed on the EC2 instance. It is going to help create a hypothesis.

That hypothesis will help us identify what impairments we could facilitate on that instance based on what it finds. Then we're going to see a document created based on one of those impairments. Here we have our agent, and we're going to start running it now. It's going to start identifying what's installed in the environment.

It's using AWS tooling to configure what the network settings are. We start to see there's one instance now that's online. It's associating with what's running on the server. We see that it's installed with an IIS application. It's understanding what the database connectivity looks like based on what it's finding. We'll review the prompts for these in a little bit for how we steered them, but then it's able to start creating chaos engineering and operational hypotheses based on what it found. You can see at the bottom we're excluding some things as well. We don't want to test Microsoft Edge because that's not part of our application. We want to make sure we're focused on application-specific things.

Now you'll see the data that was discovered is being passed into our document writer agent, and it's going to start creating a document to facilitate an impairment of IIS on that box. The Systems Manager document will run on that EC2 instance. There is an agent on the EC2 instance that facilitates communication between the EC2 instance and Systems Manager, and that's how we will invoke this document on that box. By utilizing that combination, we can expand the capabilities of what's native to Fault Injection Service to be very application-specific when we're testing things in our environment and not just utilizing the built-in FIS actions, which are very advantageous as well.

Building the Infrastructure Detective: AWS Agents Framework and Context-Aware Discovery

I know that was a lot of scrolling, so let's go back to our presentation and take a look at that and what the output was and talk about some of what it found and how it deduced these things. You may have also asked yourself how we did all that and how many lines of code that must have been. Here you can see just a couple of lines of code. We're using the AWS Agents framework to facilitate the building and running of the agent. AWS Agents is a simple yet powerful SDK that takes a model-driven approach when building and running agents.

You can see we specify a couple of things here: the model, which model we want to use; the tools associated that we want to give it to arm it with the actions to go discover things in the environment; a prompt; and a callback handler so we can view what's going on from a logging perspective. One thing I'd like to call out here is the use_aws tool. This tool was built by AWS, but it allows your agent to have context of the AWS APIs and services. When you're asking it to go to Systems Manager, your inventory, and discover these things, it understands what you're telling it. Then we have another tool here that allows it to understand tags across an environment. These are two powerful things to give context and then understand the ecosystem as well. So it's not very many lines of code to build and create that agent.

Now let's take a step back and look at the scrolling text from the agent, but maybe a little bit slower. Here we see that it's found a web application server running with a couple of services. I like to think of this agent as the infrastructure detective. It's able to run on the box, discover what's running, and start building context of what failures can happen based on what's running. When we're thinking of large language models and agent capabilities, we want to think about how we can arm it with the context it needs to be successful.

Another helpful thing to add at this stage is not just allowing it to inventory what's on the EC2 instance, but also providing the application code as well. By providing the application code, we can understand how it then retries and backs off, and we can have better hypotheses based on what will actually happen when it's impaired. By pairing those two things, we have much better context on how the application handles scenarios. So it discovered IIS is running. It also begins to discover that there is an ODBC driver for SQL Server installed and that there is no SQL Server installed locally. We can then begin to deduce that this has external dependencies on a database, and if we use that tagging tool as well, we can then begin to learn what database is associated with that application if we're tagging our applications or if we have them added to a Resilience Hub, because ideally I want to probably go do some impairments and play with that connectivity.

Throughout the scrolling, there are some interesting things it started suggesting. Our native actions to Fault Injection Service are like the AZ latency, which was an announcement a couple weeks ago that we released. So it begins to discover all these things and put them together with context.

With understanding how the application functions, how the application code functions, and then knowing about the environment—is it part of an auto scaling group? How are the health checks configured?—we can begin to put that picture together and understand that this instance is actively serving traffic. If these services are disrupted, the web server would stop serving this purpose. However, with the way auto scaling is configured, once it begins to stop serving the traffic, the instance will be replaced. A healthy instance will take over and begin to serve traffic again.

Having been on infrastructure teams in previous lives that weren't the infrastructure teams supporting an application, sometimes you're in scenarios where the people supporting the application may not have full context of what the application code does. SREs that support applications might not be the ones that build it as well. Having capabilities like this to build hypotheses based on what that service actually will do and how it runs is very advantageous, especially if we don't have all the context or are unfamiliar with it for a long time. It helps speed up the failure modes that I'm aware of and helps me execute tests on those.

From Discovery to Action: Non-Native Fault Injection Scenarios

Let's talk about some of the documents that it would then write based on what it found. These are the non-native actions that we're talking about here. There were other native actions like AZ latency or AZ impairment and power interruption that it would suggest as well, but let's talk about the non-native actions that it didn't suggest.

Here we have a database SQL port blocking scenario. It would go and manipulate the security group on these database servers to possibly block all the security groups from the web application servers, saying I'm not going to let any servers communicate between these two layers. How does my application handle that? We can also have the scenario where maybe we're going to block a specific instance's IP and then determine that our observability platform allows us to observe what instance is having the blockage. We should be able to determine it's one instance within this AZ serving this service, and now I can go take action and bring this back into service faster. These are some good hypotheses we could observe and learn about our application and how it handles based on the scenario.

Then again, what happens if we impair the IIS service or the application pool that's serving traffic? Will it continue to try to serve traffic? Will clients retry? How will this happen? What will I observe in my platform? Based on what the agent hypothesized and discovered, the healthy web servers will be replaced, and the service would resume normal operations. Having it have context of the auto scaling group and health checks, we could then also infer how long that would take. Would it take five minutes based on how everything is set up?

This helps you build those hypotheses much faster without having to do a full discovery of what all the failure modes are for your application and making sure all of the stakeholders who have the knowledge of that application are there as well. You can have a meaningful conversation on how the service functions, how it retries, and how we believe it will act in these scenarios.

Prompt Engineering: Crafting Detailed Job Descriptions for Resiliency Agents

Let's talk about some of the prompt engineering and what helps us guide our agent into understanding what we want to do. Think of these prompts as a detailed job description for what the agents are going to do. Within that scenario that was shown, this is where we're guiding that agent to do that. For our inventory agent, we are being very clear on what we wanted to focus on, and that is really thinking about the applications and your servers on what patches to avoid and what updates to avoid, and also focus on business applications altogether. Why do we say ignore patches and updates? It's not really relevant to what we're going to test. That is maintenance, that is routine, and that's not going to show us what is focusing on resiliency altogether.

With our automation agent that we showed earlier, we're taking the same mindset with the same focus on resiliency, being very specific. We never touch protected services. We want to test resiliency without breaking it, and we want to be able to recover from it. In each prompt that we've guided our agent with, we're defining what we want to do, what the agent can do, and what its role is. Secondly, we're focusing on what it cannot do. We have to focus on what it can and cannot do so we can guide that agent properly. Then we're formatting what response we want it to look like. There's a balance between our agent being intelligent and effective enough to give us what we want within the scenarios, but also constraining it so it doesn't do everything it wants by itself. We're putting in different guard rails for the different agents.

Understanding Your Blueprint: Systems Manager Inventory as the Foundation

Before we even get started into understanding what our agent is going to do, we have to understand what is on our actual nodes or instances. We're working with an idea that if you're an architect looking at building blueprints, you have to understand all the different components, the systems, the connections, and everything that encompasses them so you can understand the integrity of that building. The same concept applies here. We're using AWS Systems Manager Inventory as our blueprint reader. It's going to tell us what is on the node, what is installed, and what needs to be updated. Our first step is understanding what's on the node, really understanding what is installed from the software, network configurations, what database connections that node has, the traffic it's pulling, and all the critical components of the application that make it up.

This is where it gets really interesting. By combining Systems Manager Inventory with Fault Injection Service, or FIS, we're bridging a solution together to experiment with expertise or understand that instead of throwing chaos, we're experimenting with precise, controlled environments to understand resiliency. The difference really is thinking about what you're testing for when you're testing. Being controlled and concise is our end goal. By combining these two, we're creating that solution and making things that are actionable afterwards.

Let's talk about our first agent, the inventory agent. Our primary directive to this agent is looking at running online, easily managed instances. This isn't by accident. We're being very precise on what we want because we want reliable and actionable data that we can make information based on. We don't want to do anything with offline instances. The two main key phrases you can see are running online and managed instances. We only care if they're online, like I mentioned, and actively reporting to Systems Manager will allow us to get that reliable, up-to-date data from our nodes. Think of this as setting up the guard rails for our agent altogether. We're not just telling it what to do. We're focusing more on where to look, what to see, and what to look at. By using our prompt engineering, we're basically saying that this enables us to be more consistent with what we're targeting so we're removing that noise and only focusing on what our end goal is.

Now here's where we're telling our agent what not to do. We said that you have to focus on what it can or cannot do. This is saying what we don't want it to focus on, so we're very adamant about not looking at patches and updates. Why? It's very simple. Like I was mentioning earlier, you're not going to really want to focus on what KB entries you updated and installed last week. It's irrelevant to your resiliency testing. You want to remove all that additional noise to get more precise and controlled experiments.

From a business critical standpoint, we really care about what web servers are serving traffic, what application is processing all those different transactions, and what databases are handling the different connections with the customer data. By telling our agent to ignore the rest of that information and being very clear and precise, we're turning that into actionable insights. We're not just being wishful thinking that it's going to understand what we're doing.

Focusing on this approach, we're really letting our agent analyze the different systems and look at all the components that matter and that are going to impact your business without getting distracted by everything else that might be installed on the node. Now let's look into the construction side of things—what our agent actually does on the node that you're telling it to do. Think of this more as a five-step discovery process looking at everything that it's building and giving you the full picture of what is on your system.

We first catalog what software is installed on those different nodes and the application. This is that foundation where we want to get that information from. Secondly, we're examining the different configurations and the different packages installed. This is very much where we want to focus because it's setting up everything that it's working with and giving those different components the right foundation.

Then third is the more crucial component that we also want to focus on: really documenting everything about the service and the actual component of the application and their versions. We're looking at the services and their versions because sometimes you might think something is installed on your node and then you realize that version is not the right one or the runtime is different than what is actually installed on the node. You want to make sure that those two different things are aligned to what you're looking for because it might create a disruption. It might be subtle, but it could cause problems down the road when you're testing for resilience.

Then number four, we're also doing number five, which is packaging all these different inventory reports so that other agents can look at the inventory report and use it as actual data. It's always useful to be able to repackage everything. If you remember, we're only looking for online EC2 instances that are reporting. Through all this process, we're only working with online and not offline because they're not relevant to our testing and they're not in production or within our task dev and all in the lifecycle.

What inventory does is give you a way to understand what really matters within your resiliency testing. Starting off with business critical applications, think about your environments and what you're hosting—if it's Java, if it's Python, Node.js, if it is .NET frameworks—you're looking at everything that is within the data layer altogether. When we think about database and the clients and document everything that is on there, that's also a big component into understanding what is happening. You want to understand whether it's Apache or if it is Tomcat. They're always the front line of your customers, so paying attention to what is on that layer means you want to make sure that you understand that completely.

When you think of this, you don't want to forget the business, the customer-facing applications. Those are the specific tools that if you either built in-house or if it's a third party, you actually take the time to look at the custom applications altogether. The key thing here is really focusing on whether it has a direct business impact—yes or no. It's not just creating that one list for you to understand, but it's giving you that map and building out what are the different components that you see within your application that you might not have known were tied to your application.

Four Critical Questions and Protective Service Boundaries

Now one thing you should think about is the thought of four questions that build up the backbone of this analysis. These are four key questions.

First, what is the server's primary role? This goes beyond simply labeling that server. It's about understanding the purpose that this server has beyond different tags. Is it managing the database? Is it running application logic? This is the fundamental understanding of what the purpose of this server is going to be.

The second question is: what services are actively running and providing business function? This is really where we want to focus our time. We should be asking whether it's actively running, whether it's reporting, whether it's Systems Manager. This tells us we should be focusing more on how these failure impacts are going to affect our business. When we talked about earlier how we don't just lose customer trust, we lose different components that we might not see as something more valuable in the beginning.

When we talk about dependencies within the different services, this is where we think about failures. Failures don't happen in isolation. We think about that domino effect where one thing happens and then a bunch of other things happen that we might not have known about. This is really thinking about what dependencies are within your application while we're testing and what would happen.

These questions help us by providing an inventory of our instances altogether. We're bridging the inventory with resiliency intelligence to give us a more actionable plan for what we're doing that might fail. represents the guard rails of our agent and what it can do in terms of protecting the different services that you want to either test or not test. We're creating an isolation boundary for different resources. Think of this as a way to really maintain control, observe, and if needed, stop as a reaction to not being able to go into a resource.

The key message when it comes to protective services is that chaos engineering isn't about random testing. It's not about just going in and testing everything within your environment. It's being selective on the different resources that you want to test and making sure the application can withstand it. This is really meant to restrict the agent altogether. It's really to empower you as a tester or to build that resiliency into your environment so that you do it in a safe manner.

This isn't an end-all, be-all. This list can be modified to your use case. shows what those key analyses are. When thinking about what kind of questions you're asking the agent to solve for you, the first one is going back to what is the primary role of this agent altogether. We're thinking about that core understanding of what is happening within your environment and what it can or cannot do. When it comes to which services are active and serving business functions, the agent does a lot of similarities between each agent from what we're going from, so when it comes to the protective and versus it coming into reporting, thinking about the dependencies is like the same concept as we're thinking about moving into understanding the application and what is happening with the application.

shows beyond just protective services. Think about this as a way of remembering that when we talk about our protective services, these are services that we don't want to affect. But in reality, using the Systems Manager, that is a way to protect those services and recover our different nodes if something were to happen.

So that remember statement is more than just a summary; it's more of a design choice that we implemented. It has a purpose for reinforcing those mission boundaries of what it can or cannot do and only focusing on what it should be doing.

Creating Self-Healing Automation: Systems Manager Documents and Live Demonstration

Alright, so Nereida spoke on how we can discover things with our agent. Now let's talk about how we can actually take action on those resources. We're using Systems Manager documents to facilitate those. So before we jump into those agents specifically, let's talk about what makes a good Systems Manager document. I think one of the most important things that is native to the FIS actions is if you use the EC2 CPU stress action or the EC2 memory stress action that FIS provides, behind the scenes these are Systems Manager documents managed by the service team that go into your instance to facilitate those actions. If you look at those, they're prefixed with AWS FIS in the console. If you go look at them, they take care of putting things back in order when they are done.

We have to restore the state. If we're stressing the CPU, we have to stop that stress when the experiment is canceled, fails, or when the duration we specify runs out. So service state restoration is an important thing that we need to take care of ourselves if we are writing documents that are outside of what FIS provides. We need to put the resources back into the shape that we found them in. One way to do this is by making our documents very modular. On the right-hand side, you'll see a CloudFront impairment that validates a multi-region CloudFront implementation. Each portion on failure or cancel, we're rolling back and reverting our changes to make sure that our services are how we found them. We're cleaning up our house and want to make sure we're idempotent.

For example, the CPU stress action that's native—we don't want multiple actions of those running at the same time. We want one. So how can we do that? We add preconditions. If you saw the tech scrolling quickly for the automation document, there were preconditions saying this runs on Windows boxes where we're running PowerShell, making sure we're running the right resources on the right instances. This is what our prompt will take into mind when we're building this document.

So again, the same concept: we're starting with what is the agent's persona. The agent is a specialized agent focused on writing Systems Manager documents to facilitate chaos engineering or resilience tests, and it will generate Systems Manager documents to help create those stress conditions. Again, we're continuing to steer it. We want you to be focused on the business applications you're being passed. We want to be creating very specific things using these documents and automation, and we want to be focused on doing things that are not native to FIS. If the FIS native action is there, I don't want to reinvent the wheel. Use that. It's good. I don't want to handle all those things. Use native actions and only create things for what FIS can't do natively or to go into our box to create impairments.

We want to be able to do this on the OS level if needed and implement all those safety mechanisms. Specifically for this one, there was a RAG database behind this prompt that had the blog for how to create best practices for writing documents in the database as well as all the Fault Injection Service native actions in the database. So we gave it context of what is good, what good looks like, as well as here's what is already invented—don't go do this, we already have it.

Before we run a resilience experiment, we want to make sure there are certain things done. For example, for our IIS instance, we may not want to run an experiment to impair IIS if IIS isn't installed on the box. Is it currently running? We want to make sure it's online, it's running, and my application is in a stable, healthy state at this time. Let's validate that. Then after that, if I'm dropping a file locally to make sure I'm being idempotent, is there enough space for me to drop that file without filling up the block drive that I'm dropping it on? And perhaps are the PowerShell modules that I need to facilitate this already installed on the box? We're making sure everything is there for me to be successful in creating the impairment.

So the agent will take this into account based on what was in the database's best practices. But again, reiterating into it, follow these preconditions. And then once we have the experiment that was successful and gone through, we want to make sure that the house is put back in order. I have a toddler at home, so I'm used to telling him, "Hey, you're done playing. Let's pick up your toys. Let's put our toys away before we move on to the next thing." The same thing is true here for an IIS experiment. Is the service transitioning from a stopped state to a running state?

Is everything we did being undone? Once all of that is verified, remove the local file that you created for idempotency reasons. Let's clean up after we're done with an experiment, and the agent will take this into account. As Nereida was saying, we want to set ourselves up for success with these experiments by guiding them along the way. We don't want to impair the Systems Manager agent. I can't restore the state of the system if I can't communicate with it.

So for this one, leave Systems Manager agents alone if it's installed on the box. Don't impair that. Don't do something that critically impairs the operating system itself, and don't impair the network connectivity between the instance and Systems Manager. We want to make sure that we're able to communicate with it and take action as needed. So again, we're steering it toward what to do, but also what not to do along the way.

Reiterating, you focus on the business application you're testing. You can impair non-critical temporary files or cache directories. You can impair application-specific services, whether these are third-party or custom, and user-level preferences. So we're guiding it along the way toward your path to hypothesis. But remember, all these things need to be true when you're writing your experiment and your documents to make sure that we have the best practices implemented along the way and are setting ourselves up for success.

Back to you. Hans mentioned earlier how the agent created an automation document for us, and we're going to look at how that actually looks. The agent made three main critical things where it talks about a validation step. This is where we're running tests and not focusing on systems that are already impaired, right? Going back to what we want to focus on and making our objectives pretty clear. Then the second one is more of a controlled execution phase on the document, thinking about being a little bit precise on what it's doing within the different steps that an automation document does for you. And then there's an automation restoration step. So like Hans was saying, if something were to stop, we bring it back online so that it goes into the regular state it was in when we first started testing it all together.

The idea behind creating this automation document is defining the different states and maintaining that state within all the different levels of that phase that we're going through in our testing. This is providing more of a structure within our automation so the different steps that it's going to orchestrate flow within the lifecycle of your testing. Now we're looking at the FIS template guide that it is creating. Before that, the automation document is going to execute all together. Now it's passing specific values toward the different targets that it is testing. It is thinking about the 500-second pool in that area and it's thinking about the duration and targeting that default pool for our testing altogether. These are parameters that are defined in our document earlier. So when this is targeting those nodes, it's targeting the different nodes altogether in the application, and then it'll execute within the automation document itself.

This is kind of combining those two different services and leveraging what could be native and what is not built in, and then adding the different configurations that might not be supported. From logging and IAM permissions to different components that could not be actively supported, it's supporting it through automation altogether. Now we do have a demo, so Hans, let's see it in action. So let's take a look at all the things we built previously. So here we'll go into our Fault Injection Service and the resilience testing here in the console, and we'll see the service and the experiment that it created. You can see here the template ID, and we're going to start the experiment. We're using the console now, but this can be done via APIs or CLI to invoke the experiment.

Here is our web host. It's still up. It's connected to the database currently, so the service is still online. You can see here that the action is pending, so it's about to start and impair our web service. You can see here's the action and how it is connected to the SSM document, as Nereida showed as well when we exported the JSON template. That's what it looks like in the console, and the same thing here you see the experiment.

All the PowerShell that was written for us as part of that document generator agent implementing best practices, logging, failure, and fallback along the way, doing all that heavy lifting for us. Now if we refresh it, the instance is taken down. So we had the agent discover what was happening, and then it created the document for us as well as then impaired the system. So far we've talked a lot about what if I don't know what I want to test, right? Discover the system for me and help me figure out what are the likely failure modes. Now let's move to the learn and respond phase, that last phase of our resilience lifecycle framework. What if I know what I want to test?

Learning from Failure: Root Cause Analysis to Multi-Agent Systems

In this context, we had an impairment and we produced a root cause analysis of what the timeline was, what actions were taken, and we feed this root cause analysis document to our agentic or large language models. We can provide it better context by giving it known good experiments. This could be the Fault Injection Service experiments owned by the service team and systems manager, accompanied by the best practice blog or some samples from our open sourced FIS template library. So what do good experiments look like and what are the best practices associated with them? We can also give it an AWS environment to review that's production-like to have good context of here's the impairment, here's what good experiments look like, and here's the environment. Go help me create and recreate an impairment that will help me validate what happened and all the checks and balances I put in place so these don't happen again.

Then ideally, once we have all that built, we want to test in a production-like environment, right? There are many differences at times between a dev environment and a pre-prod environment and production, so we want to test as close to production as possible and validate that we're able to observe the impairment in terms of disaster recovery. Because resilience is high availability and then disaster recovery, if we're doing something, did all our playbooks that we updated to help recover from that impairment work? Are they all valid? Do all those work and are our teams able to validate that? We can take all these things into context to help validate, recover, and learn and respond from something faster.

We'll talk about what that agent could possibly look like. We're steering it and giving it the direction of you are an AWS resilience expert. You will discover what is in the root cause analysis document. You will comprehend it. You will create failure scenarios with Fault Injection Service and Systems Manager based on what you find. You're going to have some inputs into your decisions. You'll be given a root cause analysis document that has timelines and details of the root cause of what's going on. From there, you will look at your target environment and the RTO and RPO of the application that you were provided to know all the requirements of how the application should function and what happened previously.

From there, consume this root cause analysis document and instruct the failure modes. Make sure you map them to the dependencies and single points of failure that could be in the environment and formulate testable hypotheses based on these, then go build them for us. Again, this saves time to help us build tests to validate what happened and whether it actually stops it from happening again. Working with many customers, if we have no context of the application, it may take us a week or two to actually sit with the application teams or sit with the SREs to have a good conversation on what are all the failure modes for this application and document those and how to test and prevent those in the future.

Then we create hypotheses from those, which could take a couple of hours or days depending on the amount of effort and people's knowledge of the application. Once we have those things, then we have to go build our Systems Manager documents, which could take a couple of days to build and test if we're writing them ourselves. Then we want to spend time validating that, right? We have to test the automation we're going to trust, but validate that. That was another thing from the slide this morning: we trust the system to do things, but we want to validate it before we start experimenting. So we want to look at the automation document, make sure everything looks good, and then begin to test this in our environment to make sure that it's doing what we want and not just running code.

From doing our root cause analysis and our discovery of the system to having the experiment ready to run in an environment, that could take a couple of weeks possibly, depending on teams and setting up meetings and having our day jobs as well. So how does AI speed this up?

For us doing all the things we talked about? We can now discover the system and inventory it to produce failure modes within a couple of minutes. That was a single instance, but it took about 2 minutes to fully discover what was on the box and provide me some hypothesis of where I could start testing. It then designed the experiments and the hypothesis within a couple of seconds and created those documents, implementing best practices along the way.

And then I can spend my time as an engineer validating those and giving feedback to those documents about what needs to change, but I'm not spending my time doing all those things. I'm being very effective with how I'm spending my time. Go build it for me. I'll test and I'll validate it, and that allows me to get things done much quicker and spend my time efficiently as an engineer.

Now let's talk about the actual time savings and what it means for your team and your different organizations. Disclaimer, it's always going to be based on your use case, but it's up to 90% reduction in experiment time. We're going from weeks to days, as Hans explained earlier, so it's not just about speed, it's also about enabling resilience.

When we're talking about automatic discovery, it eliminates more than just manual systems or improves those different systems. We're thinking about removing the extra fluff if it's going through wikis, if it's going through our internal tools. We're automating and making that life cycle a little faster for us. The agent is generating contextually all those relevant scenarios that we will likely need to test, and so it's giving you that actual information. It's reading all those different information that it really needs, and you're providing all the different context from the RCAs, the best practices, and it's also defining what is happening within the environment if you're meeting those requirements. It's also doing it within a safety mode so not breaking your monitoring if you're having different applications. It's allowing you to do it in a safe environment.

And so what does this really mean all together for us? It's kind of taking you into a little bit of acceleration when it comes to your resilience testing. Think of this as more of a way to instead of you doing everything manually, you're having someone do it for you, but you're also validating. It's kind of bringing those two different components, AI and human, into oneself to give you a way to safely and in an accelerated way for your teams to bridge that gap as a working team instead of it being individual.

So how can we implement AI into test resilience? We talked about this, but Hans will bring it back into the whole life cycle. We talked and showed a bit of how we test and evaluate, discovering the system for me, creating these things, saving me some time. And then we talked about learning and respond, but these were isolated use cases in themselves. So how could we take this to the next step? How can we take this, as we like to say at AWS, what is the art of the possible to this system?

We can create these into a multi-agent system responsible for different things. Here we're going to start on the left and move to the right on the top. We have a hypothesis generator, possibly the Systems Manager agent that we have, is a subagent of that hypothesis generator. It will not just take into account what's installed on an EC2 instance, but look at the application code, look at the environment holistically, and tell me everything that we could hypothesize for this given scenario.

From there this could create many hypotheses, but then this will feed it to the next agent, which is a prioritization agent. Which will tell me which one of these is the most likely to actually happen. Let's test the most likely ones. Let's spend our time where we should prioritize these examples for me. And then from there, once we have the hypotheses that have been prioritized, then we need to design the experiments, giving context of what are the native FIS actions again, what do I then need to go create myself. So the Systems Manager document agent would most likely be a subagent under the experiment design. Go design these experiments, have context of what is native, and go beyond that where needed with safety guardrails in place. The experiment agent would be able to actually go execute these experiments for us. We can have other agents that are possibly even evaluating each of the Systems Manager documents we write for their efficacy. Are these following best practice using LLM as a judge, as you may have heard that before. We can implement these other agents along the way to make this more of a solution to say go discover my environment, hypothesize, build experiments, and then test them for me.

And then also learn and iterate. Have another agent monitoring the experiments running. Are they having the desired impairment? Are we seeing the logs come through and being able to observe what's happening? Or do I need to go reimagine my hypothesis and can you continue to iterate on that? We can put all these agents together to help us have that holistic solution to evaluate and begin. Along the way, we still want to trust but validate, so we would probably still be looking at the Systems Manager documents that we're producing and just validating things as we move them from dev environments. But at least it gives us a head start to go create these things and let me spend my time efficiently rather than toiling with creating documents and inventorying systems.

Key Takeaways and Getting Started with Multi-Agent Chaos Engineering

Thinking about takeaways, because if you could take anything from a 60-minute conversation from everything you've learned, it's kind of a lot to digest. But when we think about how we're using AI with resiliency, we're thinking about how it can help us detect vulnerabilities in our different complex systems. We're thinking more towards how that can accelerate us into being able to test faster, remediate, and mitigate as much as we can, and without it just being us. It's helping us go beyond what we already know and helping us map out those different dependencies that might be unknown, so testing that unknown.

Talking about that collaboration between AI and ourselves, we're thinking about how we can really think about all the different possible scenarios that we have not thought about or that we have not tested. When something happens and you get that 3 a.m. call, you already know what you're supposed to do or it could mitigate that risk or reduce the possibility of being woken up at 3 a.m. By using the Fault Injection Service framework with AI, we're really having that controlled environment, being concise with our chaos engineering and testing with a purpose. This is accelerating the collaboration between AI and the different teams that you might have.

The real truth and the power that I like to say is that AI partnership with us is not replacing one component of your team. It's building a team so that the AI becomes more knowledgeable through every test, every failure, and every success. We're also getting that feedback from your engineers to say is it right or is it not correct. It's validating between each other so that you're building that application that can withstand any failure at any component, wherever it might be altogether.

This is kind of the evolution that we've seen when it comes to resiliency testing. It's building that bridge, understanding what is happening and what is not happening, and then being able to learn from each other and validate and progress as the company grows over time. So what's next? Earlier I mentioned that we have something that you can implement right away. In AWS, we like to repurpose already existing things, so multi-agent chaos engineering is where you'll find the solution that we demonstrated today. It's all the different coding that comes with all the different agents and what the agents are doing so that you can repurpose as well.

Test at your own risk, validate it, and make sure it's meeting your guidelines. We have the AWS Resilience Hub framework that is giving you an action plan to understand your app, define your application, move towards what are possible scenarios and failures, and also plan for mitigation altogether. We also have the AWS Fault Isolation Boundaries, which is talking about our infrastructure and how you can plan for what could be a disaster and how you implement that and design your own application.

Want to know more beyond just resiliency? We do have a booth at the village. We have multiple use cases from observability, cloud governance, and cloud operations in general. Come find us at the booth. We have demos and swag. We all love swag. Come visit us at the kiosk. If you have any questions, we could take questions in the back. Thank you so much for being here, and I hope you have a good re:Invent.

; This article is entirely auto-generated using Amazon Bedrock.