Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience

In this video, Mike George, Principal Solutions Architect at AWS, presents five key principles for cloud resilience using the SEEMS acronym: Single points of failure, Excessive load, Excessive latency, Misconfiguration and bugs, and Shared fate. He demonstrates an agentic resilience advisor built with Amazon Bedrock and Strands Agent SDK that automatically analyzes AWS workloads against these principles. The agent uses tools including use AWS, calculate letter grade, and AWS documentation MCP server to assess resilience posture, generate architecture diagrams in Mermaid format, provide observability recommendations, and create operational runbooks based on specified RTO and RPO requirements.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Planning for Failure: The Five Principles of Cloud Resilience Using the SEEMS Framework

You've probably heard of Werner Vogels, who is Amazon's CTO. His famous quote is "Everything fails all the time." Now, if you've been to re:Invent a couple of times, you may remember last year when he said that he's often misquoted. The actual quote is "Everything fails all the time, so plan for failure and nothing will fail." That's exactly what we're going to talk about today. I'm going to tell you how to plan for failure so that your workloads won't fail. My name is Mike George. I'm a Principal Solutions Architect with AWS, and I work primarily with nonprofit organizations. The tips, tricks, and techniques we're going to talk about today are things that we use to help public sector organizations be successful through disasters.

Unfortunately, humanitarian disasters are all around us. You can think about the war in Ukraine, the floods in the Midwest, and there seem to be problems all around the world. This is something that's going to be continually happening. But if you're stranded on the roof of your home and you're calling for help, when you call, if you can't get through, knowing that everything fails all the time isn't very comforting. You want things to work when you need them to work. That's what we find with most humanitarian organizations: when things are at their worst is when your workload needs to be successful. So what I want to talk about today are the major ways that you need to think about resilience in the cloud.

When people think about resilience in the cloud, they typically think about the N plus 1 problem, right? I have one EC2 instance, so two must be better. I have one Availability Zone, so two must be better. I have one region, so two must be better. Well, not always. There are actually five principles you need to think about to have resilience in the cloud. The first one is really managing those single points of failure. This is what most people think of: one is good, so two must be better.

The second thing is excessive load. Excessive load is all about having enough resources to support your workload. This also includes things like making sure that you have appropriate service quotas and other things to support your workload. We also think about excessive latency.

Excessive latency is all about how your workload handles latency, or what happens if a downstream dependency of your workload has high latency.

We think about misconfiguration and bugs. Misconfiguration and bugs is all about making sure that you have the right CI/CD processes and automation so that you can effectively deploy workloads to production and make sure that you're not making manual changes. If you've got manual changes that you're making in production, you may not have a resilience problem today, but you will one day. And the last thing to think of is shared fate, which is really about reducing the blast radius of our workload.

Think about blast radius or think about shared fate where I have one database that supports two or more workloads. If there's a problem with that single database, then those other workloads could be affected. So again, it has a larger blast radius than I may want to have. Now when I think about these different categories of failure, you can see that we use the acronym SEEMS to help you remember it. The S stands for single points of failure, the E is excessive load, the next E is excessive latency, the M is misconfiguration and bugs, and that final S is shared fate.

So when we think about these different categories, we think about single points of failure. Think about how your workload is architected. Is it architected for redundancy? What happens if those components fail? When I think about excessive load, I want to think about what could overwhelm this component. How can this component overwhelm other downstream components? What happens if this workload takes so long to succeed that people stop waiting for it? Is it possible to throw away work that's never going to be returned? Could this workload experience bimodal behavior? In other words, does it operate one way under normal conditions and another way under a failure scenario? Are there quotas that could be exceeded? And how does this component scale under load?

When I think about excessive latency, I think about similar things like what happens when this component experiences latency, or what happens if a downstream dependency experiences latency. How does my workload behave? When I think about misconfiguration and bugs, again, we want to think about whether I can automatically roll back a failed deployment or a bad deployment. Or can I shift traffic away from a faulty container or from an Availability Zone where maybe a bad deployment occurred?

Do I have guard rails or other things in place to prevent operator errors? Are there things that could expire in my workload, like certificates or credentials? Finally, when I think about shared fate, I think about how big of a change it is when I deploy this component. Is it a large change? If it is, that could increase my risk of having a problem. Does this component share user stories with other workloads? Are there things that are tightly coupled with this workload? What happens if this workload experiences a partial or grave failure? These are all the different kinds of things that are worth thinking about.

Now when we think about resiliency, it's worth having a mental model. The mental model that I want you to keep in mind is that we have high availability and high availability is how I have built my application to react to certain kinds of failures that I would anticipate. You might think to yourself, what are the things that I would expect to fail in my workload? Maybe an easy one would be if I'm spread across multiple availability zones, I should expect over time that I might lose an availability zone. That seems like something you should plan for. You could think of other types of failures that you might want to anticipate.

On the other hand, we have disaster recovery. Disaster recovery is all about failure scenarios that I've anticipated but decided not to mitigate, or maybe things that I just haven't anticipated. Disaster recovery then is a way for me to resume my operations. Below both high availability and disaster recovery is this mental model of continuous improvement. What can I do to continuously improve my workload? Because resilience is not a one and done type thing. This is all about implementing good CI/CD processes, introducing resilience testing to not only test your workload for failures but also test your team for failures.

Building an Agentic Resilience Advisor: A Live Demonstration of AI-Powered Workload Analysis

We've spent a lot of time now talking about resilience. Let's actually now talk about generative AI. Let's look at a typical generative AI application where I have users that interact with an agent, that agent interacts with a foundation model and a set of tools. Those set of tools you may have heard of as MCP.

Now, one thing that I think would be interesting is I've talked about resilience. I've talked about those five different areas that I'm interested in testing my application for. I'm interested in determining if my application is vulnerable to shared fate, high latency, insufficient capacity, misconfiguration and bugs, and single points of failure. It would be interesting, wouldn't it? Could I use generative AI to automatically look at a workload that I'm running in the cloud and tell me if I have problems in any one of those five areas? Well, that's what I've done and that's what I want to demonstrate for you today.

Similar to that diagram I just showed you, this is an agentic resilience advisor that I want to demonstrate for you today. The resilience advisor acts like this: first of all, I built an agent. That agent interacts with a large language model through Amazon Bedrock and it interacts then with a set of tools. These set of tools give that agent functionality that it doesn't natively have. There are three sets of tools that I've given it access to.

The first is a use AWS tool. This is a tool that's built into the Strands agent SDK and the use AWS tool allows my agent to go into my AWS account and inspect the resources that I'm running there. The next tool that I've created is a calculate letter grade tool. I want to be able to have a simple letter grade so I know the resilience of my workload. Given a shared fate issue, is this an A, B, C, or D? This is just a very simple piece of Python code that I've decorated with an identifier in my Strands agent to let it know that it's a tool.

Finally, I'm using the AWS documentation MCP server. This is a publicly available MCP server that allows you to get detailed documentation on really anything you want as it relates to AWS. What I want to do is run this agent. I want to ask it questions about a specific workload in my account and get the resilience posture back so I know whether I need to fix anything or not. So what I've done now is I've got two consoles here. On the right hand side, that white screen is my agent that I'm actually running. In the real world, I wouldn't run this through a console, but I'm doing this here just so you can see.

On the left-hand side is my client who's actually interacting with my agent. I'm going to start out by starting my agent up. You can see it's beginning to run. Now on my left-hand side, I'm going to start up my client and I'm going to start by answering a few questions. I'm going to give a tag of a workload that I'm interested in that's running in my AWS account. I'm going to tell it the RTO and the RPO of my workload.

You can notice that on the agent side, it's identified that it's looking at the food-agent workload, and it's going to analyze that workload with an RTO, a recovery time objective of 24 hours, and an RPO, or recovery point objective, of 12 hours. It's going out to my AWS account, pulling back those resources, reasoning about them, and then it's pulling back documentation to look at what are things that I could do to improve my workload. You can see now on the left side that it's returned results to me, and it gives me those five different categories that I mentioned earlier. If you look really closely, you can see that it gave me pretty much B's and C's for letter grades. So given an RTO and an RPO of 24 hours and 12 hours, my workload is okay. There's probably some room for improvement.

But now I want to ask it another question. How would my letter grades change if I changed my recovery time objective from 24 hours to 2 hours? And what if I changed my RPO, my recovery point objective, from 12 hours to 30 minutes? How would that change my resilience posture? Now you can see on the right-hand side that the agent doesn't need to go back to my AWS account. It already understands my workload that's running. It's reasoning about that workload further, and it's pulling back some documentation to justify any claims that it's going to make about what I should fix.

You can see now that on the left-hand side is what I get back in my client. If you look carefully, you can see that whereas before most of my letter grades were B's and C's, now most of my grades are D's and F's. So this workload definitely does not support an RTO and RPO of 2 hours and 30 minutes. The next thing I want to do is address the fact that many of you have workloads that have been running in production forever, and you've just never updated the architecture diagram. It gets out of date. Now I want to ask this: build me an architecture diagram based on what you've already pulled out of my account.

The agent reasons about that, and if you look here on the left-hand side, what it's done is it's generated an architecture diagram for me in Mermaid format. So if you check that into Git, you'll have a nice graph that will look something like this. You can see this is the workload that it's been analyzing for me. You can see that I've got a Bedrock Agent that interacts with a Lambda function. It reads and writes from an S3 bucket. It uses Secrets Manager. It's logging out to CloudWatch Logs. It's got some IAM roles, and it's got some encryption through KMS.

The next thing I want to do is address the fact that I know there are problems with this workload. I've seen that there are definitely things I want to improve, but I don't want a list of everything that's wrong with it. I want to just start out with observability. So let's start out by asking: what are the top three things I should focus on from an observability perspective? The agent on the right-hand side is going to reason about my workload. It's going to pull back some documentation, and if you look on the left-hand side, it's now given me my results.

You can see that it's identified that I need to have comprehensive CloudWatch alarms and dashboards. Fair enough. I don't have any alarms and dashboards currently. It's then saying I should enable X-Ray for tracing. I don't have any sort of distributed tracing, so that's a good next step. And I should have set up enhanced logging and Log Insights. Again, I do have a log file, but I'm not really logging that much, so that's another good next step. You can see for each of those recommendations, it's giving me links to specific documentation on how to accomplish that task. Notice that when it's talking about setting up logging, it's not giving me a link to the CloudWatch main page on AWS. It's giving me a link deep in the documentation.

The last thing I want to do with my agent is address the fact that nobody wants to have downtime, but I want to be prepared for a bad operational day. So I'll ask it now: build me a runbook that I can use to recover from an operational event.

My agent is running and reasoning about my workload. It's going to some documentation, pulling that back, and you can see now that it's generated a run book for me. I can see some incident severities here. When I scroll down, I can see recovery procedures for my Lambda function, recovery procedures for my S3 bucket, and for my Bedrock agent. For Secrets Manager, I also have ways to validate my service overall so I can make sure that it's back up and running. And if you look here at the very bottom, it then gives me steps on what I should do to complete a post-incident analysis.

Resources and Next Steps for Implementation

So I think we've answered the question: could we use generative AI to help me improve the resilience of my workload? The answer is absolutely yes. I've got a couple of QR codes here if you're interested. The first QR code is for Strands Agent. I mentioned that I built this agent through the Strands SDK. If you're interested in learning more about Strands, scan that QR code.

That middle QR code is for the Resilience Agent. If you're interested in the code that I demonstrated for you today, scan that middle QR code. It will take you directly to our GitHub repository, which is our nonprofit samples repository. And if you're interested in seeing how to actually build this code, scan that QR code on the right-hand side. This is a QR code for AIM 336, which is happening on Thursday, where I'm going to be walking you through how to build this exact code.

As a next step, if you're a nonprofit organization, the QR code on the left-hand side shows more sessions that we're doing this week related to nonprofits. The QR code on the right-hand side, if you're a nonprofit and you want to know who your account team is, scan that and let us know who you are. We'd love to talk to you more about what you're doing. With that, thank you for your time, and I ask that you complete the survey in the session app. Thanks, everyone.

; This article is entirely auto-generated using Amazon Bedrock.