Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From ideas to impact: Architecting with cloud best practices (ARC204)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From ideas to impact: Architecting with cloud best practices (ARC204)

In this video, James Beaumont and Paul Moran present AWS Cloud Adoption Framework and Well-Architected Framework, celebrating their 10-year anniversary. They explain how these frameworks work together—CAF for organizational strategy and Well-Architected for technical implementation. The session covers AWS's operational culture including service ownership model, correction of error (COE) process, and Weekly Operations Review. Paul dives deep into the six Well-Architected pillars, focusing on Operational Excellence design principles. He demonstrates CloudWatch's AI-powered investigation capabilities for automated troubleshooting, discusses the five causes of failure (dependency, system/component, constraints, traffic spikes, and change), and emphasizes chaos engineering through game days using Fault Injection Service. The presentation includes practical examples of testing resilience across availability zones and regions, with specific focus on gray failure scenarios and continuous improvement cycles.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Welcome to re:Invent: Addressing the Great Stall in Cloud Migration

Good morning everybody. Hi, good morning. Great to see you all this morning. It's 8:30 a.m., we're in Las Vegas. It's the start of re:Invent. Thank you all for showing up today. It's great to see you all here in person. I can't see much because of the lights. I'm trying to spot people's eyes.

But yeah, here we are, an exciting week ahead. So you've all got a plan, you get really excited, and you start moving and migrating your first workloads to AWS. Everything's going great, your teams are getting more confidence, you're high-fiving each other because of the success that's happening. But you start bumping into problems. We see a couple of different scenarios of working with customers over the years.

One of them is that you give your technical teams all the keys to AWS and they just start building things without a particular plan or strategy in place, and they bump into issues and things get closed down. One of the other situations we see is that you start building things without really a clear business outcome defined, and again, things get closed down.

And we call this the great stall, and we really want to avoid this happening because, you know, cloud success is not guaranteed, and we really want you to be successful on AWS. So this brings us to our talk today, which is From Ideas to Impact, Architecting with Cloud Best Practice. My name's James Beaumont. I'm the Director of Enterprise Support for EMEA. And my co-presenter, who, where is he, he's over here, is Paul Moran, who will come on stage in a bit. He's a Principal Technologist based in EMEA as well.

Ten Years of Cloud Adoption Framework and Well-Architected: Origins and Evolution

So, it's 10 years since we launched Cloud Adoption Framework and Well-Architected, so it's pretty big for us in 10 years of a journey. But what is the Cloud Adoption Framework? So it's your business and organizational guidance for your cloud adoption journey. And the Well-Architected review is your technical architecture guidance for building workloads on AWS. So they work really well together.

Oh, pressed it too early. So they work really well together, and the Cloud Adoption Framework is geared towards your early stages of your journey to AWS, and the Well-Architected Framework is when you get into the technical implementation phases. Now, as I said, this is, we've been going for about 10 years now, but actually started before that.

So back in 2012, the Well-Architected Framework got created in the UK team, so it was created by the UK Solution Architecture team, and it was led by a guy called Fitz, and it was based on us having conversations with customers. So we'd go and sit with customers about how they're using AWS and we'd see common questions coming up. So we built a bunch of questions to walk through customers with, and it started off on a spreadsheet, as all good things do, and there'd be a set of common questions that we'd run through customers with, and there'd be guidance out of the back of it. And that was the original Well-Architected Framework.

In 2015, we published the framework so you could see what the guidance was and the questions were to run through. And then in 2018 we published it, so we made it self-service so you could run through Well-Architected yourself. Since that time, we've seen customers run around a quarter of a million Well-Architected reviews, which is pretty significant.

We also see, we've also developed other capabilities, so we introduced something called lenses. Now, there's two rough types of lenses. We get industry lenses, and in industry lenses, if you're a financial services customer, it will be how you view the Well-Architected, because there might be certain particular things that you want to focus on as being a financial services industry. One of the other lenses we have is workload-specific lenses, and one of the more recent ones that we did was the Generative AI lens, so if you're running Generative AI workloads, how you want to think about Well-Architected.

Why Use Cloud Adoption Framework and Well-Architected: Business Value and Technical Benefits

But why do you want to use the Cloud Adoption Framework? So you want to reduce business risk, you want to lower your risk profile through improved reliability and business continuity. You want to improve your ESG or environmental, social, and governance. You want to be able to monitor what these aspects are and make sure you're improving over time. You want to create new value streams or revenue streams, so most of us work for businesses, and you know, when we're building something, it's often around creating more value for customers and more revenue, so understanding what that is and how you're going to get there is really important.

And then we've got increased operational efficiency. This is around reducing operational cost, increasing productivity, and increasing employee and customer experience.

Now, why do you want to use the AWS Well-Architected Framework? You want to learn from AWS best practice. There's a saying which goes, there's no compression algorithm for experience. But we've kind of compressed 10 years of experience and understanding of working with customers into the AWS Well-Architected Framework. So rather than you having to start from scratch, you can go and get that 10 years of experience to learn from.

You want to make informed decisions. This kind of is an obvious one and goes without saying, but if you're not making informed decisions, you're probably going to make mistakes. So knowing what you're going into and how you're going to get there is really important. You want to build and deploy faster. So when you're sitting down with customers or the business and you're understanding what the customer needs are, you want to get what they're talking about and what they're asking for into their hands as fast as possible. Your ability to be able to deploy and release faster is really important, and we call that customer time to value.

And then you want to lower and mitigate risk. Obviously we don't want risk, so understanding what the risk is and being able to mitigate it where possible is really important. And there's often a trade-off between going fast and having risk, and understanding all these different components and how they work together is really important.

Understanding the Six Perspectives and Pillars: Strategic and Technical Frameworks Working Together

So the guidance framework for cloud optimization, the AWS Cloud Adoption Framework that I've experienced over this last 10 years, actually looks at six different perspectives. And they're down at the bottom here. I'm going to talk about two of those specifically. So one of them is the people component, and it looks at some really important things around your culture, how does your leadership function, what is your org design, and how are those things kind of structured to make you successful. One of the other parts is governance, and that looks at things like risk management, product management, and data governance. Because, you know, we talked around speed and how that's really important. Some of these things, if you're not set up in the right way as an organization, can introduce slowness. So you want to make sure you're getting the right guidance and following the best practice.

Now, from the AWS Well-Architected Framework, we also have six things, but they're pillars, and the Well-Architected Framework is more around the tech when you're getting into the technical implementation of your journey, rather than the Cloud Adoption Framework, which is your strategic portion and the earlier stages. If we talk about operational excellence, it looks at monitoring systems and continually improving processes, and it builds on the operational capabilities that we talked about in the Cloud Adoption Framework.

If we look at the security pillar, it focuses on things like data security, user management, and being able to detect security events. At AWS, we say that security is job zero, and the same should be true for all of your businesses, so the security pillar is really, really important. And then as I mentioned, we've got the lenses, so if you're from a certain industry or you've got certain workloads, you can take a view through the AWS Well-Architected Framework.

So in summary, the AWS Cloud Adoption Framework is for your early strategic stages in looking at your organization, so how your organization is structured and how you go about getting success on the cloud. And the AWS Well-Architected Framework is when you're moving to the technical implementation phases and taking the best practices for your workloads that you're either looking to build or you've already built on AWS. And the frameworks work really well together and they take a joined-up view of your full cloud optimization journey.

Now with the AWS Well-Architected Framework, it looks at what we call a workload, and understanding what a workload is is really important because that's how you understand what it's meant to do and how it's functioning. Now there are two types of ways to understand the workload. You've got automated discovery where you use technical tools to go and do discovery of what you've got running, or it could look at your infrastructure as code. And then you have conversational discovery, so it's sitting down with the relevant stakeholders who understand the workload to see how it's built and how it's supposed to operate.

We actually find those two things in combination give you the fullest picture of what that workload is.

Culture Eats Strategy for Breakfast: AWS's Service Ownership Model and Operational Practices

I'm going to switch gears and talk a bit about culture, and this is why I focused on those two specific aspects of Cloud Adoption Framework and a Well-Architected review. There's a famous saying which is culture eats strategy for breakfast, and we believe this is really important, so I'm going to talk about some of the aspects that AWS utilizes in how we run things around our culture.

One of these is what we call the service ownership model. You might have heard the term two pizza teams, but this is how we operate and our teams function around our AWS services. So a team is responsible for an AWS service. They build it, they deploy it, they operate it, and they continually improve it, but they have full ownership of that service. In terms of how we deploy, rather than taking a release cycle that might be released on a monthly basis or a quarterly basis, we take the approach where we want to continuously release in small chunks rather than doing big releases.

In that release process we do testing, we do phased rollout, so rather than doing a fleet-wide rollout, we'll test in smaller chunks and then we'll monitor and make sure the progress is going okay. We'll have auto rollbacks built into it. But again, this is a culture that we take where we're continuously doing releases.

And then we have something called operational readiness review. So ahead of any service going into production, we have an operational readiness review that has to be completed. And this is built based on our experience of understanding common pitfalls and things that you should think about ahead of launching a service into life. So this might be, you know, what happens if you suddenly 10x your traffic, what's going to happen? If there's an issue and you have to move away from an availability zone, what happens? What's your disaster recovery? What are the key metrics that you're going to monitor? So if you're monitoring for certain errors, what is your error rate that you want to alert on? But it runs through all these questions and the team that are responsible in this service ownership model have to answer all those questions ahead of it going into production.

We then have a couple of different review mechanisms, so this is pretty common practice across the industry, but design reviews. So, you know, have peer reviews of the code that's being done. If it's something meaningful and significant, we'll put on our principal engineers that we have at AWS to review what's being done and the approach that's being taken. And I already talked about the operational readiness review.

Now, I've been at AWS for just over 10 years now, and one of my favorite things at AWS, which might sound a bit strange, is what we call the correction of error process. So if we have a customer impacting event, we do something called a correction of error or a COE. And it's a review of what happened during the event, so there's a number of different sections in a correction of error. One of the sections will be a description of what happened, so it's how, what was the customer impact? Did we impact a certain number of orders, was there latency? There will be a timeline of what happened during the event. The timeline of the event will be down to the seconds of full flow of how the event from start to complete finish when it was mitigated.

One of the other sections in there is what we call the five whys. Now the five whys is really cool. So the first why will be why did the thing happen, why was there impact? And then you keep asking subsequent whys that drill down into the details. Now it's not always five whys, I've seen some where it's 12 or 13 whys, or there might be three whys, but you keep asking why until you can't ask why anymore, there's no more whys and the bottom why is really, really the reason what happened, what caused the event.

Also there will be actions at the end, so what are the things that you have to do to make sure this doesn't occur again. And then those actions are tracked until completion, and they get quite loud and they escalate internally within AWS if you're not hitting the dates. There's also peer reviews, so when you have one of these series, you can't just write it and publish it, you do it in a peer review setting which comes back to one of the earlier slides. There's a QR code here that you can scan.

You can go and read more about COE, but as I said, it's one of my favorite things at AWS.

We also do something called the Weekly Operations Review. So every Wednesday afternoon, I'm based in the UK so it's afternoon for me, but morning if you're based in Seattle, we have a weekly call that all the engineering teams from all of the AWS services are invited to. At the start of that meeting, we always celebrate successes. So this might be that you've improved the latency of an API or you've increased the operational efficiency of the service, you've made it cheaper to run, or you've increased the reliability of the service. We'll discuss COEs, the bigger ones as a group, and the reason is we want to share that so other teams get to learn about what happened and how they can learn from it as well.

We'll also do this spin the wheel where we'll pick certain services. AWS has a bunch of services, and we'll pick certain services to drill into. All of our metrics that were defined originally in the Operational Readiness Review and that we've iterated over time because we've been doing continuous improvement, they're all available to everybody. So we'll pull up the metrics of a specific service and that'll be reviewed in a group setting. But it's a really good experience where you've got your engineering teams across different functions and across your different services, seeing how other teams operate so they all can learn from each other.

Building Foundations with Operational Excellence: Design Principles and Anticipating Failure

But now I'm going to hand over to Paul, where, there he is, I can't see with the lights, to talk about building foundations. Hey, thank you Paul. Thank you James, thank you. So I'm going to dive a little bit deeper into the Well-Architected side of things. James talked a lot about the Cloud Adoption Framework. Both of us work in support, so we spend a lot of time working with our customers around Well-Architected solutions because it's the long-term goals that we're looking to achieve.

And as James described before, Operational Excellence came in in 2016. But before that, when we originally published in 2015, there were four pillars, and those four pillars were the Security pillar because it's job zero, Reliability, Performance Efficiency, and then Cost Optimization. And then back in 2022, we introduced Sustainability as well. And we see these as the end-to-end approach in terms of how you think about your architectures in AWS, and also we see these as complementary to each other.

One of the reasons why we placed Operational Excellence as the first pillar is because it's the foundation for a lot of what comes later, and particularly Security is made better when you have good operational procedures in place. So I'm going to dive in a little bit deeper around Operational Excellence for the next wee while and talk through that. So each pillar itself has its own set of design principles that we work with, and those design principles for Operational Excellence, there's eight of them.

And they're there to be those kind of non-negotiable components when you approach how you deliver your operations, your security, your approach to reliability, all of those kinds of things. And they help you create that mental model for building out your applications and your workloads. These ones are based on laying those foundations for everything else. So things like implementing observability for actionable insights is really, really helpful because it's not just saying put monitoring in place, it's saying have something that provides business value. You're building this out for a business reason, have something that can input back into the business on the back of it.

Safely automate where possible is really a great example of what I'm human, I'm fallible, I've made more than one or two mistakes over the years, and if I can automate out my own fallibility, I will go about and do that. Until recently there were areas I was thinking that we were just never going to be able to automate around those, and I've got a bit of a demo later on to show how automation has helped massively, particularly in the operations side of things.

And that whole concept of making the small, frequent, reversible changes, I think that really helps from a risk perspective as James was talking about before in terms of that risk mitigation and the long-term approach. And constantly refining, constantly improving, always really, really helps with your evolving workloads, your evolving business outcomes, and your operations need to run with that. The anticipate failure one seems like a fairly obvious one to a lot of IT people, but it's difficult to anticipate all kinds of failures.

There are ways of being able to understand what those different failures could be for your workloads, and there are ways of building up the muscle memories within your organization. I'll go through some of those examples a little bit later on. Going back to the correction of errors process that Jane talked about, that's how we think about learning from all those operational events, updating those metrics. And then the use managed services piece is, well, actually operations is hard and complicated. Can we use this to take some of that operational burden away from your teams and place that on AWS so that we're doing that on your behalf as part of our part of that shared responsibility model?

AWS Infrastructure and Shared Responsibility: From Regions to Resilience

Now some of the ways that we do that are by building the infrastructure on your behalf. So we currently have 38 regions around the world. Each region has the same kind of makeup. There are multiple availability zones in each region, and they are a meaningful distance apart. So that's many, many miles is the long and the short of it. Each availability zone will have multiple data centers within it, and those data centers will be geographically closer to each other with super low latency connections between them. Very often you will have an instance running in one data center, an instance running in another data center, and you can't tell that they're not in the same physical building or same physical rack. So there's a lot of thought gone into the design for that so that we're taking on board some of the operational challenges there. But what that does is it lays the foundations for your architectures and allows you to think more broadly in terms of the different approaches that you can take.

Now each of those regions have a control plane and a data plane wrapped around them, so each region can operate autonomously. Now there are a few services that have global constructs, so they're eventually consistent within a primary region globally. That primary region also has failover options on the back of it for the list of those services. We've got a really good document about fault isolation boundaries that explains that, and there's a QR link at the end of the session for that. But the idea behind the control plane versus the data plane is kind of born out of networking. It's that whole, how do I control what's going on within my network, but still serve high quality data out to the services that need to get that data quickly without having any of those conflicts in there?

So the idea is that your control plane is where you handle the complexity of the different services operating with each other in the background, and the data plane is where you're moving your data around, where you spin up your EC2 instances, where you interact with the services that are run on there. Now when building out an architecture, we think in terms of shared responsibilities. There are a number of shared responsibility models across AWS. This is the one for resilience which leans into a lot of the operational outcome type situations. The idea here is once again that we're doing a lot of the heavy lifting on your behalf. We're making as many of the complex choices as possible as easy for you to consume as you can. But there are parts of that that you still need to think about in terms of how you deliver your services out.

And from an architectural perspective, one of the reasons why any business will be running an application in the cloud is because you want it to be something that is trusted, that's reliable, that your customers can consume in a consistent way, that you know that the business can really move forward with. And that's why building out things like change management, operational resilience, observability capabilities, all of those kinds of things are a key to your own success as part of the building of your broader approach. Now moving your workloads into the cloud means that you've got quite a lot of different approaches that you can take towards resilience. So, everybody here got workloads in AWS at the moment or at all? So I'm seeing a lot of nodding heads. Still got a lot of data centers that you're still having to operate as well? Few folk nodding, okay.

So when we think about the whole approach to resilience, there are a lot of terms that are conflated very often, and one of the ways that I see quite a lot is the reliability pillar is conflated with the resilience pillar. Reliability is about that consistent set of outcomes, that consistent set of capabilities that you have, that consistent, reliable set of engagements with your customers at the end of the day. Then there's high availability. So resilience and high availability must be the same thing, aren't they?

But actually, yes, you can continue to operate given something has gone down, but that's not necessarily the same as it being resilient to a set of outcomes. And then you've got your operations themselves. And in the middle of all of that, actually that's where your resilience lives and that's where we think in terms of building out those architectures to allow us to think about the longevity and the life cycle of the applications that you build.

Architectural Approaches to Resilience: From Failover to Multi-Region Active-Active

And that leans into the different architectural approaches that you can take. So when you've worked with the data center type things, you can have that traditional enterprise failover type scenario where you can say I've got a primary, I'm going to fail over to the secondary, I've tested it, everything works, and I can fail back and it's all good. And let's keep on going. Now you can do that also in AWS in a region. You can have multi-zonal failovers. Now that's an approach that you can use for a lift and shift for a legacy application that follows that traditional pattern.

But also as you modernize, you can be in a situation where you can say well actually I can be multi-zonal, I can be active-active all of the time. And what that means is that I've got more options in terms of survivability, so I'm making my workloads more highly available. But I also have a situation where I have a slightly more complex operations. So how can I automate some of these capabilities? So it's things like putting in load balancers, auto scaling groups, and spreading the workload across all three of those availability zones in this example.

You can then have multi-region failovers, so you can say I've got a primary region, and part of our business continuity planning is that we will have a failover to that secondary region and we can continue to operate with that approach. You can still have that high level of availability of service because you're operating across those three availability zones. But you still also have that option should we have an event that means that that region's no longer available for a period of time. You can move whilst the disruption's on and fail back when the time comes.

Increasingly we're seeing patterns where it's multi-region active-active as well. So patterns where you can say I can distribute my workload, and I test my business continuity planning and my disaster recovery on a day to day basis by saying, well, I'm sending some customers in this direction and some customers in this direction. I'm moving my traffic around to accommodate where they are, it's closest to them, it could be a follow the sun type model. But either way, the situation is that you are testing your resilience across those regions as well.

And as you get more and more mature in there, you can have different approaches to your architectures. So common approaches we see are cell-based architectures. Anyone familiar with cell-based architectures? Any hands up for that one? There's a few nodding heads, so cell-based architectures are very much enclosed portions of your architecture where they can operate a portion of your overall workload. And if that cell has a problem, fails for some reason, there is an issue with a piece of code that's deployed and it only affects that cell. And the other cells around it can continue to operate whilst you're spinning up other resources in the background to accommodate those kind of things.

So cell-based architecture is a really strong way of thinking about those more sophisticated, really highly available, really resilient solutions, but also very scalable as well because you can start to deploy them across multiple availability zones, across multiple regions. Event driven architectures are about making sure that you're moving your workloads around in a meaningful way so that they're following the events themselves, not necessarily just responding to the individual application flow as well. And they're really good from a deterministic perspective. You know that when your workload is moving through, you know what it's going to be doing, and it lends itself really well to things like a serverless workload and consuming serverless services.

But all of those things are really fundamentally beneficial when you can test them well. And that's where the chaos engineering side of things comes in, and I'll talk a little bit about that later on. This is an adage that has existed since we started AWS. Werner said that a very long time ago, I think in 2006. Anybody heard it before? Any hands up? You familiar? Yeah, there's a few hands going up, that's good. And from a perspective of somebody who works in support, that's my day to day. The idea is everything fails all of the time. How can I make sure that when it does fail, we have a plan for it to work out the way we need it to from a business perspective.

Everything Fails All the Time: Understanding the Five Causes of Failure

And one of the ways to think about it is trying to visualize what the different causes of a failure can be. I like to think of them in terms of the five causes as we talk about them.

First one is dependency failure. So you've probably all come across this before. You have an application that's running, something that you depend on has failed. So it could be that it's an upstream service that you're consuming from, it could be a database or a corrupted cache. It could even be a third party service that you're reaching out over the internet to and pulling data in. Something has gone wrong, but you need to know about it because it'll have an impact on your business.

System and component failure, this is the one that I used to come across the most when I worked in data center world. That's the thing that has broken, a hard drive has failed, a motherboard has gone pop, the CPU has overheated, that kind of thing is the components that they tend to think of. But in the cloud you can extend that to, well, I do containers. Is it one of my pods in my Kubernetes clusters has gone pop? Do I care and how do I handle that? What kind of data do I want to get on the back of that? It could be something as significant as a whole availability zone has gone. So you've got that component failure, that the access to that availability zone has been closed because of a major event or because simply somebody put in a firewall rule that closes things out. It's that kind of thing that can occur.

Then there are constraints. So it's the things that you bump into that aren't always anticipated. These constraints can be something as simple as, well, I'll ask a question. Anyone here ever had a certificate expire on them and it caused a bit of an outage? Yes, there's a few chuckles there. Yeah, that's one of the most painful ones because, ah, yeah. Those are the kind of constraints that we can bump up to, a date, a time. It could be something as simple as you are growing your estate really quickly and forgot about quotas and asked to get more EC2 in there or increase the concurrency within your lambdas, those kind of capabilities. So being mindful and keeping on top of those kind of things is really important.

Now traffic spikes are always a big one. I think of these in terms of some of your inputs have changed within your environment. So it could be something as simple as there's been a marketing campaign and on the back of that you've had a surge of traffic. It could be that somebody has consumed your service, they love it and it's gone viral because they're an influencer. Those kinds of things tend to drive unexpected or unanticipated outcomes. So being able to stay on top of understanding why that can break things and how you can respond to it from an architectural perspective using Well-Architected is a great way of visualizing what it is that you're looking to do and how you can go about scaling the architecture that you've got to anticipate those traffic spikes and demands. And there's ways of thinking about those failure scenarios that I'll go into a little bit later on as well.

But in my experience, the biggest one is always some kind of change has occurred. It's the code has been pushed and there's something wrong in the code. The other one is config, either that the config that you intended to push didn't have the desired effect or the ones that I've experienced a lot in the past is it's bled through from the wrong environment, so you've got a pre-production environment with a config that shouldn't be there in production. Those kinds of things are really impactful. So how do you go about finding out about these things from an operational perspective and how do you use that data to improve your architecture over time? So we think of that in terms of observability.

Observability Transformed: Using Agentic AI for CloudWatch Investigations

The one thing that I would say is there's an observability workshop this week. I think the first one is this morning, I think there will be another one later on on Thursday. If you're into observability, definitely go for the workshop because it's a strategy building workshop. So it's not necessarily about the hands-on of how do I implement my metrics, my logs, my traces. It's about how I build out my organization's mental model for observability and what am I working backwards from, who are the personas that I need to care about. It's not always just going to be about the metrics for a developer or the logs for an operational team.

It's also about whether there's a dashboard that the CFO wants and needs to see, or whether the CTO needs to have insights at that high level that they can then drill down into, and how you keep track of those things and get the right information to the right people. Increasingly though, these three separate aspects of observability are becoming a lot more accessible. You're no longer having to mine just metrics and logs and traces to tease out the information from the data, because I think very often there's a lot of data that we're collecting, but having meaningful information on the back of that is really, really hard.

So one of the things that we talked about in the design principles section earlier was around safely automating where it's possible. For me, observability was one of those areas where I didn't think you could safely automate where possible. You have your network team looking at the network stuff, you have your infrastructure team looking at the infrastructure, and you have your application team seeing what's going on with the application side of things. But lately with the advent of Agentic AI, that has changed a lot and you've got the opportunity now to be able to dive really deeply into those things. I just want to go through a little bit of an example from CloudWatch.

This is a CloudWatch investigation. With CloudWatch investigations, what you can do is you can say we know that a thing has gone wrong, we can see that it's unhealthy. Let's have a little dive into what's happening here. We'll create the name of our investigation, and then what we'll do is we will click investigate. What the AI does on the back of that is it does all of those things that I talked about before. It will say, right, okay, I'm going to go off and do that investigation. I'm going to be the network team, I'm going to be the infrastructure team, I'm going to be the applications team, and I'm going to pull everything together that I know about in terms of the context of the error messages I'm seeing here, and I'm going to make this a part of the investigation.

Here's all the data on that, and actually this is starting to bleed into the realms of here's all of the information. You, as the person driving this, you can go through and say, well, actually this bit is more meaningful to me than some of the other stuff. I'll accept this as a part of my investigation. There are things that I won't necessarily want to be a big part of it, but ultimately I'm pulling in all of the information that I know is of value to my business.

On the back of that, what you end up with is a set of insights that you can dive into that's really summarized in a much deeper and more meaningful way. So this goes through all of that information. On the back of it, you get all of the key links to where, whether in the logs you can see the different components. But at the end of it, you can say I can accept or reject that. But you can also say how do I get my hypothesis, and this is what this bit is. This is how the hypothesis has come out on the back of the AI saying, I think I have a good idea of what's coming out as a result.

Now this is broken up into a bunch of components. First one is the confidence. This one is a high confidence hypothesis that we've got here. Then there's a breakdown of what it thinks is going on. There are some possible next steps in terms of recommendations. You can actually dive into the reasoning behind it as well if you want to look at that. Ultimately you can accept or discard this hypothesis and run through it again. Say I actually need to pull in a little bit more information, rerun the scenario. But the thing that I like about it is it gives you the information that you need.

It's saying, okay, going through all of this data, I've spotted that you've got an IAM role that's problematic in there, and you get there really quickly. This is pretty much real time in terms of that. So you're saving a lot of time, you're taking that networking view, you're taking that infrastructure view, you're taking the application view, and you're distilling it down really, really quickly.

And on the back of that, as a result you get a bunch of recommendations, and those recommendations you can implement them if you wish. The one that I like the most is this last one here. It's kind of the AI kind of wagging its finger at me saying, Paul, you should know better. Put proper change management in place because I'm not seeing anything here that's reflecting that kind of change management side of things. But how do you then take that kind of information and pump it back into your applications and make it more meaningful and much more useful?

Chaos Engineering and Game Days: Testing Resilience Through Fault Injection Service

In that regard, well, success is intentional. You can say, well, my untested architecture is not necessarily well-architected, so how do I go about testing these things? And as James said before, we've got this adage of there's no compression algorithm for experience. Unfortunately in my experience, experience comes in one of two ways. One is, oh no, I've done something wrong and it's all fallen to pieces.

The other is figuring out how you can gain that experience without the "everything's gone wrong" type of situation. The way we think about that in AWS is around chaos engineering and continuous resilience assessment. This is really useful from an architectural perspective because you can use this to take your learnings in terms of the five causes we talked about before. You can bump up against the edges of those types of scenarios and say, how do I start to build out a mental model of what happens to my application as I go through different conditions? Because no matter what you design, there will always be something that happens in your environment that you bump up against.

So you create your hypothesis, you run your experiment, you verify it, you make sure that there's improvements in there, and you get back to a steady state. Once you've done your Well-Architected review, you have updated your architecture, you create a fresh set of hypotheses, and you go back to running that experiment again. And in that scenario, we like to think in terms of injecting entropy into your broader environments, stimulating hardware failures, simulating things that can go wrong within your software, things like the scaling events that I talked about before, and pretty much anything that's capable of creating disruption within your services from that steady state being.

And what that will do is that the chaos engineering will go hand in hand with your resilience testing, and it'll give you a bunch of outcomes that are really, really useful. And a great way of doing that is running game days. I'm a massive fan of game days, and they're composed of a few really critical features that help you think in terms of what the long-term success looks like.

Game days are reasoned. You have a set of criteria that you're setting out, so that's part of that hypothesis thing. You say, I want to prove that we can survive the loss of an availability zone or that we know how to migrate from one region to another. That's the goal that you're setting out ahead of you, and these are the things that you're testing your overall architecture against. They have to be realistic. It's not going to be a thing that is, the data that you get on the back of it, you want to inject back into your broader environments. You want to use those as the learnings, and then you want to be able to say, right, okay, we discovered a thing. That thing we've fixed through an operational change or through an architectural update or through consuming a new service.

Those kinds of things will allow you to then say, right, we've got the right data on the back of this. Now we're providing the business value by making sure we're avoiding future troubles. And they need to be controlled as well, so all of the outputs that come from this, they need to be collected. You need to make sure that you are saying, right, okay, we've discovered through a bunch of traces in our observability that when a certain condition is met, the application just falls flat on its face and we can't continue to operate it. Take that data, inject it into your observability, inject it into your operational outcomes, but also do a Well-Architected review against the application with this in mind. Use the Well-Architected Lens tool and use that to say, right, okay, how can I focus in on an application that can survive this? What are the changes that we need to make to make it more survivable in a set of circumstances?

And regular, these things need to be frequent as well. And why would they need to be frequent? Well, one of my customers does a DR test every September after everybody gets back from the summer vacation, and one of the challenges there is that everybody is back. So what they're not detecting is what happens in that DR test when a key person dependency isn't there. So somebody goes off on holiday in April and they're not there. They don't know that because they're only testing it in September.

That key person dependency means that if they test it more frequently, they will unearth those kinds of things. So that person could be sick, they could be on vacation, somebody could get hit by a lottery ticket and they're no longer in the business. You want to be able to pull in all of the right information so that your operations can survive those kinds of outcomes. And the center of a good game day as well is the means of injecting entropy. For me, that's using the Fault Injection Service from AWS, and there's a bunch of scenarios that you can use there to be able to say, right, okay, how do I test, push the boundaries of my infrastructure architecture, my network architecture, the application stack itself.

Recently, there have been a couple of updates to the Fault Injection Service. These updates are particularly of interest from the perspective that they simulate gray failure scenarios. A gray failure is something where it's not necessarily obvious from the beginning. You have a failure of a particular situation that can be quite binary—it works or it doesn't work. A gray failure is where you're in a situation where you've got increased latency and the application is going slower. It's building up traffic in the background, you've got unhappy customers, and you have scale events because you can't get the resources fast enough, so you're spinning up more instances. There's going to be a cost aspect to that as well.

Using this kind of capability allows you to look at the mechanisms where that would happen—what your applications, what your infrastructure, and what your network will do. By seeing what that increased latency is within an availability zone, but also simulating it across availability zones, gives you insights that allow you to say, "Well, we didn't expect that when the database was called across availability zones in these circumstances that the response would be so slow and a huge chunk of our customer base won't be able to get access to the services they're trying to consume." Being able to manifest those capabilities early on means you're saving the business time and money in the long run.

The list of capabilities for the Fault Injection Service is quite extensive. I particularly like the power interruption scenario, which simulates the loss of an availability zone. You can say, "Well, what does my application look like when we lose one-third of our estate, and how do we recover from that?" How do we absorb the new resources in the other availability zones, or actually, how do the people in our teams respond to it? Do our automations do what we expect them to do? Are we finding the right sets of outcomes that we need as well?

Similarly, these other types of situations are really good. From a Kubernetes perspective, the EKS stress testing is really useful. You'll see what's going on with your different pods and how they respond. It'll also help you gain confidence for the use of services like Spot. Is anyone here familiar with Spot compute? There are a couple of hands going up. Spot is the excess compute available after on-demand has been consumed, and the way that you access it is at a vastly reduced price. But if we need it in the on-demand pool, you have a two-minute warning to say this is going away.

In conjunction with stateful workloads like a Kubernetes deployment or any containerized deployment, it gives you the chance to be able to say, "Well, I can have a highly scalable environment at a much reduced cost." One of the nice things about having the confidence to see what happens to your pods when you lose an instance is you're testing your resilience. Using things like Spot will help you from a cost perspective, but it'll also help you from a resilience and operational perspective. It helps you redesign the architectures of your applications as well to have that mental model of how we survive in all of the different circumstances.

Doing things like game days with chaos engineering or fault injection have a bunch of benefits. They help with that automation, they help you determine where things are going wrong, and where in your environment you can see improvements. Is there a process that's currently handled by humans that you can then say, "Actually, I can now automate this"? Like before with the observability, I didn't think we'd be able to automate observability. Now, with the use of AI, that's a much more meaningful set of outcomes and a much faster set of outcomes that you can achieve.

As a result, you improve that visibility and you get better insights. One of the big things I think about the improved visibility is that the traditional mental model of monitoring and observability tools is that it's a sunk cost—that you are spending money as part of an insurance policy in some regards. But what you can then say is, "Actually, we've taken the visibility angle of this and we're creating business insights. We're able to see what's going on within our broader environments." It helps with that improved recovery as well.

Going back to the game days and the chaos engineering, are there any golfers here? Does anyone play golf? There's one hand going up, a couple of others, yeah, a few hands. I'm from Scotland and the weather's terrible, and everybody's obsessed with golf, but I'm not. So forgive me if I get my pose wrong in a minute—I'll give you a demo of it.

There was a golfer in the 1960s who used to go around professional golf courses, and he would try and find the hardest shots. So he'd be in the bunker, and this is where my demo is not great, he'd be chipping out of the bunker onto the green, and he'd keep chipping until it would trickle into the hole. The apocryphal story is a guy came up to him and said, you're the luckiest golfer I've ever seen, you've done that five times in a row. And his response was, well actually, I found the more I practice, the luckier I get.

And I think that's the thing about the game days with regards to improved recovery. The more you practice, the luckier you get, and the more your ability to recover from different scenarios improves, the more resilient your application architecture becomes. One of the other nice sides of it is things like fault injection services provide you with compliance reporting as well, so you can actually say at the end of the day, here's the list of the things that we've tested, here's the list of the outcomes that we've delivered, and we can move forward with assurance that we know we can operate against a particular set of regulations or that our customers can say we meet the bar for their sets of criteria.

But in the midst of all of this, going back to the Well-Architected side of things, you have that opportunity to drive improvements as you go. You use those game days and you use the fault injections that you're putting into your environments to create learnings. You improve your design on the back of that, you measure the outcomes from those, and then you take it back to Well-Architected and do a review against those outcomes again. Then you start to run those game days, you start to continue with the learnings on the back of it as well.

I said earlier I provided a few QR codes. There's a couple there, the Well-Architected Framework and the Cloud Adoption Framework. The fault isolation boundary one is a really good read. It just breaks down the mental model that we have in AWS for how we build our regions, but also how we build our services in terms of how you can consume them. And that comes from the perspective of we've got global services, we have regional services, and then we've got zonal services themselves, and how you build your architecture around those and applying the Well-Architected principles to the consumption of them can make a really big difference in the long run.

The Builder's Library is a library of best practice advice from some of the most amazing people in Amazon. The work in there is fantastic, and I definitely advise reading through some of that. The Solutions Library is kind of a companion from that where there's pre-existing code and examples for you to be able to draw down from. And finally is the link through there to the fault injection services as well.

So I just want to close out now by saying thank you very much for coming along on a Monday morning, and yeah, really appreciate your time. Enjoy the rest of re:Invent. I hope it's a fantastic week for you. If you could provide some feedback in the app as well, the only way we make these things better is by your feedback. So if you can give us a rating in there and some constructive feedback, it'd be fantastic. Have a great week and see you around. James and I will be outside if you want to have a chat with us at all after this, so thank you. Bye.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community