Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Resilience of AWS Cloud: Design patterns for availability (ARC310)

In this video, Rob Charlton and Steve Martin from AWS Financial Services explain AWS resilience through a comprehensive framework they call the "resilience equation." They demonstrate how AWS regions are built with multiple Availability Zones (typically 3+ data centers per AZ) separated by single-digit milliseconds, connected by redundant high-speed interconnects and diverse power sources. The presenters detail AWS's deployment safety practices, including the automated pipeline process that deploys 50+ million changes annually through stages like One Box, single AZ, and gradual regional rollout with automatic rollback capabilities. They explain the Weekly Ops Metrics call where 200 service teams undergo forensic examination by senior engineers, the Correction of Error process using Five Whys methodology, and the Operational Readiness Review that ensures services maintain 50%+ capacity across all AZs. The session covers load shedding techniques to prevent congestive collapse during traffic surges and demonstrates how AWS remediated Log4j globally within 48 hours using these established mechanisms.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Rethinking Resilience Beyond PowerPoint

Hello everyone. Thank you for coming here for ARC310, Resilience of AWS Cloud. My name's Rob Charlton. I'm a Principal Technologist focused on resilience in the Financial Services business unit.

Hi, Rob, thank you. Thank you for coming as well. I know it's a little bit early, maybe it's very late, depending on your time zone. But my name's Stephen, Steve Martin. I'm a technologist at AWS and I run a team of security and compliance specialists, basically everything outside of the US for financial services. And we're delighted to be here today to talk to you about resiliency.

So we're going to talk about the resilience of AWS and many of you in the audience might already know quite a lot about resilience of AWS, like what a region is, but hopefully today, even if you know all that, you're going to learn something as well. So what we would say is, although you may know it, this technique that we're going to present today has proved very effective if you have to explain some of that to your stakeholders. So if someone on the board comes to say, well, how is AWS resilient? This technique works pretty well we found in terms of explaining that.

So, but first, quick show of hands, how many people have had enough of PowerPoint this re:Invent and we've been watching PowerPoint continuously? Right, good news, we are ditching the PowerPoint. Yes, no PowerPoints. So we're going to old fashioned organic gluten-free iPad and Apple pencil.

The Resilience Equation: How Application Architecture Has Changed Failure Patterns

So we're going to talk about resilience of AWS, but I'm going to start with a little bit of a digression on what is resilience and how do we think about it. Now we found this very useful in terms of explaining resilience, it's what I call the resilience equation. I'm going to come back to this because in talking to lots of customers and talking to financial regulators and other people around the world, I came to realize that often there's another resilience equation in mind when we're having resilience conversations, because they're always saying, what about data centers burning down and undersea cables being cut, which absolutely these are things that can happen and you need to be able to take account of that, but there's a lot more to resilience besides that.

So I call this the classic resilience equation, and it got me thinking, well why, why is that? Why is there this focus on a more traditional view of resilience? And here's my view. See if you agree with this, but I think this comes down to the fact that the way things fail have changed, and this is nothing to do with cloud whatsoever. This is a change in how we build applications that's been happening over the last 15 to 20 years.

So if you think about a traditional application that you might build for resilience, usually the way you'd build that in the past is in a monolithic style, so you have one giant application running as a monolith on some piece of hardware and then for resilience you might have another copy of it running tens of kilometers away in another data center. And the resilience model of this is very simple because if one of them fails, then the other one takes over.

But the way that we're building things now has changed. Nowadays teams are usually using Kubernetes or containers, technologies and building things in a microservices style, so instead of that monolith you've got this network of lots of little services all connected together and then all built on top of cloud services. But this has some important implications.

So let's think about what happens to an application when it goes wrong in both cases. So first off, let's look at the monolithic app and imagine we're plotting a graph here of traffic reaching this application, this could be any kind of business application, and here over 24 hours you can see there's a big peak in traffic in the middle of the day. Now imagine something's gone wrong. So sometime here maybe a certificate's expired, and because it's monolithic, an application has gone down completely and there's a two hour period where it's just off, no traffic being received. And then the engineers come in, they fix it.

Now this is a very, very easy model to understand and reason about, and the availability is quite simple to calculate. You can see we've got 19 hours of availability over the day there. But what happens in the new model? So now we've got our microservices app. Same, running the same business application, same traffic occurring during the day, and what happens when something fails here?

Now you can see this model is slightly different because around about here, something starts going wrong. But because it's not one big thing, because it's a network of microservices, maybe one microservice is failing or maybe there's just extra latency between a couple of services. The application is not 100% down, so a proportion of requests that are reaching that application start to fail. And then you can see a bit later, after a couple of hours, the team applies a fix, traffic starts increasing again, and then around here we're back up to full recovery again.

But already you can see something's a bit different with this graph now because you could say, well, at no point during that 24 hours was the app 100% down. So how do you even reason about up or down, and then what about availability? How do you calculate the availability now?

So one way you could do that is you could say, let's take the area here and divide that by the area of the rest of the graph, and then you get a calculation of success rate for that application. That relies on knowing what this dotted line is here, so you need to know what is the expected traffic hitting your application. But you can see, taking a step back, this is a lot more complicated to reason about. It's not so straightforward.

Then think about the team supporting this. They're going to have to have better observability skills, better tooling to be able to work out what is going on to understand that it's this particular microservice that's having a problem. And then recovery might not be as simple as going into the data center and rebooting one of the servers or a rack. So my theory is failure has changed, and that impacts the way things fail, and that means there's more boxes in my resilience equation.

A Comprehensive Framework: Infrastructure, Services, and Operations

So let's run through this. What I'd say is although I spent all that time saying that global infrastructure is not the main focus when it comes to the resilience of an application, I'm saying it's a sum of all of these things. And along the bottom row is all the bits that AWS does. It starts with global infrastructure, so that's what we're going to talk about today: data centers, cables, networking, racks, power, storage, where does all that go and how do we think about that when we're building it.

But application engineers building cloud services, building digital services running on cloud, are not using that infrastructure directly. They're not going into our data centers and putting things in there. They're using AWS services. But then what is an AWS service? An AWS service is a distributed software application that gives access to that infrastructure. So the engineers are building on services. Therefore, the way that we build and design those services, the way that our teams think about resilience, is also a key part of the resilience story.

This also helps you ask better questions about AWS resilience, because really you can ask questions about the infrastructure, but better to ask questions about what's the resilience of the services that are running on that infrastructure. But then lastly, those services are run by dedicated teams of people, single-threaded teams, and we'll talk a little bit about those teams as well today. But the way that those teams operate the service, the kind of practices and mechanisms that they have, the way that they work, is also a key part of the resilience story.

So that's the whole bottom row. That's all the stuff that AWS does on your behalf, and you don't need to do anything to enable that. That all just happens by default, but it's very important that you're able to understand and articulate that, particularly if someone on the board has come to you to say, well, how does this resilience work. Then being able to explain all of those boxes is a part of your resilience story.

But it's not all down to us. So the top row is all the bit that's your responsibility, and it starts with your architecture. So which AWS services have you picked? How are you composing them together? What choices have you made? Maybe you're working with an AWS solutions architect who's helping guide you with that. You might be using the AWS Well-Architected Framework, condensed knowledge of 19 years of running AWS and working with customers, compressed into a number of different pillars, one of which is the reliability pillar giving you guidance on how to build for resilience.

But as you just saw, because the way applications are built is changing, the way that you design your software is also important. So it's important that the software engineers in your teams are also part of this resilience conversation. They can't just leave it to the infrastructure team or the architects anymore. So we have some more guidance available. There's a resource called the Amazon Builders Library, where we've put a lot of our secret sauce in how to build distributed software applications over the last 20 years that we've been building AWS, and that's all available for you to use as well.

Then lastly, and I think most importantly, once you've deployed your applications into AWS, you have to keep them running and keep them resilient, and that's where your operations teams come into play. So what processes and practices do they operate? How are they deploying the software? How are they monitoring it? What incident response procedures have they got? When there is an incident and afterwards when they're doing the root cause analysis, how do they perform that and how do they extract the maximum amount of learning possible and then build that in as part of your resilience lifecycle. So what I would say is all of this goes towards making you resilient while you're running on AWS.

It's a shared responsibility model. The top row is down to you and your engineers, and we'll go into that as we go through the talk today. The bottom bit comes from AWS.

Risk Assessment: Environmental, Physical, and Software Threats

So before we get into the global infrastructure, let's talk a little bit about the risks that AWS thinks about. Thank you very much, Rob. Really useful to start thinking about risks and how AWS helps you address these risks. What we find is in a lot of the conversations that we're having with our customers, they're struggling with explaining how AWS is actually managing these risks. And then there's a lot of overfocus on areas that perhaps you don't need to focus on, which takes your resources away from what you should be doing, which is that top part of the equation. And so what we want to do is we want to help you today to understand how to articulate to people that you're perhaps working with or talk to your leadership. Some of you might be leaders here. How to really frame this so that you are really comfortable explaining why you're going to focus on that top piece of the equation rather than focusing on the infrastructure.

We see a lot of customers. We came from financial services background, and so therefore there's a sort of governmental or regulatory kind of focus on these questions as well. So this is where this came from, and I think thinking about it from a risk point of view is very helpful. And so when we think about, let's say if we go back to this, where do we put our data centers when we build our region? We really sit there and we think about the kind of environmental risks that may come up. And I think there's a lot of people here from all parts of the world, so sadly not everywhere is immune to these things, right? And if you take India, for example, it has a seismic plane and you have to think about these things, and we actually invest in two different regions to handle that.

So you think about earthquakes. You can't always avoid those situations, but we build our data centers and we build our environments to handle the kind of pressures that might bring. Japan, for example, as well. We think about flooding. We think about flood plains. These are important things. You go buy a house, you think about, well, is there a river nearby? We think about these things too. We think about other environmental factors that we can't predict these days. The world's changing. So you have, you've seen what happened in Indonesia recently, which is a bit sad, where you have a year's flooding in almost a week. So we think about those things as well.

And then we think about other things, right? We think about the physical threats that you're under. Now, a lot of us, especially someone my age, I come from an environment where we actually used to build the data centers and run them ourselves and put the racks in and think about these more physical areas. And for some, more and more people who are born in the cloud or companies that are born in the cloud may not think about these things, but we do. So we insulate you as much as possible from servers crashing, fiber getting cut, fire suppression, cooling failures, power failures. And we do that through thinking about redundancy in the way that we build our data centers and also providing additional redundancy when we come. We get to explain this when we think about how we put our Availability Zones together and how we build our regions.

And then we think about one of the things that I'm really passionate about at AWS, which is security, because I'm responsible for quite a bit with my customers and we really focus on security in our organization. We put security first. A lot of organizations struggle with that. So you think about that top line, bringing that shifting security left. Well, that's something that we've really brought into our organization. We do security and resiliency. We've shifted as far left as possible in the organization, and so we think about malware, insider threat, DDoS. We have our DDoS prevention. So that's really important to us.

So again, we have a lot of material that's out there that explains how we've done these things, and if you go and look at it, if you look at our security services, you'll see how it explains our DDoS services. It explains how we think about culture of security, how we've pushed it left. And then we think about software. Of course, when we design our software and we think about how we build our software, we're thinking all the time about not just the functionality that our customers ask us for, but we actually think before we do anything, we have to make sure that we have really thought about the kind of threats and issues that might happen within our software. The kind of issues that you have with change. We're quite innovative in how we do change. The rate of change in this organization, I

When I tell customers how many changes we do, it's quite frightening and I think they look at me and go white. I think they don't think they could do that, but we've worked to be better at change and to deploy change at pace. Rob talked about traffic, but we look at traffic patterns and we design our systems to handle surges in traffic, to think about how we can manage any sort of additional capacity. So all of those things are put into it and Rob's going to explain later how some of that works.

Building a Region: From Data Centers to Availability Zones

Yeah, so as we go through this, we'll pop back to this slide and we'll cross out the risks as we demonstrate how we've mitigated those. But Steve, let's talk about global infrastructure first. Yeah, I think a really important concept, and it's interesting. I don't know how many of you here have ever been in a presentation from AWS from maybe a solution architect or someone like that, and they've got a little picture of data centers into a little region and they skip through it, off they go through it. And then suddenly you're into all the fun stuff and services and what we do.

It's really interesting when you sit down and you think, what is actually in a region? How do we build it? And a lot of your colleagues that may not be as directly involved with AWS as you maybe don't know. You may know, that's great, and I hope you do know. And if you don't, you'll find this really useful way of explaining it to them. But what we kind of started with was, what do people actually know when they come to AWS and they want to use it.

And remember the kind of companies that we're dealing with are kind of hybrid in nature. They'll have SaaS vendors, they'll have some level of cloud exposure, and they'll have a lot of legacy. So it's kind of a mix of things with the companies that we're dealing with. So what we kind of find was, no matter where you went in the organization, right up to the people that ran it, the board, the people that invested, they all knew they had some data centers because they paid for them, right?

And they generally knew where they were. And they generally knew that they had one data center that they ran workloads and then they had another data center for backups. This is kind of the model they had in their heads, and some sort of synchronous or asynchronous replication maybe. And so then they have a kind of standard kind of thinking that there's maybe a database or an application sitting in one of those data centers. So we sort of started with this and said well what would it take to build a region?

So the first thing we kind of thought was, you might need some more stuff. You might need a few more data centers. And so Rob's very helpfully here drawing these out for me, right? So it's like magic, isn't it? And they're even square, very good, right? So we thought we need a lot more infrastructure if you're actually going to build a region. So we kind of like, well what do you actually need? So you know, here we go. This is one set of maybe data centers, and then we have another set over here in DC 2.

And then because we like to be massively redundant, we have another one down here in DC 3. Now for those that are guessing what these things could be, we'd kind of call those Availability Zones. So that's the kind of logical access that you have as a customer. You don't really get down to choose the data center. Some of our Availability Zones have quite a few data centers in them. There's at least one data center. OK, great.

And then we'd have to, so we've only got one connection here that's not very useful. So we'd build a lot of interconnects between these Availability Zones. OK. And there would be high speed interconnects that would allow synchronous replication between these Availability Zones. And even within the data set, even within where you've got more than one data center, we would have tremendous amount of interconnects between these data centers, effectively there's a minimal transit time. OK. So this is starting to shape up a bit more like a region.

Power Distribution and Geographic Separation Strategy

And then we'd sort of think about power. We like power. It's very useful. So we'd have power, is that those zigzag things? Power. Alright, OK. So typically what we try and do is we try and draw the power from different grids if we can, or at least different distribution points on the grid. Not everywhere has a separate provider in the grid. Different countries tend to have different approaches to this.

But where we can, we will try and draw those power from diverse sources and even within those data centers we'll have other ways to power those data centers and those Availability Zones. We'll have, you know, because we're managing the risk of disruption, so we'll have generators and oil tanks with gasoline in them. You've seen that in Spain recently where there was a, Spain and Portugal, I don't know if you were, they had an outage earlier in the year.

It's very unfortunate that, you know, with their new kind of approach to solar, the grid became unstable. And you know, because we have these things we're able to run through it.

Also, how far apart? Yes, good question. We don't really think about distance. We think about what's the transit time, what's the return transit time. And what we find is a good ready reckoner for this is about one millisecond, I think, Rob. Single digit milliseconds. Yes, yes, single digit milliseconds. So that kind of leads you to, and if people are kind of familiar with infrastructure, that kind of still leaves you with our ability to do synchronous replication between these availability zones.

So if you go and look at our literature, it'll talk about tens of kilometers kind of thing, depending on which country you're in. And then that allows you some flexibility about where you put things in terms of managing the risks that we talked about earlier about environmental risks and managing the floodplain risks or even seismic risks in some places. So if you go back there, our placement of the data centers and availability zones takes into account these things and tries to minimize the impact of these things as much as possible. They're not, it's not residual. There's always a residual risk, but it's kind of gone.

Multi-AZ Application Architecture: Mitigating Physical Infrastructure Risks

Okay, so that looks a bit like a region, if anyone's looked at this before, right? And now we're going to sort of talk about using some of the AWS services like RDS, for example. Yeah, so what is that application that was running in the on-premise environment? What does that look like when it's in AWS? Well, I think it looks a bit like this, Rob. Tell me if I'm wrong. I think we have, we'll put a database into AZ1, go onto the console, launch our database, our Relational Database Service allows us to do that. And we'll probably replicate it over to AZ2. Okay, and we'll have the app replicated as well. The data's replicated so we'll have multiple instances of the app.

And I think we'll try and have another instance of it in AZ3, might be a read replica. So the data's there if we need it. And that makes that application, if we think back to the original diagram that we looked at, where we just had two data centers and a traditional model, maybe some cluster set that's running across from data center one to data center two, we now have something that's a lot more robust, right, and it's taking care of other, if we go back to those risks, we can kind of knock off a few more, Rob, I think.

Yes, the thing to think about is how does this resilience work now? So if I put a load balancer in the middle of this, Steve, do you want to walk through, how does that work? Well, you know, if I'm a customer and I'm a client going to want to access that app, instead of going directly into the initial app on Availability Zone 1, which you could do, if you use our load balancing service, you'll go to that, that'll proxy out the workload, and it might even be that you go to AZ, you might go to all three AZs, or you might go to just one of them. Right, but the load balancer will understand the availability of the other parts of the application.

And in the case that for some reason the server crashed, the data center went on fire, the availability disappeared in a cloud of smoke, availability zone disappeared in a cloud of smoke, your customers should be able to recover into AZ2 and AZ3. There might be some minor delays, there might be some minor operational adjustments to do that, but effectively it should be fairly seamless, and that, I think that's how it would work, Rob. Yeah, so that way of deploying an application with a copy, one copy of the application in each availability zone, which particularly if it's something mission critical, that's what we'd advise you to do. Out of the box, that's taken care of the rest of these physical risks that we've got on here.

So if a server crashes in one AZ, you can fail over to another AZ. Disk crashes, if there was a fire in one of the availability zones data centers, then we're also taking care of that, and cooling failure and power failure as well. You know, funny, I was, when I first started in AWS, having run a lot of stuff and I'd seen that one of our AZs had a fire, one of our data centers had a fire, I was in a mass panic. I was like, oh my God, we've had a fire. But the customers weren't bothered by it. They're insulated from that. You know, we run our data centers very hot, and they're designed to do that.

AWS Service Design: Zonal vs Regional Services and Shared Responsibility

So yes, but they're starting to look at, that's quite an interesting design there, Rob, in terms of resiliency. So I want to talk a little bit now about the AWS services, because you remember we said that AWS service design is also an important part of the resilience story. So everything that I've drawn here, the thing that says app and RDS, so app is running on the EC2 service, so this is what we would call an EC2 instance, Elastic Compute Cloud instance. But these are what we would call resources, these are the deployed bits of a service that you, the customer, is looking after. But it's a good question to ask, well, what is the service?

What we're really talking about is that there's an EC2 API that you're calling, and that's what's deploying the virtual machines and running them for you. So there's another EC2 API over here and another EC2 API over here. Lo and behold, the EC2 service is itself a multi-AZ application. So the services that you're building on are relying on the same resilience benefits of that multi-AZ model. All the risks that we showed also apply, and the mitigations also apply to the AWS services.

Now, there's a very important distinction here which comes back to the shared responsibility model. If you notice, this EC2 service, Steve here when he was building this application, decided to put EC2 instances in each availability zone. Now if you wanted to, let's say you were running some huge payment processing system which is processing credit card payments for a third of the United States, you could put that onto one giant EC2 instance in one AZ if you wanted to. That's down to you. We would absolutely say please don't do that. You could do that, but please don't. Deploy it onto several hundred EC2 instances across multiple AZs, but that is on your side of the shared responsibility model.

There are other kinds of services though. Many of you may have seen in architecture diagrams, usually in the corner of the diagram, there'll be a picture of a bucket with the S3 service icon on there, and that might make you think, hang on, what is that? Is that a single point of failure? What's it doing down there? Is it in those, is it there somewhere? It's not outside of the region or floating in space there. So the S3 service is also, let me choose a better color for that, implemented as APIs that you're calling.

When you're calling put and get, there's an S3 API here, here, and down here. So by amazing coincidence, it's also a multi-AZ application. There's a difference here, because this application is one that AWS is managing on your behalf. You're not choosing, well, unless you use S3 One Zone, you're not choosing the availability zone where your data is being stored. You just give S3 your cat picture and it breaks it up into lots and lots of little pieces and then stores it very redundantly across the region on your behalf.

So it's very important, going back to your application, that your engineers are aware of the kinds of services that they're using. Are they zonal services like EC2 or are they regional services like S3, because the resilience model changes depending on how they're working with that. So what we also want to cover, we've covered the global infrastructure, we've spoken about that. And we've spoken about AWS service design, how the services are designed and how they implement resilience. We also spend some time talking about AWS service operations.

Two Pizza Teams and the Operational Readiness Review

So the teams, the single-threaded teams behind the services, how do they work and how do they include resilience in their processes? This goes back to the risks that we talked about before. Absolutely, so one of the key risks that we haven't covered yet, we haven't covered bugs and we haven't covered change and deployment risk. So let's go back to that. What I want to do is imagine that there's a new AWS service team and they're building a new service. Let's run through the kind of lifecycle they run through and the process they would go through to build and deploy that service and get it out into production.

Now these teams are run by single-threaded teams that we call two pizza teams. Many of you may have heard that terminology, but this is not a kind of sad and lonely UK one-person pizza. This is a massive, big and friendly American pizza, so we're talking about 12 to 14 people here. There might be Americans here, we have to be careful what we say. They're lovely pizzas. So these single-threaded teams, they run everything to do with that service. So they build the software, they deploy the software, they carry the pages, so when something goes wrong they do the product management of the software as well.

They are all-encompassing, so a service like S3 will be composed of hundreds of different service teams. There will be all of those teams, single-threaded, running a component of that service. So with our new service being deployed, it all starts with something called the Operational Readiness Review. So the Operational Readiness Review is a gate that the team has to go through where we would check a number of axioms. So one of those axioms would be ensure that you have enough capacity in your service across all three availability zones so that the service will carry on running unimpeded if one availability zone is impaired. So they need to have at least 50% capacity in all three availability zones to be able to do that.

There are a number of these axiom checks in the Operational Readiness Review, and as I go through this, you'll see how that builds up over time. But this is a key part of how we maintain a high bar when new services come online because we're making sure that everything in that Operational Readiness Review includes all the things that we've learned over the years and how to make sure that service is built in the right way. Nowadays this is something that we go through every year, so it's an annual recheck to make sure that we're still maintaining that high bar.

Deployment Pipelines: From One Box to Global Rollout

But eventually the team will be ready and they'll start building the software, so a lot of requirements will go onto the backlog and the team will be writing the code. But eventually we have to get to the point of deploying that software out. When we last published it, I think it was 2014, we were making 50 million deployments a year, and that's quite a long time ago now, so you can only imagine how many changes we're making per second as we're deploying this software out into the field.

So if you think about it, 38 AWS regions around the globe, 120 availability zones, over 200 services, and making 50 million deployments a year over 10 years ago, how can we keep that safe as we're making all of that change and your own mission-critical applications are running on top of that? The way that we do that is we do it very carefully and we use pipelines. We use automation extensively in that process.

So it all starts with a pre-production stage. Now that contains everything that you might expect in a pre-production deployment: code review with peers, performance testing, configuration testing in lots of early stage environments, environments which mirror the production environment and all its dependencies as well. But eventually, once you've got through those pre-production checks, and this is all done automatically as well, another thing to bear in mind is these changes that we're talking about, 50 million changes a year, these are not massive new feature changes. These are typically very, very small changes, just a few lines of code where the developer's made that small change, committed it to Git, and at that point, the automated tools take care of everything from that point onwards.

So now we start the production deployment pipeline, and it starts with a stage that we call One Box. So this very, very small change will be deployed to typically one EC2 instance in one availability zone in one region around the world, and then we wait. We wait for a period called the bake time, and the reason we do that is we're trying to wait for any latent bugs or hidden bugs to emerge during that period, so we'll be testing it at that point. It'll be receiving live customer traffic, this one instance. If there's not enough customer traffic, we'll send it synthetic traffic as well.

If there's any problem detected, so if any of the monitoring is off, then that change will be rolled back. We're very careful to ensure rollback safety, so the change will be rolled back automatically by the tooling and it will be sent back to the developer. But assuming we get through that and it passes, then the change is rolled out to one whole availability zone, so all the EC2 instances in one AZ for that service will get this new copy of the software. And then we wait again. We wait again for the bake time to check to see if anything shows up.

As a part of the deployment, what we usually do is we ramp the traffic up, starting at zero to this new deployment to 50%, 100%, and then on to 150%. The reason we do that is to make sure that the software itself has been built so it can take one extra availability zone's worth of traffic when it gets deployed, so it always goes up to 150%. Assuming we go through that, then the change, the small change, will be deployed out to one whole region. So one region somewhere in the world, all the EC2 instances in three availability zones in that region, and then we wait again for the bake time.

There's a lot of waiting involved here. If any problem is detected, it'll be rolled back. But assuming it's fine, then the whole process starts again, but usually we would double up at that point, so we would go through One Box in two regions, and then one availability zone in two regions, and then doubling up again, and eventually that change will have made its way out to all 38 regions around the world.

But just imagine there's a problem. We get to region 37 out of 38, this change has been absolutely fine, and some customer request comes in and it exposes a bug and it's picked up by the monitoring. Purely automatically, that change will be unpicked and rolled back from all the EC2 instances in 37 regions.

If the change fails in any of these regions, it will roll back to the last known good version and completely unpick from everywhere around the globe, and then it's sent back to the developer. Then the whole thing has to start again. They have to go through, make a change, go through the code review, and start it again. So you can see we're being extremely careful about this process because we know how important it is to get this right.

Production Operations: Metrics, Correction of Error, and the Five Whys

At that point, the change is now deployed out, and that's all been done by automated tooling. So at this point, we're now running in production, and that same two pizza team is responsible for looking after that. Now there's a very useful reinforcing factor here in that if I've deployed a change, I made a change myself and I've deployed it and it's gone out to the field, and now I'm carrying the pager, then if that goes wrong and I get paged at 4 a.m. on Sunday to say your change has gone wrong, that's a very useful reinforcing factor to make sure that I'm very careful in future about making that change.

But now the changes are in production and we're monitoring it, and I'll come back to monitoring in a second, but we have a huge number of metrics. You can say that we're metrics obsessed. So if you look in some of our internal pages where we have the dashboard for S3, say, you will see pages and pages and pages of metrics covering every aspect of that service, starting with some above the line metrics on service health, so telling us is this service healthy, is it all up and running as expected. And then lots of diagnostic metrics below that, going down to the P99.9 latency in the Sydney region for some small aspect of the service. I'll come back to why there's so many in a minute.

So those metrics have all got several alert thresholds on there. If those alert thresholds are crossed and the right conditions are met, then someone in the team is going to get paged and then they have to go and address that issue. There's two more important aspects I wanted to cover here. So although we do try very hard to try and eliminate as many problems as possible and you've seen how careful we are, bugs will occur and sometimes they will get through this process.

So when we have an incident, then we run through a process called Correction of Error. Now it'd be very unusual if any of you in the enterprises where you work don't have a root cause analysis process. It's very common, but we're very, very particular about the root cause analysis process which we run through called Correction of Error. Now this runs all the way through Amazon. If you go to the Correction of Error tool, you'll find that there's not only things to do with AWS but also roof blown off fulfillment center in Florida during a storm, and I put decaffeinated coffee into the caffeinated coffee pot, which is quite a serious incident, so everything like that has a Correction of Error inside Amazon.

Now the reason we're very particular is there's a very set structure to this, so we ask very, very particular questions. So it starts with, how did you detect the problem? So was it detected by a customer? Now that's very rare nowadays, and if it was detected by a customer, then immediately the answer to that is, well we need some monitoring in place for that particular aspect that's just been shown up. So you can start to see more monitoring and alerting comes into play.

But more typically it'll be one of our alerts picked it up and we found out in 30 minutes that something was going wrong. But then it asks, as a thought experiment, how could you cut that time to detect by half? And that probably means lowering an alert threshold or adding again some new monitoring lines, so you start to see how we get towards having so many metrics on those monitoring charts.

And then we ask, well, how did you resolve the issue? What did you do, what kind of run book did you use to do it? And then also as a thought experiment, how could you reduce the blast radius of that problem by 50%? And that might mean reducing the number of customers impacted, reducing the time it took to recover from that. So we'd then be building actions out of the back of that Correction of Error process which would go on to the team's backlog for them to do.

This process is tracked by our senior principal engineers, some of the most tenured engineers in AWS, to make sure that these actions are followed up on, and the team then goes on to build it. But we also have a section where we want to get to the absolute root cause, and this section is called the Five Whys. If any of you have kids, if you've got a four year old, you'd be very familiar with them just asking, well why?

So this process works to try and get to an absolute actionable root cause. So we might say, why did the service fall over? Because Rob ran a script with the wrong parameter. Well why did Rob run a script with the wrong parameter? Was it an issue where he hadn't gone through the training, or well why wasn't it automated?

So why was the wrong parameter accepted? Well, we had a backlog task to do that, but we prioritized something else. So we'll just keep going, keep asking why until we get to some actionable, blame-free actions off the back of that. We don't think it's acceptable that the root cause of this problem was Rob. We want to get to something deeper that we can fix.

Now, from time to time we will also run through Correction of Errors and look for any patterns within that. So we have a group called the Deployment Safety Working Group who will run through and look to see if there are any trends or patterns in there where we're encountering the same problem multiple times. And then what they will do is say, okay, well, what can we learn from that? And there's usually one of two things. So either some new best practices, so this is how things feed into the axioms in the Operational Readiness Reviews, so we've learned something from that. Or there's going to be some new automations in our pipelines, so they might add some stages to the pipeline where we're going to build in some new automated checks to try and prevent that kind of thing in the future.

Weekly Ops Metrics and Load Shedding: Learning Culture and Traffic Management

Then the last part of this is what I would say is really the beating heart of AWS. This is the machine that makes the whole thing go round, and this is a call called the Weekly Ops Metrics call. This is a critical inspection and learning and teaching event that happens every week inside AWS. So what we do is we get the leaders and the team from all 200 service teams to join this call. The AWS senior technical leadership will join and run the call as well, those senior principal engineers, the most tenured engineers will be on there.

We run through a number of stages in there, and it starts with quick wins. So any quick wins from the previous week. The reason we do that is, let's say the S3 team have found some new technique to reclaim unused storage and they freed up several petabytes of storage. So it's good that we kind of celebrate those wins, but also the other teams on the call will say, ah, I can use that and I'll put that into my service as well, so it's a great way of sharing that knowledge amongst the team.

We run through any very informative Correction of Errors from the previous week as well. So if there's been a service incident, we'll run through that, and those tenured engineers will go through and say, well, what can we learn from that? What's the maximum amount of knowledge we can get from those events? And then we have a section called Spin the Wheel. Now when this was done face to face, there used to be a physical wheel that they would spin, and this wheel has all 200 services written on there.

If you spin the wheel, and we do it twice, and then we'll pick two service teams, if you're chosen, you have to step up to the mic and then you have to present your metrics from the previous week. Now, you'd be subjected to a forensic examination by the most senior engineers in AWS as to what happens with your metrics. So you can imagine that you're going through your pages of metrics. And let's say we're looking at the P99 latency in the Sydney region.

It went upside down because it's Sydney, and so we have a graph. And let's say it looks a bit like this. Then they'll say, well, what is that spike? What does that mean? Have you investigated that? And this is where it becomes a teaching event as well, because they'll say things like, ah, well, we've seen a pattern like this when we were in the early days of DynamoDB and we learned these things. So maybe you want to consider doing this in your service or checking this or introducing some new processes and practices.

Or maybe they'll be looking at your alert thresholds. So a common one is, let's say you've got a service and, let's get rid of this big bump, let's say it looks like this, and then the alert threshold is up here. So what's that telling you? Well, it's looking like the alert threshold is too high, so maybe you've just set that alert threshold at the wrong level because if there's a problem with that service, then it's not going to cross the threshold.

It could be a couple of things. Maybe you've just set it in the wrong way, you haven't been following our best practices, or it could be that in the previous week or earlier you've made some performance improvements. You've brought down whatever that thing is measuring, but you haven't yet changed the alert threshold. So they'll say, well, come on, you need to bring that down and make that a bit tighter. So it's this kind of learning which takes place during that exercise, and it was a great way of allowing the junior teams to learn from those senior tenured engineers and build that learning into their operations.

There will be a number of actions which come out of this, including actions from the ops metrics and actions from the Correction of Error process, and those will go onto that team backlog. Then they'll build it into the service, and that's how the cycle continues.

It's also worth noting at this point that the things on that backlog, the features on that backlog, are all put there, well about 80 to 90% of it comes from you. You might find that your account teams that are working with you might bring some service teams to you to talk from time to time. The reason they're doing that is they're trying to find out what it is that you need, what changes you need. If you're able to harness that and work with those teams, then that's a great way to get the features that you need into AWS because that goes directly onto that backlog, and then those features go into AWS and get deployed.

So what have we learned now? If we go back to our board of risks, we can't, I wouldn't say that we just eliminate bugs completely, but we think we've got pretty good at eliminating bugs through all the pre-production checks and the way that we do things. We've also covered change and deployment risks. I've got a little bit of time, so I might also cover surges in traffic.

Now, to deal with surges in traffic, let's imagine that you've opened a pizza restaurant. This pizza restaurant has got a great feature that you can go to the restaurant and ask for a pizza and you get a pizza in one minute. So you've just launched this restaurant, it's doing okay, and then you find that Taylor Swift has tweeted about this restaurant and the news has gone out and suddenly there's a massive queue of people coming up to your restaurant. There are a few interesting aspects about the way that this works, which means dealing with a surge in traffic like that can be quite tricky.

Let's say this is the number of customers coming in and then time to pizza. Right at the start, we're getting there, we're doing okay, time to pizza is pretty low. But what tends to happen is this is going to go up and up and up and up, and then any system, whether it's a pizza restaurant or a digital service that you're running, after a while things start to take too long. In the pizza restaurant, the oven's going to be full of pizza, there's going to be mozzarella on the floor, the chef is having a bad day. In the digital service, the database is going to be getting slower and slower, the disks are going to have much more contention, the network's getting contention, and eventually the service starts to slow down.

But there are two important things to think about here. The first is your SLA with your customers. Remember you're getting a pizza in a minute, so there's some kind of SLA. At this point, they just start giving up. You kind of breach the SLA and then the customer gives up. Now, an interesting aspect here with digital services is, I mean this would be unusual in a pizza restaurant, but what happens is you put your request in for a pizza and it's taken longer than acceptable, maybe you've given up after two minutes. So that's what we call a timeout.

At that point you're going to go away, hungry, without your pizza. But the team is still making your pizza, it's still going into the oven, so there's no way for you to tell them, actually I don't want that anymore, because the system is busy. That's kind of compounding the problem because you've gone and asked for a pizza and then it's still being made and now you've gone away again. That's a bit of a problem, but if you've dealt with digital services, you'll know what will happen at that point. What happens is you do a retry.

Rather than just going away and going to get McDonald's or something like that, you're going to come back and put in another order for a pizza. So now as this traffic is going up, we're compounding the problem because we had our timeout value here, and we start retrying, so you're going to ask for another pizza, so you've now got two pizza orders in there. You can just see this gets worse and worse and worse, and you enter a stage which we call overload. You get into congestive collapse, where rather than getting a pizza every minute, it gets slower and slower and slower, and now we're not able to make any pizzas whatsoever.

This is the situation which we want to try and avoid. Now, what we would say is

before opening your pizza restaurant, you would want to do an awful lot of performance testing to make sure to try and identify what is the maximum number of pizzas that we can make and what happens when Taylor Swift tweets and then we get a backlog of work which we just cannot cope with whatsoever. So how do we think about that?

So what we could do is we could turn this around and do another graph now. I'm going to introduce a couple of terms. Throughput is the number of customers coming in and asking for a pizza, and goodput is the number of customers who actually walk away successfully with a pizza in a reasonable amount of time. Now we can draw the overload situation again in a new way by saying what happens here. So as we start off, throughput is the same as goodput because we're not too busy at this point. People are getting their pizzas, it's fine. But as we enter that state of congestive collapse, this starts to tail off and we're plateauing here, and eventually we reach our overload condition where it just gets so busy we're no longer able to make any pizzas, and that's a bit of a disaster.

What we want to do is we would like to extend this a bit further as much as possible. So what I'm going to talk about here is one technique that we use inside AWS. This is not the only technique, it's a technique that you can use as well. There are articles on this in the Amazon Builders' Library that you can take away, and this technique is called load shedding. So a very simple idea. What we want to do is in this period here, we were successfully serving customers and making pizzas. We want to protect that as much as possible. What we want to avoid is this dropping back to zero.

The way we can do that is to say here we're going to start throwing work away. So we know the number of pizzas because we did our performance testing. We know the number of pizzas that we can make. So beyond that point, we're just going to tell customers, no, sorry, we're full, go away. But we're going to protect the fact that we can make some pizzas, and the effect that this has is we can extend this line so we're still able to serve a fixed number of customers and carry on. And that buys us vital time because what we can do now is we've got time, we know we're in an overload situation, we can bring on some more capacity. We can kind of wheel in some more pizza ovens, hire some chefs, get some more mozzarella and tomato ordered, and we buy ourselves extra time.

Now unfortunately, it's not a permanent solution, apart from the fact we're turning work away. Eventually, when so much traffic arrives, the cost of shedding that load, the cost of telling people you've got to go away, also overwhelms the service. So what you'd usually find is at some point this is also going to drop to zero. So it's not a permanent solution, but what this means is if you're in that situation of overload, you've now got a mechanism to protect yourself and buy valuable extra time to be able to recover from that. Now obviously the right answer here is that in advance you'd have tested to beyond the Taylor Swift tweeting point and you'd be able to recover your pizza restaurant without doing this, but this is a vital technique to make sure that you're protecting the work that you're able to do.

Rapid Response to Security Threats and Closing Thoughts

So we covered one more bonus risk. Let me talk about security risks a little bit. Yeah, go on. So I learned something that was really good. Thank you, pizzas. I learned about, I didn't know you were a Taylor Swift fan. Okay, so we talked about the deployment, the pipelines and how we deploy, how we can deploy at scale. We talked very carefully. But what if we don't want to do that? What if we want to do something else and speed it up? Right, what if we've got a situation that we need to deal with?

So I'm sure all of you had, I guess a lot of you had fun with Log4j. Yeah, exactly right. I've talked to a lot of my customers in this space, and I think what they really find quite interesting is how do you deal with that? How do you deal with situations like that? So actually what we can do is we can go back to those pipelines because they're well formed, well structured, and we have clear mechanisms to, we use the word mechanisms in AWS, we don't use the word process, right? But we have clear approaches about how we're going to deploy safely. So in the case where you need to respond to a situation fast, we can speed it up. Okay, it's a balance. Right, you know, all the teams are on it.

There's a really good YouTube video where we go through this, and you can look it up. Basically, where customers have taken people who are very careful in themselves but haven't got the pipelines quite the way they want, it's taken them months to remediate these sorts of issues, so they're still at risk for quite a long period of time. We turned around in 48 hours, globally. Which is amazing, right, when you think about it.

So when you're talking to people, if you think about why we came here today, what we wanted to do today is give you some insights into AWS about how we build things, why we build them the way we do, how we do deployment, and the care we take to think about the various risks that we're under, whether it's the API calls or whether it's the pizza. I don't know which pizza shop this is, but the pizza shop that you're talking about. He likes pizza, who knew, right? We wanted to help you understand how to think about this so that you can explain it to people. That's really important to us, and we're actually spending a lot of time in our own industry with financial services, which is a highly regulated industry and has to think about all these things.

We're spending a lot of time helping everybody to kind of come up with a common model in their thinking, so we're trying to evangelize these sorts of thoughts. It seems pretty simple, but it's a good framing for a lot of these things. And the other thing is there are always going to be risks that we haven't even thought about. So the fact that we've thought about a lot of risks and we've built these capabilities, you've got that two pizza team working together in the way that we discussed, and then you've got the bigger sort of coming together in our operational weekly meetings, right, which really is a spin the wheel exercise.

I was quite shocked when I first saw it, but it's very much a no blame culture, get to the bottom of things as Rob said. This is why when we get into this and we explain to everybody why we think about things in a risk framework and how we tick them all off, then when it gets into these issues, when we have security issues, we're able to go and rectify those very, very quickly.

Excellent. So thank you very much for your time today. Please make sure you fill out the session survey in the mobile app. One last call out for the resilience meetup at the Caesar's Forum meetups today at 4 o'clock, I believe, or maybe 5 o'clock. I think it's 4 o'clock. Come with your most difficult resilience questions, anything that you want to know about AWS resilience, your own resilience service events. And there's a team of experts on hand to answer that. You head to the meetup section at Caesar's Forum. Thank you very much. Thank you so much for coming. Good job, Rob.

; This article is entirely auto-generated using Amazon Bedrock.