Kazuya

Posted on Dec 5

AWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building at global scale: Engineering AWS expansion (ARC312)

In this video, AWS principal engineer Don McGarry and principal technical program manager Michelle Thissen explain how AWS builds regions, availability zones, and local zones. They contrast the early manual "Region Bootstrap Ninja" approach used for São Paulo in 2010 with today's automated process that uses existing regions as bootstrap environments to parallelize physical and software builds. Key technical details include breaking circular dependencies between EC2, S3, and DynamoDB using the "Franken instance" and intelligent proxies, implementing static stability principles so services recover gracefully without manual intervention, and managing hundreds of interdependent services through batched deployments. They emphasize game day testing in a dedicated test region, the COE (Correction of Errors) process for learning from incidents, infrastructure as code practices, and maintaining a healthy escalation culture to meet customer commitments for new region launches.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to AWS Global Infrastructure and Partitions

Welcome to ARC312. We're going to talk about how AWS builds regions and hopefully what that means for you. My name is Don McGarry. I'm a principal engineer in EC2, and my name is Michelle Thissen. I'm a principal technical program manager also with EC2. So before we kick it off, let's do a quick overview on what we're actually going to be talking about today.

We'll start with a very quick recap on AWS global infrastructure concepts. We'll talk about how we build for resiliency and how you can do the same. We'll take you on a journey back in time on how we used to do builds and then transition to how we do things today because things have changed a lot in the past decade. We'll then talk about availability zones and local zones and then talk in a lot more detail on dependencies since those are the primary problem when it comes to builds, and then we'll leave you with some key takeaways.

So first, let's talk about our AWS global infrastructure. This map shows all of the AWS regions we have. The ones in green are the ones that have already been launched, and the ones in red are the ones we're currently working on, and those include the EU sovereign cloud, Chile, and the Kingdom of Saudi Arabia. So overall right now we have 38 regions and we're continuing to add to those as we go, including more availability zones and other expansion options like local zones.

So this is our local zone map. We have 38 metro areas at the moment and we're expanding heavily in that space as well because there's a lot of customer interest in leveraging local zones, but we'll dive more into that later. So next up, let's dive a little bit deeper on how you can think about our global infrastructure. The highest level of abstraction in AWS's global infrastructure is a partition. Most customers don't necessarily need to think in terms of partitions because most customers are working in what we would call the commercial partition.

If you look at an ARN, it says ARN colon colon AWS, that's the partition, so the commercial partition is the AWS partition. That's where most customers live, but we've had multiple partitions for quite some time. There's the US GovCloud partition, which is US-gov. There's the China partition, AWS dash CN, and the reason we wanted to call this out in this talk is the upcoming EU sovereign cloud is going to launch, and that will be its own partition. So a lot more customers will be having to deal with the partition abstraction than before.

The difference between partitions is that partitions have an instance of our global services in them. Most notably, IAM and Route 53 are the two global services that will come to mind when you need to work across multiple partitions. What that means is an IAM identity, which includes an account number in the commercial partition, has no concept of that identity in another partition because they're two separate stacks. So if you need to migrate an application across or build it in a new partition, things like IAM principals and account numbers will be different because those are two different software stacks.

Understanding Regions, Availability Zones, and Local Zones for Resilient Architecture

Outside of the global services, partitions are just grouped for one or more regions that can be contained within one and only one partition. Now we'll dig into a little bit more in terms of regions, availability zones, and local zones. So a region is just a geographic area where we offer our services. Regions are comprised of three or more availability zones, and as a customer, you can think of an availability zone as a data center. In reality, there can be tens to even in some cases hundreds of data centers that make up an individual availability zone.

Those data centers within an availability zone are very close to each other from a latency perspective, both from a network path and from a shared fate perspective. So each availability zone within that zone can be thought of as sharing fate from an availability perspective. But when we pick where we want our availability zones to be in a region, we want them to be close enough to each other to maintain single-digit millisecond latency between those availability zones, but they're far enough away that they won't share fate with each other from things like natural disasters or power events.

So our regions are architected such that if you spread your workload across multiple availability zones, it's unlikely to have shared fate in an outage across those availability zones, and we extend that into our software stack as well, which we'll talk a little bit more about.

We extended the idea of an availability zone to this thing called a local zone. In a local zone you can think of it as an extension of an availability zone, so a local zone is parented to a single AZ within a region, but that local zone can be, from a latency perspective, a pretty far distance away. What that does is the control plane that will launch instances in that local zone exists in the parent availability zone. But the capacity that's running your workload is in a metro area closer to your customers or in some cases for machine learning in an area where we're able to acquire lots more power to run machine learning either inference or training workloads. We've used the availability zone concept to essentially stretch capacity in an AZ to either bring it closer to your customers' workloads or to provide additional capacity for things like machine learning.

Michelle will dig a little bit more into how these concepts apply when you're thinking about them in your architecture. Let's talk about how we can build for resiliency. We've talked about regions and availability zones, so let's put this into an example here. We have a region, we've selected two availability zones, and our VPC spans those two availability zones. What we want to explain here is that understanding the scope of the services that you're using is really important. In this architecture diagram we have our load balancer, an auto scaling group, and our EC2 instances. What you'll see is that our EC2 instances are in specific availability zones, and that's because when you launch an EC2 instance you specify the AZ, so this is a zonal service.

As opposed to these services such as S3, CloudWatch, and DynamoDB, these are regional services, so they exist at the regional level. When you create an S3 bucket and put objects into that bucket, you do it at the region level. You don't specify an AZ. The reason why understanding this is important is because you can plan your operational responses accordingly. You can look at what metrics are important to your business and then make business decisions based off of that, for example, knowing when to scale up your EC2 instances or when you need to fail over in the event of a disaster recovery scenario.

For the majority of our customers, leveraging multiple availability zones within a region is sufficient, as Don mentioned. However, there are specific use cases where a multi-region strategy might make sense, but that does come with some additional complexities managing DNS and costs if you're replicating your entire environment in a different region. Hopefully this is a good introduction into some global infrastructure concepts framing it with resiliency in mind. Let's actually jump into how we build new regions.

The Evolution of Region Builds: From John the Bootstrap Ninja to Modern Challenges

Region builds have changed a ton over the last decade. Even in the last few years, things have changed significantly. However, there are two primary work streams that have remained fairly consistent. The first is the physical. We talked about the elusive AWS cloud. At the end of the day they're still servers in data centers, and the physical work stream consists of everything from picking proposed sites including looking at expansion options. When we first build a region, we have three availability zones, but in the future we might want to add additional availability zones, so we have to keep that in mind when planning and picking our locations because, as Don mentioned, we have requirements on how far apart the data centers can be. We need to consider all of that when picking the sites.

Then we have to construct the data centers, delivering cable all of the racks, and then connect everything to the AWS network. The second work stream is of course the software build, and we're going to spend most of our time in this work stream. This involves everything from building out the most foundational things such as making sure that services can authenticate all the way out to building all of the AWS services in the new region. We've always had these two work streams, but the way we go about builds has actually changed a lot. Let's take it back to 2010 when we were building the first region in South America. In 2010, AWS announced the launch of the São Paulo region. It was our eighth region. We were the first major cloud provider to launch a region on the South American continent. AWS looked a lot different back then and the way that we operated was quite different as well.

To talk about how we built Brazil, I need to introduce you to John, our fictional character. John's job title was Region Bootstrap Ninja. John would be at the office in Seattle, and when it was time to build São Paulo—after we had built the data centers and the racks arrived—we would give John three jump drives, one for each availability zone. John would get on a plane, go to each availability zone, walk in there, pick a server, plug in that jump drive, and literally start hand-hacking away, installing all of the bootstrap scripts to launch our internal virtualization infrastructure that predated EC2.

This worked well for a while, but there were problems with this approach. Over time, John needed to learn a lot of things: provisioning network equipment, provisioning our Zen stack, and handling various Red Hat tasks. John needed to learn more and more, so it became hard to train more Johns over time. Additionally, if I sit down at a computer and you ask me to do about thirty different things, I'll usually do about twenty-eight of them really well the first time, but I'll make mistakes on one or two things. Then I have to go back and debug it. By the time I'm on the second and third availability zone, I'm getting lazy and making more mistakes. Sending one person to do lots of stuff is a pretty error-prone process.

Flying people around the world who are a limited resource isn't a good way to scale to multiple regions at once. We managed to build eight regions this way, which I think is pretty good, but there were some other changes happening behind the scenes that unfortunately put John's ninja bootstrap role on ice. We continually innovated over time. When AWS first launched, we ran original services like S3 and SQS on the infrastructure that Amazon.com ran on. Then we launched EC2, and over time, we started to build all of our AWS services on top of EC2. In fact, today, when you launch an instance in EC2, the machine that talks to the server that launches your instance is actually an EC2 instance. EC2 runs on top of EC2. Everything runs on EC2 now, which makes the idea of John walking into a data center and plugging in jump drives not really work, because you need EC2 to run EC2. Your head starts to hurt.

The other thing is we came up with IAM opt-in regions. When we launch a new region, your IAM information for your account doesn't exist in that region yet until you as a customer opt into using that region. Then your IAM materials will propagate to that region. That made it really challenging because we need this regional identity stack to exist before we can start installing things in the data center. We kept innovating over time. Customers wanted things like digital sovereignty and data protection, and they wanted more regions to run their workloads closer to their customers. The demand for us to grow our global footprint continued to increase over time.

This gets to one of the fundamental values of cloud that we talked about from the first re:Invent: the ability to go global in minutes. I can be sitting in my house in Northern Virginia and launch an instance in Australia just from the management console. I don't have to travel there and figure out colocation space and do all that other stuff. In order to make good on that value proposition to customers, we needed to expand our global footprint pretty aggressively. The global reach of what our customers wanted to run on our platform drove this expansion.

Building AWS on AWS: The Bootstrap Region Approach

With all this, how do we build regions today? Well, it's now 2025. I can't believe I'm saying that. That's wild. We have continued to launch more regions and availability zones and local zones. I just included these blog posts from some of the more recent regions that we've launched, but this just wouldn't have been possible using the approach that we used a while ago. So let's dive into how we build regions today.

As Don mentioned, it was a very sequential process. We would do the physical build and then fly people in and do the software build. In order to scale, we had to parallelize this work. Well, we just build AWS on AWS. We have all of these existing regions around the world, so what we do now is we use those existing regions to bootstrap the new regions. Instead of waiting around until the physical work stream is complete, we get a head start with the software build and can parallelize the effort.

We have a build region here that's still under construction. We're still super early on in the planning phase. No racks have landed, but we're going to get a head start on this region build. We're going to select an existing AWS region, which we'll refer to as our bootstrap region. We're very careful when selecting our bootstrap region because it's a production region with customers running there. We look into things like capacity considerations, the network latency between the bootstrap region and the build region because we need traffic to go back and forth and we don't want to be traveling all the way around the world every single time, and then fiber considerations.

Once we've selected this bootstrap region, we go ahead and start that software build. Even super early on in this build, we're designing with resiliency in mind. Our bootstrap VPC here is already spanning three availability zones. Everything that we build in this bootstrap environment, every service, everything thinks it exists in the new build region. We connect it up and eventually we will get the racks landed and everything good to go, but let's dive a little bit deeper into what's happening in that bootstrap VPC.

We start with a VPC and we have an account just like any other customer. We need to make this VPC look like the region that we're building, so we need to start building some services. Well, that's pretty easy. I can launch an instance and give it to a service team and say, here you go, it's all good. Well, the service team's going to say, this is great, except I need to get my AWS software on there. How am I going to do that? That's a problem. The Amazon deployment system can't tell this AWS account from any of the other millions of AWS accounts. So we need a way to authenticate that host, and then we need a way to talk to our deployment system, and then we need a way to make it look like it's the build region inside there.

We need to install some software on that, which is just like an RPM, right? And what's really an RPM repo? It's just a web server. We're going to need to do some DNS trickery because we need to make DNS inside this environment look like the region or the zone that we're building in. But what really is DNS? It's just something that answers on UDP 53 based on something I look up in a database. I can surely fake that out. And then most of our host identification, well, that's just a certificate. I can issue certificates.

So what we did in order to fake all this stuff until the real services that do all these things come up is we launched another instance and we called it the Franken instance because it just pieces together all of the stuff that you need to break the circular dependencies to get that host to talk to the deployment system to be able to get software on it. We made a Franken instance and it does DNS and it does PKI and it does a little RPM repo, and that'll turn that generic EC2 instance into an Amazon service team host. We can do that to build the early services like IAM because that's what you need before you need anything else, right? Who am I talking to and how do I authorize them. All right, cool, so we built IAM, right? So this should be easy. Now we just have to build EC2.

Breaking Circular Dependencies with Intelligent Proxies and Migration

However, remember when I mentioned that everything runs on top of EC2? EC2 also uses other AWS services because we didn't want to be left out of the fun. Most notably, we use S3 and DynamoDB. In a production region, the way we use S3 and DynamoDB is strongly controlled and monitored such that if those services degrade, we have a way to either continue portions of our operations or we'll degrade relative to those services degrading. However, early on in the build, we have a chicken and egg problem: S3 needs EC2 to build, but EC2 needs S3 to build.

We have a problem, but I'm running in an AWS account with S3 and DynamoDB in the region. What's really the difference between S3 and DynamoDB in the bootstrap region and S3 and DynamoDB in the build region? The only difference is the DNS. If I were building US East 2, it would be S3.us-east-2.amazonaws.com, and if my donor was US East 1, the DNS would be S3.us-east-1.amazonaws.com. That's just DNS, and I can fix that because I have the Franken instance. I can make DNS whatever I want.

I need to call against that regional opt-in stack because when you're in a regionalized account, you need to talk to the regional opt-in stack. What if I just threw a little proxy server in there that responded on that DNS endpoint, shuffled your call over to the API that we just built, but then when it actually goes to store your data, it just turns around and uses the one in the bootstrap region? That's what we do. We have these intelligent proxies which, like a good mullet, have the build region in the front and the bootstrap region in the back. That's how we break the circular dependencies early on in EC2 to allow us to use S3 in order to build EC2. Once we build EC2, then S3 and DynamoDB can build.

At this point, EC2's core dependencies are up, and we can build out the EC2 control plane. We also build out some core networking services and other services in this bootstrap environment. In our build region, we're at the point where the racks have been cabled and we can actually launch EC2 instances into the build region using the control plane that exists in the bootstrap VPC. Once we can launch instances, other services can start their builds. The first few services that build in the build region are S3 and DynamoDB among a few other core services.

Once S3 and DynamoDB are available in the build region, we go ahead and migrate things over. We take all of the data stored in the bootstrap region S3 and DynamoDB and copy that over to S3 and DynamoDB in the build region. We also migrate all of the services that were built in the bootstrap VPC into the build region. Service teams scale up their instances in the build region and then scale down those in the bootstrap VPC. This is all possible because to a service, they are the same region, whether it's the VPC environment or the build region environment.

Once the control plane is up and running in the build region, we have EC2, S3, DynamoDB, and all of the core services in the build region, so we have a very stable environment for a bunch of other services to start their builds as well. Once that's done, we terminate all of the resources in the bootstrap S3 and DynamoDB. We also shut down and eventually terminate all of the instances that we're running in the bootstrap VPC. We then apply an access control list to block traffic from the bootstrap region to the build region and vice versa.

We then give it some bake time just to make sure that nothing is trying to call back and forth anymore. If everything looks good, we go ahead and disconnect the build region from the bootstrap region, and then the build region is on its own. We build out all of the services, we run our game days, and then we launch the region. That is at a very high level how we build a new AWS region. Let's now jump to how we do this for availability zones and local zones.

Expanding Infrastructure: Building Availability Zones and Local Zones

The process for adding availability zones is very similar to building regions, albeit much simpler. What is interesting about availability zones is that it actually takes us longer to add a new availability zone to an existing region than it does to build a new region from scratch. We are working on making it better, but the reason is our isolation and our promise to customers. One region is strongly isolated from another region, so when we are running in that bootstrap VPC, we do not do anything special. We are just like any other customer, so we are not doing anything that will impact the region we are running in. We really are running like a brand new region, so that is pretty isolated from everything else in the global AWS infrastructure stack. There are no customers in the build region yet, so we can go really fast.

With an availability zone, you are operating in a live region with live customers. From an availability perspective, there are a whole bunch of rules that we follow about deployment safety and not deploying to more than one zone in a region on a day and not deploying to so many zones. All of those rules apply, and we want to be really cautious that we do not do something in a build availability zone that would impact customers running in the parent region. We are expanding our availability zones worldwide, and in some ways it is nice because it is a way to get additional resiliency and additional capacity available to customers without you having to rebuild your stack in a new region. This provides greater flexibility.

The way that we do it is we essentially have a bootstrap AZ instead of a bootstrap region, and we just create a VPC there. It is much smaller to scale. We only build the zonal services that comprise the EC2 control plane there, as well as some of the zonal networking services. Then we do some fancy under the hood network peering to the zone that is being built. Once we can do that, we can provision hardware and launch instances and be able to do that same migration process that we do in the region build process. It is somewhat similar, slightly different, and the fun fact is it actually takes longer, mostly due to deployment safety.

Local zones exist so customers can run their applications on AWS infrastructure closer to their end users and to provide additional capacity. We have a few different flavors of local zones. We have our standard local zones and we have dedicated local zones, which are dedicated to a specific customer. This means they might come with additional security services or whatever that customer may require. Similar to an availability zone, we have a parent AZ and the EC2 control plane lives in that parent AZ, and then we just extend over to the local zone.

The set of services that we build for a local zone is much smaller. Most of the complexity in a local zone build actually comes from setting up the network connectivity between the parent AZ and the local zone. Similar to the AZ, once we have that established, we can provision that capacity, the data plane essentially in the local zone. One of the huge benefits here is that customers can run their workloads in different countries. For example, say you have a parent region in country A, but due to data residency requirements you need to store all of your data in country B. Using a local zone, you could use some of the local zone versions of regional services like S3 Express to ensure that all of your data is stored in country B to meet the data residency requirements you may have. It is a really nifty way of extending AWS infrastructure capabilities to locations beyond where we have regions.

The Dependency Challenge: Managing Complexity in Service Bring-Up

Next, we are going to jump into perhaps the most tricky part in all of this building. We are going to talk about dependencies and how we deal with this during builds. Don touched on this a little bit earlier. It is tricky because we have a lot of circular dependencies between services.

And it continues to become more and more complicated as we add more services and more features. More services want to leverage the benefits of services that we've developed previously, like Lambda and Fargate. If you're a service team, why wouldn't you want to use those, right? So we've broken it up into four main categories.

The first is service bring up. This is all about how we bring up all of the AWS services during a build. We have hundreds of external AWS services, but probably thousands of internal services. The way we handle bringing those up in batches involves quite a bit of complexity, and we'll go through that using an example later.

Then there's dependency management. Dependencies change all the time. It's almost impossible to maintain an accurate list of dependencies, especially at a granular level from a program management perspective. When we build a new region, availability zone, or local zone, we do have a high-level dependency graph. This allows us to plan the schedule and make sure we launch the builds on time. However, it's much higher level than the individual service level because that's just almost impossible to maintain.

Then we have static stability, which is about making sure things recover gracefully in the event that something goes wrong. And then continuous testing, ensuring that all the things that we planned for, we actually test and make sure that they work. To frame dependencies and how problematic they can be, we're going to run you through an example. We'll use EC2's most foundational API, run instances, and talk you through it.

Up until this point, everything sounds pretty easy, right? I think it's easy. So what does it take to run an instance? This is a really simple run instance call, right? Give me an instance from that availability zone, make it a T3 micro, and I'll give it a key pair. What does that look like?

This is actually a screenshot from our internal API orchestration software that we use in EC2. That is what a run instance call looks like. Every one of those boxes on that chart, whether they're purple or yellow, are what we call agents, and an agent represents a service team. A service team in AWS is a two pizza team, so it's roughly 8 to 10 engineers. When you work on a two pizza team, you are responsible for your service and you own everything about that service.

In this API orchestration framework, if you hover over a box, the green lines highlight what gets called when that agent is being called. If you step down the left side at various times, all of those little green lines will be pointing at different boxes. Each step in that process is a number of services that get called underneath the hood. So you can see from this that this is a problem.

When we talk about a service, we talk about one of those little boxes on the screen, even though most people think EC2 is a service. When you come into region build and you need to run instances to be able to deliver capacity to people during region build, you've got to build, at a minimum, the number of boxes that are on the screen, and it turns out actually a lot more. When we talk about building EC2, we're talking in the order of a couple hundred different services. Those services, just like EC2 depends on S3 and DynamoDB, depend on each other. So it gets really tricky with this dependency thing.

The engineer thing you want to do when you come in is think, well, there exists a dependency graph in which I could order every box on that screen in the perfect order to know what order I need to build everything in to be able to build EC2. While that is factually accurate, the reality is that if you were able to piece together that dependency graph, first, it would take you a while to do it, and second, if I was able to wave my magic wand and have that dependency graph exist at this very second in time, it would be out of date tomorrow.

Every one of those teams of 8 to 10 software engineers is writing code every day and changing the behavior of the system every day. Actually, the reason why we built this API orchestration framework is we need every box on that to be able to change how API calls work in order to deliver new features for you and not be dependent on needing to coordinate because it used to be we had to talk to every person on that screen to say we're going to have to change your own instance.

Everybody needs to get on the same page. Chasing dependencies in a distributed system can be a bit of a fool's errand. So how do we make this thing actually work in a region build? Well, we've got this trick up our sleeve, and the trick is this thing called static stability.

Static Stability: Building Services That Recover Gracefully

Static stability, usually when we talk about static stability, we talk about it in the context of have you tried unplugging it and plugging it back in, right? We do that for entire data centers or availability zones. If the power goes out and the generators for some reason don't come on, which happens from time to time, all of the servers in the data center go off, and then usually the power comes back on at some point. Once the power comes back on, all of the servers in the data center turn back on. Wouldn't it be nice if when that happened, EC2 just came back to life and it didn't require hundreds of engineers going in and saying I've got to get my service working? Then you have that dependency problem. Well, that's static stability, right?

Because we run EC2 on EC2, we want the special EC2 instances that run EC2 to be able to come back and not lose their state. Secondary to that, we want all of the services to come up and just start working without hundreds of engineers having to go in and poke things. In order to make that work, you actually have to build your services in a certain way, and the way that we build our services is kind of twofold. The first is that I don't take a lot of dependencies to get my software from existing to running. I may not be able to do much, right? Maybe I need, in order to run a web server, I probably can't start if I don't have my SSL certificate, right? Because that's not going to work very well. I kind of need that.

But do I really need to talk to the monitoring system to be able to admit metrics just to get started? No. Do I need to be able to actually talk to the database? Do I need to talk to DynamoDB to start my service up? Well, not really, right? I can handle API requests. I can't do anything with them. I'll get my little web server running and it'll happily sit there and return a 500 error to anything that calls me. But if I can minimize the dependency surface that it takes just to get my service started, and for those few dependencies that I need like my SSL cert or my IAM credentials to be able to call something, those things are kind of important and you need those to start up. I'll just sit in an infinite loop and go, can I start? Crash. Can I start? Oh, I can start. I got my IAM credentials and my cert, so I'll start. Oh, I can't talk to any of my dependencies, so everybody who calls me just gets an error, but that's data that I didn't have before.

So static stability starts in how we build software and that our services continually try to activate themselves no matter what, and they'll start with a minimum set of a minimal dependency footprint. And then respond based on when you make an API call to me, if I can't talk to my dependency, I'll return you an error, but when my dependency comes online, the next time you make an API call, oh, you get a 200, great, it works. We can use this in region build, because while I can't build that perfect dependency map of what EC2 looks like today, I can put stuff in buckets, right? I can put stuff in the foundational database bucket of EC2 and the provisioning bucket and the fancy API layer bucket. And if I group services into groups of like 20 or 30 or 50, all I need to do is launch all 50 and then kind of let God sort it out for a little bit, right? Let them all sit and spin for a little bit and send API traffic through the system.

In a perfect world, everything will get to active, all of the dependency things will solve each other and API calls will start to work. In the more realistic scenario where maybe 75 percent of everything works, well, at least I got 75 percent working and then I can kind of go poke along the API call path, fix the one or two things that didn't come up, and now everything is working. So we can use this thing that we do to recover from major outages and actually use it to our advantage to not have to really get into granular dependency mapping and project management and instead kind of go with the well, let's move on from that approach.

Testing and Learning: Game Days, COEs, and Operational Readiness Reviews

We can talk about how we actually make this happen in real life and do things like testing to make sure that this stuff actually works. This all sounds great in theory, and we might plan for it , but how do we know it actually works? Well, we run game days. Regardless of the type of build, regions, availability zones, or local zones, we have time in the schedule dedicated to game day testing. That time is used to actually test things like what would happen if we power down an entire availability zone. We'll just pull the plug on the AZ and see what happens. Do things recover gracefully or do we have some gaps to close?

These game days have proven to be so valuable that we actually built an entire test region to do game day testing. We built an entire region just for this test region. It has three availability zones, all of the AWS services are there, and we run game days there very frequently. What this allowed us to do is move the game days away from just being tied to specific builds. There are only so many builds in a year and so many game days you can do in a specific environment, but we can run game days whenever we want to because those learnings are extremely valuable. What better way to learn than when things go disastrously wrong?

Sometimes things do go wrong. Whenever there is an incident, whether during a game day or in a production region, our first priority is always to mitigate impact. However, after we've done that, we want to make sure that we figure out what went wrong. One of the mechanisms we have at Amazon, not just in the build space but across the company, is a Correction of Errors mechanism, or COE. After we've mitigated the impact of an event, the service team will go in and do a root cause analysis. They'll author this document in a specific format with tight deadlines around it, and you really dive deep into what caused the issue. You assign action items to your own engineering team and to other engineering teams to close gaps.

The primary thing here is that it's not meant to be punitive. So if the root cause of an issue is operator error, if someone fat-fingered something, we don't let that be a root cause because we're not going deep enough. We ask what enabled them to be able to do something like that. It can't just be a person's fault. The COE process is very effective in getting learnings from when things go wrong, but we actually extend that beyond just that individual team. At AWS, we have ops meetings at various levels: the org level, EC2, and even AWS-wide. During these ops meetings, which are typically led by our principal engineers, everyone will review COEs and dashboards, and there'll be a discussion on one or two COEs.

In fact, sometimes the principal engineers will assign action items to other teams because when there's an operational issue and a team identifies some gaps, it's not just that team that could learn from that event. There are learnings that could be implemented in other places as well. The ops meeting is one way that we as a company really prioritize sharing those learnings, and they happen on a weekly basis. Attendance is required, so it's a huge part of the culture and a great mechanism to ensure that we're improving. Finally, we have Operational Readiness Reviews, or ORRs. This is a process teams go through on a regular basis to ensure that their services are operationally healthy. This is for existing services, but also if a service team is developing a new service or a new feature, they would undergo this ORR process, which also incorporates some of the learnings from noteworthy COEs.

Key Takeaways: Resiliency, Centralization, Automation, and Culture

These are some of the mechanisms that we use across the company to really drive this culture of continuous improvement through those mechanisms and testing. Let's move on to some more key takeaways. Hopefully you can learn a bit from what we've learned over the last few years. The first is, well, we kind of alluded to it throughout the talk, but all of our services that we build are architected from the ground up with how we do resiliency.

Our services are aware that they are zonal services running in a specific zone, and they know which other zonal services they communicate with. This is really baked into all of our services, especially our stateful services, but even services that handle API call paths for EC2. Those availability zones and the bootstrap VPC are not just there for show. Our software depends on knowing that it is a zonal service running in a particular zone, and therefore it talks to other services in that zone.

That connectivity is handled through things like DNS. As you build services on our infrastructure, you need to decide what failure modes you want for different components of your system and how you design your foundational infrastructure, your VPC networking, and your DNS infrastructure around that to create strong silos between your components. This architectural approach is critical for building resilient systems at scale.

Second is knowing when to centralize tasks versus when to distribute tasks. Michelle and I worked on the same team in EC2 for a number of years, and that team was dedicated to doing region builds for EC2. For those hundreds of teams working on the API, we were a team that, if a service team met our bar for automating their region build, we would do their region build for them. Interestingly, nobody likes doing region builds, so every team was eager to ensure their region build was automated so our central team could handle it. That program is called managed builds inside EC2.

We also have another program called managed fleets. Each EC2 pizza team is responsible for everything, including how many hosts they have in every zone in every region, patching their hosts, and fixing broken hosts. For a service team like EC2 running in every region around the world, with new regions constantly being added, this is a lot of work. People do not like patching their infrastructure, so we started developing centralized solutions. The problem is that every team spends an hour a week on patching their hosts if they are doing a good job. When you have thousands of teams, they are spending thousands of hours on this undifferentiated work, but no single team is centralized to actually make that any better.

When you take that thousand hours of work and put it on one central team, suddenly they are really incentivized to build cool systems to make patching hosts happen at Amazon scale really fast, efficiently, and safely. That is what happened in the region build space. Amazon was generally operating in a kind of anarchy where all these teams ran around and did whatever struck their fancy. Over time, we discovered that while that provides a lot of value and allows you to go really fast, sometimes centralizing things is actually better. Each business is going to be different because how you do things varies, but you really should think about when it is time to centralize work versus when it is time to keep it distributed.

The flip side of that is we struggle as a centralized team because when region builds become somebody else's problem, my incentive to maintain my automation is not that great, and my incentive to not do problematic things with my hosts is not that great either. There is a natural tension there, which is why Amazon defaulted to having service teams own everything and go on call for it. If you do something silly and get paged at two in the morning, you will be incentivized to fix that. There is a trade-off there, and exploring that as part of your business and development life cycle is really important.

Next, do everything in code. Please do everything in code. All of our infrastructure is in code. We obviously use

a lot of CDK internally. Everything becomes complicated when you run a low-level service because CloudFormation runs on us, so we cannot use CloudFormation. We have an internal infrastructure as code framework that we use, but everything is infrastructure as code and any type of regionalized config is actually auto-generated as part of the build process. This extends to having predictable DNS endpoints for your service either at the regional or zonal level. You can have a simple script that generates those endpoints for every single region that you ever want to build in, and then you never need to think about it again. The same goes for configuration in the software that you run. Generate that at build time and have that live in your source control.

We go a step further, and this is becoming more interesting as we start to use some of the generative AI tools for coding. Most knowledge at Amazon is in a gigantic wiki site, and you can write infrastructure as code to generate wiki pages. We actually use wiki pages for our operational dashboards for services. Even our runbooks and dashboards for services are captured in our infrastructure as code. The reason I mention the generative AI aspect is that many more teams have been doing this. You can get it to write halfway decent documentation, and it can do things like graphs and diagrams pretty well in markdown. More documentation is actually going into source control as markdown over time as some of our projects start to use this stuff more.

I encourage you to really give that a try because the more stuff we put in source control, the easier it is to search. It goes through pull requests. It generally has a structure, so if you show me a repo, I know generally where to find stuff depending on what programming language it is written in. Really having everything in your source control repository system versus some document repository that no one really likes is generally a best practice. We have continuous improvement, so let me use builds as an example. Builds are a very long project, and so there is a tendency to say we will talk about lessons learned and retrospectives once the build is done. The problem is that so much time passes, things have changed, and people forget.

We really encourage continuously looking at what did not go well and iterating quickly. The way we have done this, as an example, a few years ago there was a real push at AWS to have service teams automate their builds. We noticed trends. For example, we saw that a thousand different teams were running into the same blocker when trying to automate their builds. Instead of waiting and letting every single team work through that issue themselves, we took a step back, paused, and said this does not make sense. Why would we have every single team work through the same issue? We cut a tracking ticket, made sure it was documented, escalated it, and ensured that the owning team resolved that centrally. That way, the next build, this is not an automation blocker for those hundred teams, and those hundred teams do not have to worry about implementing workarounds themselves.

Similarly, we have had sticky parts of the build where there was a lot of churn, ownership is not clear, and there is a lot of back and forth on tickets. Instead of just having that same issue every single build because it is a tricky part of the build, what we have done in the past is assign principal engineers to that ambiguous problem and have them own that space, find owners, and drive improvements. Instead of just letting it be an issue over and over again, taking that approach of looking at each specific phase of the build and doing a quick retrospective on lessons learned has really helped us improve things very rapidly.

The time we see the most improvement is actually when we have several builds back to back with a short gap in between because we can implement lessons learned and then hit the next one without those issues. We slowly make progress across each build. We have actually staggered builds on purpose in the past to make sure that we are not just running into the exact same issue in four different builds at the same time, which is very painful.

For everyone involved. Culture is incredibly important in general, but specifically in the build space. We have engineers who are actually doing the work. They are building out their services and their automation, so they need to feel empowered to vocalize when things are just not working. If they are running into automation blockers or if they have been going back and forth on an issue over and over again, we want to make sure that those concerns are heard and escalated to leadership so we can actually prioritize fixing them.

At Amazon, we have a very healthy escalation culture. When our engineers or anyone says that something is a really big problem, we make sure that our leadership hears that. Leadership does need to be willing to listen, and fortunately our leadership does. We use that escalation culture to very quickly align on a path forward. This is specifically important for builds because once we make a commitment to our customers to launch a region at a specific time of year, we need to meet that commitment. Builds move very quickly, and if there is any friction, we need to escalate that quickly and get on the same page so we can move forward.

Culture is extremely important in general, but especially if you have very large projects that span multiple months or even years. These are some of our main takeaways, and hopefully they have been helpful to you. If you want to learn more, we have two topics that might be interesting if you enjoyed this session. We have some builder library resources, one on static stability and one on building resilient services. Those are the two QR codes. We thank you so much for your time. Please take the survey—we need that feedback. We hope you have a great rest of your re:Invent. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.