Kazuya

Posted on Dec 4, 2025 • Edited on Dec 11, 2025

AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

#distributedsystems #testing #aws #devops

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Large-scale software deployments: Inside Amazon S3’s release pipeline (STG352)

In this video, George Lewis and Vandana from Amazon S3 share deployment safety practices for managing millions of servers across 39 regions. They detail three testing systems: Noodles for behavior-driven API testing, HiFi for model-based testing that validates all combinatorial API possibilities, and performance testing including S3 Rise for production qualification. The presentation covers blast radius containment through staged regional deployments, monitoring with CloudWatch alarms and canaries, and application controls like feature flags using AppConfig. Vandana explains stateful deployments, emphasizing data preservation through checksums, consistency, redundancy across three availability zones, and durability threat models. She describes the host reservation system that coordinates maintenance across the fleet, performing safety checks before deployments. Real data shows operator escalations dropped from 478 in 2022 to under 10 in 2025 by implementing proactive durability threat models and automated safety checks.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Amazon S3's Deployment Safety Practices at Global Scale

Good morning. My name is George Lewis, and I'm here with Vandana. Today we're going to talk to you about Amazon S3's release pipeline. What I want you to walk away with today is some of S3's best deployment safety practices. We're going to dive into our testing infrastructure, how we actually do it at scale, and how we allow our global teams to scale when we're talking about testing and validating before deployment. We're going to talk about multi-partition deployments. How do you go from commercial to China? How do you go from China to US GovCloud, European sovereign clouds, and then ultimately into Amazon dedicated clouds that have different partitions and different considerations? Then we're going to talk about application feature controls such as feature flags. We're going to talk about canaries and being able to do AB-style deployments. Finally, we're going to end up with stateful deployments and pipelines, talking about how we maintain state and durability for S3.

Before we get started, I did want to say that we have a couple of assumptions for this audience today. We already expect that you are experienced developers and operators coming into this. We're not going to be covering some of the 100 and 200 level stuff beyond a couple of slides to give us some context, and then we're going to move on pretty fast past that. We expect that you and your teams are able to understand and write unit and component tests. You have integrated IDEs and everything along those lines. We're going to assume that you already have deployment pipelines, that your hardware infrastructure is code, and that you're not trying to manually do a bunch of this stuff. Although this presentation is keyed for larger organizations and more complex software systems, smaller teams and teams that are on the path to that large scale should be able to get quite a bit out of here and take some of these lessons home.

All right, let's start with some of the context around S3. We have millions of servers worldwide across 39 regions, each with three availability zones, and we're growing year over year. This includes Amazon dedicated clouds and China partitions. Our fleet includes a large variety of different types of services, but we're mostly going to talk about three major components inside S3. The first is the S3 web server, which is our stateless API front end with standard web server architecture. Then we're going to dive into index and storage fleets, which are the key stateful architected systems that capture the metadata, the indices of the storage, and actually durably store our information. All right, let's get into deployment safety. Deployment safety really starts with testing.

Noodles: Behavior-Driven Testing Platform for Distributed Teams

Let's first look at some of our testing systems here. As we said, we're not spending much time on unit tests and basics. Instead, I'm going to focus on three systems: Noodles, HiFi, and performance testing that allows us to horizontally scale across those dimensions. First is Noodles. Noodles is a test platform for writing functional behavior-driven tests against S3's public APIs, and we built this specifically for S3's distributed teams. On the surface, this looks like any other behavior-driven test system. You have the given-when-then structure along with scenarios, and most of you are probably pretty familiar with this.

Here is a code snippet of how this works. The top part is where you actually define the feature and the scenario, such as put returns 200 for a bucket owner, given that Bob owns the bucket and uploads the item. After defining the user-based feature, we actually link that feature to the step that actually connects it and allows it to run. Here we can see that uploading PNG to the image buckets translates to the put object response method that we previously defined. In our case here, most of these put object steps are defined by the infrastructure team that put Noodles together. So when we talk about our distributed software teams, they're not spending a whole lot of time on actually doing the step definitions. They're defining the test and just doing the link to previously defined methods. That way they don't have to think about what's actually working under the hood.

For folks starting from scratch, the initial step definitions are where you're going to benefit from spending the most time. That's going to allow other folks to move a lot faster and focus on the more behavior-driven components that separate from implementation. This approach ensures that the foundational work pays dividends across your entire team's productivity and testing efficiency.

Where Noodles really becomes powerful for us at scale is that it abstracts and uses pooled accounts and resources. By resources, I mean buckets, CloudWatch metrics, the actual accounts themselves, DynamoDB tables, and all the logs and metrics are centrally owned by a central infrastructure. Individual service teams don't have to worry about setting up any of these components. They essentially write the test and it goes into this infrastructure.

The last big win here for us with Noodles is that it is a write-once, run-everywhere product, and I do mean everywhere. Using definitions provided by the developers along with the abstracted and shared infrastructure, Noodles converts those to be on-demand tests that the developer can manually run from their IDE, from their integration environment, from our web server integration environment, as well as our non-production region validators. Noodles then can take those same tests and translate them into actions for continuous canaries that run in production against every single production region. So what this ends up being is you write once and you run absolutely everywhere.

HiFi: Model-Based Testing with Automated Reasoning for Comprehensive API Coverage

Moving on from functional testing into more science-based testing, functional tests are the bed and rudder of software development. However, they always suffer from the problem that they test the known, known issues. Even when you don't intend to, we don't always cover the possibilities of how our APIs are being interacted with. If you've owned even the simplest of APIs for more than a week in public, you know that your customers have used them in some way that you did not expect at all.

This is where automated reasoning and high-fidelity model-based testing comes into play. The basics of model-based testing is that we are starting with the specification or model of what we want the system to do, not necessarily how the system goes about doing it. The test validates that the system implements the specification rather than testing the implementation.

While this sounds simple when considering a small number of APIs and implementations, where this becomes critical is when you start increasing the number of dimensions and options per API and when you have multiple implementations that you expect to be kept in concert. Take an S3 put request for example. It has dozens of possible headers: content type, cache control, content encoding, server-side encryption options, ACL settings, metadata headers, and so on. Model-based testing doesn't just test each one of these headers individually, like what you would do in your functional test. It tests every combinatorial possibility.

What happens when you have conflicting encryption headers? What is the error precedence for multiple headers that are malformed? Do you throw back a 400, 401? Which error gets returned first? In S3's case, how do you compare S3 general-purpose buckets to S3 Express One Zone directory buckets, which has a separate implementation of the exact same API? Same thing with S3 on Outposts. Now while there are some API differences between those three implementations, that's fine, but the APIs that stay the same, customers are going to expect that they operate the same in every single implementation that you have.

Now let's get into what this looks like from a builder perspective. Our HiFi system has three major components. The first is the actual model or specification, and this isn't just a science document. This is an actual executable model or specification. When building the model, precision is the key and it's where you're probably going to spend most of your time doing the construction. Not only are you going to be looking at public documentation and your own written specifications, but you're going to want to spend some time validating that your customers are actually operating against that. You're going to want to do some log dives and research to see how customers are actually interacting with you today and actually see those permutations, not just rely on your specification.

If it's in your logs and customers are doing it today, it is part of your specification even if it's not written down somewhere. After the specification work, the model is not nearly as complex as a lot of people want to make it out to be. You can think about this as little more than a key-value store that's in memory. You can see some sample code right there for one of the very basic models. It's basically a key-value pair. The big thing is just making it executable, meaning that you can actually provide requests to it and it provides the response in accordance with that model.

The next component here is the test generator, which is where the specification, the automated reasoning, and customer behavior come together. The generator service continuously queries the API specification and is generating the requests. Remember our example earlier of all the permutations that were possible for just the S3 put object API? Well, here is where all of those combinations actually come together, and you get that expressed true systematic coverage of all APIs. We then utilize real customer behavior to shape those tests into workflows and combinations that customers actually use. Do all the combinatory math along those lines, but make sure that anything your customers are actually doing becomes key workflows. We're not just talking about headers. We're talking about when they do a put, they do a get, they head the object, they change the ACLs. Those standard workflows that are across multiple APIs need to be tested as well, not just the individual API.

Finally, we also sample some arbitrary requests in production and rerun those so that we're continuously looking at new customer behavior that comes in. The last component of HiFi is the validator service. The validator service is fed tests from the generator and then executes against service endpoints or implementations under test. It runs the same test against the executable specification. Ultimately, it reports deviations from the model or specification. You get a nifty little divergences found report that you can deep dive into. You don't just want a nifty little science report. You also want to be able to hook this up to your CloudWatch alarms and set this up as a blocker for your pipelines. If you don't match the specification, you need to stop right then before customers go out there and end up using it wrong.

In HiFi, we've also taken the run once, run everywhere perspective that we discussed earlier. Currently, we have our validator service running against our integration gamma, our regional non-production validators, and a couple of select production regions. We're actually going to be expanding this to hit every single region worldwide to actually run all of our model-based tests as well. You might ask why you care about this if you've blocked any model deviations from there. This is particularly important for things like web services where you're integrating with external systems that are outside of your pipeline. You get into a new partition, a system has a new bit of code, and instead of returning a 401 not authorized, they return a 400 bad request. Suddenly your system starts doing a 400 where previously you would always do a 401. How your system reacts to some of those edge cases is why we want to check our models in production as well.

Performance Testing and S3 Rise: From Software Features to Production Qualifications

The last part of testing is performance testing. Some of you might be thinking how does performance really tie in with deployment safety? Sure, I care about it. I want my system to operate fast and everything along those lines. But performance, particularly when we start talking about performance as the peak amount of traffic your service can take on any individual hardware instance type, it becomes a different thing. Ultimately, performance testing gives you, when taken together with the forecast of your predicted customer traffic, your minimum capacity plan and your safe capacity plan. Of course, then we go into AZ redundancy.

S3 is a regional service, and you can end up with the exact amount of capacity that you feel comfortable with in every single Availability Zone and be able to handle an AZ down event. This is why we care about it. We have three phases of performance testing implementation inside S3. First is our software feature performance. This is done by the micro level, which is an isolated environment for the team and it's micro level performance. Next is instance performance. This is where software will meet representative hardware types, but still be in an isolated environment away from production. Finally, we actually produce per region ratings, and we'll dive into each one of these here in a second.

First, software feature performance . This is micro performance. It's best reasoned about by the individual service teams that are actually putting out the feature. For some organizations, they try to consolidate and have one central performance team that owns the whole world. They struggle in reasoning about how individual APIs perform. Take something like a compression API. Let's say you were doing compression. The amount of compressibility or entropy inside the object is going to change how the compression and decompression algorithms actually work. A lot of times a centralized performance team isn't going to understand how the implementation can change between Z standard or whatever standard of compression that you're using. So it's really important that the teams own this up front.

Now, individual service teams generally don't have the exact representative hardware when you have a large fleet of a million servers and a bunch of different types out there. So what you can do is just give them standard hosts. You want to get them as close as possible, but what we do is we run at least two iterations that swap hardware configurations and isolate software performance from any sort of hardware variant. This cross comparison eliminates any sort of hardware bias. If software B consistently performs better regardless of what host it runs on, it's a real software improvement or a real software regression. You don't have to worry about whether it was this or that.

Similarly, we also compare across each dimension in combination and API separately. We found that it's very easy for micro individual API performance issues to go completely unnoticed in larger scale testing environments. They basically get averaged out. Everything, eighty percent of our requests are GET requests, and so it takes a lot for a performance degradation to actually spike up in some of the tests. So it's really important that you spend some time and get each one of these dimensions. We look at all of our different encryption types. We do different object sizes and we compare those in A/B directly to themselves. So you end up with a big long list of all the different dimensions and how their performance compares from software A to software B. Then that single red line out there that says suddenly the performance regressed right here will give you the indication of where you need to dive into.

All right, the next part is production qualifications . So why do we do production qualifications? You have operating services, and it sounds a lot like I'm testing in production right now, and we sort of are. The first lesson we learned is that when we have highly varied regional traffic profiles. Some of this is based on the age of the region and the well established workflows. US East 1 has some of our oldest workflows that have some of the largest object requests that we have inside S3. And that is actually our largest technical difference from region to region inside S3. With each new launch region, the average is somewhere around 250 kilobytes per request, while US East 1, the average request size will actually be over one megabyte. What this ends up getting is that as requests get smaller, individual hosts get higher TPS, and that higher TPS then stresses CPU much more than it does bandwidth.

When requests increase and you're spending more time streaming, it starts to put pressure on the bandwidth in your network, and you have lower TPS along those lines. How you scale even your microservices that are off your main request path often depends on TPS or bandwidth. What this means is that your same software and same hardware will work completely differently in different regions just because customers are utilizing you in completely different ways. Obviously, capturing this deviation for capacity planning is essential to make sure you have the right amount of capacity in the right region and the right type of capacity.

With this sort of knowledge, you can understand what type of instance may work great in US East One and be the most efficient from a cost perspective versus, say, one of our European regions where we need to go up with CPU and a smaller network. This is why we built S3 Rise, our system that allows us to continuously qualify our production fleet safely. S3 Rise is a step function orchestrator that integrates with S3's capacity control system. The S3 capacity control system integrates directly with our DNS settings. What this allows is S3 to change the DNS weight for the systems that are under test. As customers go and recycle their DNS, and of course you're having high performance systems so they're doing that all the time, they will start putting more and more traffic on that individual instance even if you're at your normal load for the day. We'll continue to ramp this up until the host actually begins to see resource exhaustion, not to the point where we start breaching KPIs and latency or availability, but once you start hitting, say, 92% CPU, you start to see it being where you want it to be. Your bandwidth is at the right point, so you say, OK, we sustain that for a bit, and from that we say, OK, that is the appropriate rating for this software and hardware combined with the customer profile in there. We save that inside our S3 rating store and we do this in every single region continuously.

From S3 rating store, this produces output that goes into our capacity forecasting, and our forecasting team can actually look at what sort of instances and everything along those lines. It's continuously updated because customers do change. We've seen it quite a bit. Customers will migrate from one region to another for business reasons, and that can actually shift your overall traffic profile with large customers and what's best for your region at that time. Of course, we run this once, right? Once run everywhere. We run this in every single region on every single major instance type. Now S3 is a pretty heterogeneous fleet. We have a lot of those, but we always make sure we go with our worst, our best, and any major components inside the fleet. If you have onesies and twosies, you probably don't have to do those all the time continuously, but everything else you need to do constantly.

Blast Radius Containment: Multi-Stage Deployment Pipeline and Monitoring with Canaries

That brings us back to the initial performance slide about how we drive capacity management along those lines. This rolls directly into blast radius containment. I'm not going to spend a whole lot of time on this. I'm sure that most of you understand blast radius containment from a kind of stateless architecture, and Juanana is going to get into blast radius reduction a lot more when we start talking about stateful systems here in a few minutes. But ultimately with a web server, we try to keep the changes contained to the smallest fault unit possible. Here's a quick look at the S3 web server pipeline. This is obviously a really rough draft. We start with pre-production testing. This covers everything that we discussed in the first portion. Then it goes into what we call these validators. Now I've said validators a number of times in here, but to be specific about what these are,

we are production hosts in each production region that does not take production traffic. It uses canary traffic only. Now, what we found, particularly in Amazon dedicated cloud environments where you have different partitions and different micro instances of services that you're integrating with, is that being able to load up the software and connect to all of its dependencies before you roll into the first production region is very beneficial. In most pipelines, you're not going to your Amazon dedicated clouds until after you do most of your commercial capabilities. So if you make sure upfront that the software loads and can connect to all the dependencies, you'll cut off a whole bunch of different errors where someone has fat-fingered or copied in the commercial endpoint into configuration.

After we go through validators, we do a first region. I'm not going to call it a sacrificial region because I get in trouble when I do that. But it is one region where we go to one box and spend a lot more time baking between the one box and each one of these availability zones in this first region because it's the first time it takes production traffic. Then we roll into US East 1. US East 1 is obviously our largest region and has the largest variation in different workloads and traffic patterns for us. By going in there, we're exposing the software to these varied traffic patterns early before it gets into 39 other regions, and we can make sure that it passes as a quality gate.

Immediately after US East 1, we start our exponential fan. Three or four years ago, we used to have six or seven stages of fanning. We've actually collapsed this with all of our shift-left testing. Now we really only have two waves. We do four regions after US East 1, and then immediately after those four, which include a GovCloud region, one of our China regions, and two commercial regions, it goes to the rest of the world all in parallel. This doesn't mean we're not doing the availability zone by availability zone deployment. We are going to all these regions, but each region will only have software patching or deployments to a single availability zone at a time for our web server.

Now, what are we doing while we're fanning out? We're going to start monitoring our deployments. You probably already have CloudWatch set up and understand some of these things. Here is my list of core metrics and essential alarms that you should have. Depending on what your service is, this will be available afterwards if you want to dive into it. There's also the AWS Well-Architected Framework if you want to dig into some of the very specific alarms. These are very easy to set up.

Here is my obligatory AI recommendation with CloudWatch metrics, and you can actually start iterating very quickly through it. All joking aside, it's really good for using Amazon Q to audit your alarms and everything along those lines and making sure from a deployment standpoint that they're all linked up with your deployment zones. If you're using CodeDeploy, it's a lot easier. CodeDeploy will allow you to meet up the deployment metrics along with your deployment zones and recognizes when it needs to roll back a deployment automatically. If you're not using CodeDeploy or something along those lines, use something like Amazon Q to go through, look at your metrics and alarms, and make sure it's actually tied to your deployments so it knows how to stop a deployment when something goes bad.

Monitoring your application service is only part of the equation. Application alarms that are looking at customer traffic struggle to detect issues that happen prior to your load balancer or prior to your application, or when you have something that just takes your application completely out and you're not actually reporting anything. Although you can and should have alarms on customer traffic when it bottoms out, many times it doesn't bottom out exactly like this. The drop in traffic is hidden inside your normal ups and downs throughout your day, and you don't even know that there is some sort of silent failure going on. This is where canaries come in. We've talked about them a couple of times throughout here. What a canary is, is it's essentially you spin up your own customer client, right, that you own, that you run.

Which will push synthetic traffic to your public APIs from where you actually expect your customers to be asking for requests. If your customers are mostly coming from outside of AWS, you need to make sure you have an instance that's outside of AWS coming in so that you can actually detect along those lines. This prevents that silent failure category, which represents the worst operational events I've ever been in. When somebody goes, "I don't know what happened. I just had a bunch of customers call me and say that they're having a bad day and they're not able to get through," those canaries are going to actually begin seeing the 500 spikes just like your customer and give you a solid spot to start.

For S3, we talked about noodles earlier and we talked quickly about those canaries. Again, we automatically build our canaries based on the unit task right there, so you basically get all of the tests that you're doing from integration all the way through as canaries as an automatic component. I highly recommend that for your service teams.

Now, we built a lot of this ourselves, but you can get something very similar out of CloudWatch Synthetics. We don't use this in S3 simply because of dependency management and when we come up inside building a region. S3 doesn't depend on CloudWatch Synthetics, so we can actually launch because CloudWatch needs S3. It has global coverage and direct integration with the rest of the CloudWatch alarms and metrics. It will automatically correlate canary and customer traffic alarms, and one big thing is CloudWatch Synthetics will also give you the opportunity to test UIs. The example I have up here, which is one of our public examples that you can download, is a very quick sample of testing Amazon.com and actually going into the web page right there.

Application Feature Controls: Feature Flags, Allow Lists, and Shadow Mode Deployments

We're going to our last section, which is before I hand it off, application feature controls. Let's start with feature flags. Since this is the first day of re:Invent, this is the day that all the feature flags are flipping for AWS, turning everything on before tomorrow's keynote. Basically, engineers code the feature and deploy it when the feature is hidden, and then they gradually make the feature available by flipping the flag from true to false. In S3, we utilize AppConfig for all of our feature flags and our dynamic configuration capabilities.

It works like this: first you define the feature flags and any validations you have. Pretty simple along those lines. And then you're going to write client code on your application side to dynamically pull these in. You can set this up to either be event driven, meaning any time you change this, but I also recommend, and in this example we actually show, a regular polling of that just to make sure that all the events actually get through.

Now let's talk about allow lists and deny lists. These are usually used in conjunction with feature flags in order to launch a subset of users. This is where you see beta tests, alpha launches, internal launches, and everything like that. Particularly in S3, we use allow lists for our own dogfooding mechanisms. If we're going to launch a feature, the first thing we do is allow list ourselves. We have first start with S3's own accounts, then we'd move to AWS internal accounts, then we'll do big Amazon, and all of that will operate along those features before we ever send it to customers. Then we can actually use the feature flag to roll it out during something like re:Invent.

The last tunnel I want to talk to you about is using shadow modes. In a shadow mode deployment, real world production traffic is copied and sent to a new version of the service in parallel with your existing production service. The shadow service processes the request, but its responses are never sent back to the customer. Instead, they're logged or metriced so that you can then compare results between your primary system and your shadow system. The implementation is pretty straightforward. Right here I give you a load balancer example. API Gateway also allows you to do this kind of thing right out of the box where you can split it here. This is fifty-fifty along those lines. Now, personally I don't like using this.

I prefer to use allow lists and feature flags to roll those out. What I use this for is internal infrastructure improvements and generic software upgrades. If you're going to change your web server engine, at the end of the day you expect your customer traffic to be the same today as it is tomorrow. This is a really good way to make sure that everything is staying the same. You're updating your JDK and sure it's supposed to work, but this will actually give you evidence that yes, it did actually work the same way and you're not getting something weird coming back to your customers.

The lack of shared state gives us a lot of options with the failure area. One web server goes down, it's not going to affect that many customers or that many requests. Once you take it out of service, it's not going to affect any other requests that are going into any other service. This is only good and well if you don't have to worry about shared state and you're operating a web server, but what happens when you do have shared state? Say you're a durable storage service like S3.

Stateful Deployments: Data Preservation Through Integrity, Consistency, and Durability

For this, I'm going to hand this off to Juan Donna who's going to come in and talk to us about the stateful deployments of S3. Stateful pipelines are those that deploy software to hosts that persist data, and we have to persist this data and keep it available and durable throughout the entire deployment process and beyond. I'm going to walk through what data preservation means, the specific risks to data during deployments, how we mitigate those risks, and dive into the details of how S3 performs maintenance activities on millions of hosts every month.

Let's start with why stateful deployments are different. Everything that George just talked about for stateless pipelines still applies to stateful pipelines. In addition, to preserve the data on these hosts requires careful planning and maintenance and understanding of where the data lives and how it's organized. We are talking about our index hosts. These are the ones that manage and store S3 metadata.

That metadata is used to find the data on our storage nodes, and our storage nodes are where we process the actual data. Stateless hosts don't persist information, every request is independent, so they can be restarted any time. Full hosts, though, we can't randomly restart these servers in parallel. We have to understand exactly what data each host is responsible for and what the redundancy of that data is, the mapping of the redundancy, and ensure data preservation when we take that host offline.

Let's look at the key dimensions for data preservation. Integrity is about making sure that the data stays accurate and trustworthy over time. Think about it like a bank that guarantees that your transaction records are not tampered with or altered. Consistency means everyone sees the same information at the same time. Like when you're shopping on Amazon.com, you put something in the cart on your phone, go to your laptop, it's still there. Everything is the same. That's consistency.

Resiliency is the ability to recover from loss. It's similar to when you keep your photos in multiple places on your phone, backed up on your local storage, stored in the cloud. You're making your data resilient against failure. The more copies you have in different places, the better the tolerance to failure. Finally, data durability is the ability to protect data from loss or corruption over time, ensuring it remains intact and consistent even in the face of failure.

It's measured in terms of probability of loss, and of course, as you must have heard multiple times, Amazon S3 is engineered for 11 nines of durability. Data durability considerations are a continuous process from design and implementation to deployment. Every time we build a new feature or a new service, we do durability reviews. This helps promote a durability culture among teams, implement mechanisms to keep our customers' data durable, and provide real-time visibility into factors that threaten data durability. One of the areas of focus during these reviews is deployments and having clear paths for both successful and failed deployments.

Successful and failed deployments require the ability to roll back and roll forward without impacting any data. The goal is that data must stay available before, during, and after deployments. To preserve data integrity, we use several specific methods. End-to-end checksums work like digital fingerprints that travel with your data everywhere it goes, from the moment you upload it through all our processing, storage, and yes, even through deployments. They help detect bit flips in memory and bit rot on disk, and background processes can use them to validate the accuracy of the data.

Consistency is another critical component. When you upload or delete objects using S3 APIs, the operation is atomic, so it either succeeds or fails with no partial upload. S3 has strong read-after-write consistency, which means you have a uniform experience. When you read your data, you always get the latest version. This consistency is maintained even when we are updating hosts in the background.

For resiliency, we are constantly monitoring system health and data integrity. If we detect any compromised data, we automatically initiate recovery processes and start repairs to restore resiliency. The goal is that your data stays available through all processes, normal operations, during deployments, and even post-deployment. Finally, redundancy is key to achieving eleven nines of durability. We store data across three availability zones using a combination of application and erasure coding. We use durability simulation models that factor in real hardware, the annual failure rates of various components, and we have extensive data about that in S3. We also take into account the repair time and mean time to recovery to determine exactly how much redundancy we need.

While we build for correctness, we still need to monitor for threats to our data's integrity, availability, and durability. We channel the methods used by security experts who create security threat models to strengthen their systems. In a similar fashion, we create durability threat models to help us identify every possible way that data could be compromised, and then we plan actions to reduce the impact of these threats.

Durability Threat Models: Preparing for Hardware Failures, Software Bugs, and Human Errors

One of these threats is hardware failure. Storage drives wear out, power supplies fail, or bit flips escape memory error correction. We can partially mitigate this with redundant hardware to replace the failed units. However, if a drive or host fails during the deployment process, the data associated with that host will need to be rebuilt.

Another threat is software bugs. These are bugs that escape all the testing, but they are typically edge cases or unlikely conditions that occur because of some changed behavior or timing change when a new deployment is made. Bugs like incorrect error handling could impact data, or version incompatibility can compromise data preservation operations. Our impact reduction strategy is using data validation tools to catch the problem and trigger recovery processes.

Operator errors and human errors can result in accidental deletion or maintenance requests being scheduled on too many hosts. To reduce the impact of human errors, we continue to automate our processes. Impact mitigation is a continuous learning and development process to reduce the probability of failure, but there are no perfect solutions. S3 deployments span across millions of hosts. Let's assume we have a 99.95% success rate. For one million hosts, that means 500 deployments will fail. That can translate to 500 times that we have a poor customer experience or there is a potential threat to data. Now that is a problem.

If you have 10,000 hosts, that can still mean 5 bad customer experiences, which is still a problem. Let me go back for a moment to the often-quoted Murphy's Law. Anything that can go wrong will go wrong, which comes from aerospace engineer Edward Murphy Jr. when a technician made a costly error during a safety-critical project. Murphy's Law, though, isn't about pessimism. It's about being prepared. It's a mindset that drives thoroughness and proactive problem solving to prevent disasters before they happen. This philosophy really shapes our threat mitigation strategy. We start with the assumption that every deployment will fail, not just the few that statistics tell us, but every deployment. This becomes a P-100 solution because now we are prepared to handle the impact of that failure.

For stateful pipeline deployments, this translates into the assumption that the data on that host will not be available after the deployment. We need the ability to restore the resiliency of that data and have enough redundancy in place before we start the deployment to rebuild and restore the content of the host being updated. Now this would be very simple if you could update one host at a time, but obviously that's impractical. Instead, we serialize deployments by availability zone and create carefully constructed groups for parallel upgrades.

Host Reservation System: Coordinating Safe Maintenance Across Millions of Stateful Hosts

This grouping process isn't random. We need to consider the initial data placement policies that determine where information lives, the physical location of hosts, and how these locations relate to other hosts already in the reservation group. For example, if we have 3 replicated copies on host 259, these servers will not be part of the same reservation group. So that brings us to our actual deployment workflow. S3 manages maintenance operations through a reservation-based model coordinated by a host reservation system. Think of it as an air traffic control for storage infrastructure.

No maintenance happens without permission. The host reservation system controls deployment access to all stateful hosts by requiring both humans and automated systems to obtain exclusive reservations before performing any maintenance. This includes all planned activities like patching, firmware and software updates, and hardware maintenance. It also includes the automated remediation workflows that respond to detected issues and ad hoc operator work during emergency situations. Every operation has to go through the same reservation process, whether it's scheduled or a critical response. This is why we can make sure that we never exceed safe maintenance thresholds across our fleet, and it's part of S3's operational model for disciplined, coordinated access to our infrastructure.

When the reservation system receives a maintenance request, it performs a series of safety checks. First, it examines the current reservation state of the target host because multiple actors may simultaneously have requested maintenance on the same host. Next, it validates there's sufficient offline capacity to sustain taking this host out of service without impacting performance. It also ensures that approving this request won't push us beyond our fleet-wide maintenance limits. After all of these checks pass, the reservation system checks if it's actually safe to deploy. But since it operates at the physical infrastructure layer, it lacks visibility into the data mapping on the host.

So who has the actual data placement information? The service itself. Both the index and metadata storage service and the data storage service know the current health of the host and the projected impact from maintenance based on the data that persists on that host. The host reservation system checks with the service whether it's safe to proceed, ensuring that the data's resiliency will not be compromised if the host fails to come back up.

S3 has developed custom software specifically designed for this decision-making process. Every service comprehensively gathers intelligence about each host, its current health status, allocated resources, and the specific data it persists. The service performs additional safety checks. It verifies that no conflicting deployments or maintenance activities are in progress, maps the host's physical rack location within the data center and availability zone, and then correlates this information with the other hosts reserved for deployment. The service, with its internal knowledge of data placement and redundancy patterns, is the one who authorizes whether hosts can be reserved for deployments in a resilient way.

Once all safety checks have passed and we've confirmed it's safe to deploy, the system reserves the host and marks it as available for maintenance to begin. But what if the checks fail? This request has to be retried. The fleet update component in our deployment workflow is responsible for handling both cases, either initiating the retry logic or starting the actual deployment when the reservation succeeds. Fleet Update is a regional scheduling service that orchestrates all maintenance for stateful hosts, including patching, firmware, software deployments, and hardware maintenance. It can intelligently combine these activities to minimize downtime.

Its built-in safety controls, velocity controls, and casualty tracking help prevent widespread issues. Plus, it allows us to pull the emergency cord that George talked about at any time to halt deployments, switch hosts to read-only mode, or handle multiple failures simultaneously. So is this all theory, or do these techniques actually yield results? I'll share some real S3 data from storage hosts in one availability zone and the results of two of our risk mitigation strategies. The first was transitioning from being in a reactive mode to building durability threat models and being proactive in building guard rails. The second was adding automated safety checks to replace manual interventions.

Operator escalations during deployments in that particular availability zone went from 478 in 2022 to less than 10 in 2025. To ensure deployment safety, testing strategies and containing blast radius are of course primary. You can reexamine the processes you have in place to check if there are additional testing strategies, such as science-based testing, that can provably show the correctness of your software and improve your customer experience. You can also take additional actions to reduce the blast radius in how you group your hosts for deployment.

Not all regions are equal, so be aware that capacity, performance, and networking can all vary between regions. Take that into account for your pipelines. Application controls like emergency cords and feature flags are safety nets that help prevent crises from becoming major disasters. If your pipelines don't already have mechanisms that allow you to stop deployments or roll back changes at a moment's notice, it's worth adding those. When deploying to stateful hosts, additional considerations around data preservation are necessary.

Assuming failure can be powerful as a proactive solution. Failure simulations assist in understanding the resiliency and redundancy of the data. The techniques we've shared today represent years of learning from both successes and failures, and hopefully some of these techniques can be applied or adapted to your environments. Thank you for choosing to spend this time with us. Please take a moment to fill out the session survey in the mobile app, and George and I will be at the back if you have any questions. Thank you again.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community