Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Deep dive on Amazon S3 (STG407)

In this video, Seth Markle and James Bornholt from Amazon S3 deliver a deep dive on designing for availability. Seth explains system-level architecture using quorum-based algorithms, detailing how S3 achieved read-after-write consistency in 2020 through a replicated journal and witness system while maintaining high availability across 120 availability zones. James covers implementation-level failure handling, discussing correlated failures, gray failures, and metastable failure modes like congestive collapse. Key insights include converting availability problems into latency problems, never allowing local systems to make global health decisions, and using sophisticated retry strategies with backoff mechanisms to prevent request amplification in distributed systems.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Designing Amazon S3 for Availability and the Evolution to Read-After-Write Consistency

Everybody, welcome. I hope everyone's having a nice re:Invent so far. We're just about at the halfway point. I'm Seth Markle. I'm here today with James Bornholt, and we're doing a deep dive on Amazon S3. We've been using this talk series over a couple of years now to get into some lower level details of how S3 works. Two years ago, I talked about the fundamentals of S3. Last year James and I talked about how we use the scale of S3 to our advantage and yours as a customer. This year, we're going to be talking about how we design for availability.

We're going to come at this from two angles. I'm going to start out talking about the system level view, about how we think about failure at the architecture layer. I'm going to take us through how we designed for read after write consistency as an illustration of this. Then James is going to talk us through the server level view about how we think about failure at the implementation layer of our system.

Before we begin, let's define some terms. Availability is all about dealing with failure. To understand this, you have to define both failure and what you mean by dealing with it. To understand what I mean by failure, let's get into some different ways to look at S3.

The conceptual view of S3 is that it's a storage service that holds over 500 trillion objects, stores hundreds of exabytes of data, and serves hundreds of millions of transactions per second. These are all abstract terms. You can't touch an object. So what is S3 in concrete terms?

In concrete terms, S3 is drives and servers which sit in racks, which sit in buildings. We manage tens of millions of hard drives across millions of servers in 120 availability zones across 38 regions. When we talk about failure, it's these components that make up our fault domains.

Within a drive, surface defects can cause individual reads to fail, or individual drives can stop working altogether. Servers can suffer from bad fans, racks can lose power, or buildings can catch fire. All these things can fail. Failures can be permanent loss of a component or transient unavailability, which could be due to power issues, networking issues, or just overload of a component in a network or CPU. These are the sorts of failures that we're talking about today.

Now that we understand what can fail, we need to define what it means to deal with it. This is all about design goals. You set your design goals for your system, and then you design your system around those goals, and then you implement the system according to those designs.

Here are some of S3's design goals. If folks are familiar with our product page, you'll see the phrasing designed for 99.99% availability or designed for 11 nines of durability. We provide read after write or strong read after write consistency.

Some of you will remember that S3 wasn't read after write consistent until 2020. This is important for this talk because what this meant for us back then was that an acceptable way to deal with failure was to violate consistency guarantees since we had none. Let's take a look at S3 before the consistency launch to see how we dealt with failure beforehand and what we had to change afterwards as we implemented it.

S3's Indexing Subsystem: How Quorum-Based Replication Enables Fault Tolerance

What I mean by consistency here, so we're all on the same page, is the property that an object get reflects the most recent put to that object. To understand how this works in S3, we have to start with our indexing subsystem. Our indexing subsystem holds all of your object's metadata. These are things like its name, tags, and its creation time.

The index is accessed on every single get, put, list, head, and delete. Every single data plane request goes to this index. More requests go to the index than our storage subsystem because head requests and list requests don't need to go into storage. At the core of the indexing system is a storage system dedicated to the index that's responsible for durably holding these index entries.

Here's a simple illustration of how we store metadata in the indexing system. The data is stored across a set of replicas using a quorum-based algorithm, and as we'll see in a moment, a quorum-based algorithm is very forgiving to failures. You're going to see a QR code in the bottom corner here. This is to a particular academic paper. Throughout this talk, James and I will be linking to various papers that are relevant to the systems that we're presenting. If you miss taking a photo here, these will be up on YouTube when the talk is posted, and you can capture it there too.

Here's how our implementation of Quorum works in the index. We start with servers running in separate Availability Zones, and the reason we do this is because it allows us to avoid any correlation on a single fault domain. Since the failure of any single disk, server, rack, or zone itself will only affect a small subset of data, and it never affects all of the data for a single object or even a majority of the data for a single object. We spread everything across multiple failure domains.

The rules when interacting with this system are straightforward. Reads and writes are only required to hit a majority of servers. Let's walk through an example. Say you want to write the value of A, and remember these are the metadata values for your objects. In this case, the writer succeeds on all nodes, so all the nodes have the value A. Now let's say you want to write B. When you're writing B, there's a failure on the node on the right, which for whatever reason is failing to receive B. However, B is succeeding on a majority of the servers, so there's no availability impact in this situation. This is the resulting system state.

Now we have a reader who initiates a read. In this case, one of the servers fails to return, and it's a different server than failed last time. However, the other two do return, so the reader sees the value of B. You might notice that in this case the reader sees both A and B, but it can reason that B wins because of an associated timestamp with it, so there's conflict resolution in the system. We saw both reads and writes succeed here despite servers failing in the middle of requests. There was no availability impact.

This system is available because there are several nodes that we can talk to and there's an allowance for failure. This is Quorum-based systems in a nutshell, and this is the central concept in our system level availability design. We have several nodes to route to and we have headroom for failure. We configure and size this system, either the number of replicas or the amount of headroom, to achieve our 99.99% goals.

The Caching Challenge: Why S3 Wasn't Initially Read-After-Write Consistent

Some of you might have spotted that readers see the values written by the writers in the example I showed. That's because reads and writes always overlap since they both require hitting a majority of servers. You can't have two requests both hitting a majority and have them not overlap. You might be wondering, if this is what S3 did before it launched consistency, why wasn't S3 read-after-write consistent? We should be able to read the writes because they're overlapping. The answer to that is our caching layer.

S3's front end heavily caches heavily accessed objects in the system. Let's walk through how that cache works. We start out with a set of empty cache nodes. Remember, the values we have in storage from earlier are sitting here—B, B, A—just like we had before. They're sitting underneath the cache. A reader reads through one of the cache nodes at random, and the value B gets returned, just like it did before, to the reader through the cache node. So the value B is resident in the cache. Now a write for C comes in, which is an overwrite of your object.

In this case, it goes through a different cache node, and this is because our front-end servers are sitting behind the IP addresses that you get back from DNS, so you're effectively routing to our front end at random. The value C gets persisted durably down to the replicas. In this case it succeeded on all the replicas, so you see C, C, C. But now we have values B and C cached on separate nodes. This is our design before 2020. So now when a read comes in, it could route at random, and so it goes to the former cache node in this case, causing an inconsistent read. The value B is returned instead of C. C was the last value written, but B was returned, and this is because reads and writes are not overlapping in this design.

With Quorum at storage, we saw reads and writes overlap, but in the cache they don't. In our cache, like with Quorum, we have this key availability property. Several hosts can receive requests and there's an allowance for failure. But unlike with Quorum, there's no overlap in reads and writes, and this was acceptable at the time. This was an acceptable way to deal with failure because our design goals didn't have consistency as part of them. However, consistency was probably the most requested feature that we had at the time.

Building Cache Coherency: The Replicated Journal and Witness System

We set out to solve this overlap problem by building a cache coherency protocol. This protocol had to be fast, efficient, and available, which meant we needed to retain that property that multiple servers can receive requests while some are allowed to fail. The way we decided to do this was with a replicated journal.

Let's look at how this works. The replicated journal is a distributed data structure where nodes are chained together such that writes into the system flow through the nodes sequentially. The writer sends a value through the journal—the same A as before, which is your object's metadata. A flows through the nodes in the journal sequentially, with every node forwarding to the next node. Once it flows through the journal, it gets sent to a quorum of storage nodes just as it did before. When a subsequent write comes in, it also flows through the journal into the storage nodes. What you can see here is that although the storage nodes only have the latest value, the journal keeps track of the recent history. Every node agrees on that ordering because they're sending requests to each other in sequence. In this example, A comes before B.

This reasoning about ordering wasn't possible with just our quorum-based system from earlier. If A and B were written concurrently, some nodes might see A then B, while others see B before A. Even the timestamp we used for conflict resolution wasn't sufficient because messages could be arbitrarily delayed. No single node in the quorum-based system was able to reason about ordering correctly or in a way similar to the other nodes. By sending all writes through this journal, we create a well-defined ordering for mutations that come into the system. This ordering is the key building block for consistency.

It allows us to establish a watermark to write to S3. All writes get assigned a sequence number, which increases over time. In our example, A has sequence number 1, B has sequence number 2, and C has 3, and so on. When the storage nodes are written to, they learn the sequence number of the value along with the value itself. On subsequent reads, such as through the cache, the sequence number can be retrieved and stored. You can see here that in the cache, along with the value, we're also storing that sequence number. These sequence numbers plus the well-defined ordering allow the cache node to ask this question before serving requests: did any writes for this object arrive after this sequence number? Because of the journal, we're able to answer this question without talking to storage nodes.

In order to answer this question, we built a system that we call a witness. The sole purpose of the witness is to track the high water mark for writes to the index. There are a couple of simplifying assumptions for this witness system. First, the witness doesn't need to hold the actual data. It just needs to hold the sequence number because it only has to answer whether any writes came in. It doesn't have to tell you what the write was. It's always safe for the witness to tell the caller that they're stale, because in that case, they're just going to read from storage anyway and get a correct result.

The witness can keep an approximation of the right answer as long as it overestimates the last sequence number. This leads us to a relatively simple design where the witness is just an in-memory data structure—literally just an array of integers—that sits next to the journal and allows the cache nodes to ask that key question: did any writes come in after the value I have cached? With the witness, we can modify our read and write algorithm to look like this. Writes go to the journal as before, and then to the witness. Reads talk to the witness, and if the witness says you're fine, then the reads can return from the cache. If the witness says no, you're stale, they can go to storage. Whereas with the early cache design we were available but inconsistent due to non-overlapping reads and writes, now we have overlapping reads and writes. They overlap on that witness.

Dynamic Reconfiguration: Restoring Failure Allowance Through Quorum-Based Configuration

However, with what I showed you, we actually lost our failure allowance. Remember that key property of system-level availability: we need to have a failure allowance in the system, so we're not available yet. Here's the original architecture of the journal that I've shown you. What happens if one of the nodes fails? In this case, the tail of the journal has failed, and writes can't progress through the system. This is a very different situation from a node in a quorum failing. The difference is that the journal sends writes through it sequentially, whereas the quorum services writes concurrently. If a node in a quorum fails, the other ones probably respond to the caller. But in this case, if a node in the middle of the journal fails, there's no one to forward the request on. If a node dies, the whole system can halt. We've lost our failure allowance.

To fix this, we introduced dynamic reconfiguration into the system. The nodes in the journal pay attention to each other's availability. They're pinging each other all the time, with messages going through the system constantly. We're serving hundreds of millions of transactions per second, so every node has an up-to-date view of its neighbor's availability. When they encounter an issue talking to each other, they ask a quorum-based configuration system to reconfigure the journal. This all happens within milliseconds of a node failing, and the configuration system itself is quorum-based, similar to those storage nodes we had originally.

High availability in these systems always comes back to quorum, even for systems like the journal where there's no apparent quorum. It's a chain replication algorithm, and the witness system is quorum-based as well. Now we have a cache with a failure allowance and overlapping reads and writes, which is what we ended up launching. This allowed us to change our correctness goals, which now means that S3 is both consistent and highly available. We retained high availability through the consistency launch, but we had to design for it.

To recap how we think about availability from the system-wide perspective, you need many servers to choose from while only being required to succeed on some. You need the ability to reconfigure the system quickly in the face of failure. Quorum-based algorithms are always lurking somewhere, even when it's not obvious, like with our journal. There's always quorum somewhere in these systems. With that, I'd like to welcome James onto the stage to talk about how we deal with failure at the implementation layer of our system.

Understanding Correlated Failures: Physical and Logical Failure Domains in S3

Good morning. Thanks for coming out. My name is James, and I'm an engineer on the S3 team. I want to talk about how we deal with failure at the implementation level of our system and how we design for failures on the nodes themselves. The first question we might ask when we talk about failure is, what can fail? Seth told you about how we have racks and availability zones, and I want to make that idea a little more concrete. Failure isn't just one node failing at a time. Often the most important failures we have to deal with are correlated failures, failures that happen together.

Let's start by thinking about physical failures in a system. Seth showed you a version of this picture already, but here's a physical view of S3 at a super high level. We have multiple availability zones, each with multiple racks of servers, and each server has multiple hard drives. There are lots of different things in the system that can fail. The simplest version is individual failures. Hard drives might fail on their own, maybe the motor breaks or the platter gets scratched. Instead of a hard drive failing, maybe an individual server could fail. Even though this is only one component, like maybe the CPU is broken, it appears as multiple failures because all of the hard drives attached to that server appear as if they've failed and become unavailable to the system.

There's a really cool thing, by the way, if you saw Andy Warfield's talk yesterday about some things we're doing to design around this with metal volumes to attach drives to different servers to work around this particular kind of failure. But this is a correlated failure, right? All of the hard drives attached to this server have failed at the same time. You can build these pictures up from here. If an entire rack fails, that's as if all the drives in that rack had failed, maybe the switch in the rack is bad. The worst case for us is that the entire availability zone fails, maybe the power goes out. These are all correlated failures, and they're essential to thinking about availability.

If you think about Seth's design for quorum, it's okay for one node to fail, but if all the nodes were in the same availability zone or if they're all on the same rack, then your availability property is gone. You've lost your failure allowance because they all failed together. Correlated failures are a really important mechanism for thinking about what it means for our system to be available. But it's not just physical failures. There are other kinds of failure domains we think about as well.

Another one that might be less obvious is logical failures. For example, when we deploy new software to our fleet, we don't do it all at once. We deploy it to a few nodes at a time and ramp up from there. The set of servers that have the new software are effectively a failure domain. If there's a bug in the new version of the software, all those servers are probably going to fail together. We have to be reasonably intelligent about which servers we deploy to first, what is the domain of servers that get deployed to first, and how we ramp that up so that if there's something wrong with that software, we can tolerate that level of failure among all of those servers as well.

How do you design around this? Our job when designing around correlated failures is to think about how we expose workloads to different levels of failure. For example, when you upload an object to S3, you probably already know that we replicate that object.

We don't just store one copy of it; we store it multiple times. That replication is really important for durability. We design for 11 nines of durability, but that replication is also crucial for availability because it means that if any of these correlated failure domains fail—for example, if AZ1 in this picture fails—there's still a copy somewhere else, so the data is still available even though an availability zone has failed, or a rack has failed, or a server has failed, and so on. Designing for correlated failure starts with making sure that your workloads are exposed to multiple failure domains. They're not all exposed to the same failure domain, and they can't all fail together.

Fail-Stop Failures: From Simple Power Loss to Crash Consistency Challenges

Thinking about failure domains like this helps us understand what can fail, but what does failure actually mean? What does it mean for a server, a rack, a heart, or a switch to fail? The simplest version that you probably have in your head is what we would call a fail-stop failure. The way to think about this is basically like the power got yanked—the power cord came out of the server, that kind of level of failure. It's simple because it behaves in the way you might expect. The server just stops doing anything. You can't reach it, and you can't do anything anymore. That means it's pretty easy to detect because the server is just not serving traffic anymore.

It's also reasonably easy to react to, kind of in the design that Seth was showing you. If you've designed a system that's tolerant to some amount of failure, and that server just goes away, that's okay. Generally, you can do a design that can tolerate a certain level of these fail-stop failures, these failures where the server just goes away. It gets a little bit more tricky when we start thinking about components that are not just servers though.

Let's think about a different kind of fail-stop failure. Maybe the switch in between these two availability zones has failed, and it's failed in a fail-stop way, like it's just not accepting packets anymore and it's not routing packets anymore. So it is a fail-stop failure in the switch itself. In the system, it looks a little bit different because some requests in the system are going to continue succeeding. If I need to talk to a server in the same availability zone to serve a request, that request is going to work fine because it doesn't have to traverse the switch that has failed. But other requests are going to start failing.

Even though a request came from the same source and started on the same server, if I had to traverse that link and that link is gone, that's a failure. Now I have this kind of fuzzy failure mode where some requests are succeeding, some requests are failing, and it kind of depends on a property of the request. This can be quite hard to detect. Designing around this one looks a little bit different. It's all about redundancy. In reality, we don't have one switch between availability zones; we have multiple.

A really interesting part of the design that we have here is that we actually have three availability zones in all of our regions, and they're kind of linked together as a ring. There's a cool property where if that failure occurs on that link and we can't reach AZ2 from AZ1, we can always just go the long way around. We can go from AZ1 to AZ2 via the rest of the links. Now, this is worse, to be clear. This is a performance problem now, but we preserved the availability. We had a failure on that link, and we preserved availability by converting our availability problem into a latency problem.

This is actually a really powerful mental model for thinking about availability. One thing we often talk about when we're talking about system design is how do I convert one kind of problem into a different kind of problem. Here I've converted an availability problem—I couldn't reach some servers—into a latency problem where I can reach them, but maybe it's a little bit slower because I have to go further. Now, that's not always a good thing because we care a lot about performance and I don't want to just sacrifice performance. But this idea of converting one kind of failure into a different kind of failure is a really useful way to think about system design and the trade-offs that we can make when we're designing a system like this.

One more thing about fail-stop that's worth knowing before we talk about other failure modes: fail-stop is particularly tricky to reason about in a stateful system like S3 because it can get the system into states that are otherwise unreachable in the absence of failure. For example, when we talk about storage systems, a problem we talk about a lot is called crash consistency. The idea of crash consistency is that a system should always return to a consistent state after a fail-stop failure.

Here's a really simple program, and you can probably just run it in your head and know what it's going to do. I'm going to open a file. I'm going to write two lines of text to that file, and I'm going to end up with a file that looks like this. It's a reasonably simple program, and if you look at it, this is kind of the only state that can result from this program. If you just run this program every time, modulo running out of file system space or something, you'll end up with a file and it will have these two lines in it. So that's a normal execution of the system.

This doesn't work anymore in the presence of fail-stop failure. For example, what happens if the system fails, loses power, crashes, that kind of thing in the middle of executing this program? Maybe I have only written the first line of text to the file and then the system crashed. When the system comes back up, that file is going to be there, but the file is only going to have half the contents in it. I never got a chance to write the second line to the file before I lost power.

Now I've entered this kind of interesting state. This is a state we all kind of agree is unreachable in the system, except for the failure. States that weren't reachable before are now reachable in the system. This is a really hard thing to design around. It is a really hard thing to wrap your head around as an engineer—that when I ran this program, there are actually states that are possible to reach only in the presence of failure that are otherwise impossible. It is a pretty tricky problem.

We spent a lot of time thinking about it. We wrote a paper a few years ago about Shodstor, which is our storage node software. It runs on all of our hard drives and the paper is about this problem, right? It's about how do you reason about the set of states that the system can reach in the presence of failure, in the presence of concurrency, and things like that. So there are some really cool ideas in that paper about how we help engineers reason about those states while they're building the software themselves.

Gray Failures and Metastable States: Retries, Timeouts, and Congestive Collapse

That's a little bit about fail-stop failures. They're kind of the simplest mode of failure. There are other modes of failure that we think about a lot in S3, and in particular the one we spend a lot of time thinking about is what I might call a gray failure. A failure that's not just a power failure or a fail-stop failure mode. So how do we detect those, and how do we respond to them? Let me give you an example of what I think a gray failure might look like.

This is a really simple overview of how a put works in S3. Your put comes in from the internet. You reach an S3 front end server, a web server that serves your request, and that web server is eventually going to fan that data out to a set of storage nodes. Seth showed you a more detailed version of this picture. This simple version is going to suffice for today.

So what happens if this front-end web server is successfully receiving traffic from the internet, accepting requests, but maybe it can't reach some of those downstream hosts that it needs to talk to? Maybe there is a networking issue or something. It can reach some of them, but not all of them. It's accepting traffic. It's going to return errors back to you when those requests fail, because it's going to try to replicate that data around those disks, and some of them are going to fail. So it's going to return you back an error, which is okay. But now it's hard to reason about the system. The system is doing work. The server is accepting requests, it is responding to requests, it hasn't had a power failure or anything like that. It's just that it's not doing useful work. It's failing all the requests that it has because of this downstream issue. This is what I might think of as a gray failure. It's a little bit harder to detect because the system is not really failing. The server has not failed, it hasn't gone away, it's just doing something weird.

What do we do about gray failures? Well, there are a few different things we can do. One really powerful technique to be resilient to gray failures is to use retries. So if a request fails, often if you can retry that request, it might go somewhere else. It might go to a different front end web server on the retry, and that web server might have a good connection to all those storage nodes, and so your request can succeed. Now this is sort of hoping a little bit. You have to hope that you go to a different place, but actually a lot of the AWS SDKs have pretty sophisticated retry strategies and they're aware of properties like this. So a lot of the SDKs will actually intentionally retry their requests on a different web server, on a different IP address, specifically for this kind of failure, specifically to detect the case where a single web server is bad and try to go somewhere else and see if that web server works instead.

So retries are a really powerful mechanism for dealing with gray failures because it lets you try a different path through the system. Retries are not a panacea though. Retries are a pretty dangerous thing in a distributed system. Think about a distributed system like this. This is now a slightly more sophisticated version of S3, where we have some downstream microservices. It's not just one service, but multiple services. If all these services do retries in the face of failure, it's really easy to end up in a situation where you have a massive amplification of work.

Say for example that the system at the end of this stack is failing requests. The intermediate server is going to retry maybe three times before it fails. Once it fails, the server upstream is going to retry three times. It's going to go through this whole stack again, and so on and so forth. So you can very quickly end up, if you have a failure at the bottom of this stack, with many, many retries in the system, maybe 27 times more retries than there were original requests. So you can overload the system just with retries.

What this means for us is that we have to be fairly intentional when we design retry strategies. So for example, we might actually intentionally choose to do fewer retries further down the stack, or maybe do no retries at all further down the stack, knowing that things up the stack in the system are going to do those retries, and so the work will still get retried. You'll still get the same effect of availability, protecting you from gray failures, but we won't overload the system with those many, many retries. So we have to be a little bit intentional about how we think about retry policies down the stack of services inside of S3, also on the clients themselves.

A particularly difficult case for retries is when the failures have been caused not by networking links or hard failures, but by load. Perhaps the web server that you're talking to is not actually disconnected from anything. It's maybe just overloaded with work, it's very busy. That manifests as being slow. It might manifest as a failure, but usually it will come first just as being a slow request. It's not serving requests as quickly as you would like. So again, a powerful mechanism for this is to do timeouts. Your client should just time out if a request is slow and go and try somewhere else. And again it's the same property. If you retry somewhere else, hopefully that web server is not overloaded and it will succeed your request on the retry. So timeouts are also a powerful mechanism for a different kind of gray failure where your server is overloaded.

But, and this is kind of becoming a theme in this talk, timeouts are also not perfect. Timeouts are very hard to design around, especially in the presence of retries. So let me give you an example of that. Suppose that our web server is overloaded. It's having a hard time processing requests. In particular, it's building up a queue of requests that it's trying to serve. I'm making a request to this web server. My request enters the queue. It sits in the queue for a while while the server works on other things, and eventually it takes so long that my client

times out. My client gives up on this request and says, "This is taking too long. I'm going to go somewhere else." That's good for the client because it will get availability by retrying somewhere else. But now we have a problem on this overloaded web server because that timed out request is still in the queue. The client was the one that did the retry, and so the server has no idea that the client has given up on that request.

That request is going to sit in the queue and keep moving through it, and eventually the server is going to work on that request, even though the client gave up a long time ago. You end up in a situation where the server is essentially doing useless work at this point. It's working on a request that the client has given up on long ago and gone and tried somewhere else and probably succeeded. If this happens over and over again and you're building up a queue of requests, eventually you get to the point where every request in the queue is being timed out. Every request in the queue is a backlog of a request that has been retried somewhere else, and you end up in a state that we call congestive collapse, where the server is basically spending all of its time processing requests that have been given up on by their clients.

Clients have gone somewhere else to get their data, but the server is still having to process these requests. It's a chain reaction effect and it can be self-feeding. Because the queue and the server are overloaded and getting slower, that's causing clients to retry, which is causing more work to get put into the queue, which is causing more retries, and so on. You can get to a state where all you're doing is serving failed timeouts from retries over and over again. We call this congestive collapse. There are a few different ways that we can work around this, and one of the most important ways is to be smarter about how we use this queue.

When we're dealing with servers that have a queue of work and they get into a state where they're overloaded and the queue is full, we often invert how the queue works and actually process the queue from the back. This is unfair because the requests that arrived most recently get processed first. It becomes a last in, first out queue, and it penalizes some clients in exchange for making some clients very fast. The nice thing about this is that once you start processing the queue from the back, some of those requests start succeeding. You get out of this loop where every request is timing out. At least some of the requests are succeeding now and not creating more work, so you start digging yourself out of the hole of slow timed out requests because at least some of your clients are now seeing success, and the hope is you can burn through the rest of the queue as you go.

Another thing we do that looks similar is on the client side when we do retries. We don't retry immediately. All of our SDKs as well as our internal services back off and retry, leaving some space between a failure and another retry. This gives the server a little bit of time, a hole in that queue, some space for it to try to dig itself out of the backlog. The combination of processing the queue differently and backing off and retrying requests more slowly helps us reduce the backlog and eventually dig the server out of this hole.

This kind of failure mode is super interesting. It's called a metastable failure mode. There's a really cool paper by some folks from AWS about how to analyze these failure modes in your system. We call it metastable because you end up in a state where even though the original problem happened a long time ago, you're now just spending all of your time working through the queue over and over again. The failure happened many times, many minutes ago, many hours ago, but you're still just processing these timed out requests repeatedly. You've entered a different state of the system. This paper is super interesting and gives you ideas on how to think about detecting these failure modes in your system design, so it is definitely worth checking out.

Self-Healing Systems: Health Checks and the Principle of Global Decision-Making

That's how clients themselves can dig themselves out of this problem. But what should the system itself do at failure? How should we design the system itself to recover from failures? Ultimately, what we want is for S3 as a system to heal itself. It's not scalable or tenable at the S3 scale for operators to manually intervene when one of these failures occurs, and obviously we can't just turn S3 off and turn it back on again to get rid of failures like this. Let's think about a web server example from before. We had a web server that was not successfully responding to requests. Clients were able to protect themselves from this by retrying somewhere else, but that server is bad. I want to get rid of that server because it's doing the wrong thing. How do I get that server out of service so that clients don't have to deal with failures in the first place? How do we detect the failure and how do we mitigate it?

A really common answer for this question on a web service like S3 is health checks. Think of a health check as another server whose job is to check the functionality of each of these web servers. This health check server is going to send requests directly to each web server just to ask if they're functional and serving requests successfully. If any of them ever say no or the request is failing or they're overloaded and timed out, that health checking service can now take action. It can, for example, take that server out of service by removing it from DNS. Now clients won't even reach that server and won't experience those failures in the first place. The health check can detect the failure by seeing the server is not responding to requests, and it can mitigate the failure, whether it's by taking it out of DNS or maybe rebooting the server or other actions. Health checks are awesome and they are a really essential tool for automatically healing the server so that someone doesn't have to manually come and look at the server.

A cool thing about S3 these days is that we actually build S3 on AWS, so we use the same infrastructure that you can for things like this. A lot of S3's web server capacity these days is just network load balancers. Our DNS is done with Route 53, so we're doing the same things that you can do on top of NLBs and Route 53 for our health checks in S3.

But there's a little bit of a danger with this health check system. Because in reality, a health check is just a server as well, right, just another kind of server. And so what if the health check server itself is the one that is failing, right? The web servers are all fine, the health check is unhealthy for some reason. What if it's that server that's causing the problem? Well, that health check server is going to go and talk to a whole bunch of web servers, and it's going to say that they're all bad, right, because the health check server itself is broken. Maybe it has a network link that isn't working or something like that. So it's going to detect that all these web servers are bad, and then it's going to go and tell our DNS server, hey, all the servers are bad, you should take them all out, right? Turn them all off. This is obviously really bad, right? We don't want this to happen.

So we have to design not just for availability of the servers themselves, right, the web servers themselves, but also the availability of the health check system. So what can we do about this? One of the really important things we do in S3 is that we don't trust a single health check, right? We actually try to get a very holistic view of the health of our web servers from multiple perspectives. So you have health checks that are coming from the same region. We have health checks coming from another region. We have health checks coming from the internet publicly, and we can combine these signals to form a more detailed view of whether a web server is broken or whether something else is broken in the system.

This design is especially nice because it helps us to form a view about correlated failures like we were talking about before. So for example, maybe only the requests from US West to a different region are failing on all the web servers. That actually suggests there's not something wrong with the web servers themselves, but some kind of networking issue, right? There's a networking issue in the system, looks like a correlated failure because it looks like all the web servers have failed, but it's actually something else. So it helps us to diagnose these kinds of issues in addition to just being able to protect ourselves from individual failures on the web servers.

It's a really important tenet here and it's so important that I want to put it in really big words. The most important thing we think about when we think about failure is that we never let local systems make local decisions about the health of the service. That's what the original health check was going to do, right? It was going to make a local decision based on what it saw about the system and then react to that. We never allow this to happen when we design a distributed system. We cannot trust local decisions to correctly understand the bigger picture, the bigger health of the service, and so we intentionally engineer this kind of correlation, this kind of global view of the system to make sure that we don't make mistakes with local decisions.

Let me give you one more example of this. At S3 scale we have millions of hard drives, right, and sometimes hard drives fail. So we have software that detects whether our hard drive is failing. And when that software detects that a hard drive is failing, it triggers some kind of remediation. Maybe we just turn the hard drive off and on again, like a power cycle of the drive. It might be as complex as actually asking a data center technician to go and replace the hard drive if it's truly failing. But again, this health checking service is making local decisions, right, it is one service looking at a hard drive and deciding that hard drive is bad.

And so if that service itself is broken, it can take a lot of hard drives out of service very, very quickly. So again, we want to make sure that we don't make local decisions about the system like this. And so in this case we have a sort of global rate limiter, right, a system that the health check has to go and talk to before it makes a change to the system to ask, like, I think this hard drive is unhealthy. I want to take it out of service. Am I doing the right thing, right? And if this health check service goes haywire, right, if it starts trying to take everything out of service, the limiter is going to stop it. It's going to say you've taken too many hard drives out of service, something is wrong with you, we have to stop. Right, so it's all about removing the ability for local decisions to change the system and making sure there is a global perspective of how the system is functioning.

So that's actually what we have for today. We've talked a little bit about availability. These are some of the things that we think about when we're designing S3 for higher availability. It starts with defining, right, what does it mean to be available? What are your correctness goals in the system? Then you need to design the overall behavior of the system, the architecture of the system itself, to meet that goal. That's what Seth was showing with our quorum-based architecture and moving to journals, our strong consistency design. And then you need to implement the actual pieces of your system, right, when implementing the pieces of the system, it's important to understand how each piece can fail and how each piece can misbehave and how you can remediate those failures without making local decisions about the health of the system.

Thank you so much for coming out today. Please take a moment to fill out the session survey if you do. Seth and I really enjoy the chance to do this talk every year to talk about the sort of hard stuff the S3 team does, and it's really valuable to fill out the survey so we know that we can do this kind of talk in the future. Enjoy the rest of re:Invent. Thanks.

; This article is entirely auto-generated using Amazon Bedrock.