Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Global Resilient Apps: Guide to Multi-AZ/Region Architecture with ELB (NET311)

In this video, Jon Zobrist and Felipe da Silva from the ELB team explain multi-AZ and multi-region resiliency strategies using AWS Elastic Load Balancing. They cover how ELB uses DNS-based failover with 60-second TTLs, Route 53 health checks, and target health checks to route traffic away from unhealthy zones. Key topics include cross-zone load balancing trade-offs, static stability through pre-provisioning capacity for at least one AZ failure, configurable target group health thresholds to trigger failover before 100% failure, and Route 53 Application Recovery Controller for zonal shifts. For multi-region resilience, they discuss using Route 53 failover and weighted records, DNS load shedding to prevent congestive collapse, deployment pipelines with zonal rollouts, graceful degradation strategies, and the importance of client-side best practices like honoring DNS TTLs and implementing connection pooling. They emphasize that configuration and deployment changes cause most outages, making testing and change management critical.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Global Resilient Apps: Multi-AZ and Multi-Region Resiliency with ELB

Welcome to Global Resilient Apps, a guide to multi-AZ and multi-region resiliency using ELB. I'm Jon Zobrist, Head of Customer Success for the ELB team, and with me is Felipe da Silva, Principal Solutions Architect on my team. Thank you all so much for coming out so early and during a keynote. We really appreciate it, and for those of you who are here from Felipe's talk the other day, good to see you. We're going to go over some guiding principles, then we're going to talk about multi-AZ resiliency and then multi-region resiliency, and then we'll wrap up with some Q&A. If we run out of time, which we probably will, we will be outside in the hall afterwards when we have free time. If you have any other questions, we're happy to chat about your specific architecture.

Let's jump right in with everyone's favorite quote from AWS of all time: "Everything fails, all the time." Our VP and CTO, Werner Vogels, has said this repeatedly, and I think we all know it's true. Things fail, and when failures happen, we have to deal with them. We can't just hope they never happen. To mitigate failures, we think about how to become more resilient to them.

From our Resilience Hub, which you should all check out, we have this definition: the ability of a workload to recover from infrastructure or service disruptions. That's the main area we're going to be talking about today. Under the hood, there are a couple of drivers for resilience, and it's important to understand that both of these are important when you're planning or building your architecture. The technical drivers are going to be things like having less downtime, being more available for your customers, and making sure things have lower latency. On the business side, you're going to have concerns like keeping your revenue going, keeping customer trust high, and making sure your applications are up and available.

Understanding Failure Scenarios and the Shared Responsibility Model

We're mostly going to focus on the technical one, but what are we actually mitigating for? What is the problem we're trying to prevent, or why are we being resilient? There are a lot of things to think about. The first place many think of is highly unlikely scenarios like earthquakes, floods, and tsunamis. These do happen and we need to be prepared for them, but they're by far the least common. Moving towards the more common issues, you've got data in state where you could have corruption of your data or invalid stored data that you didn't fully replicate.

Then there's the core infrastructure, the things we often think about: racks, servers, power, air conditioning, all within the data center where the servers are running. But by far the biggest one we see is configuration and deployment. Humans touching things in production is by far the number one cause of issues and outages for both customers and for AWS service teams. We're going to dive into that a little bit more as well.

We aren't going to go as deep and really cover disaster recovery, which is sort of the twin of resilience. Disaster recovery often focuses on having a process and procedure with backups in a secure location, often multi-region. If you're just getting started on your journey, multi-region disaster recovery may be a good first step into going multi-region. But the difference here is your recovery is going to be a lot slower. You're going to have hours to days, doing things like restoring from a database backup or relaunching servers.

Whereas high availability means you may have a primary site and a secondary site, but they're both live and running, and you want things to be able to keep running in both of them. If the primary fails, you fail over to the secondary for higher availability, or you go something like active-active and have multiple primary sites. We're always trying to improve these. We want to make sure that we're reviewing what happens operationally, taking the lessons, and building them into our plans so that we minimize the chance that we have impact from a similar failure in the future.

When we talk about resilience on the cloud, we love the shared responsibility model. Resilience, like everything, is a shared responsibility between AWS and you. The primitives we give you include multiple regions all around the world, and within the region, we have multiple Availability Zones. These are completely independent infrastructure pieces—one or more buildings, which can be many buildings in bigger regions—and they're geographically isolated from other Availability Zones but still close enough to keep the latency reasonable in the single-digit milliseconds. These are the primitives we give you in terms of where you put things and how you structure them.

Multi-AZ Resiliency Fundamentals: How ELB Uses DNS for Failover

Let's hand over to Felipe. Welcome to re:Invent, Felipe. Thank you, John. I appreciate that. So let's talk about multi-AZ resiliency then. We are going to be focused on ELB here, but essentially the tips that we're going to give apply to any workload. Let's start first with a very simple application. As you can see in this application, we're not using ELB here. There are multiple EC2 instances running across different Availability Zones, and then you have your primary database.

Everything is working fine until you have some dependency issue. The primary database fails or something like that. What do you have to do here? You have to fail over from that primary database to your secondary. Your service availability looks like this. It drops until the failover initiates and then you recover.

Now let's talk about a different problem, which is also very common. You have hosts in one availability zone that are unable to connect to the database, but the other hosts are not. In this case, you don't want to fail over from the database because everything is working, but you need to fail over from the front door. This action to failover is a failover on the front door, and until you do that, these users are going to face a degradation of performance or they cannot connect. The availability looks like this. It's not down, but while you haven't applied the mitigation, you see availability that's slightly down. This could be due to host failure, connectivity flapping, or something between the availability zones that causes the connectivity to stop.

How can we actually improve this and improve our service availability so that it's smoother? ELB improves your availability by scaling transparently and distributing the traffic to targets. You have multiple targets that you don't have to manage. It performs health checks and manages the traffic routing using DNS as well, and that's the key part that we are going to be focusing on today. A lot of the talk is going to be related to DNS. Essentially, that's one of the mitigation points that we use in ELB: failing over using DNS.

This is a typical architecture for ELB. You see on the top part there is a DNS part, and that is what we're going to talk a lot about. Then you have the load balancer, then you have the target group with your EC2 instances or whatever you have, and you have your dependencies there. The ELB distributes traffic to healthy and appropriate scale nodes based on your workload. You'll see here a screenshot of DNS resolution using the dig command. The DNS responses actually return IPs in a random order to distribute the traffic evenly across all zonal IPs. Each IP here belongs to a different availability zone, and the TTL of the record is always 60 seconds. That ensures that each time the client resolves, they get the newest healthy hosts that are available.

I wanted to talk about from the ELB perspective which IPs we publish in DNS. The answer is all healthy zones are in DNS. When you configure a load balancer, you can pick how many availability zones you want to have available, and as long as the availability zone is considered healthy, that availability zone IP will be in the DNS. Let's understand what a healthy zone definition is. It contains at least one healthy target, and the node in the zone is healthy from the Route 53 health checks. We're going to see more in a bit because I need to explain how the Route 53 health checks actually work, and we're going to cover that shortly. But stick with that definition because we are going to use that in our presentation here.

Let's do a quick walkthrough on DNS resolution. The users are connected and they are performing a DNS resolution. They're querying DNS servers and they receive a DNS response with the IPs of the healthy zones. They connect and everything is working fine. Then if there is a failure, users perform another DNS lookup, receive the response of the current healthy hosts, and they connect and they're back online. That's basically the way that the mitigation actually works. In this scenario, I put the load balancer node failing, but it could be two different reasons as well, and that's what we are going to dive deep into today.

Two-Tier Health Check System: Route 53 and Target Health Checks

Let's go back to our previous example where there was a dependency failure, but the dependency that was actually failing is from one of the availability zones to connect to our database. In this case, you don't want to fail over your primary database. You just want to make sure that you don't route traffic to those targets. How do we detect and route the traffic around that? The answer is health checks. Before I start talking about health checks, I need to explain how we design the health check systems of ELB. We think about a two-tier health check system, which means we do two types of health checks. One is that Route 53 is constantly performing health checks against the load balancer nodes that are running.

When I say nodes, I'm referring to the IPs that we publish in DNS—specifically, the subset of nodes that are part of the load balancer, which could be an ALB, NLB, or even a classic load balancer. That's the first tier, and it ensures that we only publish healthy IPs in DNS. That's the first part. Then we have the second tier, where each node performs health checks against your targets. You've probably seen this in the access logs of your targets, where you see health checks being performed by different IPs if you have cross-zone load balancing. Each node has its own view of which targets are healthy. We only send traffic to targets that are healthy, and clients will only see availability zones that are healthy in DNS.

Regarding health check actions during graceful failures or hard failures, the first thing you want to do is reroute traffic from the affected targets or availability zones, and then you want to initiate target replacements. Let me recap the target health check options. What I mean by target health check options is the response that your backends provide to the load balancer. We often see shallow health checks and deep health checks. A shallow health check is when the load balancer sends an HTTP or HTTPS request and expects a response. You configure the response code, the interval, and so on. You can provide either a shallow response, or you can provide a response where you perform something extra—connecting to a dependency and making sure everything is working.

The shallow approach means you're connecting to the web server and retrieving a response. If the web server connectivity is working fine, that's it. It doesn't perform any extra dependency checks. The deep approach would do that, but one thing to consider is that this would have high resource usage. If each health check requires you to check whether you can connect to a dependency or perform some extra operation, that may be expensive. However, we don't have only these two options. We actually have a third option that we discuss frequently with customers: a hybrid health check approach. It combines the best of both worlds. You perform a synchronous dependency check across your dependencies, and then you populate a file or cache that you provide as a shallow health check to the load balancer. Essentially, each health check that the load balancer performs reflects the dependency state, and this allows you to route traffic around to other healthy targets if you have a graceful failure or are unable to connect or perform some action.

Static Stability and Cross-Zone Load Balancing Strategies

Now let's talk about static stability. One of the principles of static stability is that when impairments occur, this can cause resources to be unavailable. If you lose a third of your fleet—say, one out of three—capacity, this could lead to overload. The answer is that you have to be statically stable and have that redundancy pre-deployed. You want that redundancy available across multiple availability zones. That's what we actually do with ELB. We always overprovision the load balancer by at least one availability zone to tolerate that failure. This enables us to do two things: provide a seamless DNS failover and give us buffer capacity for traffic spikes. That's another benefit because we have that extra capacity.

For your targets, you should think about it exactly the same way. You should pre-provision capacity to tolerate one availability zone failure if you want to be resilient, and scale up quickly while scaling down slowly so you keep that capacity while scaling down conservatively. This allows you to sustain a traffic spike if one occurs in the middle. In this example, everything is healthy and green, but if one availability zone has all targets failing suddenly, the other two availability zones should still remain healthy. That's only possible if you pre-provision.

Now let's talk about cross-zone load balancing. Everyone knows what cross-zone load balancing is. If you have cross-zone off, each IP of the load balancer can only talk to targets in the same availability zone. With cross-zone on, it talks to all targets across availability zones. What I want to discuss here is if we imagine the traffic distribution is the same whether cross-zone is on or off.

You have traffic distribution to your targets. One thing you'll notice here is that I intentionally put one of the Availability Zones to contain fewer targets, and you can see that the targets in this Availability Zone receive a disproportionate amount of traffic because the load balancer can only send traffic to those targets, and with fewer targets available, they perform more work. In contrast, with cross-zone enabled, traffic is distributed and compensates for the disproportionate number of targets. However, I don't want you to think that even with cross-zone, although the traffic distribution is fine, you should ignore this concern because that will not enable you to achieve static stability. My recommendation is to keep a proportional number of healthy targets per Availability Zone and make sure that your backend capacity can handle an easy failover. This is essentially what will enable you to achieve static stability.

DNS Failover Mechanisms: Understanding Fail Open Mode

Now let's talk about DNS and dive deep into the DNS failover mechanisms. First, let's imagine hypothetically that we have one Availability Zone where all targets are failing. The question is what happens in this case. The answer is it depends. That's why I explained what cross-zone is and how it works. If you have cross-zone off and all targets in that Availability Zone are failing health checks, the IP is also removed from DNS, even though the load balancer is not unhealthy from its perspective. Because it cannot send traffic and it doesn't have any healthy targets, the IP is going to be removed from DNS, and that distributes the traffic to only the Availability Zones that contain healthy targets. If cross-zone is on and the load balancer in that Availability Zone can still send traffic to other targets in other areas, the IP is not removed from DNS.

One thing that is very important to understand is that this failover occurs at the data plane level. There is no control plane involved, and this happens for both Application Load Balancers and Network Load Balancers. Let me dive deeper into another important aspect of load balancers: we don't fail closed, we always fail open. You can see in this diagram that all targets are failing, so the IPs should be removed from DNS, but that would essentially return no record, and then clients would not connect. We don't do that. Instead, at the DNS level we fail open, and the IPs are returned regardless of being unhealthy. However, one thing that happens as well is we flag that on Route 53 and we evaluate target health, which also fails. So if you have a failover policy configured, that would fail over to another load balancer that is healthy or another resource that is still healthy from the Route 53 perspective.

On the target side, if all targets are unhealthy, we send all traffic to any of the targets as if they were healthy. It's better to send traffic somewhere than to just fail. I want you to think about this as well: this masks other failures and prevents additional failover decisions if you are always in fail open mode. You can be in fail open mode and still operate just fine because you might have a bad health check, for example, but you lose the ability to fail over traffic when a real failure occurs. So I don't want you to think that fail open mode is the right mode for you to operate in. It's just that we like to offer it because it works even in cases where you are unhealthy and the application is still returning data. One thing I wanted to mention is that this is 100 percent configurable using load balancer target group health thresholds, and we're going to see that in advance.

Sometimes we hear customers asking how they can monitor if their load balancer is in fail open mode. They want to monitor and see if there is a failover occurring, and they want to know if they can create an alert for that. One way you can monitor if your load balancer is failing open is to create a Route 53 record with a failover policy, a monitor record for example, that you don't use for production. Configure the failover policy, and as you can see here, the primary is pointing to the ELB and it has evaluate target health enabled, and the secondary returns no record. So if you detect that it's returning no record, but you can actually point to anything you want, and when you capture that it's not pointing to your main load balancer but instead pointing to the static resource that you configured before, then you know that the load balancer is in fail open mode. Don't use that monitor record for client traffic, because otherwise your clients will fail to connect.

Target Health Thresholds and Route 53 Application Recovery Controller

Let's dive deep into the target health thresholds. Remember that I mentioned that fail open and failover in the previous slides only occur if all targets are failing, and that is the default. All targets must fail health checks in order to trigger that behavior.

All targets must fail health checks in order to trigger that failover. To recap, it's the DNS failover and the target fail open as well. These thresholds are configurable, and what we want to talk about in this action here is that you should configure earlier intervention points because if 100% of your targets are failing, maybe it's too late. You can configure to fail away before that.

In this example, you have unified threshold and detailed thresholds where you can separate the DNS failover and the target failover. The DNS failover has to occur first. In this example, I put the unified one and set it to 30%. If a target group contains less than 30% of the targets in a healthy state, it would trigger a failover. You can see here that not all the targets are failing. I still have two healthy targets, but that could lead to an overload or a bad customer experience. So it's better for you to fail away from that Availability Zone if that occurs.

If you have multiple target groups associated with the load balancer, that's a very common configuration. The reason I'm bringing that up is that we sometimes see customers in fail-open mode or failover scenarios, but they have healthy targets in the same Availability Zone in a different target group. In this example, I put two target groups: Target Group 1 and Target Group 2. Target Group 1 is a test target group, for example, and Target Group 2 is serving the production traffic. This feature allows you to go to that setting and disable that for one of the target groups that's not serving the main portion of your traffic.

Essentially, as you can see in this example, the Availability Zone is not returning DNS, but if you disable that for that target group, it will still resolve and still return the DNS. If the load balancer receives traffic on that Availability Zone and that points to that target group, that target group is in failed open mode essentially. But at least the IP is not removed and capacity is not removed from that Availability Zone.

Let's talk about another mechanism to fail over traffic, and it's using Route 53 Application Recovery Controller. Let's assume you have this scenario where you have canaries and you're probing the load balancer and you detect a degradation or something where your targets are not considered unhealthy yet due to some other reason. You can actually go and request a zonal shift, and by requesting a zonal shift, Application Recovery Controller will remove that IP from the DNS as well. That's a mechanism that you have control over. You also have zonal auto shift. If we detect an issue, we can also perform the zonal shift for you on our behalf. You can use that as well for testing, if you can actually tolerate that Availability Zone failure that I mentioned and ensure that you are statically stable, so you perform exercises and so on.

In this example, it shows for cross-zone off, but it can also be used in cross-zone on. Please check the QR code. There is a lot of information in the article that this QR code is going to return.

Observability Best Practices and Client Connection Management

Now let's talk about observability. I wanted to make a disclaimer on the observability side. Observability is a huge topic, and I'm not going to dive deep on observability. We won't have time to go over everything. Let's imagine this scenario where you have your users, you have your load balancers, you have your targets, and you have your dependencies or a stack of multiple load balancers. The question I want to ask here is: where are you measuring when things are wrong or things are doing well? The answer here is you should measure everything. You measure from the load balancer metrics, but you also should emit metrics from your targets, collect metrics from dependencies, and understand in case there is a failure or something. You want to understand what component is actually failing because you may see the front door failing with something, but the problem actually is on a database that is down on the stack, right? So you want to make sure that you know that. An indication of error can mean something external to the load balancer itself.

The question that we ask when we are troubleshooting things is: is the failure occurring at a single zone or is it occurring at multiple zones at the same time? The reason we ask this question is that if it's happening from multiple zones and each load balancer is seeing the same thing, you can pinpoint that the issue is not at the load balancer because we provision zonal resources. If we are seeing the same event across all your targets, for example, it may indicate that you have a dependency issue. But if the issue and you have cross-zone off, for example, is happening with a single Availability Zone, you can then verify if there are any specific targets that are returning errors in that Availability Zone and so on.

And that is only possible if you have those metrics available or if you have to process access logs. One thing that we do at AWS is in our services we monitor not only for negative metrics, in other words like errors, but we also monitor for positive metrics, meaning whether your requests are within the boundaries that we expect. What is the 2XX rate? Is the canary successfully probing the nodes and the load balancers? What is our health host count? If you see a sudden drop in the health host count and not an increase in the unhealthy host count, it's still a problem, even though the unhealthy host count didn't alarm. These are things that I want you to think about when you are creating alarms. You should also think about your positive metrics.

One cool feature that we have with CloudWatch is the composite alarm, so you can combine multiple alarms into a single alarm, and that will enhance your visibility when you're alerting your own call. Now you can see which alarm is firing. Maybe it's firing the 5XX alarm on the load balancer, but it's also firing something like elevated latency on the target because the dependency is failing, and that will give you better visibility. As I mentioned earlier, there are other talks that discuss observability in depth. I just wanted to give my thoughts about how you should think about observability here and client best practices.

I want to start here on the client. One important aspect of resiliency is the client. You can have a very good architecture, well architected, everything works fine, but you have a client that sticks with a specific IP because they are caching or something like that, and they don't recover when things fail. They don't detect it, or they detect that there is a connection failure and they keep retrying the same IP all the time. That's bad. One thing that I want to mention here is whether your clients are ready, because it's not only DNS.

When you're setting up clients, most web browsers will do what we're explaining here in this slide, but when you're creating your client for your APIs, you need to ask: are your clients ready for connection management? Do they have a maximum connection persistent time? Do you have connection pooling enabled that can accelerate things because you can pre-open connections essentially? Are clients honoring the DNS TTL? When errors are occurring, do you try unresponsive IPs and skip failed ones? Do you implement exponential backoff with jitter? If you do that, then you have loads of benefits. You have balanced connection distribution, graceful failure handling, faster recovery times, and reduced latency.

This is just a basic example. You're going to see that on the connection pool example here. You have connections to multiple nodes. The client is aware that one node is not producing the expected response. It just ignores that node temporarily, and then when that node recovers, it reconnects to it. That's it from my end actually. I'm going to hand over to John.

Multi-Region Resiliency: Blast Radius Isolation and the CAP Theorem

Thanks, Felipe. I appreciate it. So now that Felipe has gone into multi-AZ resilience, let's talk a little bit about multi-region resiliency. The first thing to talk about is why would you want to do this? Why go multi-region? The biggest thing in my opinion that you get from going multi-region is you get another level of blast radius isolation. We talk a lot about blast radius and how we minimize impact during a failure. The blast radius isolation that you get will help you with other things like configuration issues or deployments. It will help with region-wide catastrophes if you're in a different region where there isn't a catastrophe happening, and it may help you with legal or compliance reasons.

We are seeing more customers and more countries put data sovereignty laws where certain kinds of data need to be stored within certain geopolitical boundaries. These are countries that have them so far, but it's a new trend and we expect it to continue. Definitely an important thing to think about. So before you go multi-region, the biggest thing, the hardest thing about this, you have to realize—I'm sure many of you know this—it's a very hard problem we're solving. We're taking complex distributed systems, building other systems on top of those, merging them together, and then we want to take that and start moving it to other regions in a way that synchronizes with the original.

The two biggest points that I think we need to do when we're thinking a lot about going multi-region are one, align everybody. Now this doesn't necessarily mean you have to have your customers aligned beforehand, but during an event you need to be giving them the information that you're all aligned on, like what's the failure mode, how we're recovering from it, what are our expected recovery times or expected behaviors during an outage, and really

keep that alignment going throughout the life cycle of the entire project. The other big thing is simplicity. We've all heard "keep it simple"—simplicity scales. We see systems at AWS where all of the big systems have simple core principles that they're built on, and that's the reason they can scale. But keeping it simple also gives you the ability to reason about what's going on during a failure. So you're new on call at 3 a.m., you get paged. If the system is simple and you've learned those principles and understand how it should behave, you'll be much further ahead and able to find out what's going on and initiate any required actions.

Before you go multi-region, there's a few things you should do, and you really should have this multi-AZ nailed down. We do have customers who go multi-region in single AZs, and when I see those architectures, other than getting the blast radius isolation, you're really doing the same thing as multi-AZ in terms of failure. You should have highly automated architectures. Your infrastructure should be defined in code if you can. You want these things to be fast and reproducible, but the more important thing with having it in code is that it's consistent. There's nothing worse than setting up a region and manually setting up another region, and you forgot to set some options, and now you've got a difference that you don't know about until you actually fail over.

Your data authority and replication strategy should be well defined, and this is another thing to align on with the people who are writing the database queries or using the database, the back end engineers, the front end engineers—everyone needs to know this is how we will know that these values are correct. This is our replication strategy. This is how it will behave when we fail. So one of the big failures that can happen is core infrastructure, and to mitigate these failures, we are responsible for the resilience of the cloud. To do that, we actually define our services in three different groups, and you'll notice there's zonal, regional, and global services. Global services versus our control plane in one region and data plane in all regions—it's kind of multi-region, but we don't have one that's actually multi-region. We don't build services as multi-region as a service team. When we deploy to a new region, it's essentially a completely isolated incarnation of the entire service.

Under the hood, when we're building these services and when you're building them, we can have a zonal service. The big difference when you ask should I be zonal or should I be regional is whether I'm building a core service that other services are going to build on top of. If the answer is yes, then you probably need to be building a zonally isolated service. Hyperplane, which is what NLB and gateway load balancer use under the hood, is a completely zonal service. Hyperplane in a zone has no idea that other zones even exist, and the way we do cross-zone is we register all of the targets to that target group in every zone.

These zonal services, when we build them, have a regional control plane that has a common endpoint, then they'll have zonal control planes, and data planes are always zonal as well. Regional services are more commonly what we actually build. Most of our services are regional. We have the whole region as one logical incarnation of the service. Sometimes these will have a control plane in the zone, but they may not, and they'll have a regional control plane. It's the endpoint you talk to, and it will propagate to the zonal data planes. Now under the hood, everything is always going to be physically in one zone or another, so the data planes are still isolated even if they're aware of the other zones or using them.

Then finally, we have a very small number of global services. These are generally the edge services like CloudFront, but they're also services like IAM and Route 53. It's important to realize that these mean they're running in one region—their control plane—and if that region is impacted, their control plane may not be available in any of the other regions. This is why Felipe was mentioning you want to keep your mitigations in the data plane as much as you can. Nobody likes to have an issue that they need to run something to react from and then the API is actually down. We build our control planes to be resilient and highly available, but nowhere near the same level of effort is spent compared to data planes.

The other category of errors that we can run into is really data and state. When you're planning for this, you need to consider and keep in mind the CAP theorem. I'm sure everyone has seen this in some form or another. We'll just do a quick refresher. Partition tolerance is the ability to survive a network segmentation where one component of your application cannot work with or talk to the other components. In network applications, where everything's networking, you're ready to talk, so you guys are dealing with networking. You have to choose this, right? You can't not have partition tolerance. If a user is on their desktop at home and your application is there working locally and the data is consistent and available, it's probably not the same, right? They need to be able to connect to something.

So we have to choose partition tolerance, and that means we have to make a trade-off between consistency or availability. Now what are these actually? Consistency is the requirement to always return the correct answer. So the correct answer could mean the latest version of a file. It could mean the current balance of a ledger, or it could mean something else that requires additional state syncing before returning and saying yes, we have this. It does not mean that it will always return a response.

Availability means it will always return a response even if it's wrong, and wrong here could be out of date. It could be again like the ledger where you could have a time from earlier when you have the latest value that you're now showing. Most people pick availability and then want consistency anyways, and this is something that when you're planning you need to be honest and say we need to be clear that we can't have all of these things all of the time. Now there are strategies you can do to mitigate problems, and we'll talk through some of those. We just want to cover the CAP theorem again.

Data Replication and DNS-Based Traffic Shifting Between Regions

Now I mentioned earlier the disaster recovery planning. This is an important part of resiliency or an important part of availability, but it's not really what we're focusing on right now. If you're not having any multi-region for your applications or data and you want to start, this is a good place to say let's just ship our data somewhere, our backups into S3 in another region, and then it's less effort. It'll be less cost overall, and the downside is the recovery could be hours to days instead of hopefully seconds in a more highly available resilient system.

So data replication is something that is a challenge, and it will vary a lot based on what your data is and what your users' expectations are. You can have multiple options in terms of maybe you need to write your changes to one region and maybe you need to replicate to another region before returning that write as a success. You may have that kind of consistency requirement. You may just write to the region and then have it lazy replicate or use another system to replicate, but this will help determine whether you need to have active-active in multiple regions at the same time or active-passive where you have write regions and read regions.

We have a handful of services that actually give you global features. DynamoDB, DocumentDB, and Aurora Global Database are all good options. These will let you determine or configure how your data will replicate between regions, as well as monitor it and have metrics and views and insights into what's actually happening. So when we're talking about a failure and it's a regional-size failure we need to fail over somehow. You've heard Felipe mention that we're doing DNS a lot. We actually use DNS for everything. I think everyone knows it's always DNS, but when you're actually shifting traffic, we're going to be changing DNS records or updating them or letting the health checks dynamically change them.

And when you fail between regions or you're sending traffic between regions, you're going to be using the AWS backbone. The backbone has multiple layers of encryption. It's DDoS resilient. A lot of the hardware we designed and built ourselves, a lot of the fiber we laid ourselves, and it is highly scalable and has a lot of good features, and you'll be shipping traffic between regions using this backbone.

So when you're getting your traffic into the ELB, you have a few different options. Route 53 is the first one, so you always have Route 53 with ELB. It's because we're a Route 53 customer. We pay them for health checks. They health check every single IP we have globally all the time, and that carries over to your DNS records when you have an ELB DNS record.

You can also use CloudFront, which will give you things like edge caching, WAF at the edge, Lambda at the edge, and other features that can help get you into your load balancer, or Global Accelerator. Global Accelerator's real advantages are it has static IPs. You get two IPs. They are in different, completely different infrastructure under the hood, so there's no overlap. It's another good level of blast radius isolation. They're anycast, so you can configure multiple regions as targets and let Global Accelerator figure out how to route the traffic, as well as the static IPs that you don't have to worry about changing.

So when you're actually having a failure and you need to shift traffic over, you want this again in the data plane, and you can do this with Route 53 records if you choose the failover type record. In this example we've got a primary region and a backup region, and our primary region, as long as it's healthy, we will have traffic routing 100 percent to it. But if the primary region becomes unhealthy, Route 53 will detect that and flip the DNS record to unhealthy.

This causes the traffic to start failing over to the failover region. If you have an active-active workload, you can use a weighted record, which will let you put in different numbers for weights for each region, and you could send 50% to the first region and 50% to the second region. It does the same thing when things fail. So if the first region goes unhealthy, it's still going to fail traffic over to 100% to the other region, even though they're both primaries.

I put 45% of impact here because this setting is what Felipe was talking about on your target group where you can change the threshold. So you could say 50% of the hosts need to be healthy and this failover would trigger exactly like this. The default is 100% need to be unhealthy, and we don't think that's what most folks should configure, but we hate changing defaults. So definitely go spend some time and look at changing that setting. The simple version is use the unified one and decide on what level of failure you're willing to tolerate in your application and set that.

But what happens if when you shift this traffic over your new region gets overloaded? We all know that when errors happen, clients sometimes go into connection storms, reconnecting rapidly, sending more and more connections, increasing load, and making the outage worse. If you're having an outage and you fail over a bunch of traffic from one region to another, or even if you're just getting too much traffic, you might be entering a state of congestive collapse where the problem creates more of the same problem.

In this case, our backup region is still healthy, but it's not 100% healthy, and we know it's because of load, so we need to shift some traffic away. You can use DNS load shedding. So if you use a weighted record, you still get those health checks if you check evaluate target health, and in this case we've got an ELB 1 record showing you'll be 1, but you would replace that with the same thing. The ELB 2 record is now a new load shedding record. It's essentially going to have its own weights, and it's going to say somewhere in this case a null route to 0.0.0.0 is weight 0 and the main site weight 100.

Then if that main site crosses its unhealthy threshold or if you want to go change it so that it is going to happen sooner, you're going to manually configure it and say shift traffic to this. In this example we're going to send 20% of traffic to a null route. Clients get this, they won't be able to connect to your site, but your site can then have a reduced load and preserve the experience for some of your customers. It's not always the best option, but in a lot of cases this is what we can get you out of. We're having a problem, we're in congestive collapse, we need back pressure or load shedding.

And then after your region goes healthy again, you can update the record and change it back to 0%, and in this case we're still shifted over to the backup region because their primary region is still unhealthy. These are all Route 53 records that you can configure, and there's more of them, including latency or geo records to say that people from this geography will go to this place. These records are very useful when you're creating your global infrastructure because you're doing multi-region. Your clients are going to be closer to some regions than others. You want to configure your records to ensure they're sending traffic to the right region.

Mitigating Configuration Changes: Testing, Deployment Pipelines, and Graceful Degradation

So let's go to the big thing that we've talked about. Most failures are caused by humans changing things in production, so we need to try to be as careful as we can. What does that mean? What are the things we can actually do to mitigate these configuration changes and deployments? The biggest thing is testing. This includes integration tests, unit tests that run when you check things in, and automatically running these tests at each stage of your deployment or of your environment.

The other big thing is change management. We do a lot of automation and I'm sure you do a lot of infrastructure automation. Those kinds of things are allowed to continue on their own because they have their own systems for detecting failures and mitigating things. But when a human is going to change production, we require going through a strict change management process. You're going to document everything that you're going to do and then you're going to run it step by step, and that increases visibility and increases the ability to say when we're manually changing things we're not doing it without asking a bunch of people to review what we're going to do and getting more people involved so we're not in the heat of the moment trying to change production.

What does that actually look like when you're deploying? You've got Code Pipeline, which is basically the thing we use. You've got a pipeline of deployments where you're going to start with some pre-production environments. Someone is going to write the code and check it in. It's going to get built by a system and pushed to the first stage. At each stage as it deploys, you want to have multiple things that are checking and allowing you to forward that deployment to the next stage or do the deployment itself.

That includes all your testing, so we run integration tests in every environment in every region before any deployment separately from the previous runs, and that will detect things like somebody changed.

You want to make sure that you have a rollback alarm that will automatically roll things back if it triggers. You want to make sure that's green before you deploy. You want to make sure your dependencies are all green, and you want to make sure that the time of the deployment is okay within a schedule. Many teams schedule deployments in the middle of the night or on the weekend, but we don't really have a weekend or a middle of the night when our customers aren't using our services.

What we actually do is say we want you to deploy in the region during the daylight hours. We want the people who get paged when a human change breaks something to be awake, alert, at their desks, and maybe already ready to go and able to start investigating. We don't want to page somebody at 3 a.m. to wake them up and then have them figure out what to do. We want to make sure they're prepared for that.

At each of these phases, we're going to run those tests and then propagate to the next phase after the deployment completes. Alpha and gamma, we generally think of as follows: alpha is usually just the team's changes with production from every other service that you use, and then gamma is usually the services integration where you've got all the changes that your whole service is making. Sometimes there are more pre-production stages.

Once you get to your one box stage, and one box is just a name, it doesn't always mean one box. If you have a fleet of tens of thousands, one box may be dozens to hundreds of actual application instances that get updated as part of this. It needs to be statistically significant enough that you can detect the vast majority of problems with the majority of your users, so look at how much traffic your application sends and pick a number to say that's our one box. A good answer is one if you want to know.

When you go to production, we start zonally, so we'll do one Availability Zone and we'll go to the next zone. As we grow this out, it just scales . We can start touching multiple zones or multiple regions in the same day, but this is another place where having alignment is a big deal. Having a rule that says you will not deploy to two Availability Zones in the same region in the same day means that you can look at this, or it's an even more complex version , which is not the smallest pipeline I've seen at AWS.

But you get a simple understanding: I know that each of those deployment stages won't be in the same region twice in a day. If you see that, you hopefully wouldn't put it into the pipeline, but if you see it, you know immediately that's a problem with my configuration and planned deployments. This scales and we do deployments where we fan out after we get more and more confidence that we are detecting or we would be detecting any issues that are occurring from our deployments .

Another concept that's very useful is graceful degradation. When you've got a failure and you're mitigating it or reacting to it, you want your application components to continue to perform a subset of the core functions, even if the dependencies become unavailable. A good example of this is the Dogs of Amazon page. I'm sure if you've heard, there's this website, Amazon.com, where you can go buy things.

If you run into an error, the graceful degradation that the application sometimes does is return you a picture of somebody's dog who brings it to work. First of all, you get a cute dog usually, and you see they're adorable. You also get instructions on what to do next: go back and try again. You also get clarification that this is on our side. We had a problem. You got a 5XX error instead of just giving you a generic error where you don't know what to do. We've given you some instructions on what to do.

There are many other kinds of graceful degradations you can use, but the key concept is you want to have part of your functionality or some level of information to help the users or whoever is interacting with what's failing to have a way out of it without having to panic or open a support case or complain. Maybe they just retry. Every time I've seen a Dog of Amazon, a retry has fixed it. But if it doesn't, you'll still keep getting the same error, and that error is a big part of what's gracefully degrading. The website didn't just return nothing; it returned something, even if it's not what you were actually looking for. You have a step or you have an action that you can do something.

Another example of this is you actually scale in the services you're running. There are customers who have multi-region, but the backup region is not running the full application, and it's not expected to. They're expecting that when they fail over, they're going to have a subset of functions available. So if you're a bank and that's the ledger and you want to say we always want to have the right version or the right balance for it, you're going to make that right, go to the first region and then to the second region before saying yes, we accept the right.

That way if you fail over to the second region, you can have strong consistency knowing we have the value because we wouldn't have accepted it if it wasn't already replicated.

But other features may not be there in the second region. Maybe I can't go make a payment or maybe I can't go and file something else. There's a bunch of things you cannot have in your backup region as you build into this more and more. Now, of course you can replicate everything, run it in another region and have everything fail over smoothly and have all of the features. It just saves you some money as well as complexity if you know that during a failure this is how the system's going to behave and everyone's aligned and agrees that's the right thing to do. It's going to help you out a lot and save some money.

But when you're thinking about this, you're going to be prioritizing what your users really care about, what's the critical service, and make sure that they're informed. So an example for our availability in our CAP theorem, if you have a ledger and you have the latest value, you could show the time where you have that value from, and you could have that cached on the client. You could have that in the target that gets it from the database but doesn't have it in the failover database and giving that information to your users can help them understand. I know that that transaction isn't there yet, and the time shows me we don't have the latest version.

You probably also want to say something about we're experiencing a failure or we're running in a degraded state, but it's something that gracefully you can help your users have a better experience or continue to maintain function during a failure. One thing that a lot of customers do, but honestly the majority don't, and I think it's undervalued, is toggles and circuit breakers. So toggles in your features would be we're deploying a new feature, we deploy the things that make it work, and then we watch to make sure it's going to work the way we want, but we can turn it on and off without having to go and do a full deployment.

Toggles can help you disable a feature that maybe its dependency is having an issue. Again, prioritize business functions, resource intensive features under load. Maybe you don't want to do that extra widget that costs a whole bunch of CPU when you're running in your other region. Those are the kinds of things you want to think about and remove from your part of your failover. Queuing also works well, serving cache content. So if you're in a standard application and your client has code that you own, you can go change this and add things you want to make sure you're following all the TCP best practice Felipe talked about, but you could also have local versions of, say, an error page, something like the Dogs of Amazon page, or the latest version of the values that you're going to request, just cached and then display them.

So you can gracefully degrade right there on the client if you own it. At the target group, the health check configuration we talked about, you can use that to fail away sooner and make sure that your target, your experience is preserved for your users, whether that's failing open and sending traffic at the target group or failing away. Load shedding is a useful thing that gets you out of congestive collapse, something that definitely is a very useful thing to have happen or to be ready for in case you get a surge in traffic. Rerouting traffic to alternate regions is another way you can gracefully degrade and have your lower experience or your reduced experience application or your full application.

So let's go back over to our checklist. Multi-AZ resiliency, so use multiple availability zones, pre-provision so you can have static stability. Maintain headroom at least one AZ worth of capacity. Use DNS everywhere at your client. Make sure you're honoring DNS TTLs. Make sure when your clients are having errors they're reconnecting to an IP that didn't have the error and think deeply about your health checks.

Configure them so that you'll fail away when you don't want to send when the resource that's in the health check would have, the resource in the health check that you're actually health checking, should it be in there or not, you determine that by saying if it's failed and I cannot respond, would I want that response to go to the customer, and if the answer is no, I wouldn't, then that should be in the health check. And as that gets more expensive, you're going to want to use strategies like the health check separate asynchronously from it getting served directly to the health checker.

On the multi-region side, again, the biggest things, keep it simple and align everyone are the two main things that I really want people to think about. When you're doing these things and you're making these decisions, you should not be doing it isolated. You should be doing it as a group so that everyone understands this is the expectation during a failure. That's it. Thank you all so much for coming, really appreciate the early morning rally and don't forget to fill out the survey in the app.

; This article is entirely auto-generated using Amazon Bedrock.