DEV Community

Cover image for AWS re:Invent 2025 - Building resilient multi-Region applications with Capital One (ARC404)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Building resilient multi-Region applications with Capital One (ARC404)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Building resilient multi-Region applications with Capital One (ARC404)

In this video, AWS and Capital One present strategies for building resilient multi-region applications, addressing three key challenges: dependency management, recovery orchestration, and data consistency. Capital One shares their journey using MELT data (Metrics, Errors, Logs, Traces) to identify hidden dependencies and achieve 70% reduction in recovery times. The session introduces AWS Application Recovery Controller Region Switch for automated failover orchestration, eliminating manual processes. For data consistency, Amazon Aurora DSQL and DynamoDB multi-region strong consistency are demonstrated as solutions enabling true active-active architectures across regions. The presenters emphasize continuous testing using AWS Fault Injection Service, automated recovery workflows, and understanding CAP theorem trade-offs when choosing between availability and consistency in multi-region designs.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The 2 A.M. Wake-Up Call: Why Multi-Region Architecture Matters

Thanks, and good morning. I'm going to start us off with a quick scenario here. Can you imagine it's 2 a.m. and your phone erupts with alerts? Thousands of business transactions are failing, and your applications are down. You're wondering: will my failover work? Or will my application even run in the other region? Your engineers are frantically searching for recovery runbooks, half awake, executing procedures you tested a year ago. For some of you, this may sound all too familiar. I see a lot of smiles in the crowd. For others, you're waiting for that day to come because it's going to come.

Thumbnail 0

Whether you're exploring what it takes to go multi-region or looking to improve your own multi-region approach, welcome to ARC404: Building Resilient Multi-Region Applications with Capital One. I'm Daniel Cil, along with my colleagues Eduardo Patrocinio and Prem Kumar Dhayalan from Capital One. We'll show you how to deliver predictable recovery when it matters most.

There's been nearly a decade of multi-region talks at re:Invent that shows how best practices and approaches have evolved over the years, and we're back again. These challenges haven't changed dramatically, but our solutions have become more integrated and sophisticated. Today, we'll talk about three of those challenges and how practical strategies, along with new AWS capabilities, can help you solve these problems.

Thumbnail 60

Prem and I will start with the foundation needed for successful recovery. We'll talk about dependencies, often overlooked until they cause your recovery to fail, and share how Capital One identifies these critical dependencies. Then we'll explore recovery orchestration using AWS Application Recovery Controller region switch and Capital One's journey with multi-region recovery. Eduardo will come up and wrap us up on data consistency, still one of the hardest aspects of multi-region design. He'll demonstrate how Aurora DSQL and DynamoDB multi-region strong consistency can enable true active-active architectures.

Thumbnail 80

Hopefully, you'll leave today with practical knowledge on identifying your dependencies, implementing reliable recovery, and making informed data consistency decisions. But first, let's talk about when you should consider going multi-region. We're up to 38 AWS regions and counting, each resilient and designed to operate independently as fault isolation boundaries. For most applications, a well-designed multi-AZ architecture within a single region is going to be sufficient for resilience. For applications with the highest availability requirements, multi-region architectures can provide additional protection, but it comes with increased cost and complexity.

Thumbnail 140

It requires ongoing investment, not only in technology but in people and processes. So I want you to carefully consider: does my business continuity need require a multi-region approach? For many of our customers, it does, and we typically see two reasons. You may be in an industry that has regulatory and compliance requirements that dictate you need geographic separation for disaster recovery. Financial services, healthcare, and government customers often fall into this category. Or you might have applications that need predictable bounded recovery in the rare event of a regional disruption. These are typically critical applications where extended downtime can create significant business impact.

Thumbnail 180

Capital One's Journey: From Compliance to Bond Platinum

We're fortunate to have Capital One joining us today. They've been on this journey for years and have developed innovative approaches to build and operate multi-region applications at scale. As Daniel just mentioned, at Capital One, our multi-region journey was driven by both factors. As a financial institution, we have regulatory requirements around geographical distribution and recovery capabilities. But beyond compliance, we recognize that our customers depend on continuous access to their financial information and the ability to make transactions at any time.

Thumbnail 250

Hello, everyone. I hope you're all having a great time at re:Invent. I'm Prem Kumar Dhayalan, Senior Distinguished Engineer for Enterprise Resiliency and Recovery at Capital One. I'm glad to be here presenting this topic. With AWS, our cloud-based environment operates on a massive scale, managing over thousands of applications deployed across thousands of AWS accounts. This infrastructure includes millions of AWS resources and supports thousands upon thousands of transactions.

Thumbnail 280

Thumbnail 310

Our infrastructure includes millions of AWS resources and supports thousands of critical workloads. Our approach has evolved from treating multi-region as a specialized disaster recovery capability to making it a fundamental part of our application architecture. We call it bond Platinum. This shift allowed us to not only meet regulatory requirements but also deliver a more sophisticated and reliable experience for our customers. A key part of our success has been centralizing and automating best practices. We don't leave it to individual teams to figure out multi-region solutions on their own. Instead, we created platform capabilities and guard rails that make the right way the easy way.

Thumbnail 390

Uncovering Hidden Dependencies: The Silent Killers of Recovery Plans

What I'll share today isn't theoretical. It's based on years of practical experience operating mission-critical applications across multi-regions. Daniel and I will talk through the first challenge. You've probably experienced this already, but dependencies can ruin your recovery plans. I hear this often, and it's usually the same story. We try to recover or fail over to the other region, but we had a dependency on X and X was unavailable because it was still in the primary region. A single overlooked dependency, whether it's a hard-coded endpoint or a third-party API, can turn into extended downtime.

Thumbnail 420

Thumbnail 430

Thumbnail 440

Thumbnail 450

In our multi-region fundamentals paper, if you haven't read it yet, there's a QR code here. We talk about understanding your dependencies, and you can think about them in four categories. First, AWS services that must be available in your target region because not all AWS services or features are available in every region. Second, internal systems such as shared services or on-premises components. These might include your authentication services or other business systems that your applications depend on. Third, third-party services like identity providers or other SaaS applications. I've seen this increasingly become really critical in most organizations today. And finally, configuration requirements, including service quotas and secrets. These need to be consistent across both regions to support failover. If any of these dependencies are not available, it could cause your recovery to fail.

Thumbnail 470

What I see often is that the challenge is really how you identify these dependencies because they're not core components of your application. If you think about your architecture diagram, they often don't show up on them. So it's often out of sight and out of mind. I like to use this mental model to think about these dependency categories either as recovery dependencies or runtime dependencies. Think about the critical dependencies you need to recover your application in a different region. Once they're recovered, what runtime dependencies do they need to operate normally, such as upstream or downstream services that need to be in the same region?

Thumbnail 510

I'll start with AWS services. Knowing which services and features are available in the target region is crucial, and we just made that easier. We recently announced AWS Capabilities, a new addition to AWS Builder Center that gives you visibility into AWS services across our global infrastructure. You can compare regions side by side, filter for specific services or features, API operations, and even confirmation resources. You can use this tool now to ensure that you have parity for your services and features across your AWS regions.

Thumbnail 540

Thumbnail 550

Thumbnail 560

Thumbnail 580

Thumbnail 590

For the remaining dependencies, let's take a look at this multi-regional architecture. Here we have an application deployed in the primary region. What you don't really think about is that its dependent components are likely managed by a different team. When you're planning for regional disruption, it's easy to think about only recovering your application in that other region. But what about all the dependencies the app has? Are any of them critical for recovery? Is the application you're recovering in the secondary region dependent on any services that are in the impaired primary region? Now your recovery has failed because these dependencies were either overlooked, unknown, or assumed they'll be available because it always worked during DR testing.

So how do you know what recovery dependencies are needed? Images to scaled containers, secrets or credentials to connect to databases.

Thumbnail 630

One practical way to start is with your recovery runbook to determine what services or tools you need to execute every step in your recovery plan. Then you can take it a step further. Once you identify those dependencies, you can do dependency chain mapping to understand what dependencies your dependency has. By doing this, you've at least identified the known but overlooked dependencies. But that's the easy part. What's hard is what's hidden. What about the unknown ones that you don't know about that might be tribal knowledge, something somebody knew 15 years ago and they've left the company since?

One way you can identify these hidden dependencies is to block network traffic between your AWS regions during a failover test. You can use a purpose-built test like AWS Fault Injection Service's cross-region connectivity scenario, which does exactly that. You can use this scenario to find hidden cross-region dependencies and validate that traffic is not going from the primary region to the secondary region during a failover test. What it does is deny cross-region traffic, access to AWS public endpoints, and access to your workloads via load balancer and API gateways. It'll also pause cross-region replication to see how you handle data reconciliation.

Thumbnail 690

The goal really is that during a failure scenario, there should be no dependencies between the primary region and the secondary region. To achieve this design for regional independence, you want to ensure your critical dependencies are available in each region. For some organizations, these capabilities are often provided by a central platform team. Consider operating that platform in an active-active strategy so that it's always available. One benefit to this is that you can always support applications that fail over during testing at any time without coordination, or during an actual event, you don't have to wait for the platform to recover first, thus avoiding extended recovery time.

An important consideration is if you're using a SaaS provider that's also hosted in the same AWS region, make sure they also have a multi-region strategy. To ensure this regional independence, you should continue to use FIS cross-region connectivity scenario as part of your regular testing to validate there's no traffic going between both AWS regions during a failover. Now runtime dependencies are just as important, and I'll have Prem come back and talk about how Capital One does runtime dependency management.

Thumbnail 770

Capital One's MELT Data Approach: Mapping Runtime Dependencies at Scale

Runtime dependencies represent the highest tier of complexity and risk, with reliance on coordinated availability of multiple services. Consider this example of bill payment. The single process is achieved through a chain sequence of transactions. The client requests bill pay, which initiates the business process. Balance API acts as an upstream dependency to validate funds. Scheduling API manages the orchestration and timing of the transaction. Print check API executes the core business action. Update API and database operation represent the downstream dependencies critical for state persistence and final record update.

Thumbnail 860

As you can imagine, failure in any single link within this chain, for example, bill payment balance API timeout or schedule API fails to communicate or commit, results in the complete failure of the business process. This highlights just how complex the upstream and downstream dependency challenges can be in large systems. Over the years we've found that the most dangerous dependencies are often undocumented, existing only as tribal knowledge or implicit dependencies that were created years ago and forgotten. In our early multi-region implementations, we focused heavily on the technical aspects of replication and failover but underestimated the complexity of dependency management.

Thumbnail 900

In our recovery exercises, we discovered numerous hidden dependencies that prevented the application from functioning correctly in the secondary region. Those humbling experiences taught us critical lessons to evolve.

Thumbnail 930

Thumbnail 940

Thumbnail 950

Thumbnail 960

Thumbnail 980

Lack of visibility leads to cascading impacts. There is no single source for the complete dependency chain or the precise recovery order, which makes impact analysis and appropriate remediation action harder. A significant hurdle is determining the appropriate actions to take on each dependency. The other significant challenge is the difference between the dependencies declared during design versus observed in the live traffic. Without proper dependency management, you risk major business impact and longer downtime during incidents. Incomplete visibility leads to delayed reaction time when responding to failures. Dependency information quickly becomes outdated as applications constantly evolve.

Thumbnail 1000

Thumbnail 1010

Thumbnail 1020

These failed recovery exercises led us to develop a sophisticated approach to dependency management using MELT data: Metrics, Errors, Logs, and Traces. Our process begins with collecting a high volume of diverse MELT data directly from our AWS resources and applications. The data allows us to analyze live traffic and tracks to generate and maintain accurate application dependencies. We correlate this data and apply context enrichment to identify complex dependencies, achieving deep visibility. We map all upstream and downstream dependencies, assigning criticality and factoring in the resiliency tier for prioritization.

Thumbnail 1060

Thumbnail 1070

This comprehensive data-driven map is persistent and used as our core intelligence to determine the correct and sequential recovery during automated recovery. This data-driven approach has been a game changer for us, moving us from static, potentially outdated dependency management to a dynamic, accurate view of how our systems actually interact. This is a high-level view of how the solution flow works: from an AWS resource to MELT data collection, enriched with context, forming a detailed application dependency tree. This platform offers flexible views including infrastructure, API dependencies, and component views, and many more.

Thumbnail 1100

To summarize our key takeaways on dependencies: identify hidden recovery dependencies through testing and analysis. Don't assume your documentation is complete. Continuously monitor and update. Identify all the runtime dependencies for end-to-end operations. Look beyond what's needed and consider what's required to function over time. Ensure regional independence to avoid cascading failures so that a problem in one region doesn't affect your ability to operate in another.

Thumbnail 1140

Thumbnail 1170

Based on our experience, I would emphasize that dependency management is not a one-time exercise. It's an ongoing process. At Capital One, we continuously monitor and update our dependency maps as applications evolve. We found that automated approaches like MELT data analysis, which we discussed earlier, are essential for keeping dependency information correct and accurate.

Thumbnail 1190

AWS Application Recovery Controller Region Switch: Automating Multi-Region Orchestration

Now that we've identified critical dependencies, I think of that as pre-recovery things that you have to do beforehand, before you actually need to fail over. So now when it's time to actually fail over, how do you orchestrate recovery into another region? You need a reliable recovery mechanism to coordinate a sequence of steps for regional failover. As Prem talked about, when we understand what these dependencies look like, whether it's applications or services, often they need to get failed over in a particular sequence.

Thumbnail 1220

Thumbnail 1240

Thumbnail 1250

Thumbnail 1260

This is where I see a lot of organizations struggle, using manual processes or repurposing deployment pipelines that weren't designed for recovery scenarios. Here you see a typical regional failover. It requires more than just flipping DNS to the other region. It needs an automated approach involving multiple steps that must be executed in the correct sequence. In the sample architecture, there was a disruption in the primary region. We would perform the following recovery steps in the standby region. Scale up compute resources and scale up our data streams. Fail over our databases, and then we would probably do some application verification in between. Maybe application configuration or validating that your service and application work, and then flipping over DNS traffic so that clients can continue to operate and go to the other region.

Thumbnail 1270

Thumbnail 1280

Thumbnail 1290

Maybe you have these capabilities already today, but they're painful for a few reasons. Many customers still manually work through a list of steps from the recovery runbook, whether it's click ops or executing handcrafted scripts to fail over. This approach is error prone and takes more time coordinating between different operators, making sure that they're going in the correct sequence. Reliability is critical as well. Your recovery tool needs to be available when you need it most, which means it shouldn't have a dependency on the primary region. Not only that, but your recovery plan is only as good as the last time you tested it, very similar to your backup plans. Without continuous validation, you risk discovering broken permissions or configuration issues during the failover.

Thumbnail 1320

Thumbnail 1330

Then when executing a recovery, it's often unclear where you are in the process, what succeeded and what failed. Lack of real-time visibility can impact your recovery time. I'm excited to share that in August 2025 we launched a new capability for AWS Application Recovery Controller called Region Switch. We built it to provide customers a fully managed, highly available orchestration service to switch between regions to address the very challenges that we just talked about. It's a reliable mechanism for two reasons. First, it performs a planned evaluation every 30 minutes to ensure that your recovery plan will work when it needs to most. Second, the recovery plan is actually executed from the target region that you're going to, so there's no dependency on the region that you're leaving that may be impaired.

Thumbnail 1370

It provides automation that eliminates the need to build and maintain custom recovery tools. Let me show you how to build an automated workflow in a Region Switch plan that reduces recovery time and human error. I've taken the application recovery process that we just saw and built this workflow for it. Here we use the term activate to fail over to the passive region and deactivate to prevent traffic from going to an unhealthy region for active-active architectures. Every box you see represents an execution block, a step in your recovery workflow. Execution blocks can run sequentially like steps 3 and 4, or in parallel like steps 1 and 2. We start off with a manual approval to allow human in the loop to make the decision to fail over, but the rest is automated.

Thumbnail 1430

Region Switch plans are also flexible so you can build nested workflows for complex scenarios to fail over groups of dependent applications. Whether they're groups of microservices that you have together or groups of applications that need to move over together, you can coordinate all that with the plan. If we take a closer look at the execution blocks, the first thing that we did was scale up compute capacity in the target region to take on failover traffic. This execution block is an ECS service scaling action that increases the number of tasks in the target region. But how do you know how much to scale up in the target region? Region Switch will frequently track the peak running capacity of the source region to allow you to match by percentage in the target region, and that's one of my favorite features. By the way, it also supports EC2 Auto Scaling groups and Elastic Kubernetes Service with these same capabilities.

Thumbnail 1470

For specialized recovery requirements, Region Switch supports a custom action Lambda execution block. This flexibility allows you to incorporate any API accessible action to your recovery workflow, addressing unique requirements while maintaining the benefits of centralized orchestration and monitoring. What you see here is a Python script I just wrote to scale up Kinesis shards using its API in the target region. But it's like a Swiss Army knife. You can use it to do the application configuration we talked about or do application verification prior to failing over. So whatever custom code you have, you can use with this execution block.

Thumbnail 1510

Thumbnail 1530

Database is often the complex part of recovery. This execution block makes it simple to fail over the Aurora global database, so you no longer need your DBA to do it. You just need to provide the cluster name and the ARNs and regions, which handles the rest. It even allows you to perform a switchover for planned testing or failover during actual incidents.

And then lastly, the Route 53 health check execution block provides a reliable way to route DNS traffic using Route 53 health checks. Here you'll add a hosted zone ID and a DNS record that supports health checks to the execution block. In this example, we're using a DNS failover record that has a primary record in US East One and the secondary record in US West Two. Once that execution block is created, Region Switch is going to generate health checks for each of the DNS records like you see in the diagram below.

You just copy these health check IDs into the corresponding Route 53 records and associate them. What's important to note is that these health checks are created in an AWS managed account and they actually don't monitor the resource for your application. Since Region Switch owns these health checks, it can control DNS routing by changing the health check status to route traffic to the other region on demand as part of your recovery workflow. We use what's called a stop pattern where the secondary takes over the primary, so that way there's no dependency on the primary. Everything is executed from the secondary.

Thumbnail 1630

That's a quick overview of how Region Switch works. Hopefully it's going to make your life easier and more reliable to recover applications in another region. I'll have Prem come back one more time to share about how Capital One has gone through their recovery automation journey over the years.

From Hours to Minutes: Capital One's Recovery Automation Evolution

At Capital One, our recovery journey begins with slow, manual processes and isolated component level approaches that ignored the broader ecosystem. We recognize this critical gap. At that time, no market solution met a high standard for fast, reliable recovery for business continuity. Building an in-house tool was a strategic investment in our resiliency posture. It's been a multi-year journey and a continuous investment.

First we developed automated runbooks, converting static plans into automated operations. Next we created custom CLI-based runbooks for total flexibility and teamwork. Then we introduced experiences with least privilege. This reduced our reliance on the DBA and cut down human errors. Our plug and play features allowed for quick onboarding and automation for common enterprise cases. Finally, we achieved full maturity with end-to-end automation via low-code and no-code workflows.

Today, the decision to fail over remains human in the loop, but the execution is fully automated. Remember we talked about the dependency resolution in the previous section. Here is where we leverage those dependency management capabilities for smart, targeted, and sequential recovery. Our preferred and most efficient method is the dependency group failover, though we support individual application failovers too. Our evolution mirrors the maturity journey of many multi-regional organizations. That's why we are excited about the AWS Application Recovery Controller Region Switch. It addresses many of the challenges we solved internally.

Thumbnail 1770

Thumbnail 1790

Thumbnail 1800

Our recovery tooling efforts were transformative. Cutting recovery times from hours to minutes, our tools are a required common capability across operations at Capital One and help us to ensure standardization. This consists of behavior and eliminates thousands of unique custom solutions and standardized workflows across common tasks, which helps reduce mistakes during incidents and testing events.

Thumbnail 1820

Thumbnail 1840

Second, it significantly reduced the mean time to engage and restore automation and led to faster incident response and recovery. Third, it frees more capacity to deliver higher business value for developers to spend on innovation. This transformation has fundamentally changed how we approach resilience, making it a core operational capability rather than an afterthought. Our technical recovery exercise provides clear proof of our tooling benefits. Our average minutes to recovery across all resiliency tiers has reduced by an estimated 70 percent. We have a long way to go, but it's a significant increase from where we started the journey versus where we are today from a benefits perspective.

Thumbnail 1900

This improvement comes from several factors: smoother recovery process, reduced risk from human errors, and making it easier for applications to test failover more frequently. This transformation converted what were once static dependent documents into functional compliance and recovery tools that deliver real business value. AWS tools and services act as an accelerator for incident response and recovery, minimizing impact during disruptions. The slide shows the comprehensive ecosystem of recovery tools we use. For DNS-based routing, we use Route 53 health checks for traffic management with the Global Accelerator and Elastic Load Balancing.

Thumbnail 1960

Thumbnail 1970

For disaster recovery, we use a combination of AWS services to build the in-house recovery tool which I talked about earlier. For testing, AWS Fault Injection Service provides resiliency testing capabilities. Although we built our own orchestration capabilities, we are excited about the continuous evolution of database services and looking forward to seeing how ARC Region switch can be used to reduce our operational burden. The key takeaways are: implement a reliable automated mechanism like ARC Region switch. Don't rely on manual processes or repurposed tools which were designed for other scenarios, such as CI/CD tooling. We went through that similar journey in the early phase.

Thumbnail 2020

Ensure your failure process accounts for application dependencies. Understand which components need to move together and in what sequence to maintain functional integrity. Test regularly under realistic conditions. Don't test in ideal circumstances. Simulate the chaos and constraints of real incidents. By understanding these elements, you will significantly improve your ability to recover quickly and reliably during regional disruptions, whether planned or unplanned.

Thumbnail 2080

Building a Credit Card Authorization System: The Active-Passive Foundation

At Capital One, our experience has shown that investment in purpose-built tools and orchestration pays dividends, not just during incidents but also in regular testing exercises. The confidence that comes from knowing your recovery process works reliably is invaluable, both for technical teams and for business stakeholders. Now, Eduardo will talk through the third challenge, which is data consistency. Thank you, Prem. Hello everybody. My friend Prem talked about the fact that dependency mapping is the hardest problem, and I agree with him. My name is Eduardo Petrosino. I'm a Principal Solutions Architect here at AWS supporting Capital One and the banking community across AWS. I want to talk about a different kind of challenge whenever we are building resilience applications on multi-region, and that is data consistency.

Thumbnail 2100

Let's look at some requirements for a credit card authorization system. This is a fictitious application that I'm building.

Thumbnail 2110

Thumbnail 2120

Thumbnail 2130

Thumbnail 2140

The application I'm building has realistic requirements. We want to process transactions in less than 500 milliseconds, and we want to do card validations using the Luhn algorithm, validating the CVV and other details. We also want to have some ability to do business logic for authorization. Then we want to be able to process at least 100 transactions per second with 99.9% uptime. We want to persist the data with encryption, have a RESTful API, and handle PCI data and similar requirements. This is the scenario I'm going to use to talk about data consistency.

Thumbnail 2150

The solution I built here is still in a single region, a single AWS region, spreading across multiple availability zones. Let me describe the different components in that solution. For the data layer, I'm using Aurora to store static information such as the user's profile, credit balance, credit limit, and whether their card is authorized for international transactions. Aurora is used mostly for profile information. DynamoDB I'm using to store the transactions. Any time a user makes a transaction that gets approved or rejected, I store it in DynamoDB. Since I want this solution to be as quick as possible, I'm using ElastiCache to provide in-memory caching. I take the most recent transactions from DynamoDB and bring them into ElastiCache. I also bring the user's profile information into ElastiCache. My goal with this solution is to do the approval or rejection as soon as possible.

Thumbnail 2260

To achieve that, I'm using Fargate, which will look at that information and pull data from ElastiCache if necessary, then quickly say whether the transaction is approved or rejected. Everything else I'm doing asynchronously. I send the transaction to a Kinesis stream, and then a Lambda function pulls that information and persists it back to the DynamoDB table. This is good, but as my friend Dan talks about, there are many reasons to consider doing multi-region, such as different kinds of requirements or different user experiences. You might have users all over the world or all over the country, and you want to ensure that every user has the same experience.

Thumbnail 2290

How do I evolve the solution I had before to provide multi-region capability? My first step is to take that solution and deploy it across two regions. Hopefully you're using infrastructure as code, such as CloudFormation or Terraform, so I can just deploy that solution across two regions. This is pretty good, and I'm using two kinds of databases to help implement this multi-region solution. The first is Aurora Global Database, which allows me to replicate data from one region to another. Similarly, DynamoDB also has the capability of a global table that I can use to replicate the data. This is a pretty good solution. I have Route 53 on top serving DNS. I'm still using an active-passive solution here, so all the traffic goes to the left side to Region A. But if there is any impairment in any of these components, I can quickly switch, maybe using the region switch that Dan described before, to move from one region to another.

Thumbnail 2360

Thumbnail 2390

This is pretty good, but it's still just one region being active and the other region being passive. Before I talk about how to enhance that solution, I need to talk about the CAP theorem. The CAP theorem was created 25 years ago, and it tells us that whenever there is a network partition, you have to choose between consistency or availability. When we are using multiple availability zones or regions, you have to choose whether you're going to focus on the availability side or on the consistency side. Now let me move that solution to be able to do active-active. I changed my Route 53 to send traffic to the two regions. Now users across different regions of the world can be served with the same kind of experience. The solution is sending traffic to both regions, so we are in good shape.

Thumbnail 2450

But are we done here? The answer is no. If you look carefully at that solution, even though all the components are running across the two regions, it is not a true active-active solution. Why? If you look at the pictures, there is one component that is running primarily in one region and it has a read replica into the other region. Aurora requires all the writes to go to a single region. You can have the read replica into the other region, but with that solution, I'm still dependent on one region to be able to write the records to my Aurora database.

Thumbnail 2460

Achieving True Active-Active: Aurora DSQL and DynamoDB Multi-Region Strong Consistency

So how can I make this solution better? What I want to describe is Amazon Aurora DSQL. DSQL is a new kind of service that we introduced earlier this year and allows us to do active-active writes to a database across multiple regions. Yes, it does require a third region to serve as the witness so that we can have quorum across the three different nodes of the Aurora DSQL. But with this solution, now I can write to Region A and Region B. Pretty cool. So now we have a service that AWS is providing to you that helps you implement multi-region deployment.

Thumbnail 2500

Thumbnail 2510

Thumbnail 2530

But are we done? The answer is no. Aurora DSQL is something that we are going to manage the infrastructure and the scalability of the solution. It allows you to really do multi-active across the regions. It does automatic failover. Dan described a little bit about the capabilities of Application Recovery Controller to be able to implement the enablement and activation of the different regions. Now, AWS is going to do all the failover of the database for you. It is PostgreSQL compatible, and one thing that I find super interesting is that transactions are welcome. We don't like to talk about transactions on the cloud, but here is the opposite. All the transactions or all the statements that you have in the transactions are going to be resolved locally inside that region, and whenever we do the commit statement, that's where we are going to sync with the other side. So transactions on Aurora DSQL are welcome, and I encourage you to use them.

Thumbnail 2560

Now let's look at this diagram again. Latency is kind of awkward. Even if I stop here for two seconds and you don't hear any noise from me, you might wonder what's going on. What happened is that if you look at the DynamoDB Global Table, it is still doing replication between Region A and Region B asynchronously. The traditional mode that we have with DynamoDB we call multi-region eventual consistency. DynamoDB is going to persist the data locally to one region and behind the scenes, asynchronously, it is going to replicate that data to the other side. As a consequence of that, there is a caveat whenever we use DynamoDB: the last one wins.

Let's go to a scenario where you decided to store, for example, the credit balance of your credit card as a single record on your DynamoDB. By coincidence, my wife and I decided to do a transaction at the same time. I'm here in Vegas and I'm spending ten dollars on my coffee, and my wife is spending one thousand dollars on something else that she's buying. If it happens almost at the same time, we have the chance that with DynamoDB, the last one wins. Whatever record has the most recent timestamp is going to be the one that prevails. So you have to be careful whenever you're using DynamoDB with Global Tables because it still has this property where the last record that updates a certain record will win.

Thumbnail 2660

In order to address that, there is another flavor now for DynamoDB called multi-region strong consistency. This means that the persistence of that data is being done not only into that region, but we are going to synchronously take that data and replicate it to the other region. Similarly to Aurora DSQL, DynamoDB multi-region with strong consistency requires a third region to serve as the witness. So we do require three regions for that. But this way, we have the guarantee that whenever you are storing the data in one region, the data is first persisted to the other side before we go back to the users.

Thumbnail 2700

So what are the benefits of DynamoDB multi-region strong consistency? It allows you to implement strong consistency across the regions. It allows you to still have low latency.

Thumbnail 2710

Thumbnail 2730

DynamoDB is famous for providing us low latency while still maintaining strong consistency across regions. We still have low latency even though we must wait for the round-trip time to send data to the other region and receive acknowledgment, but DynamoDB multi-region strong consistency provides a low latency solution. It simplifies your application logic so you don't have to compensate for the "last one wins" scenario where you would need to write data and then read it back to ensure persistence. With multi-region strong consistency, you have the guarantee that once you write data, it will be persisted in DynamoDB.

I started my part here discussing the CAP theorem, which states that whenever there is a network partition, we must choose between consistency and availability. My initial solution was very simple, using just Aurora Global Database and DynamoDB with what we call multi-region eventual consistency. In that case, we favor availability, so on the CAP trade-off, we are looking for availability and consistency is eventual because it happens asynchronously. The availability is high and we do support partition tolerance, but now with the new solution using DSQL and DynamoDB multi-region strong consistency, I am focused on consistency instead.

Thumbnail 2760

To have stronger consistency, I have to sacrifice some availability. As I mentioned, DSQL and DynamoDB multi-region strong consistency requires a third region, so you need to get quorum across at least two regions for data to be persistent. With DSQL and DynamoDB multi-region strong consistency, availability is reduced whenever there is a network partition, but we can focus on consistency instead of availability.

Thumbnail 2850

Thumbnail 2870

Practice Like It's Game Day: Final Recommendations and Resources

Implementing resilient multi-regional applications is hard, and we have discussed different aspects here. DSQL and DynamoDB multi-region strong consistency help you with these challenges by providing ways to do synchronous replication of data across different regions, allowing you to create a true active-active solution across those regions. To increase consistency, we have to decrease availability, so we have reduced availability in the case of network partition.

Thumbnail 2890

Thumbnail 2920

One thing Prem discussed is the technical recovery exercise, which is not really an exercise that my friends at Capital One do in the traditional sense. It is really running the show, meaning they take the entire workload and switch it from one region to another. Likewise, my recommendation for you is to practice everything you are doing as if it is game day. You don't practice football by yourself by just kicking the ball and running touchdowns. Practice all your things, maybe using fault injection services as a way to help you, but our recommendation is that you practice your solution as if it is game day.

Thumbnail 2930

Thumbnail 2940

Thumbnail 2950

In terms of the call to actions we have, dependency management is a hard thing to do, but you have to do it. It is impossible for you to create any kind of plans if you don't know the dependencies across your components. You have to automate your recovery. Don't do manual steps, but automate your recovery orchestration. The last thing I want to talk about is selecting your data consistency solution. Whether you are going to focus on availability or consistency, if you are going to focus on consistency, we do have AWS solutions such as DSQL and DynamoDB multi-region strong consistency that can help you with that solution.

Thumbnail 2970

Thumbnail 2990

Thumbnail 3000

Thumbnail 3020

We do have some white papers available, so feel free to take a picture, and they will help you on this journey of doing multi-region and resilient applications. There is a meetup this afternoon, and all the resilience experts are going to be there. Feel free to join us at 5 p.m. today for the meetup on resilience. We do have some swags available, but they are not going to be given here because we are supposed to leave this room as soon as we are done. Prem, Dana, and I are going to be available at the lounge area near the door, so feel free to reach out to us and share your experience of doing multi-region and resilient applications and see how we can embrace this journey together. I want to thank you for coming to our session, and I hope you have a great day.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)