DEV Community

Cover image for AWS re:Invent 2025 - Multi-Region strong consistency with Amazon DynamoDB global tables (DAT440)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Multi-Region strong consistency with Amazon DynamoDB global tables (DAT440)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Multi-Region strong consistency with Amazon DynamoDB global tables (DAT440)

In this video, AWS engineers Somu Perianayagam and Amrith Kumar explain DynamoDB Global Tables' two replication modes: asynchronous and Multi-Region Strong Consistency (MRSC). They detail how asynchronous replication offers low latency with eventual consistency using unidirectional replicators and last-writer-wins conflict resolution based on atomic clocks, while MRSC provides serializable reads through a multi-region journal requiring quorum writes across regions. Key differences include network partition behavior—asynchronous tables stop replication between isolated regions while MRSC routes through witness regions. The presentation includes live demos comparing latencies and consistency guarantees, discusses failure modes, and emphasizes testing applications using AWS Fault Injection Service to validate failover scenarios before production deployment.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Multi-Region Data Replication Challenges and DynamoDB Global Tables

Good afternoon everybody. Welcome. This is my first silent session, so if my voice modulation is slightly off, please bear with me. The talk is going to be strongly consistent, and I'm hoping you guys will eventually hear it. So that's good. My name is Somu Perianayagam. I'm an engineer at AWS. I have Amrith Kumar with me, who is a senior principal in Amazon DynamoDB. We both are here to talk to you guys about the two replication modes of DynamoDB Global Tables, the guarantees they offer, and what to choose for what kind of workload.

Thumbnail 50

Customers today are building systems that need their data close to their end users. Their end users are all over the globe, are multi-region, and this can be for a variety of reasons. This can be for reasons for low latency or compliance reasons, geo-compliance reasons. But a lot of customers are building applications and systems where data has to be close to their end users. They also want these data to be highly available. What do I mean by that? They want the writes which happen in any region to be available in other regions as well, simultaneously. And for business, they need high business continuity in case of regional service disruptions. If one region goes down, they might want to have high business continuity for their applications.

Thumbnail 120

These requirements are making customers and a lot of applications multi-region by default. And this means the database also needs to now start growing to be multi-region by default. DynamoDB Global Tables was created exactly to support this kind of multi-regional architectures. But multi-region, unlike single region, is hard just because regions are very far apart. They're across oceans, their latencies are very variable. That means when you write data in another region, it's going to take some time to propagate to a region far away. For example, a write from Virginia to Tokyo may take a couple of hundred milliseconds, while a write from Virginia to Ohio may take twenty milliseconds. And this is a very different latency profile than a write which is replicated within a region.

Thumbnail 190

And if your architecture now allows multiple writes across multiple different copies of this data, then you have to reason about how are these writes ordered, how are these writes propagated, what is the order in which the writes will be visible to end customers. These fundamental realities mean that there's no one single replication model that would fit every workload. Hence, DynamoDB Global Tables offers two different kinds of replication models for different workloads. As you build these multi-region architectures, you have to have some trade-offs. And one of the biggest trade-offs is, do you want your application to be fast and have predictable performance as it would when it writes to a single region, or do you want writes to be visible across multiple regions? If a write is done in one region, do you want it to be visible in other regions?

Thumbnail 250

Because this communication of writes going across regions is going to have a latency penalty, which means that you will have to sacrifice the latency or you have to sacrifice consistency. And DynamoDB Global Tables has two modes. One is the asynchronous replication mode, which helps you have the high performance, low latency, predictable performance suited for predictable performance workload, while Multi-Region Strong Consistency Global Tables, which is a whole mouthful to say, so we're going to call it MRSC Global Tables, which is suited for workloads which require strong consistency where a write in any region needs to be visible in other regions. The other big trade-off surrounds around what is your Recovery Point Objective and what is your Recovery Time Objective.

The Recovery Time Objective is an objective where how seamlessly can your application fail over from one region to another region or how quickly can your application recover from a failure. DynamoDB Global Tables, from the ground up, from first principles was designed as an active-active architecture so that you don't have to worry about this, and you can have an application architecture which has a Recovery Time Objective of zero. The Recovery Point Objective is an interesting one. It is the objective that says how much data can I recreate or afford to lose if I have a service disruption or if my service fails in a single region. And this is going to drive how fast you want to replicate writes or how strongly writes have to be replicated if you want a low RPO. DynamoDB Global Tables, asynchronous and MRSC Global Tables have different RPO characteristics.

And your workloads can now pick and choose which trade-off they want to make. With this, I'm going to hand it over to Amrit to dive deep into the consistency models and asynchronous global table replication.

Thumbnail 330

Understanding Consistency Models in Distributed Databases

Thanks Ari. Thank you so much. I need a quick show of hands from you if you're able to hear me or not. Okay, thank you. My name is Amrit. As someone said, I work on the DynamoDB team. I'm a Senior Principal Engineer. I spend much of my time working with customers, and the thing I like the most about that part of my job is I get to see what it is you're building with DynamoDB.

Thumbnail 350

I want to talk to you a little bit about eventually consistent global tables, and I'll hand it back to Somu who's going to talk to you about the strongly consistent global tables. What we're going to cover are consistency models, use cases for multi-region global tables, and Somu actually, I think, is the brave one here. He's going to show you a live demo, so that's going to be fun.

Thumbnail 380

All right, so let's talk about consistency. When you're building applications with a database, consistency is about whether your data read in multiple places is the same or different. In a single instance database, consistency is relatively easy. You write your data, you can read your data, things are good. DynamoDB, on the other hand, is a distributed database, and your data is stored durably on multiple copies. We'll talk about how many copies and how we do it and so on.

In order for us to scale, it is not possible for us to always give you strong consistency. Applications sometimes need strong consistency. What is strong consistency? You write the data, everybody should be able to see the same data instantly. Eventual consistency is you write the data, everybody will get to see that data eventually. In other words, if you do a bunch of writes and then stop, all copies of the data will converge to one place. That's eventual consistency.

If you're not able to guarantee strong consistency for everybody, it makes it easier to build applications if you're at least able to say when I write some data, I can read it back. That's read-after-write consistency. And the last one which I want to point out is something called monotonic reads. With eventual consistency, I said data converges to one place, but suppose you were to do multiple writes at time one, two, three, four. Monotonic reads basically say time will always move forward. There are database architectures where you can read one, three, two, four. We do not give you that. We always give you monotonic reads.

Thumbnail 480

DynamoDB Architecture and Multi-Region Replication Fundamentals

To understand global tables, you need to understand first something about DynamoDB. So this is a quick overview of DynamoDB. You, the customer over here, this is your application. You connect to DynamoDB over a network. Your request goes through a whole networking infrastructure, load balancers, and all of that stuff, but eventually makes its way to a request router. Your data is stored over here on the right, and this is shared infrastructure used by all of our customers. We have millions of customers, and they all share the same infrastructure.

When your request makes its way to a request router, its job is to figure out where your data is. I want to read an item. On which storage node is your data actually stored? That's its job. In order to do that, it uses metadata. Now, the most important thing for us is to make sure that we only show you the data which you're allowed to see. So we have to verify who you are, authenticate you, authorize you. Then we have rate limiting to make sure that you don't overdrive your database, and then we send it over to a storage node.

For availability and durability, we store three copies of your data on three storage nodes in three different availability zones. When you want to do a write, your write always goes to a storage node leader. I just depicted the leader here with the crown. So with that in mind, let's talk about how an eventually consistent read works.

Thumbnail 570

Thumbnail 600

You send down a request. It comes to a request router. Since it's an eventually consistent read, I can read from any copy. That's one of the reasons why you have to understand eventual consistency in your application, because when you make a GetItem call, you have to realize that you may not be getting the most recent version of data. Your application has to understand that. However, if you do want the most recent version of the data, you will do a strongly consistent read, which is also something which we support. Then we will send your request to the leader. The leader is always part of a write, therefore it always has your most recent data.

Thumbnail 620

Everybody okay with this concept so far? Yes, okay. So when you want to go from a single region setup, now this was already three availability zones, but if you go from three availability zones in one region to multi-region, you have to think about a couple more things. First, there's propagation delay between regions. Propagation delay between availability zones is always much shorter than the propagation delay between regions, so this is something which you have to consider.

When you write data to a table, if it's a global table, it's our responsibility to get it over to the other region. But in what order are your writes sent across? The guarantee we provide is that on a per item basis, data will be transferred in the right order. In other words, it is monotonic. You're never going to go back in time. It's monotonic on the other region as well.

Global tables are also active-active. So your application could legitimately be running in multiple regions. We have global tables which are in 30 regions. We're responsible for the replication. Your application can be writing anywhere. But if multiple parts, multiple locations were to modify the same item, what is the item finally going to converge to? We use timestamps for conflict resolution. So we'll talk about that as well. And finally, we're going to talk about failure scenarios, various different failure scenarios and how we deal with them.

Thumbnail 700

How Eventually Consistent Global Tables Replicate Data

So let's talk about actually the way in which replication works. I have here a picture of a global table. I'm showing two copies. We'll get to multiple copies as well. So in US East 1, I have a table copy, and in another region, whichever region it is, I have a replica. Whenever you make a change to a table, DynamoDB will first perform the write locally. And to perform a write locally, it means it has to be written in two availability zones. Once the write is written durably in this region, it goes to a replicator.

Thumbnail 740

And the replicator's job is, the replicator is going to read from streams. This is normally the symbol for streams. The replicator's job is to read from streams and propagate to the other region. Okay, so the sequence is you perform a write in region one. Once the write is durably written, it goes to streams. The replicator pulls the stream and writes it to the other region. Fairly straightforward, correct?

Thumbnail 770

This is two regions. What do you do if you have multiple regions? One possible architecture is to have one replicator for multiple regions. We don't. We have one replicator per region. And the reason we do that is because we want each path to be its own individual failure domain. Let's assume that US East 1 has a problem writing to this region, whatever it is. That should not have any impact on this path. Therefore, your write comes in to US East 1 where you perform the write. It goes through streams. Now you have multiple replicators, one per target region, performing the replication into the other.

The exact same scenario can be played in the opposite direction as well. US East 1 becomes whatever the remote region is, and it has a replicator in that direction. Replicators are unidirectional, so from US East 1 to destination region, replicator in the opposite direction, another replicator. And this is the reason why these are eventually consistent. You perform the write here. The write is durably recorded. If you do a strongly consistent read in this region, you're going to see that data. There is a propagation delay before it makes its way over there, but it will eventually get there. That's the eventually consistent part of global tables.

Thumbnail 850

Timing, Ordering, and Conflict Resolution in Asynchronous Replication

So let's assume you have a table. It has N regions. Like I said, each region to another region is a pairwise replication mechanism. This arrow just shows that there's a replicator in one direction, a replicator in the other direction. So what happens if you have various failures? Oh, let's first talk about timing diagrams. Okay.

Thumbnail 870

Thumbnail 890

Thumbnail 900

Time progresses forward. We have four regions here in this picture. I perform a write in region one. That write is durably recorded in region one, and it's then replicated to region two. Everybody good with this representation so far? Yes, okay. That replication makes its way over to region three. That replication makes its way over to region two, but at a different time. Each of these is an independent replication stream. Of course, there's one more here.

Thumbnail 910

Thumbnail 920

The important thing to realize is that all these replications happen at different times. They're all independent. Now, let's assume that I do a write. The write is successful, and I got a 200 on the write. I perform a consistent read. Whatever I wrote will be visible in that consistent read.

Thumbnail 930

What happens here? I did the write. The replicator has got the data. The write only happens here. A consistent read here will not show you the data. Remember that strongly consistent reads in eventually consistent global tables only reflect those writes which are in that region. A strongly consistent read in this region will reflect this write. A strongly consistent read here may not reflect that write. Again, the write was replicated. It will see it. When you're building your application, realize that just because you do a strongly consistent read on an eventually consistent global table, it does not guarantee that you've got the latest data.

Thumbnail 990

What happens about ordering? I told you that ordering is on a per key basis. Let's assume that these are two different items. On the first item, the initial state is 10. On another item, the initial state is 200. I perform a write where A equals 11. This item has been modified. At a later time, I write B equals 201. So these are two writes which occur in this order. This replication comes over here. This replication happens at a completely different time there.

On a per item basis A, or another item basis B, we will guarantee that the replications are going to be in order. But across items, we cannot guarantee the same thing. The replication streams are independent. Therefore, this replication happens here, and the other replication happens here. Notice this write happens after this write. On a per item basis, we can guarantee monotonic reads. Across items, we cannot. Across keys, we cannot.

Thumbnail 1070

And the last thing we'll talk about is conflict resolution. If you have an active-active system and in multiple regions you modify the same item, which one is going to survive? It's the last writer. So all of our regions share atomic clocks. We know exactly what time you're performing the write. You do a write in two regions. No matter how close you think they are, one is going to be before the other. The one which is later will always win.

Thumbnail 1100

So how do we go about doing that? Let's assume that this is an item which we're accepting a write for. I'm going to record as part of the write that it came in the IAD region at a timestamp of 10. I'm going to propagate it over along with the timestamp. If there is a write in the other region at timestamp 11, this write is going to be ignored because this is at an earlier timestamp. Last writer wins. And of course, if there's a write at timestamp 11, it'll propagate in the opposite direction. Therefore, both regions will converge to timestamp 11.

Thumbnail 1140

So, conflict resolution using a timing diagram. At approximately the same time, the write is done in two regions. Both of them are going to replicate forward. This time here, 20 and 30, indicates the time at which the write happened. Since this write says 20, a subsequent write which says timestamp 30 will overwrite. Therefore, in region one, you do the write at time 20. A subsequent replication of the write at 30 is going to overwrite it.

Similarly, replication into this region where there were no writes, R1 happens first. This write comes in first. This one comes in second. It is going to converge here. Take a look at what happens here. You performed a write at time 30. The replicator sent down something at time 20. We dropped it. Same thing here. The replicator came down here with 20. It dropped it. An important consequence of this is that

in region one, you will see a DynamoDB Streams notification for this and a Streams notification for this if you're building an application using DynamoDB Streams. In this region, you will see a notification for 20 and then a notification for 30. In this region, you will only see a notification for 30. In this region, you will see only a notification for 30. You will never see this notification. So if you're building a Streams application, realize the consequence of having eventual consistency in your global table.

Thumbnail 1240

Failure Modes and Streams Behavior in Eventually Consistent Global Tables

And the last thing we'll talk about is failure modes. A three-region global table: US East 1, US East 2, and US West 2. For whatever reason, US East 2 is completely offline. Nothing happens to the replication between US East 1 and US West 2. That continues just fine. You can write data here, it just doesn't get propagated over there. You can write data here, it gets over to US East 1, it doesn't go over to US East 2, that's all. At some point, when US East 2 comes back online, we do the catch-up and the data in all three regions will again converge, last writer wins.

Thumbnail 1280

Okay, slightly different scenario. US East 2 is perfectly running here. There's a network partition. Any write in US East 1 will make its way over to US West 2. Any write in US East 2 can't make it over. So in this condition, all three tables are active. This table can serve reads and writes. It is just going to be serving data which is local. When the network links are re-established, data will come over, again we will converge to the last writer wins.

Thumbnail 1320

And the last one I'll talk about is this particular failure mode where one network link breaks. I want you to listen to this one carefully because this is something where eventually consistent global tables differ from strongly consistent global tables when someone gets to it. Any write in US East 1 will be propagated over to US West 2. Any write in US East 2 will be propagated to US West 2. Any write in US West 2 will be propagated to both. A write in US East 1 will not go to US East 2. We do not relay. A write in US East 2 will not go to US East 1. We don't relay the write. Therefore, a write here goes to both, a write here does not make its way over there. When the network link is re-established, that will be re-established. Keep that in mind because it's different in strongly consistent global tables.

Thumbnail 1370

So I mentioned Streams earlier on the timing diagram. Remember that your Stream will only reflect those writes which actually happen. If you get a replicated write which is stale, you will not get a Streams notification. You will only get all writes which are actually written. So you will get any write in the local region, and replicated writes may not happen.

Okay, if you were to perform a transaction in a region, we guarantee ACID transactions in the region. Remember that replication is on a per-item basis. Therefore, it's possible for you to see a torn transaction in the remote region. If you're building an application which needs transactions, remember that eventually consistent global tables may show you a torn transaction in a remote region.

Thumbnail 1440

Introducing Multi-Region Strong Consistency: The Journal-Based Approach

Okay, with that, I'm going to hand it back to Somu, and he'll talk to you about strongly consistent global tables. Thanks, Amrith. Thanks for the deep dive into asynchronous global tables and how the application works. Some of the slides were eventually consistent, but that's okay. So we built asynchronous global tables, a lot of customers loved it for the workload and gained a great lot of adoption. But customers still wanted to read their writes from any region.

Thumbnail 1460

Some applications got really smart about this. They said, let's take asynchronous global tables, let's assign a single region as a designated leader and writer, and we're going to read and write from that region. This is pretty much like my household. We have my wife as a designated leader. Everything goes through my wife in the house, right? It works fine. The latency profile of the reads and writes depends on how far the regions are, and it's very different from every region. So strongly consistent reads will have to go back to the designated region.

But if my wife is not at home, now my kids and I have to kind of reconcile the stuff and see who's right, and that becomes a problem because once the designated leader region is not available because of a network partition or something else, then we have to figure out whether the latest write from that region is available to the other regions or not.

Thumbnail 1520

Regions have to reconcile this stuff, and you'll have to have a failover protocol in the application to take care of this new building, getting a new designated region which will be the leader, writer, and reader for all the writes and strongly consistent reads and writes. Similarly, some other applications said, you know what, we're going to get around this whole write problem and having higher latencies. We're going to write locally, and then we have to do strongly consistent reads because it's a small part of our application. We're going to do a read from all the regions and see which region has the latest write. This works fine, but again, the strongly consistent reads are going to be very slow, as slow as the region which is farthest away from you because you have to read all the writes from that region. And if that region is not available, you don't know how to reconcile all the reads you have, so this becomes problematic.

Thumbnail 1560

Thumbnail 1600

To take away the differentiated heavy lifting from customers, we built this feature into DynamoDB global tables. We built multi-region strong consistency DynamoDB global tables, which solves the read-write problem, makes writes highly available even in case of region failures, and makes strongly consistent reads serializable at an item level with all the writes that have happened to the item. So how do my wife and I solve this problem? We have a joint calendar, we kind of share that and take notes. And we built a basic primitive block, a journal, which spans multiple regions as the replicas of the global table. This journal is an append-only log, and the journal would successfully append a log if it's able to durably commit and store the log in two out of the three regions.

Thumbnail 1640

On top of that, all the logs from the journal are delivered to the clients in the order the journal appends them. So the journal can have multiple clients. You have three different regions writing to the journal, then you want the clients listening to the journal to receive the logs in the order the journal sees it, not in the order the clients are appending it. So the journal actually provides an interface where all the clients are getting logs delivered in the order the journal has appended the logs.

Thumbnail 1650

Write and Read Operations in MRSC Global Tables

Now, how does the write work in case of multi-region strongly consistent global table write requests? If you see, as soon as the write request comes, it is forwarded to a global table replication engine. But this replication engine is different from the asynchronous global tables. This one is inside DynamoDB as opposed to the other one, which is waiting for the streams and asynchronously replicating it. This replication engine has two key functionalities. One is generating the replication log entries and appending it to a multi-region journal. And second is to listen to the multi-region journal and applying all the log entries in the order the journal has appended to the journal.

Thumbnail 1690

Thumbnail 1700

Thumbnail 1710

Thumbnail 1720

So the replicator usually reads from the local DynamoDB table, says, hey, what is the state of the item you're writing to this item, generates a log entry, and appends it to the multi-region journal. Once it's successfully appended to the journal, you get a callback from the journal saying, hey, I have a new log entry for you guys. And it sends it to all the replication engines in all the three regions. And then all the three regions apply the log entry. They can apply it anytime they want to apply, but the region in which the write happens, the application is done immediately and a success or failure is returned back to the customer.

Now, this is the same request flow diagram, but it's just in a timeline to just kind of set pace along with the slides. The write happens, the log entry is appended, and then you get a callback in each of the regions. And you can see the application of the log entry can happen at various points, various times, timelines in different regions. But the region in which the write originates returns the 200 once it applies that log entry.

Thumbnail 1750

Now what happens if there are concurrent writes to different items from different regions? In this example, we have two regions having two writes to two different items. It might seem that write A arrives at region one before write B. But since write B is appended in the journal before write A, write B is applied by all the regions before write A is applied. So the journal allows you to serialize these two writes. The journal serializes the writes for you in the order in which it appends the log entries.

Thumbnail 1790

Now, how do strongly consistent reads work in case of multi-region strongly consistent global tables? The key thing here is to know that when a strongly consistent read arrives in a particular region, that particular region has all the writes. It is not going to miss any writes from any of the regions.

Thumbnail 1810

Thumbnail 1840

And the only way for that read to know that there are no writes not applied is by ensuring that it is applied everything from the journal. And how do we ensure that everything from the journal is applied before the reader is served? We use a read fence, or tracer bullet or heartbeat, whatever you call it. It's another log entry. This log entry is appended to the multi-region journal, and then we listen to all the log entries. We'll try to listen to the heartbeat, but as a part of listening to the heartbeat, we'll also listen to all the other log entries which we have not applied so far. So we'll start applying all those log entries which we've not applied so far until we reach the heartbeat.

Thumbnail 1850

Thumbnail 1880

And once the heartbeat arrives, we can be sure that one, we've seen all the writes in all the regions because the heartbeat is serialized along with the other writes in the journal, right? And you've seen all the writes before the heartbeat, so this acts as a read fence, and now we can successfully and correctly say that yes, this strongly consistent read is going to return the latest version of the item, no matter which region the item is written in. Now, the protocol is the same, even if you have a single region write and you're reading strongly consistent from that region because it ensures correctness and there's no bimodal behavior.

Thumbnail 1910

Witness Regions and Handling Write Conflicts with Idempotent Logs

But the thing to keep in mind and remember is that since the heartbeat also goes through the journal and it's appended to the journal, you pay the penalty of the write. The latency of the strongly consistent reads is almost equal to the latency of writes. Now, last year we actually had multi-region strongly consistent tables in preview, and at that point in time, it was launched as a three-region replica global table replica. We got some amazing feedback from you guys saying that, hey, two regions is more than sufficient for us. A third region is not necessary because we incur operational cost and the cost of having a DynamoDB table in a third region.

So before we went to GA we launched a witness region. And what a witness region allows us to do is that for multi-region journal, we need at least three regions because the journal has to append to two regions to successfully say, yes, the log is durably committed. But if there's a region disruption, then you don't have two regions. So we need three regions at least for a multi-region journal. So the witness region is just saying that, hey, your multi-region journal spans three regions, but you don't have to have a DynamoDB table in the witness region. So you have no resource, no cost, no operational costs to maintaining a witness region. So Multi-Region strong consistency global tables supports two full replicas and a witness region.

Thumbnail 1980

Thumbnail 2000

We did talk about asynchronous global tables and how it resolves concurrent writes. And for Multi-Region strong consistency global tables, it's very easy on how to resolve concurrent writes because you're writing to the multi-region journal, right? So if there are two writes, they're going to write to the same journal, they're going to get serialized. But the important point here is, what is it that we write to the journal? What is the replication log entry we write to the journal and why do we do this? So we write a very idempotent replication log into the journal.

The reason for this is that we want it to be applied at most once, no matter what, because a global tables replicator can read the replication log entry, apply and fail, fail after that or crash after that, and then it might restart from the previous checkpoint from the journal to start listening and try to reapply the logs, so we make the log entries idempotent. And how do we do that? Like everything DynamoDB, DynamoDB supports optimistic concurrency control, so we just take any request that comes in, read the current item and convert it to a conditional insert or conditional delete.

Thumbnail 2050

For example, here I'm incrementing the counter of the number of attendees for re:Invent 2025, right? When this request comes into a region, it will read the current state of the item and generate an insert log entry. And this insert log entry is conditioned based on the timestamp which we store as system metadata for every global table item we have, right? And it says, okay, fine, update the item to the value seven from six. If the timestamp matches whatever the previous timestamp on the item is, and then we also increment the system metadata to show the new timestamp.

Thumbnail 2090

Thumbnail 2100

Now let's see how this works. Now I have two increment re:Invent calls from two different regions. Both of them are going to generate replication log entries which look almost identical, except that one of them may have a different timestamp, new timestamp on the item versus the other one, right? Now each of these regions are going to get a callback for each of the log entries in the order. So region R2's log entry will be applied first, and you'll get a success back in region R2.

It's great. What happens when both regions read the second log entry? They will not be able to do this insert again because this is a PutItem with a condition, and the condition doesn't match anymore. So R2 is going to say I can't apply it, I'm just going to skip this log entry. R1 will return a failure to the application saying, hey, there was a replicated write conflict, please retry this write again. Usually, the SDK is smart enough to retry this because it's a retryable exception, so the SDK will retry this exception for most applications.

Thumbnail 2110

Thumbnail 2120

Thumbnail 2150

Failure Scenarios, Testing, and Building Resilience in MRSC Global Tables

Multi-Region Strong Consistency global tables and failures. Like I said, we went through strongly consistent reads and how they work. We talked about write conflicts and how we handle them. The other key thing is understanding the failure characteristics, what happens in case of failures, and how these tables behave. Since the regional latencies are different, if a region is completely isolated, that region is not going to be able to serve any writes or strongly consistent reads because it can't write to the journal, right? And it can't serve any writes. But interestingly enough, before the partition, US East 1 would have had very good latency in terms of writes and strongly consistent reads because its closest neighbor is US East 2. So it only needs US East 2 and itself to acknowledge any write quickly.

Thumbnail 2170

Thumbnail 2220

Thumbnail 2240

Thumbnail 2250

Thumbnail 2260

But once US East 2 is isolated, it has to pair up with US West 2 for all the journal commits and all the journals, which means the latency of the application would increase if you have your closest region completely isolated. Now, if there is a network partition between US East 1 and US East 2, this is the most interesting failure scenario. In case of asynchronous global tables, if there is a network partition between US East 1 and US East 2, the replication stops because it's a point-to-point replication, bidirectional replication. But in case of Multi-Region Strong Consistency global tables, it actually works because in steady state, US East 1 is replicating logs to US East 2 all the time and US West 2. But if there is a network partition between US East 1 and US East 2, those replication logs just take a longer route. They just travel through US West 2 to US East 2. So any replication logs, even from US East 2 to US East 1, will travel via US West 2. So your application latencies will go up because there's a network partition and the replication logs will have to take a longer path. But all regions will be available for strongly consistent reads and writes. This is a key difference between asynchronous global tables and Multi-Region Strong Consistency global tables.

Thumbnail 2280

Now, Amrit was talking about streams and transactions. All writes to a key are visible in the same order in all regions for Multi-Region Strong Consistency global tables. For a given key, it's visible in the same order. That doesn't change with streams. Multi-Region Strong Consistency global tables don't support transactions and time to live still. So these are two key things to keep in mind when you're using Multi-Region Strong Consistency global tables for any of your workloads.

Thumbnail 2310

Now this is just a recap of what are the regional failures and how these two tables differ. In case of a regional failure, again, asynchronous global tables, both of them remain very highly available in the remaining healthy regions. In case of a network isolation, again, they remain, asynchronous global tables is highly available in all the regions. Your reads are going to be very eventual, will be stale, but writes will still be available. But in case of Multi-Region Strong Consistency global tables in the isolated region, reads and writes won't be available, but other regions will be highly available. Network partition, like I said, they behave slightly different across both these global tables.

Thumbnail 2360

Now, one of the key things I get asked everywhere I go is like, hey, how do you guys gain confidence in this protocol? This is a complicated system. All distributed systems are very hard. And in DynamoDB, one of the things we take very seriously is distributed systems. The human mind can't reason about it. So we use a lot of tools to help us reason and check our system invariants all the time. So for global tables, we use this new tool called P model checker, which allows you to model the states and transitions of an item and the protocol and validate it during runtime, during application runtime. As the global tables protocol is running, we can generate application logs and validate that using P Observe. And since P uses a language which is very close to C, it's kind of easy to keep the code and the model in sync.

Thumbnail 2410

When you want to change the code, the practice is usually to change the model first, verify the protocol still works, and then modify the code. A big part of building resiliency, as my friends across the stage were talking about, involves chaos and scale testing. There's a lot of testing which goes into all the failure modes and how global tables behave in failure mode. So we do a lot of chaos testing and scale testing to ensure the protocols still work and validate the invariants of the system.

Thumbnail 2440

Last but not least, anti-entropy. For all global table replicas, we are performing anti-entropy always to ensure there is no divergence and item writes converge in the case of asynchronous global tables eventually, and in the case of Multi-Region Strong Consistency global tables, they are converged. So we keep performing anti-entropy, and this is a property we do even for regional DynamoDB where we have three copies of the data across different availability zones. We have a system which continuously performs anti-entropy to verify the replicas have the same data.

Thumbnail 2470

We also have introduced an experiment in AWS Fault Injection Service, which customers can use to test themselves how their applications will behave if there is a failure with global tables. I think this is very important because we've tested a lot of this stuff, and it is customers who need to test their failover and how their application behaves when there's a failure. So we do support paused replication for DynamoDB global tables, which will help you test what happens if replication is paused for asynchronous global tables.

Thumbnail 2500

Choosing Between Replication Models: Comparison and Live Demonstrations

A couple of things which Somu talked about. First, we currently don't support transactions with strongly consistent global tables. We are working on that. If you use TTL Time to Live, it is supported with eventually consistent global tables. It will also be supported with strongly consistent global tables before long. So keep those two things in mind.

Thumbnail 2530

When you're building your application, one of the choices you're going to have to make is, should you use eventually consistent global tables or strongly consistent global tables. These are some of the things you have to keep in mind. With asynchronous global tables, there is no write cost across regions. Your writes are always in the local region. Your reads are always in the local region as well. With strongly consistent global tables, even your strongly consistent reads have to effectively go across regions. That's the heartbeat which someone was talking about. So remember that a strongly consistent read has a latency of approximately the same as a write.

Regarding availability, both of them are going to be highly available, but your data is instantly available in the case of a strongly consistent global table. It's eventually consistent in the case of an asynchronous global table. Also, remember the network isolation failure modes are a little bit different. This is the one which is most important for your application to deal with. And I strongly urge you, when you are building an application with either form of global tables, to use the fault injection mechanism and see how your application deals with the failure of replication. This is not something you should discover in the middle of an event. You should test this on a regular basis.

The other thing which you should do if you are building an application which is active-active, it's really good that you're driving traffic to both the regions. But if you are building an application which is active with a standby region, always please try to drive traffic to the other region as well, ten percent, ninety percent, or do failovers on a regular basis. Do not make those things only exercise during a failure. Test continuously. And again, the last one talks about the kind of things that you would expect to use these different global tables for.

Thumbnail 2640

Thumbnail 2660

Thumbnail 2670

Thumbnail 2680

Thumbnail 2690

Thumbnail 2700

So I've reset the counters. I'm choosing us-east-1 as the region where I'm going to do the writes. I'm going to increment the counter five times in a loop. While I'm incrementing the counter, I'm going to do strongly consistent reads from the two other replicas I have, which is one in us-east-2 and one in us-west-2, and see what the experiment provides us. So as you can see, in us-east-1 the values went from zero, one, two, three, four, five, which is local writes, which is fine. And in us-west-2, it's very interesting that when it started reading, strongly consistent reads for quite some time were giving the same value as eventually consistent reads, and I ran ten iterations of it. It reads zero and then skips directly to five. It just sees five. So you can see that it's an asynchronous replication. The eventual value comes over to us-west-2. Likewise, since us-east-2 is nearby, it does see zeros, threes, and this one.

Thumbnail 2710

But one of the key things to notice here is the latency. The latency for both strongly consistent and eventually consistent reads is very low. I'm talking about, this is running on my laptop from Vegas, so it's about 80 milliseconds to all the regions, but both reads are really, really fast because they're talking locally to the region and they're not using strongly consistent reads for multi-region strongly consistent global tables.

Thumbnail 2730

Thumbnail 2740

Thumbnail 2750

I'm going to run the same experiment with multi-region strongly consistent global tables and let's see what we get. So first you can see that ID writes were almost the same time still. Writing to one region is fine, but you can see that strongly consistent reads and eventually consistent reads now start seeing the same values as they are incremented in us-east-1, for example. Also, a key thing to notice here is now your latencies for the strongly consistent reads have gone up from 70 milliseconds to about 200 milliseconds because, like I said, strongly consistent reads for multi-region strongly consistent tables do write to the journal. So that latency does add up when you're doing strongly consistent reads.

Thumbnail 2780

Likewise, you can see that they finally see most of the values throughout. In some cases, you can see, for example, here, the strongly consistent read in us-west-2 does see the value of four, but the eventually consistent value sees three because it is reading the local copy of the data, which may be stale. But the strongly consistent read is reading whatever is the latest write which has happened.

Thumbnail 2800

Thumbnail 2810

Thumbnail 2820

The second example I'm going to show you is a multi-writer example. Here is where we're going to increment the counter across all the regions simultaneously. I'm going to have three different threads incrementing the counter across three different replicas. So we'll see how this works with asynchronous global tables first. Run this experiment and you can see that they all incremented the counter five times, but they just ended up with the same value eventually because they're not talking to each other. The writes are not coordinated, the writes are not getting propagated before the writes happen locally in the region. So all three replicas end up being the same value. This works for applications where you really don't need the writes propagated across regions. You can maintain separate copies of the values. And you can see the latencies are also very fast for the writes. The latencies are almost as good as how long it takes from going from Vegas to each of these regions.

Thumbnail 2860

Thumbnail 2870

We're going to run the same experiment again with multi-region strongly consistent global tables and see what happens. Since each of the regions is incrementing the counter five times, you'll end up with the final value of 14 and 15, but you can see that the values get incremented zero, one, two, three, four. But there are some writes where you can see the value, the time taken is very high. For example, in us-west-2 the first write, where the write is five and value is six, you can see it's taken about half a second to write. And if I were to go debug this, in all probability this is because it got a replicated write conflict the first time it tried to increment the counter, so it had to retry that increment the counter again in the region again. So the retries do add up and you can see this latency go up, but you can see that all three regions eventually converge to the same value of 15 after incrementing the value five times in each of the regions. So that was the part of the demo. Amrit, do you want to hit that switch for me again?

Thumbnail 2930

Key Takeaways: Testing, Trade-offs, and Best Practices for Multi-Region Applications

All right. So we talked about how you make a choice between using strongly consistent global tables and asynchronous global tables. And let me just quickly recap the things which we talked about. Customers building applications sometimes need the ability to have data available in multiple regions, distributed applications if you want to have redundancy in the event that there's a regional failure. Some customers are able to tolerate a non-zero value of RPO. They are willing to replay that data themselves. Others who are not will probably want to use strongly consistent global tables.

There are multiple trade-offs when you go from one region to multiple regions, and when you go from multiple regions asynchronous to multiple regions synchronous. One of the things which we've realized is that building applications with scale is typically much easier if you're able to embrace eventual consistency. If you require everything to be done synchronously, you are at the mercy of having the long network lags. If you're able to tolerate eventual consistency, you can do everything with the inter-AZ or intra-region costs.

How does DynamoDB behave in single region versus multiple regions? And the most important thing is how to choose the right model for your application. But no matter which one you choose, the one thing which I hope you take away is this: test your application in all the possible failure states. We do a lot of that. We use P model checkers. For those of you who are not familiar with P, it's a much more recent version of what you probably use, which is TLA+. It is extremely useful to verify your algorithms are actually going to work the way you expect them to when there is a failure.

Thumbnail 3040

Thank you very much for your time. We enjoy doing these presentations. We'd love to get your feedback. So please let us know what it is you want to hear next time and we'll try and do that. Please take the feedback on your mobile app. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)