🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - An insider’s look into architecture choices for Amazon DynamoDB (DAT436)
In this video, AWS engineers Amrith and Craig explain DynamoDB's architecture and design decisions guided by four core tenets: security, durability, availability, and predictable low latency. They detail how DynamoDB achieves hundreds of millions of requests per second through distributed systems with groups of three replicas across availability zones, sophisticated request routing using DNS and load balancers for cross-AZ optimization, and a two-tier eventually consistent caching system with MemDS that uses versioning, hedging, and constant work patterns to handle cache misses. The talk covers partition metadata management, transaction limits (100 items), item size restrictions (400KB), and explains how limits like 1000 WCUs and 3000 RCUs per partition ensure predictable performance on shared infrastructure.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: DynamoDB's Guiding Tenets and Team Philosophy
This talk is about Amazon DynamoDB. I'm going to start with this one, which is probably a better talk. You guys didn't want to give me a round of applause? Come on, let's start. I did not expect this slide here. Do you have any idea what the slide is about? I put this up. This is probably an indication for you how the rest of the talk is going to go.
Craig and I worked together for several years. My name is Amrith, and I'm the guy on the right here. We worked together for several years, but we go back and forth quite a bit. That's what you're going to see during this talk. Both of us work on DynamoDB. He's Craig, and I'm Amrith. What I've found in the last six or so years that I've worked on DynamoDB at AWS is that customers who build applications on DynamoDB need to understand not only how the service works, but also how we made some of the decisions about how the service works.
This talk is more about how we made those decisions. The exact decisions we made may not be relevant to you, but what I hope you will take away from this is how we come about those decisions, because those may be more applicable to you as well. The way in which a large organization makes a service like DynamoDB is important to understand. If you have a large team and you need to make everybody in the team able to be productive at their maximum level, you cannot have centralized decision making. We build a distributed database, and the idea is you want to distribute things across many nodes and have the power of many nodes. What's the point in having a team which has one decision maker? There is no point.
We have tenets in DynamoDB, and these give the team an understanding and a shared understanding so that we can push down decision making into the organization. This is something which I think you should understand. This is how we work at Amazon, and this is something which is common to all AWS services. It helps teams make decisions, and these are our tenets. First, security. Second, durability. Third, availability. These are non-negotiable. I don't care what the feature or the implementation choice is. I will never trade security in order to give you something else. The fourth one down there is predictable low latency. This is a DynamoDB feature. We want to guarantee predictable low latency at any scale, but we will never trade one of those.
In the rest of this talk, what you're going to hear is a reiteration of these tenets: security, durability, availability, and predictable low latency. These need not be the tenets in your organization, but as a decision maker in your organization, the thing I would urge you to do is think about what these tenets are. Make sure that everybody in the team has a shared understanding of these tenets so that when they're making a decision, they will understand what these tenets are. In the rest of this talk, we're going to talk about this over and over again, but many of these choices are going to come down to predictable low latency because, like I said, security, durability, and availability are non-negotiable items.
If we have to give you predictable low latency, just show of hands. How many of you here use DynamoDB and how many of you have never used DynamoDB? Never use DynamoDB, show of hands. All right, very small number. The rest of you understand that we try to give you predictable single-digit millisecond latency at any scale. So with that, let's talk about this. All right, you've talked enough, right? This is how it usually works with Craig. You want to take it over, man? Go ahead.
The Challenges of Distributed State Management at Scale
We're talking about scaling, right? I want to give you a sense of the scale of DynamoDB. Five thousand requests per second, right? That's a lot, right? We have hundreds of tables right now, just right now, that are all pushing five hundred thousand requests per second. This is just normal for us every day. This is what we see.
You know, just barely half a million. We've got, for instance, the Amazon stores business on Prime Day, they peaked at 151 million requests per second. That's one customer of ours. So a lot of the scaling causes us to think about things a little bit differently. When you combine the scale with state in a distributed system, this is where a lot of the challenges come from. I'm not saying that stateless distributed systems are easy by any means, but stateful is a whole another level of complexity.
Before we get into a lot of the details of DynamoDB and some of the specific choices that we made, I want to talk about some of this problem in the abstract just so we're all on the same page. If you're building an application that has state, the simplest thing to do is put the application and the state on the same instance, right? Concurrent programming is not necessarily easy, but relatively speaking, yes, I'm saying this is easier than distributed state management. This pattern works great for a lot of applications, but pretty soon you want to scale, you want to scale things independently, and so you move your state onto a separate instance.
Now you've got to deal with failure. You've got these two things now that have to be available. Your application server has to be available, and now your stateful server has to be available. They both have to be available, so this is strictly less available than the previous approach. But you get some benefits because you can now scale the state and the application independently. However, it also means that the state can disappear out from underneath you at any time, and so you have to think about these failure cases throughout your code everywhere.
So how does your application actually know what the state of the underlying database is? Well, in some cases it doesn't, right, because especially as we start scaling, we get to the world where we say, hey, one instance isn't sufficient. We're going to have multiple instances and now we're introducing coordination and coordination is where a lot of this complexity comes from, right? One of the simplest things you can do then is move from a single instance model to a primary and secondary model, right? You do all your reads and writes to the primary.
You think through this and you're like, okay, well, I've got to deal with when the primary fails, right? That's easy. Everybody thinks about when the primary fails. What happens the other side? What happens when the secondary fails? Yes, this one's actually more interesting, and this is one that's a little more subtle because what do you do in this case, right? Do you continue to accept writes on the primary? But those writes won't end up in the secondary because it's down. What happens to those writes when the secondary gets healthy again? How are you going to repair all this? You start getting into a lot of these corner cases and the complexity that comes from the distributed state coordination, right?
So maybe two isn't the answer. Two doesn't work, let's try three. Three is the real minimum then, right? Well, this is great because now if you want to survive a single box failing you can because you've still got two healthy and you've got another one. The write comes into one of the nodes and you can get it to one of the other nodes. So now your write's in two places and that write is no longer susceptible to losing a single box.
If three is not enough, what are you going to do? Well, three is not enough. Let's try four, right? If three's good, four must be better, okay. Well, with four, you know, now in theory you can survive two failures, right? Because I can still get the write to the other node, and that's fine. But how do you know these two are the correct two? Yeah, so there's this failure case, right? Maybe all the nodes are actually healthy and they just can't talk to each other. We have a network partition, right? Both sides think that, well, I'm on the healthy side, the other side's dead. I'm going to keep accepting writes. You end up with this what we call split brain, where you've got writes, the data, the data set diverges over time. When this partition heals, you've got these two totally different data sets you have to merge. Having an even number is a horrible idea. Four is a really bad option.
So basically the idea is you need to have an odd number of nodes. Is that the story? Yeah. So what about five? Let's just keep counting up, right? Four is out, we could do five. So with five, you can survive two failures. Maybe that's good. Maybe that's what you want. Well, what we've found through a lot of our experience is that as soon as one fails, you're working as hard as you can to get that replica back to par. You're bringing a new node, you're going to repair the situation. So what you're actually doing here for the second failure is racing—how fast can I get a node healthy again to get this cluster back to par, back to normal, before we lose another box.
A lot of the effort that goes into having more nodes might actually be better spent trying to repair a node faster, because you're going to need to do the repair in this case anyway.
At this point you're basically saying you need to have an odd number of nodes. Is that it? Like, 3 is OK, 5 is OK, 4 is not OK, 2 is clearly not OK. Right. As you're growing up, you have to do more writes. Even in this world, you have to have the majority healthy so that you know you're on the healthy side. You're adding more boxes and so you have to think about the cost.
Let's take this to the extreme. Let's say for scale reasons you need to have 1,001 nodes, because it has to be an odd number, right? So you have 1,001 in order to accept the write. I've got to have 502 boxes. It has to be the majority to accept the write. No, statistically you can't do that, because any one of those 500 and whatever boxes can fail. If you do this, you're going to trade off availability. Yes. It's also really expensive because you've got all these boxes. You really need to write that thing 502 times just so you know you can survive some failures.
So what do you do? An alternative that we think is a much better alternative is you have lots of groups of 3. You still can have 1,000 nodes or however many tens of thousands of nodes, wherever you want, but you group the data sets into groups of 3. In the groups of three, it has all the properties we like. We can replace the boxes quickly. We know it's correct. It's a nice odd number.
Solving the Routing Problem: Metadata, Caching, and Replication Trade-offs
How do you know where to send a request to? You add complexity. If I send a request, I want to put all these boxes behind a load balancer, right? So the client doesn't necessarily know that in this case, C, F, and J are the three boxes that are in the cluster for the data set that this client is interested in. If I send a request through a load balancer or whatever, it's probably going to land somewhere else, almost certainly as the number of boxes goes up.
What K has to do then is say, OK, well, K can either return or redirect, or it can proxy onto the correct node for you. But if you think about that, right, K then just becomes the client, because K has got to know the data set. We have the exact same problem to solve—for this particular data set, where is it? I've got a lot of boxes. So basically you're going to add one level of interaction here and say that every problem in computer science can be solved with one level of indirection, and K needs to know where to go. Look this up. OK, fine. Have you seen these slides before? No, I have not. You made the slides. I haven't seen them too. No, yes, another level of interaction, right?
We add a data set which is the data about where the data lives, so our metadata. Right. The protocol then is the client, in order to go figure out I want to go talk, get this particular data, I do a lookup in metadata, and then I go get the box that I need to connect to. Right? Two lookups is expensive. Yeah. So now we're doing double the lookups. I have to do X requests per second to the data set. I also have to do X requests per second to the metadata set.
I know what you're going to suggest next. Yeah, there has to be an easier solution because if you were to do this, then you need to scale your metadata to the same capacity as your actual data, which I'm not willing to do. So you need to give me a different answer. We've got two of our classic computer science solutions now: an extra layer of interaction and caching. Right, so if you had a cache here between the metadata, which is fine because these data sets don't change that often, you can cache this. You have less than one request per second, and metadata doesn't have to scale to the same request rate.
Any snarky comments about this one? No. OK. What are you going to do if your caches are cold and what are you going to do in situations where your metadata is not able to keep up with the uncached data? All of those things? You're foreshadowing your portion of the presentation coming up later. Thank you for telling me what I'm going to talk about. It's going to be nice to know.
So caches can be a risk, and we'll talk about that more later. We'll also talk about how we do this routing in a little more detail. The other thing is boxes can still fail. Right, so we have these large boxes, we have to deal with this case. When a box fails, in this case J has failed and we want to replace it with L. L then has to go and update metadata somehow and indicate, hey, I'm part of this set. Right. So we have this problem, right? This problem is
Who is the authority of which servers participate in which groups? You might think it should be the metadata because I'm doing all the lookups. But we have a race condition between when a box becomes part of one of these clusters of three, when it can serve traffic, when it tells metadata, and when the client caches can recognize it. The idea we're talking about is a system with a system of record authority and something which points to it that is potentially eventually consistent. So are you suggesting that for scale you need to understand eventual consistency and deal with it? Yes, it's hard, but that's really the key. Of course we have this metadata thing we've been talking about. What is this and how's that going to work? There are two options here. We could make metadata another one of these tables. This works great, but you still have that routing problem of which subset of metadata do I need to talk to, so I have to solve all of that again. It makes the writes relatively simple though because there's only one place I need to write to.
The system is going to have a lot lower request rates because of the caches, but it's also going to be significantly smaller because all you need is a single item to say that tens of gigabytes of data you're looking for are over there. This system is going to be multiple orders of magnitude smaller in terms of data size. This opens up an alternative: instead of having subsets of this, you can have entire replicas. If you have entire replicas, the lookup problem gets really easy because as a client, you have to talk to one of the metadata nodes. It doesn't matter which one. But the trade-off is it makes the write problem a lot harder when we change the membership set. Somehow we now have to go update all of the metadata nodes that exist because you can't predict which one the client is going to talk to later. This makes it more eventually consistent. For the rest of the talk we're going to dig into three aspects of this and the choices we've made in DynamoDB and look at it in a more tangible way of how DynamoDB decided to solve some of these problems.
Request Routing Architecture and Cross-Availability Zone Latency
We're going to look at request routing, how the client can get to the storage we've just described. We'll talk about how we structure our metadata. But also in a system this size, we'll talk about some of the constraints and limits that we have to put into place and why those exist. Customers will run into these, and when you understand why, we think this will help you build your applications. Routing is very simple. This is what DynamoDB looks like. The storage node fleet on the right is what we were just talking about with that large number of thousands of nodes with clusters of three, and the request router fleet is the fleet that sends the requests to the appropriate box. When a client comes in, they get sent to a random request router node. The request router has two jobs. First, it does authorization and authentication. Are you who you say you are? Do you have access to this data? This is where we do the SigV4 processing. Then we also do the metadata lookup, and from there we know which particular storage node to talk to. We forward the request on.
This picture is too simple; there's more to it. We have load balancers in front of the request routers. They're normal load balancers. We use a network load balancer. But because we have so many load balancers, we have DNS in front of the load balancers because it becomes the load balancer of the load balancers, the thing that sends traffic to all the load balancers. We're running within a region. Regions have availability zones. Availability zones are independent failure domains composed of multiple buildings where the actual EC2 instances eventually run. At the end of the day we're all talking about real hardware somewhere. This separation is really important for how we provide our availability and durability guarantees, but it does provide some challenges for latency.
Everything we have is just an EC2 instance at the end of the day, which means it runs in an availability zone. We've been very careful about how we've put our servers across the availability zones. We've striped them across multiple availability zones. In every availability zone, we have a choice.
We've constrained our load balancers to be within an availability zone because we find it easier to think about these units of failure. This is the choice that we've made. If you zoom in on one availability zone, the picture becomes more complex. We have a whole bunch of load balancers, and behind every load balancer is a bunch of instances, but I'm simplifying the picture here.
Going back to the simple picture with load balancers, every DynamoDB table is divided up into partitions, which we'll talk about more later. Every partition has three replicas, and we spread the replicas across the three availability zones. This is how we provide our availability and durability guarantees. Because as we discussed at the beginning, we can lose one box, one replica, and this cluster can stay healthy. Our risk is if we lose two. As soon as we lose one, we're going to try very hard to replace it, so we're trying to replace the instance as fast as possible. Our risk is correlated failure, and the most likely correlated failure we'll see is across an availability zone because of larger scale events.
We do all of our software deployments scoped to an individual availability zone. We want to make sure if two boxes fail, they don't impact two of the replicas in a partition. We do this by spreading them across availability zones. This is critical to what we do, but there's a consequence to this. These availability zones are physical buildings that are connected. You can and should test the network distance between them. Launch two instances in an availability zone, ping them, and see what times you get. Do this in different availability zones, and you'll see it's different. Do the same thing across availability zones, and you'll see it's different again. What's always true is that the network distance between availability zones is significantly higher than the network distance within an availability zone.
Our clients are running on EC2 instances, Lambda functions, containers within ECS or EKS, whatever it is—they're in availability zones too. The simplest thing we can do for load balancing is to randomly route across all of our available instances, which means we're highly likely to route a customer across an availability zone and back across an availability zone. So we'll be paying the extra latency penalty multiple times.
Ideally, what we'd like to do is use the shorter network path. This is meaningful to a service like DynamoDB because if you look at the components of our latency, our server-side processing time is actually relatively small and of a similar order of magnitude to the network distance itself. It's one of the biggest things we can do as we work on improving our predictable low latency over time: shrink the distance. We don't have to do any software optimizations within the server itself; we just shrink the distance. By shrinking the distance, that ends up with a meaningful improvement in customer latency.
Optimizing Network Distance Through Availability Zone-Aware Routing
DNS is our load balancer across load balancers. If you do a DNS lookup for our domain name in US West 2 and keep doing this over and over again, you'll get different IPs for different load balancers within the region. One of the things that we can do is split horizon DNS, where you get a different answer depending on where the query came from. If you're in one of the availability zones, you'll get a set of load balancers in that same availability zone. If you're in a different availability zone, you get load balancers from there. So DNS is our availability zone selector as well.
Failure is where this all gets hard. This is what we expect to see and what we like to see: you end up with one-third of the traffic in every availability zone, meaning every availability zone processes one-third of the requests. It's relatively easy to scale and deal with from a capacity planning perspective. We have to deal with a case where an entire availability zone fails or we want to take it out for whatever reason. In that case, all the traffic from that availability zone now has to go somewhere because we prioritize availability over latency. As you'll notice here, availability zones one and three are now seeing a fifty percent increase in the amount of traffic they would have had otherwise. So as a regional service, this is something that we have to plan for ahead of time, and we just have capacity sitting there ready all the time.
The price of doing business as a regional service with DynamoDB services means price follows cost, and this is baked into the price of DynamoDB to handle situations like this so customers don't have to know under the covers that we're dealing with an impaired Availability Zone at the moment. But if you think about the case where traffic in one Availability Zone is significantly larger—we have a traffic skew—one option is we could just scale up the number of servers in that Availability Zone. That's relatively easy, right? But think about what happens when that one fails. Now Availability Zones 1 and 3 have to handle a doubling in traffic. The idle capacity that we have to have sitting there ready in case of this event is significantly larger, making this a much more expensive way for us to run this service. This would mean we'd have to pass on the cost to customers, and we don't like that option.
The alternative is what we do: we say we'll never send more than one-third of traffic. This makes our DNS management much more complex, and it does mean that some requests will always be going across Availability Zones. In reality, the law of large numbers helps us out, and we don't have a very large SKU, so we don't really see this scenario. We also have some traffic coming straight from the internet, and the internet is close to every Availability Zone at the same time, so we can send that traffic wherever we want with more DNS complexity to fill in the gaps. These are the types of things you must think through as you're building this style of architecture. That's how we shrink the first hop, which is a big latency win.
Now look at the second hop between the Request Router and the storage node. Ideally, we send one-third of the traffic everywhere. From a capacity planning perspective, this is easiest. DynamoDB has done this since launch, and we've baked this model into our capacity planning, our pricing, and our limits. When you're doing a strongly consistent read, you must talk to a node that we've elected as leader, so there's only one box you can talk to and you have the request processing capacity of one box. But for eventually consistent reads, you can talk to any of them. We said one of the boxes might be unhealthy, so we can't count on that, but we've got the processing capacity of two boxes, and that box has to be there anyway, so let's just use it. This is why for eventually consistent reads, the limits are higher and the price is lower—because we have this capacity that would have been idle otherwise that we can use all the time.
Making this change to route locally is pretty easy because we control both the client and the server. We have to know about all three replicas anyway, so we just pick the one in the same Availability Zone that we are and send all the traffic there. Unless, of course, you're relying on the fact that you can send two times the request for an eventually consistent read, because you could now be sending two boxes worth of traffic to a single box. That's no good. What we have to do is actually monitor statistics on the server side and detect when this happens. This happens sometimes, not super frequently, but when it does, we can send a message back to the client and say, "For this table, let's go back to the random routing mode because we need to prioritize availability over the lower latency in this case."
In reality, most tables with much higher traffic have clients distributed across Availability Zones all the time, so they're being routed to each individual Availability Zone, which means that when we do the local AZ routing, we see the spread we'd expect to see and so we don't end up in this case as often as you might expect. This is the path we've taken and how we can shrink our latency. A few times we've bumped up against the availability versus predictable low latency trade-off, and when forced to choose, we're going to make the service available. But in a lot of cases, there is no conflict and we can do both. It leads to a bunch of complexity on our side, but these are the tenets and the most important things for our service and for our customers, so we think this complexity is worth it in order to provide a better product to you folks.
A couple of things which I mentioned: we do client routing where one-third of the traffic goes to each Availability Zone. Another thing which we do is we try and keep one-third of the leaders on each Availability Zone. Therefore, even on the strongly consistent reads, you're evenly distributing traffic across the AZs.
The Metadata Lookup Challenge: Finding Partitions in Tens of Microseconds
When a request comes in, it goes all the way through to the request router. The important decision that a request router needs to make is: should we serve this request? This question has three parts. First, are you who you claim you are? Is your SigV4 signature matching your request? Second, do you have the permission to do the thing which you want to do on the table which you're claiming you want to do it on? Third, are you within your limits? These are the three questions which we have to answer for each and every request we get.
We have hundreds of customers who do over 50 million requests per second. We do this billions of times a day, and we have to do this really fast because what we're after is predictable low latency at any scale. Once the request router knows that your request is legitimate and we do want to serve it, the next question we have to ask is: which storage node do we send it to? Craig talked about distributing a table into multiple partitions and partitions across multiple storage nodes. We have storage nodes here, but there are literally hundreds of thousands of storage nodes across which we distribute data.
Your data is co-located on storage nodes with other people's data. It's completely shared infrastructure. We do not spin up a cluster for you. Your data is co-located on hundreds of thousands of storage nodes with other people's data, and hundreds of millions of times a second, we need to decide where to send each of these requests. That's the problem which we have to solve. Once the request makes it to a storage node, first, all data is always encrypted at rest. You can provide us a key, but if you don't provide a key, one will be provided for you. All data is always encrypted at rest to ensure security.
Then we need to decide: are you overrunning a partition? With shared infrastructure, rate limits, and provision throughput, we need to make sure that we want to admit the request. There are two places where we decide whether to admit your request or to throttle you. Once we're done with that, we have to serve your request, and your response is going to go from the storage node back to the request router and back to you.
For those of you who have used DynamoDB before, consider a simple table with a unique attribute login name and a non-unique attribute human being's name. Somewhere in the middle there are two people with the same name, but they have different login IDs. The primary key on this table is login. When you create a table, we ask you for three things: what's the name of the table, what's the primary key, and what's your credit card number? That's how you're going to pay for it.
Once you tell us what the primary key is, we compute a hash of your primary key and order your data based on the hash. Notice that the order of the names is different from the order of the names on the left because we ordered it by hashes here. Contiguous ranges of hashes become partitions. This is partitioning for horizontal scale. Once you've partitioned the table, if you want to fetch an item, I need to find the partition where it is. So if I want to find the item for Jorge or for Richard or whoever, I will compute the hash on that, find which range of hashes it falls within, and send my request to that location. Hundreds of millions of requests a second with predictable latency. That's the thing we're after.
We built a system that achieves this in tens of microseconds. Tens of microseconds, literally. We measure this thing, and when it starts to get out of this bound, we're really concerned about it because we do it so often. What we have to do is locate the partition where your data is. Look at where your traffic data is. Today, the traffic we have is hundreds of millions of requests per second. But at any point in time, the things we have to deal with are how many storage nodes do we have and how many request routers do we have. All these numbers are measured in six digits or more. At any point in time, you're creating tables, dropping tables, scaling up traffic on some table, and splitting a table because of that. Partitions are moving around, and we need to figure out where your data is in tens of microseconds.
I want to talk to you about how we go about doing this. I want to reiterate what I want to get back to is the point which Craig talked about: State is hard. Shared state is really hard. This is a distributed system we're trying to build. The request routers on the left here need to be able to figure out in tens of microseconds which storage node your data is on because those storage nodes at the back are shuffling all the time. They're moving partitions around, splitting partitions, changing key ranges, dropping tables, and creating new tables. These things have to stay in sync. This is the problem we have to solve.
Building a Scalable Metadata System with Two-Tier Caching and Versioning
How do we go about solving it? These are the piece parts we have to deal with: request routers, storage nodes, a control plane that handles create table, update table, drop table, and delete table operations, and partition metadata that we're supposed to look up. These are the piece parts. Let's talk about this situation. You create a table and update a table with a synchronous update to partition metadata. This is scalable. How often does this happen? Relatively low volume—tens of thousands of times a second. Not a problem; we can deal with this. Partitions moving around and partitions splitting. Leadership moving around on storage nodes because we're doing software deployment happens hundreds of thousands of times a second. If you're going to start updating your partition metadata, you're building a pretty serious database in the middle there. That in itself will become a scalability problem for you.
The worst situation you have is our customers sending us billions of requests, each of which needs a metadata lookup. So effectively, this metadata system is going to have to handle the same front-door traffic that DynamoDB handles. This is not a scalable system. Let's look at maybe the next obvious choice. First option: put a cache in there. How many of you have been in a situation where you're building a system and somebody in your team says they need to scale and they're going to put a cache in there? How often has it worked out well for you? Caches are a great solution. They are a dangerous solution, and we'll talk about why.
So I have a cache. Every request router now has a cache. Magically, we've implemented one. The size of the arrows from the request router to the partition metadata went down. Let's assume this cache has a 99 percent hit rate. Partition metadata is serving only 1 percent of the front-door traffic. Cache misses. Any idea what could go wrong in this system? Stale, all right. What happens if all your caches go stale? If one cache goes stale, who cares? What happens if you have a situation where for whatever reason, hundreds of thousands of caches go stale?
Now your partition metadata is going to get hundreds of thousands of cache misses, and it's going to fall over. We thought about this and said the system might not work. So we went one step further and said this is a situation of a large fleet and a small fleet. A large fleet of request routers and a small fleet in partition metadata could cause the small fleet to fall over. One of the things I said when I started was the specific choices we made may not be relevant to you, but there are some concepts which you should take away. If you're ever building a cache and a large fleet drives a small fleet, be very careful, because it's not going to end well for somebody.
All right, so this didn't work. So we came up with a next-generation system where we built a two-tier cache. We said MemDS is going to be one tier of cache, and the request routers are going to have another tier of cache. One tier of cache is eventual consistency. This is now eventual consistency on top of eventual consistency. So keep that in mind when I say this. You create a table, we're going to push it down to MemDS. There's going to be a poller there which is going to go read all the storage and say what partitions do you have for what tables. I'll push that to MemDS. MemDS is now an eventually consistent cache.
You make a request on the request router, and if it gets a cache miss, it goes to MemDS, which is itself an eventually consistent cache. This is the system we built, and let me tell you why I think this works. The control plane pushes new table creation. The one thing which we added here is we versioned the data across the board. Whenever there's new data in the place of it, you talked about place of authority. The authority is the storage nodes. The storage nodes know what data they are. Here's a storage node. Do you know what partitions you have? Absolutely you do. So if the partition publisher comes to you and asks you what partitions do you have, you will give an absolutely authoritative answer.
That's now pushed to MemDS, but that is the state at that point in time. But you also say my version is now version 20 or something like that. At some point, if you want to move a partition from one place to another or there's a split, it comes to the same thing. Bump the version number. So here's what we did. A storage node had version 2 of metadata. It split a partition. It moved a partition. The new storage node says I'm now at version 21. Now if a request were to come in to the front door, the storage node says I have a cache value which is 20. Maybe it had a cache miss and it went to MemDS and MemDS said value of 20 doesn't matter. It goes to the place where version 20 pointed it to. Version 20 is the authoritative source and it says you actually need to go to this other place because now we're at 21.
The immediate thing which we do is the request is going to go straight to the new place. And it is going to get served immediately. So even if you have eventually consistent data in the system, like the caches are two tiers deep and eventually consistent, if you do have a cache miss, which happens infrequently because there's a partition split or partition move, we are still able to serve that in reasonable time. If you're interested in how all of this stuff is supposed to work, we published a paper about this. That's a QR code, scan it. Actually, this is not being recorded, so you probably want to scan it or if you want to contact me afterwards.
The basic idea is this: you've got a system with data which has entropy. You have a need to do stuff with caches. These caches are going to be eventually consistent, and they need to deal with the situation where all the caches can probably go cold. So far, I've talked about the first two. I haven't talked about the last one. So let's talk about that one. System of record, authoritative there. Two levels of eventually consistent caches were good so far. Now let's talk about the situation with large fleet, small fleet, and hedging.
If you're ever building a system and you want predictable latency, use this concept called hedging. We do it internally. Hedging is not our idea. It's an idea from Google. That's a copy of the paper up there. When a request shows up, add a request router.
In the eventuality that we have a cache miss, we don't make one request—we make two requests to two MemDS servers. Why is this a good idea? Assume that your system is like most systems and your latency follows some kind of normal distribution. If you make two requests, statistically, one lands on the left of the median and one lands on the right of the median. You take the first response, and it's always good. We hedge our requests. What this means is that in the eventuality of a cache miss, MemDS is serving twice the traffic that we're getting for DynamoDB. We go one step further. What happens if all the caches happen to be cold?
We introduce this concept of constant work. You make a request to DynamoDB. That request goes through load balancers and all that stuff, shows up at a request router. The request router says, where's this item? The cache is correct. It makes a request to the storage node. The storage node serves a response. We've already given you the answer. What do we do in the background? We send two requests. To two MemDS servers, even on a cache hit. Everybody with me? On a cache hit, we send two requests to two MemDS servers. Why? Assume someday all the caches went cold. MemDS would not know the difference. Large fleet, small fleet. This is the cost to us of doing business and giving you predictable low latency at any scale. These are the things which we keep in mind when we build our systems.
We try to build systems with a guarantee of availability. We can't just say your cache is cold, therefore, we had lots of availability. We can't do that. The latency needs to be not just low, but predictably low. These are some of the considerations which we had to go through because these were our tenets. Figure out what your tenets are and have your decisions mirror your tenets. You're not going to probably have to make the same choices, but in the event that you do, my suggestion to you is be very, very careful of the person who says we add a cache in front of it and it's going to solve all the problems.
Caching is hard. I've been doing this for about thirty-five years. The number of times when somebody has said I'll put a cache in front of it and life is going to be good—the number of times they are wrong is close to one hundred percent. Caching is hard. But if you do understand that you want to have caching, understand eventual consistency. Build systems with versioned data so that eventually consistent data is your friend. Don't run away from it and say I need synchronous upgrades across large numbers of nodes because that is not a scalable solution. Eventual consistency is your friend if you want to go to scale.
Practical Recommendations: Connection Pooling, Hedging, and Constant Work
Why are these things important for us? Because our tenets were security, durability, availability, and great, predictable, low latency at any scale. If we want any scale, we need to understand eventual consistency ourselves. Everybody good so far? I've talked a lot about caching. I've talked a lot about the things we do on a request router. What are the things we cache? You want to connect to DynamoDB and make a request. We cache your identity credentials. You want to make a request on some particular table, we cache your table metadata.
Suppose you were to make a new connection to DynamoDB on every request, you're probably going to a new request router. Are you going to get the benefit of caching? No. Make sure you have a long-lived connection. What does that mean? Right-size your connection pool. If you have low traffic, have a small connection pool. If you have high traffic, it's okay to have a larger connection pool. If you have a very small amount of traffic, don't have a huge number of hosts who are serving you that traffic. Think about these kinds of things to make sure that you get predictable low latency.
Also, if you want predictable low latency from your application, hedge your requests. Send two requests. If they're writes, make sure they're idempotent. If they're reads, eventually consistent reads. Choose the first one, don't change the timeout. And if you're building caches, please, by all means, build constant work into your plan.
Understanding DynamoDB Limits and the Importance of Tenets in Decision Making
To give you predictable low latency, we implemented the concept of limits in DynamoDB. I'll discuss a couple of these limits. There are read and write limits, and the motivation for all of these limits is predictable latency. At the end of the day, DynamoDB runs on physical hardware, and physical hardware has limitations. This hardware is shared infrastructure, so we need to operate a service that is cost-effective for us so we can pass those benefits on to you. We have physical hardware with some limitations, and if you create a table with a certain number of provisioned RCUs, historically we would divide that table into partitions. When we split that table, we would divide the provisioning onto those two partitions. This was called IOPS dilution, and there are many people who have been using DynamoDB for a long time who think this is still a problem. However, this is no longer a problem.
The way things work today is that if you have a number of partitions and we split a table, each partition gets the same partition-level limit. Currently, the limits are 1000 RCUs and 1000 WCUs, and 3000 RCUs. The numbers may change over time, so don't take a commitment on those, but we have these limits because we want to guarantee predictable latency. I want to make one small detour here. Many customers ask me what the partition count is on their table, and this is a meaningless metric for you to ask us. The reason is simple: if you have a table, there is no guarantee that each partition has the same fraction of the key range. The real question you are asking is whether your table will be able to serve your traffic during a major event like a Cyber Monday sale in a couple of days. The question you need to ask us is this: if each partition has a different size, which partition is going to throttle me first? The real answer is that the partition with the smallest capacity is the one that will throttle you. Simply telling you that you have eight partitions is useless because eight times 3000 is not meaningful to you.
We introduced a feature called partition table warm throughput to answer this exact question. If you have a major event coming up, you can use this feature to determine whether your table will be able to serve your traffic without throttling. Warm throughput is a feature we launched last year at re:Invent. When you describe a table, it will tell you what traffic you can serve without throttling. We have a similar limit called transaction size limit. When we launched DynamoDB, a transaction could have no more than 25 items. Today we have 100. We offer standard ACID transactions, and if you build applications with a relational database, we have the same transactions for you. This allows you to build banking applications with all of the guarantees you want. However, our transactions are different. We have two kinds of transactions: read-only transactions and writable transactions. All of the items in the transaction are specified at one time. We did all of these things because we want predictable low latency. Have you ever been in a situation where your application is hung because somebody started a transaction and went to coffee? This cannot happen in DynamoDB because the transaction contains all of the changes that are going to happen in one place.
We offer standard serializable isolation. If you build applications with relational databases, you understand this. We use the standard two-phase commit approach with prepare and commit phases. Everything is standard the way you would expect with a two-phase commit transaction. We did do one optimization for reads. We do not do two-phase commit for reads because it is cheaper for us to just do the read two times. If the items did not change, your transaction is good. The question then is why we have transaction limits. Why 25? Why 100? The simple answer is that we want predictable low latency. We did a bunch of testing and asked ourselves: if we increase the number of items in a transaction, what happens to latency and what happens to availability? If you have contention on your items that are being modified in the transaction, availability goes down and latency goes up. When we launched, we were able to safely do 25. Today, we are able to do 100.
We want to increase the number even further. These are some of the reasons why we have limits on transactions. Similarly, we have item size limits. Why do we have an item size limit of 400 kilobytes? It's because of shared infrastructure. We want to make sure that you do not impact your own usage on some other item which is in the same partition. As the size of the item goes up, the number of times you can read or write that goes down. The amount of time it takes is going to go up. We have item size limits of 400 kilobytes for a limit. If you do need a larger item, I would love to talk to you about it.
We have limits on Global Secondary Indexes, or GSIs. When we launched, it was 5. Today the number is 20. The reason is we want predictable low latency. We were able to make changes to the software so that we gave you consistent, predictable replication lag to the GSIs. Today, the number is 2. Why do we have GSIs? Because you have alternate access patterns. Now, of course, if you have a need for more than 20 GSIs, that is my email address. Please do contact me because I think there is a different conversation we need to have about your schema.
Many of these decisions we took, every one of the limits were to give you predictable low latency because I think that is the one thing which differentiates us from any other database. Predictable low latency at any scale. So I came to the end of this, pretty good. The conclusion is we talked about the tenets. Every one of the decisions we make as a team is based on our tenets. These are our high-level tenets. Whenever we have a project, whenever we have a new feature, whenever we have a new thing which we want to develop, one of the things which we ask the team is, what are your tenets? Because those are the things which help distribute decision making across the team, and they make sure that the teams are able to iterate faster. But as a service, these are our tenets.
If you ever have a question why we did something, the answer is probably one of these four. Why did we not store your data unencrypted on disk because it will be faster? The first one up there is security. Why do not we give you some other write guarantee other than we will write to two availability zones, and it will be faster, will not it? Durability. We are a regional service. We guarantee that if a complete availability zone were to go away, we will not be in the least bit impacted. Your data is written to two availability zones before we commit it to you. Availability is our third guarantee. There is no point having availability with wrong data or giving your data to the wrong person. Security, durability, then availability, and to us as a service, predictable low latency.
So my ask to you would be this. In the projects you are building with DynamoDB, or without DynamoDB, it does not matter. Think about these things. What are your tenets? How do you drive decision making in your organization? This is how we drive it in ours. So I hope this was useful to you. We are going to hang around here and answer any questions. But apparently, if you want a hoodie, you have to come and talk to somebody in the database booth. There is no QR code here. There is a whole bunch of trainings which we do offer.
The one other ask I have for you is this. We do these presentations because we believe it is really important to share with you the things which we have learned operating a service at scale and making the choices which we made and learning from you. A part of it is standing here and talking to you about these things, but a part of it is listening to you after these conversations. But if you want more content like this at re:Invent, please do fill out the feedback sessions. Thank you very much. I think that is the last one.
; This article is entirely auto-generated using Amazon Bedrock.














































































































































Top comments (0)