🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Advanced data modeling with Amazon DynamoDB (DAT414)
In this video, Alex DeBrie covers advanced DynamoDB data modeling, focusing on three key areas: secondary indexes with the new multi-attribute composite keys feature that eliminates synthetic key overhead, schema evolution strategies including handling new attributes and backfilling existing data, and common anti-patterns like kitchen sink item collections and over-normalization. He emphasizes understanding DynamoDB's partitioning model, consumption-based pricing, and API design to achieve consistent performance at scale while keeping implementations simple and cost-effective.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Advanced Data Modeling with Amazon DynamoDB
Thank you all for coming. This is Advanced Data Modeling with Amazon DynamoDB. I'm Alex DeBrie. I'm really grateful for you showing up here on a Wednesday morning. This is my seventh year talking at re:Invent.
In an hour, I can't cover everything about DynamoDB. I try to cover different topics every year, but if you want to look at previous years, these are all on YouTube. You can check out those topics. Additionally, there are a lot of really great talks this year from all over the spectrum—data modeling, architecture. There's a great one by Craig Howard that's tomorrow about the service disruption that just happened. There's a lot of really great stuff. Some of these have already happened, and some you probably can't get into, but check these out on YouTube as well. Really great speakers for all of these.
In terms of what we're going to talk about today, I'll start off with a little bit of background and some data modeling goals and process, and then we'll dive into some topics around using secondary indexes well. There's a really good release we had two weeks ago that I'm super excited about—talking about schema evolution in DynamoDB because I get questions around that a lot. And then there's a quick anti-pattern clinic around some anti-patterns I see and how you can address them instead.
I like to think about it as: we want to first set some baseline facts just so we're all on the same page, maybe have some higher level concepts and things we should be keeping in mind as we're doing this modeling, and then some actual application where we're applying it to our data modeling. I'm Alex DeBrie, AWS Data Hero. I'm going to talk fast. I have a ton of slides, and they're worried I'm actually not going to get through it all. So I probably won't be able to do Q&A in here, but I will do Q&A out there if you want to. I will also be at the DynamoDB booth in the expo hall most of this afternoon. So bring questions, bring your friends, and let's talk DynamoDB.
DynamoDB's Unique Characteristics: Fully Managed, Consumption-Based Pricing, and Consistent Performance
All right, let's get started with the background. I want to start off with some unique characteristics of DynamoDB because DynamoDB is a unique database. It's different. Most people know relational databases, right? And if you want to model well in DynamoDB, you need to learn some different things. You need to change what you're doing. You need to teach a lot of people on your team. Given that, I think if you want to use DynamoDB, you should want one of its unique strengths and unique characteristics.
The three that I always think about are: number one, it's fully managed, right? It's fully managed in a different way than RDS or some other managed database service. With DynamoDB, they have this region-wide multi-tenant, multi-service, self-healing giant fleet with storage nodes, load balancers, request routers, and all sorts of different infrastructure. You cannot take down DynamoDB. That's not going to happen. You could overload a relational database or OpenSearch or some other database, but not DynamoDB. So it's really fully managed and pretty hands-off operationally compared to most other databases.
In addition to that, it has a consumption-based pricing model that I love. You're not paying for an instance. You're not paying like a traditional database where you pay for CPU, memory, IOPs, and however big your instance is. You're actually paying for consumption. With DynamoDB, you're paying for read capacity units—actually reading data from your system. Every four kilobytes of data you read consumes a read capacity unit. Same with writes: every one kilobyte of data you write to DynamoDB, you consume a write capacity unit. And then you're also going to pay for the data you're actually storing.
There are some unique implications of that. The big one that I like is very predictable billing impacts. If you have an existing application and you want to add a new secondary index, you should be able to calculate pretty easily how much it's going to cost to backfill that index and how much it's going to cost going forward, knowing your write patterns. I also say that a bunch of DynamoDB costs you can do in Excel. You don't have to actually spin up a database and see how much it's actually using. You can do this in Excel, and it's pretty nice to do that.
There's also a really tight connection between how efficient you are, how you model and think about efficiency, and your bill. DynamoDB is pricing it in a way that's giving you signals on how you should be using it. I'd say take those signals, use them, and you'll save a lot of money on your bill. I love that consumption-based pricing model. And then performance-wise, the one people think of is consistent performance at any scale. You can have DynamoDB tables that are small—there are lots of megabyte level tables—but there are also lots of terabyte level tables and petabyte level tables. DynamoDB gives you consistent performance at any scale, no matter which size you're at.
Understanding DynamoDB Architecture: Tables, Items, and Partitioning
I would say these three things—ops, economics, performance—at least one of these should stand out to you. I want that from DynamoDB. That's why I'm going to learn DynamoDB. That's why I'm going to change how I do some modeling to make it work within DynamoDB. Let's do just enough architecture to understand that consistent performance at any scale part. If we have an example data set here, this is some users in my application. Those are going to be contained in what's called a table. You'll create a DynamoDB table to store your records. Each record is going to be called an item. When you create your table, you have to specify a primary key for your table. That primary key must be included on every record and must uniquely identify every record in your table.
In addition to that, when you're writing items in DynamoDB, you can include other attributes on it. You don't need to specify these attributes upfront. There's no schema that's managed by DynamoDB itself, other than that primary key. The schema is going to be managed within your application layer. Those attributes can be simple scalar values like strings, integers, or they can be complex attributes like lists and sets and maps.
The primary key concept is so important to how DynamoDB works. There are two types of primary keys when you're creating your DynamoDB table. There's a simple primary key, which just has a partition key. Our users table, for example, just has that username that uniquely identifies every record in our users table. But we also have a composite primary key, which is probably more common depending on your use cases. A composite primary key has both a partition key and a sort key. So we can imagine in our application, maybe users can make orders. In that case, we might want to use this composite primary key which has these two different elements: that partition key of a username and that sort key of an order ID.
Note that each one of these are distinct items. Even though some of those share the same partition key, the key there being that the combination of that partition key and sort key needs to be unique. With that primary key, you're going to choose this type when you create your table—simple or composite. You can't change the key names or anything like that after you create your table. The combination of that primary key, whether simple or composite, has to be unique for each item.
Now as you look at that primary key, both of those elements have that partition key element, which is probably one of the more important things you need to know about DynamoDB and using it well. When you're creating a DynamoDB table, behind the scenes, they're going to be spreading your data across multiple different partitions. Let's say you start with two partitions. When a write request comes into DynamoDB, say you want to insert a new user into our table, that front end request router is going to look up the metadata of your table, understand what its primary key is, look at that partition key value, hash that value, and then figure out which partition it needs to go to. So in this case, that item belongs to partition one.
The great thing about this is that as your data grows, DynamoDB behind the scenes can add a third partition, add six more partitions, add a thousand more partitions. It doesn't matter. That first step of figuring out which partition an item belongs to is a constant time operation. So even if you have one of those ten terabyte tables, it's still going to be a constant time operation to get down into about ten gigabytes of data. Those partitions are all managed for you. You don't have to add new partitions. DynamoDB is doing that behind the scenes for you. That's where that consistent performance at any scale happens.
The DynamoDB API: Single Item Actions, Query Operations, and Mental Models
The most important thing you need to know is the combination between partitioning and how the DynamoDB API exposes that to you. These items are going to be spread across your partitions by that partition key, and then you want to be using that partition key and using that whole primary key to identify your items. Rather than giving you a SQL interface with a query planner behind the scenes, you're basically getting direct access to those items through the API.
DynamoDB has what I call single item actions, which are all your basic CRUD operations. If you're inserting a record, reading a record, updating a record, or whatever, you're doing that with your single item actions. In this case, it requires the full primary key. You need to specify the exact item you want to go and manipulate. All your write operations are going to be with those single item actions. There's not like an update where you can update a bunch of records given some predicate.
Now if we go back to that composite primary key and think about how that partition key is used mostly to assign records to the same storage partition, we can see that records with the same partition key in this composite primary key table are going to be grouped and stored next to each other. They're going to be ordered according to that sort key. It's really useful to us because sometimes we need to get a set of related records, and that's where DynamoDB gives you the query operation as well. If you have a table with a composite primary key, you can do this and you can fetch multiple records with the same partition key in a single request.
When you're doing this, it requires the full partition key because it needs to know which partition to go to to actually fetch those records. You can also have conditions on that sort key to say, maybe, values before the sort key value or after the sort key or between those sort key values. Finally, it's only going to let you get one megabyte of data per request, which is how it gives you that consistent performance at any scale. It doesn't want you to have a gigabyte of data that you're pulling back because that's going to change your response times.
Additionally, the DynamoDB API has a scan API, which is just fetch all. I'd say you're going to use this pretty sparingly, other than like exports and different things like that. So given this partitioning and the API, my mental model for DynamoDB is that when I make a request to DynamoDB, I get one contiguous disk operation from some unbounded amount of storage. DynamoDB is just giving me one contiguous disk operation from an unbounded amount of storage.
Dynamo is providing this infinite fleet of storage that expands as much as I need. However, I need to physically think about how I'm setting it up within my application to ensure I can get what I need when I need it. If I want to insert a new record, I need to identify the primary key to determine exactly where that goes. If I need to read a record, I need the primary key so I can find it easily. Or if I need to fetch a set of related records, I need to arrange them so I can use the query operation with all items grouped together and sorted as needed. I can read up to a megabyte of data there.
The key point here is that every API request gives you one contiguous operation from an unbounded amount of storage, rather than with SQL where you get a query planner that hops all over the disk doing multiple reads. Given that you get basically one contiguous read, I really love the DynamoDB API. I think it works very well with the partition key, and you need to understand that. Given that, I would say don't start with the PartiQL API. There is a PartiQL API, which is a SQL-ish API you can use to query DynamoDB. Under the scenes, it's basically just turning it into one of these operations: a single item action, a query, or something like that. However, I think it hides a lot of what you should actually be thinking about in your application: how do I want to arrange my data to fit with the DynamoDB mental model?
Secondary Indexes: Enabling Additional Read-Based Access Patterns
The last quick concept we need to cover is secondary indexes. We've seen that on this table we can fetch records by the primary key. I can fetch the Alex DeBrie record or fetch Madeleine Olson by username. But what if I want to fetch someone by a different attribute, like their email address? That's where the concept of secondary indexes comes in. You can create secondary indexes on your table, and it's basically a fully managed copy of your data with a new primary key. This enables additional read-based access patterns by repartitioning your data and reorganizing it so you can fetch it more easily. We can create a secondary index on the email address, and now we have this email index which allows us to fetch a record by a given email address.
There are two types of secondary indexes. The first kind is a global secondary index, which you should prefer in almost all cases. The second type is a local secondary index, which you should really understand before using. I was going to do a deep dive on why and when you should use local secondary indexes, but I had to cut it for time. I would just say that a local secondary index is kind of a one-way door and has some serious downsides. Make sure you really understand a local secondary index before putting one on your table.
To summarize and get us on the same page with some basic facts, make sure you understand the concept of partitioning, using the partition key to spread your items across your table. Understand how the DynamoDB API works to provide consistent performance at any scale and the importance of the primary key in that. Then understand how secondary indexes enable additional read-based access patterns and the consumption-based billing, which I think is unique and pretty interesting about DynamoDB. You get predictable and visible billing there. Let's move on to some high-level concepts around data modeling goals and process.
Data Modeling Goals and Process: Keeping It Simple
The first thing I would say is that you want to keep it simple as much as possible. I think a lot of times when I see errors or issues with people's DynamoDB data models, it's just that they're more complex than they need to be. I was guilty of this for a long time as well. I saw this recently on Twitter about how a novice does too much in many different areas, while a master uses the fewest motions required to fulfill their intention. So keep it simple as much as you can.
I like to think about what your modeling meta goals are, regardless of what database you're using. What are we doing when we're doing data modeling? Number one, you have to think about how to maintain the integrity of the data you're saving. Ultimately, a database is serializing and storing some state of the world, whether that's representing physical objects like inventory, offices, or people, or digital stuff like social media likes and comments. You need to be able to maintain the integrity so when you read that back out and represent it, you can still understand what you have in your database. Additionally, in addition to maintaining integrity, you need to make it easy to operate on the proper data when you need it. If I have one of those big tables, how do I get down to just the records I actually need?
Thinking about that in the context of DynamoDB, how do we apply that? When thinking about maintaining the integrity of the data you're saving, the first thing you have to do is have a valid schema. DynamoDB is not going to maintain your schema for you like a relational database would. DynamoDB is a schemaless database, so you can write whatever attributes you want, which means you're going to maintain that in your application code. It's an application-managed schema rather than a database-managed schema. You should have some sort of valid schema in your code. I use TypeScript and Zod, but you can use whatever you want. When you're writing records into your DynamoDB table, especially if you're inserting full records, you should almost always be validating that you know what you're writing in and that it's valid data.
You can put it in there because you don't want to put junk into your database. The same applies when you're reading data back from DynamoDB. You should parse that out and make sure the shape matches what you expect. If it doesn't, throw a hard error rather than limping along. You should understand that you're getting something back that you did not expect. Where did that issue happen? You don't want to keep corrupting your data worse and worse over time.
So that's a big one: make sure you have a valid schema somewhere in your data. Additionally, when you're maintaining the integrity of your data, you want to maintain constraints across items. If you have some uniqueness requirements, you don't want to have multiple users with the same username. You need to maintain uniqueness that way, or maybe you have some sort of limits you need to enforce in your application. How are you going to do that? That's where DynamoDB condition expressions are going to be your friend, maybe transactions, but think about those constraints and make sure you're modeling for them.
Then with DynamoDB, sometimes we duplicate data. Sometimes we do a little bit of denormalization. So think about how you avoid inconsistencies there. When you're duplicating data, think about whether this data is immutable. Is it ever going to change? Sometimes you will have functionally immutable data. If someone makes an order and you want to store the payment method that they used for that order, even if they change something about that payment method later on, you don't really need to go change that order. You're capturing a snapshot of what the payment method was at that time.
But sometimes you'll be duplicating mutable data. If I have mutable data, how am I going to update my data when it changes? First of all, how many times has it been copied? Has it been copied five times or a thousand times? How do I identify the items to which it's been copied where I need to go update all these different records for this data that's changed? And probably the hardest question now: how quickly do I need to update it? Do I need to update it in the same request, where if I'm updating that parent record and now it's been spread out to five different items, do I need to do that in a transaction to make sure they're all updated at the same time? Or can I do that update asynchronously? Am I going to have data inconsistencies across that? What does that mean for my users or clients?
So be thinking about this when you're duplicating data. That's what I think about when I'm talking about maintaining the integrity of the data within DynamoDB. Then you also want to think about how to make it easy to operate on the right data when you need it. That's where your primary keys are coming in. If you're writing, think about what's your proper primary key to maintain uniqueness and what's the proper context. How do I canonically identify a record that I'm always going to have available that I can use to address that relevant item? When I'm reading, what's the primary key structure? What are the indexes I need to filter down efficiently rather than significantly overreading and filtering after the fact?
Those are some meta-goals we'll keep in mind as we look throughout this. Just a quick run through the data modeling process: I always say that most of DynamoDB data modeling happens before you write any application code. You should be thinking about this, writing it down, and then the implementation aspect is actually pretty straightforward. First thing you need to do is know your domain. What are the entities in my domain that I'm actually modeling? What are those going to look like? What are the constraints that I have in my application? What's the data distribution? If I have a one-to-many relationship, how many can that related aspect be? Is it ten related items per parent, or is it a thousand or a million or something unbounded?
How big are these records because that's going to affect modeling and some of the choices I'm making there. Then with DynamoDB, you want to know your access patterns up front and model specifically for your access patterns with your primary key. I always say be very specific with this. You should actually list out what are my write-based access patterns and go through those mechanically. Same thing with your read-based access patterns. As you're modeling your primary keys, you should make sure I know how to handle each one of these. If I have conditions in my write operations, do I have that set up properly? All these sorts of things. So know your access patterns and then the last thing: just know the DynamoDB basics, the things we talked about before, the primary key API and secondary indexes. That's going to do most of it for you.
Multi-Attribute Composite Keys: A Game-Changing Release for Secondary Indexes
So please just keep it simple on this stuff. I think the basics are going to get you a long way. Using those single item actions for your write operations and your individual reads, using some queries for range lookups and list operations, sprinkling in those secondary indexes when you need them for additional read-based patterns, and then sometimes using transactions sparingly for certain operations. All right, so that gets us out of sort of background conceptual type stuff. Let's go apply it somewhere. And I want to start off talking about secondary indexes. The reason I want to talk about this is because there was a huge release just two weeks ago about how DynamoDB now supports multi-attribute composite keys for your GSIs. This is a huge release. I think this will simplify a lot of things for people. But in terms of walking through why this is useful, let's look at an example table we have here that's just tracking orders within a warehouse within some system. We have multiple warehouses, and we have assignees within those different warehouses that have to go process those orders, pick them, and make sure they're all ready to be shipped.
So we have these different attributes on our table. We might have some sort of access pattern that says, for this warehouse, for this assignee, for this status, what are the things that they should be working on?
You might see some sort of attribute like this in your table: GSI 1 PK and GSI 1 SK, which are synthetic keys made up of other attributes that are already in your table. If you look at this, we have the warehouse ID put in there, then we have a pound sign, and we have the assignee ID jammed in there. Then we've got status, we've got priority, we've got created at—all of this is made up into these synthetic keys in our GSI PK and SK.
This was a very common pattern we used to have to do with these synthetic keys, where we're manually combining these attributes to partition, to group, and then sort as needed. I didn't realize how much I disliked these until this new release came out, because there's a ton of downsides to this. The number one is just the item size bloat.
If you look at that item that we have in our table, the meaningful attributes in our thing there are 100 to 101 bytes, so a pretty small record. If you look at the other attributes—these synthetic keys—they're 92 bytes. So this is almost half of our item. It won't always be half of your item because you'll actually have larger other attributes there, but if you have two indexes with GSI 1 PK, GSI 2 PK, and SKs as well, you might be talking about 200 bytes, which is 200 bytes you're storing on every single item.
You're paying storage for that. 200 bytes is 20 percent of a WCU, so it's likely to kick you over the WCU limit a lot of times. So every time you write, you're paying for an extra WCU and into index replications as well. This is just a lot of cost for very low value. These attributes are already on your table.
There's item size bloat, and there's also just the application modeling complexity. When you're doing all your access patterns and setting up these indexes, you have to think about putting all these together and have I done it right? Have I implemented it right in my application? There's sorting complexity around taking all these attributes and turning them into a string. But if one of those attributes is an integer, now you have to sort that integer like a string, so you have to zero pad it to the longest length it could potentially be and think about that sort of thing.
Then there's the update logic. It gets harder. If someone comes and says, hey, update the status for this order—it's no longer pending, it's prepared or whatever—I have to know all those other values. If I don't know all those other values and I have to read that record to pull down those values just to do my update, it's kind of a pain to do all this sort of stuff.
That's the old way. But now we have these multi-attribute composite keys. The way this works is you can support up to four attributes each for the partition key and the sort key when you're creating your secondary index. If we have our existing table and we want to use this multi-attribute composite key pattern, what we do is when we're creating our partition key, we say, OK, I want my first element to be that warehouse ID. I want my second element to be that assignee ID. I want that third element to be my status.
Same thing with the sort key—I want the first element to be that priority. I want the second one to be that created at. Now I don't need those GSI 1 PK and GSI 1 SK values at all anymore. They're just reusing those when they create my secondary index to know how to set up that index.
If we go back and look at those downsides, how does this work with our multi-attribute composite keys? We don't have that item size bloat anymore because it's actually reusing the actual attributes in our table. We're not bloating it up with another 100 or 200 bytes on our table. It's a lot easier to reason about because when I'm writing or updating an item, I don't have to think, OK, which other GSI synthetic keys do I have to update as well?
We don't have that sorting complexity. If one of my partition or sort key attributes is an integer, it sorts like an integer. It doesn't sort like a string, so you can just do normal sorting on it. Then my update logic is a lot easier because again, I'm only updating the actual attributes in my table. They're handling all the work for that.
In terms of how it works, you get up to four attributes each for your partition key and sort key. So you can still just use one attribute for each if you want to, but you can specify up to four. Now, when you're doing a query operation, you have to specify in your key condition expression that you have to include all your partition key attributes. Because it's the same thing, you still need to know exactly which partition you want to go to, where this data is located, so you need to make sure you have all your partition key attributes in there.
You can do conditions on that sort key as well. The important thing is that sort key is going to be ordered. The ordering of those attributes matters, and I would think of it like a SQL composite index. It's basically left to right, no skipping, stops at the first range. So if you have four values in your sort key, you can match on all four of them, you can match on the first three and do a range on the fourth one.
But what you cannot do is do an equality match on the first attribute and an equality match on the third attribute without providing a value for the second attribute. It will stop at that one and just scan there.
The one thing I will say is this will not solve your overloaded index issues, probably. So if you are doing single table design in this case where we have some user entities in one table, we also have some organization entities. You can see we have these GSI 1 PK and GSI 1 SK values here for a secondary index.
If we look at our secondary index, we have an item collection, which is a set of records with the same partition key. We have an item collection that contains two different types of entities. We have pre-joined these entities, organization and user, in that same item collection. This is going to be hard to do with those multi-attribute composite keys because it is unlikely they are going to have the same attribute names across these different entity types. So for our partition key, they both have organization name, we could use that as our partition key. But you can see in the sort key, the username is coming from that username on the users. The organization is just a static string that we have here, so it would take a little bit of work to do this if you are doing these overloaded index patterns. So it probably will not help that one there, but in all other cases, this is going to be a huge win for you.
Cost Management with Secondary Indexes: Strategic Use and Optimization
All right, so for secondary indexes again, use these multi-attribute composite keys. This is huge. I would use this for almost all cases except for those overloaded indexes. Honestly, for existing tables, this might make sense too just to save on item sizes. Create a new index with this, switch over to it, and drop your old index. You can stop writing that synthetic key, which could actually save you money depending on your use case.
In addition to that, let us talk about cost management with secondary indexes because I think this is undervalued. Every time you are writing to a secondary index, it is going to consume write capacity units. Secondary indexes are a read-time optimization for which you pay for writes. But writes are more expensive than reads, right? A write capacity unit costs five times as much as a read capacity unit, but it is also only one-quarter of the max size, so it is going to be five to twenty times as much as reads depending on the size of your items. So make sure you are getting the value from that.
Here are some cost management tips on secondary indexes. The first thing I think is, do I actually need a secondary index? Because I think a lot of times we will write our access patterns, we will solve that first access pattern with the primary key in our base table, and then we will say, okay, every other read-based access pattern, I am just going to add another secondary index for that. But now in this case, I have three secondary indexes. Every time I write my item, I am going to have to replicate to each one of those. My write costs are now four times as much as they would be. So make sure you actually need all your indexes.
Sometimes you can reuse secondary indexes for multiple different use cases. I would say the two areas I usually see this is like if you have a really high correlation or overlap between different read patterns, sometimes you can do that. I had a talk recently which was about an order delivery app, something like DoorDash. Imagine you want to show all the orders for a given restaurant over time. They want to say, hey, what orders did I have last month or the month before that? They are grouped by restaurant, ordered by time. But also that restaurant wants to say, hey, what are my active orders that I should be working on now? I want to put up on the board in my restaurant to make sure people are working on them.
Well, the thing is, all your active orders are going to be the most recent orders. You are not going to have an order from two weeks ago that you forgot to deliver and you need to be working on now. So you just look at the last fifty or one hundred orders, filter out those that are actually completed, and those are your active orders. You do not need a secondary index for that one.
The second place you can reuse a secondary index is just when your overall search space is pretty small. So searching for admin users within all users. Like if you have a SaaS application where organizations sign up and they have lots of different users in there, and somewhere deep in your user management page you want to look for just who are the admins in my application. Well, if you only have like one hundred, two hundred, or three hundred total users max within a given organization, you probably do not need this separate index just to show admin users. You can just fetch all the users and then filter down to admin after the fact.
My rough rule of thumb here is like if fetching that total search base, in this case, all the users for a given organization, if that is less than one megabyte, which is one DynamoDB query page, I would say usually do not need a secondary index for that, depending on how many times you are reading from it and different things like that. All right, so that is the first one. Do I need an index at all? Can I avoid having an index? The second one is, if I do need an index, do I need to put all my items into my index? And this is where the sparse index pattern shows up.
The thing about secondary indexes is DynamoDB is only going to write items to your secondary index if they have all the index key attributes. So if you have that GSI 1 PK and GSI 1 SK, or if you are using these multi-attribute composite keys, it has to have all those different elements to be replicated in your secondary index. And you should use this to your advantage.
If we go back, we had that orders table that we showed in the beginning, let's say we had an access pattern that said find all the orders for a given customer that had a support interaction. Maybe what we do is when an order has a support interaction, we add this support interaction at timestamp attribute on it just to indicate when that interaction happened. Notice that not every record is going to have one of these.
If we set up a secondary index on that table using that support interaction, partitioned by that username and sorted according to that support interaction at timestamp, now when we have our order support index, it's only going to have the subset of items that have both of those attributes. We've filtered out all the records that don't have a support interaction. So again, use this to your advantage both from a modeling perspective. This is like a global filter over your data set. We basically said where support interaction is true. You want to be doing filtering with your primary key, with your partition key, with your sort key, but also with your secondary indexes using that sparse index pattern, which is another good way to filter data.
It's also going to reduce costs because now you're not paying to replicate those items that you're never going to read from this index, or you don't have to overread and filter out records that don't match your conditions. So that's the second cost management tip. First of all, do I need an index? Second, do I want all the items in my index? The last one is, do I need the full item in my index, where you can choose how much of that item to project into your index.
I used to say just project the whole thing, but that actually can get really costly in a lot of ways. Think about our user record again, and maybe we have a user detail page that has a lot of information about that user. It's got this long bio section, preferences, an activity log, maybe we're persisting some of their most recent actions on there, just a lot of stuff. But if we have a list users access pattern, we don't need almost any of that data. If you look at that, we need a user ID, name, email, just a few little bytes of information.
So if we're creating a list of users in an organization access pattern, we don't need to replicate all that or project all that data into that index. With your index projection, think carefully about which attributes you're projecting into that secondary index because there's significant savings you can have from not doing that, and it comes in three ways. Number one, it's going to reduce the WCUs that are consumed for a write. If I have this five kilobyte user record but I'm projecting less than one kilobyte of it into my secondary index, I'm reducing my WCUs from five to one, that's eighty percent savings right there. But even better is that it prevents some writes entirely.
If a user goes and updates their bio and I'm not replicating that to my secondary index, it's not going to update that record in my secondary index. I skip that write entirely, now I save one hundred percent of that write for that secondary index. Additionally, it's going to help you on the read side because now when you're reading all those users, you're not paying over WCU for each record you have there. You're just paying for a much smaller item. It's going to reduce the number of pages when reading, so really think about your projections carefully around this stuff. I'd also say it's not a one-way door.
You can create secondary indexes after the fact. We're going to look at that in the schema evolution section, but you can if you need to change your projection over time, create a new secondary index either with a larger or smaller projection and then drop your old index and start reading from the new one. So that's secondary indexes. Again, key takeaways here is use this multi-attribute composite key pattern wherever possible. I think it's a huge addition that's really going to reduce your costs and simplify a lot of your logic there.
Schema Evolution: Addressing the Myth That DynamoDB Can't Change
Look into sparse indexes for global filtering over your table. Then think about that index cost flow. Do I need that index? If I do, do I need to replicate all those items into my index? And then finally, do I need to replicate the full item into that index? All right, next, let's talk about schema evolution. I get this question a lot where we talked earlier about how you have to know your access patterns in DynamoDB, and that leads a lot of people to say DynamoDB is great if your application will never change.
I don't think that's true. I've worked on a lot of different DynamoDB applications and they've all evolved over time in different ways, so I don't think that's true. What I think is true is that certain patterns are always going to be hard to model in DynamoDB, and sometimes those come up later and then you feel frustrated thinking this is too hard to do. So let's talk about some patterns that are just always hard in DynamoDB before we actually move into schema evolution. The big ones are going to be if you have any aggregations around your table. DynamoDB doesn't have native aggregation functionality. You're going to have to write it in your application code. So if you have questions like "How many transactions has this customer done each month?" or "What's the largest purchase done by customer Y?" it's tricky for DynamoDB. You're going to have to manage some of that yourself.
I think the more common one, and the one that comes up when people are saying "Hey, DynamoDB can't evolve" is complex filtering needs. I say that's when you're filtering or sorting by two or more properties, all of which are optional. That's really hard, right?
Let's say you have just a list of records in a table view, and you want to show your users lots of different attributes. You want to let them choose which fields you're filtering by and sort by different things. That's really hard, right? If I go and say, "Hey, find me all the trips by this company that came out of the Omaha airport and maybe were over 500 miles and within this time range," this is going to be a really hard pattern to model in DynamoDB, even if you knew this on day one before you wrote any code for your table. This is a hard pattern to model.
You can't do it easily. I've talked about complex filtering more over the last couple of years, so you can look at that. There are some ways to do it in DynamoDB. Sometimes you want to use something else like OpenSearch or ClickHouse or something like that. And sometimes you're just like, "Hey, you know what, this would fit better in a different database." But complex filtering is a hard one to do. So this is true: certain patterns are always hard in DynamoDB, and if they come up later, they're going to be hard. But that's not because evolving DynamoDB is hard. It's because this pattern is hard in DynamoDB.
Three Types of Schema Evolution: From Application-Only Changes to Data Backfills
So I do want to talk about more traditional schema evolutions that you do see and how you handle that in DynamoDB for access patterns that actually fit within DynamoDB. I'll do that with just an example here, which is a support ticket application, right? Customers can come file support tickets, they get assigned to different users, and we'd have some sort of table like this with a partition key of our tenant ID. We have our different tickets all in this table that have our different attributes on them.
Now, as your application is evolving, I would say first you want to understand the type of evolution you're performing. I think there are three main types of evolution that you're going to see pretty commonly in your application, and the way you handle them is just a little bit different, right? So the first one is you might have a schema change that does not affect data access, right? You're not fetching based on this schema change.
So if we go back to our support tickets here, maybe just on the left here, we've added these little badges based on the tier of the customer, right? Maybe they're a platinum customer, maybe they're gold. All we're doing is helping our support agents understand what that customer tier is. But you see there's no filtering on that customer tier or anything like that. It's purely this little badge that we're putting on there. So in this case, we're adding a new but unindexed attribute. We're not indexing it. And this is generally the easiest type of evolution to handle with DynamoDB, right?
We talked about how DynamoDB is schemaless, so that means you can just start writing new attributes to the table for new items as you want, right? So we have this new customer tier attribute that we're starting to write to our item. Notice that some items don't have them. Existing items might not have this customer tier attribute, and we're okay with this, right? This is just like if you're changing your SQL table to add a new column with a default value, but now that default value is probably going to be in your application code rather than in DynamoDB, right?
So this is the easiest type of evolution. What you want to do is update that schema in your application code. That's going to be mostly where you handle that. Add default values and change your schema to handle that. So we talked about having that valid schema, that modeling meta goal before. If we had our different ticket schema here, we might add this new customer tier attribute on the bottom. It has the different values it can be. It has a default for items that don't have that particular record. Depending on how complex our schema change is, maybe now you need some versioning of different schemas. The first thing you do is sort of detect that version. Maybe you have to parse the ticket differently based on what version it is and then sort of normalize it into some sort of schema. But you can mostly handle this within your application code.
That being said, you can handle it completely within your application code. However, you might decide you actually want to backfill all your existing data, right? There are a couple of reasons for that. Number one is you might end up with schema bloat over time where you have twenty different versions of your schema, and it's hard to reason about. Like, okay, if I have a V2 item, how does it get to a V16 item or something like that? So it might be easier rather than managing that. You just say, "Hey, I'm going to backfill my existing records and handle those." Or another thing is you might be exporting your data to external systems, OpenSearch for search or ClickHouse or S3 and Athena for analytics, right? And while you can handle the default values in your application for your OLTP stuff, now you also have to communicate all those values to whoever's maintaining those systems, and it can be hard to deal with. So you might, just for long-term data hygiene reasons, say we're going to backfill and update this new value on existing items. If that's the case, now you're out of this first type of evolution. You're going to be into the third type we'll talk about in a second. But at the very least, what you can do is handle this completely in your application code. That's handling a schema change, adding new attributes, renaming attributes, things like that, that does not affect access. This is a mostly easy application-only change.
The second type is a new index on an existing attribute. If we look at our support tickets, at first we're just showing them in a flat list with no filtering. This works well when we have 5 tickets, but over time we're going to have 5,000 tickets or 500,000 tickets or 5 million tickets. Now we need a way to filter down to just my tickets. I'm a support agent, and I want to filter by assignee so I can say, just give me the tickets that I have. This goes back to our modeling and modeling meta goals—making it easy to operate on the right data when we need that data. So what we need is a new index on an existing attribute. The good thing is global secondary indexes can be added at any time. You can go in and add a new secondary index to your table, and DynamoDB is going to handle that for you. If we go back, you can see here we already have this assignee value. We can set that up as our partition key. We can use created at or ticket ID as our sort key. DynamoDB is going to do the work to backfill that for us, and now we can query from that accordingly.
The general process for this is number one, you add that index, whether that's in your infrastructure as code tool or maybe you just do it directly in the AWS console. DynamoDB is going to kick off a backfill for you and basically scan all the existing items in your table and write and replicate them into your secondary index for you. Once that backfill is done, then you can start querying your secondary index. You can't query it until that initial backfill is done, so that's where you add the application access pattern to start reading from that index. This is a fairly easy, straightforward change—I want to access my existing data in different ways. DynamoDB is going to do that backfill of your new index for you, and it's not particularly hard.
The third type of evolution that you'll go through commonly is when we need to make some change to existing data. As we talked about doing a backfill before, the example I came up with here is we have a lot of records with things like priority and status, but maybe what we want is to add this urgent button where I can filter down to just the most urgent tickets and things I want to handle there. If I click that button, it filters down these urgent tickets so I know what I want to get down to. Let's say that access pattern is kind of funky and has some business logic. What is an urgent ticket? We've decided it's tickets where the priority is P1 or P2, the status is either open or pending, and then we want to order them by created at. If we look at the existing items in our table, there are some items that meet this criteria, but there's not really an existing attribute that we can use to index and filter and get down to just these urgent items. So what we might have to do is when we're writing records to our table or updating records, if it meets that criteria based on our business logic, we also add this is urgent true flag to it, just something to indicate that this is an urgent ticket. Then we can go and create that secondary index using that partition key, having multiple sort key values in there so that only the tickets that have that is urgent flag are going to get replicated into our urgent tickets GSI. This is that sparse index access pattern we were just talking about. We have just those urgent records here.
So this is if you need to change existing data. The problem here again is we have a bunch of tickets in our table that don't have this urgent flag, but we need to add it for those ones that are truly urgent. This is mostly going to come up if you need a new index on a new attribute in your table or if you're backfilling a new attribute, even if you're not using an indexing like we were talking about in the first type. You need to backfill existing items with this new attribute. The general process for this is first you want to update your application code to start writing this new attribute. What you don't want to do is perform that whole backfill, and then once that backfill's done, all the new items are not getting that new attribute. So make sure you're updating your application code so you're writing this new attribute to new items going forward. Then you actually start your backfill, which means I want to scan all the items in my table. I identify the ones that need this new attribute and run my update item operation. Then once that's done, if I need a new index or whatever, I update my application code to actually use my new attribute there. This feels like a tough process. It's not nearly as fun as just having DynamoDB backfill your index for you. There is some tooling around this to make it a lot easier, which is great. You don't have to just write your own scripts. There's AWS Glue which can operate directly on your DynamoDB table and do this sort of thing. You can also use export to S3 to export your entire table to S3 and use Glue to operate on it there and write back into DynamoDB. My new favorite is there's a bulk executor. Jason Hunter, who's a DynamoDB SA, created this new bulk executor tool. It's on GitHub.
He's a dynamo on DynamoDB. This tool allows you to do this sort of stuff and make it pretty easy to perform these kinds of migrations. I've also seen people use AWS Step Functions if you're into that. If you're good with Step Functions, this can work. I prefer the other ones, but any of those can work.
That last one—changing existing data—is probably the hardest one because it requires that backfill. But I would also say this is a lot less common now that we have multi-attribute composite keys. When I initially made this example, I was going to apply it to that second example we had where we're adding a new assignee index. I was saying it used to be very common to create those synthetic keys where maybe I need to have my partition key be like tenant ID and assignee, just to make that unique across tenants. Now I'd have this new tenant assignee value, some sort of primary key that I need to add to my item for my secondary index. So now I need to go and decorate every single item in my existing table to handle this new tenant assignee index. That would be really annoying to do with those synthetic keys.
But now because we have multi-attribute composite keys, what I can do is say that for my partition key, the PK1 value is that tenant ID, the PK2 value is that assignee ID, and then my sort key value is that ticket ID. It handles all this for me. I don't have to do a backfill for these synthetic keys for my secondary indexes. This should really help prevent those backfills, and now you're into those easier worlds of application-only changes or having DynamoDB do that index backfill for you. That's a lot easier.
The main takeaways I would say from the schema evolution section are that certain things are going to be hard for DynamoDB no matter when you do them. If you find out about that use case down the road, it didn't matter if you knew that upfront. Complex filtering is going to be hard with DynamoDB, so you probably should think upfront: am I eventually going to need complex filtering type use cases? And if so, how am I going to handle that when I get there? Am I going to use an external system and am I going to do it myself with DynamoDB, or should I just choose a different database?
Otherwise, I think most types of evolutions are pretty straightforward and fit into one of those three buckets. There are certain types we didn't cover that are difficult. If you need to change your primary key, that is actually a hard one. You need to create a new table, you need to migrate, you need to dual write for a while and handle that sort of thing. I would say changing the primary key is pretty rare because the primary key is like the canonical way I'm going to identify my record and operate on it. Usually you don't see that changing a ton, so I would say it happens, but it's not super common.
Anti-Pattern Clinic: Common Mistakes and How to Fix Them
The more common one I see is if you need to combine or split items, like you're doing some sort of denormalization. That can be harder to do than migration, not impossible, but a little harder. I would say the more normalized your table is, the easier denormalization in some senses is good. But also, the more normalized it is, the easier it is to make these sorts of changes. All right, we've got 12 minutes. Let's zip through the anti-pattern clinic here. These are just things when people ask me to look at their data model and talk through these things. These are things I see a lot that I think we can do better on.
The biggest one is what I call the kitchen sink item collection, where you're throwing everything at the kitchen sink in a single item collection. An item collection is all the records that share a given partition key. Someone will come to me and say, here's my data model. I have this particular partition key which is for this given user. In this item collection, I have 44 different entity types. I have some payment methods for that user, I have the actual user record, I have their orders and I have their order items for all that sort of stuff. It can be okay to overload this, but what you want to do is you only want to put items together in the same item collection, give them the same partition key, if you're going to retrieve them together.
If you're going to use that query operation to make that one contiguous read and you need to fetch those different item types in a single request, that's when you do it. If somebody shows me this, I'll say, okay, do you have an access pattern where you need to fetch the user record, all their payment methods, all their orders, all their order items, all that stuff? And they say, no, no, I just get subsets of it at a different time, usually one item type at a time. In that case, you could split these up. This is the exact same thing functionally, but you don't have the complexity of putting it all into the same item collection. You have a payment methods item collection that are all stored together, and now you can list all their payment methods very easily. You can fetch that user record directly. Maybe you do have item collections that combine the order and the order items for you so you can fetch those easily in one single request, but you're doing that intentionally and strategically rather than throwing everything together in one item collection. So again, only put items together that are going to be retrieved together. The big reasons to group different item types together would be, number one, those items are frequently fetched together. You're basically pre-joining those records, that's the order and order items.
I want to fetch those in a single request rather than making two separate requests for that. That can be a valid reason for it. Another reason, almost the opposite reason, is sometimes you might take a single entity and vertically shard it, separate it into two different entities, right? Especially this happens if you have a very fast-moving, frequently updated attribute, but you also have a lot of bigger items. You don't want to pay for updating that entire item every time. You might split it into two different attributes and then you can query it, you can fetch the whole thing together, but then when you're doing updates, it's smaller targeted updates.
One last reason is around if you need stream processing that has ordering across item types, but within some sort of partition key, you can do an LSI with this. I think this is pretty advanced. If you really think you need this, come talk to me sometime. Those are the reasons to group together different item types. So I think that's one anti-pattern, that's one anti-pattern: throwing everything into the same item collection. I think related or the opposite of that is just creating an over-normalized model, right? You see the word "tables" and you think, "I know tables from a relational database world, I know tables in DynamoDB." You shouldn't bring your exact model from a relational database to DynamoDB, right? You're likely going to have fewer tables in your DynamoDB schema than you would in a relational database.
What you want to do here is use sensible denormalization. Where can I duplicate some data? Where can I use embedding? How can I flatten that hierarchy so I don't need to functionally join together four different tables to handle that query? How can I put those attributes all together in a single item or a couple of items? But I will say, one table for all items is not necessarily correct, right? I think there's a lot of single table stuff that I've talked about before. You don't necessarily need to do it, right? It adds a lot of complexity. There are blast radius issues. If you're doing stream processing, this can be hard if you have different types of entities, even just configuration, right? All your TTL, your backups, your billing mode, global tables, all that stuff has to be the same within a single table, and you might have different needs for different entity types.
So my general rule here is this: only put different entities together in the same item collection if you have at least one access pattern where you're retrieving them together, right? Order and order items, you want to fetch those all together to get all the information about an order. Put those in the same item collection, that's fine, but don't put payment methods in there. Given that, you should only put different entities into the same table if they need to be together in at least one item collection, right? So going back to our example, payment methods could be in a separate table, users could be in a separate table, and then keep orders and order items together in the same table, right? You're still going to get all the benefits with all the ways that DynamoDB has simplified things with these multi-attribute composite keys, with on-demand billing, with adaptive capacity, all this sort of stuff. You don't necessarily need to put a bunch of unrelated attributes together in a single table. Only do that if you're actually fetching them together in at least one access pattern.
Another thing I see a lot is when people hide the DynamoDB API. Sometimes they'll say to me, "I heard DynamoDB has consistent performance at any scale. As we start to get bigger query sets, we're seeing slower operations. What's going on here?" It's almost always trying to abstract it with some sort of abstract query items or query DynamoDB items or something like that, right? With that query operation, you can filter on the partition key and the sort key. They take those as parameters, but then they add in filters and limits, and basically what they're trying to give you is complex filtering on top of DynamoDB, but hiding what's happening under the scenes. Now what's happening is you have this really selective filter. You're not understanding how limits work. It's like doing a bunch of pagination through all that to satisfy this request, right? So I don't think you want to hide the query API like this. You want to actually be thinking about your data distribution, how you're going to be filtering data, what that pagination is going to look like, and how you're going to manage that because I think this can lead to issues for folks.
I think relatedly on the DynamoDB API, just not using the full DynamoDB API, right? If you have an increment operation, I see a ton of people that first they read the entire item. They increment some value in memory and then they write that entire item back. If you do this, now you need to manage versions around that so you're not getting race conditions. You're basically doing optimistic concurrency control there. It just adds a bunch of stuff where what you can do is just do an UpdateItem directly on that, right? You can do atomic updates to this counter record. You can increment that count. You can have conditions on there to make sure it exists and all sorts of things like that. So you don't need to do that read-then-write to handle this sort of update. Counters are also a really fascinating area, and there's a great post from Jason and Chris about resource counters on DynamoDB. I would say counters is one hard problem. That post shows seven different ways to handle it. It's also just a good way to think about idempotency generally in DynamoDB. A lot of these patterns work for whatever sort of idempotency requirements you have.
So definitely check out that post. I was going to talk about it and this covers it so well that I don't think I need to. Alright, so those are the big two anti-patterns I see. The last couple is just that some people always use strongly consistent reads as a default. With DynamoDB, if you opt into the default eventually consistent read, you'll save 50% on your RCU consumption there. I would say most of the time you don't need strongly consistent reads. The lag isn't that much, and usually it's not going to matter. I'd say take advantage of that eventually consistent read discount.
Another thing is having really large items and not thinking about item size or how that's inflating your costs. So thinking about some of that stuff and how you can skinny down your items so it's not quite so expensive. The last one is overusing TransactWriteItems. The transact API is really useful and valuable in a lot of ways, but it does add cost and latency. If you're working with an item that's involved in a lot of transactions, you have concurrency issues and transact conflict issues. So I always like to say use transactions for low volume but high value operations. You don't want to be using it for every sort of operation, but if you have something that's very high value, like incrementing some bank balance or handling a transaction that way, that's low volume high value. But don't use it for everything.
Summary and Key Takeaways: Understanding Partitioning and Keeping It Simple
Alright, we made it. In terms of summary, this is what we covered today all those different things. I have a few main takeaways on this. I would say the biggest thing is make sure you really understand partitioning in the DynamoDB API. Don't hide the API. Don't use particle like understand what you're doing with the DynamoDB API and just have a better understanding of what that performance profile is going to look like. Remember those data modeling meta goals, right? Your keys are making sure that data you write maintains its integrity, is valid, and can be used later and also enables you to act on it quickly and get that consistent performance at any scale.
Know your domain and consider your trade-offs. In some sense you have to be the query planner. You have to know how you're going to access your data and arrange it accordingly for those different things. Finally, keep it simple. Don't overcomplicate it. Use these new multi-attribute composite primary keys. That's it. Thanks for all for coming. I really appreciate it. Again, I'll be out there at the DynamoDB booth if you have any questions today. But yeah, thanks all for coming.
; This article is entirely auto-generated using Amazon Bedrock.





















































































































































































































Top comments (0)