🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - A tale of two transactions (DAT455)
In this video, Marc Brooker from AWS explores database transaction isolation levels and their performance trade-offs. He compares Two-Phase Locking, Optimistic Concurrency Control, and MVCC implementations, demonstrating how Snapshot Isolation coordination scales with writes while serializability scales with reads. Using PostgreSQL and Aurora DSQL as examples, he shows that serializable isolation can be no more expensive than Snapshot Isolation for write-heavy workloads, but significantly impacts read-heavy patterns. He argues that strong consistency is achievable with high availability, challenging common CAP theorem interpretations, and presents latency comparisons across single-region and multi-region deployments, emphasizing that application developers find eventual consistency difficult to reason about.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction: Playing the Isolation Game with a Dog Adoption Database
Okay, hey, good morning. It's 9 a.m. on a Monday morning. Thank you all for joining me. I hope you're excited about the beginning of re:Invent this year. Good energy to get the week going. Cool. So today I'm going to be talking about transactions. Transactions are some of my favorite things, so I'll try and make this as interesting and exciting as possible.
So I'm Marc Brooker. I've been at AWS for 17 years, and I've worked on a bunch of systems, including EC2, EBS, Lambda, Aurora, DSQL, and now I work on the Agentic AI side, spending most of my time with the agent core team. I've worked on services of all sizes, from a couple of transactions a day to, let's say, nine digits of transactions per second. Throughout all of those designs, I found relational databases and transactions to be an extremely useful building block.
We're going to start off today playing a little bit of an isolation game. I'm going to ask you some questions and see how you feel about your ideal database, the database that you would like to use if you were an application developer rather than a database enthusiast. We're going to look at some of the classic algorithms that are used to implement database isolation, isolation between transactions. We're going to talk about serializability and snapshot isolation and the relationship between them, and maybe touch on some really interesting data and some surprising results. And then finally, I'm going to end talking about consistency, because I'm only really a pretend database person. I'm actually a distributed systems guy, and so I always want to talk about consistency.
So isolation levels are a trade-off. Isolation levels, you might have come across people on Twitter who are super opinionated that there is one best isolation level, or if you aren't serializable, you can't say ACID, or real transactions are only serializable, or if you're using anything stronger than read uncommitted, then you're being a fool. Those people are largely wrong. Isolation levels are a fairly complex trade-off between the amount of complexity we put on application programmers to build correct applications and the amount of complexity we put on our application programmers to achieve a certain level of performance and scalability.
And so we're going to play a little bit of a game. And this game is based around my adoption agency for dogs. And this adoption agency for dogs has the simplest possible database schema. Each dog is given a unique integer ID, and I have a flag column of whether that dog is adopted or not. Obviously, in the real world, that schema would be more complicated. Maybe I would have the dog's name, or breed, or whatever in there. But this is enough to highlight the points that I want to make about isolation.
The rules of this game are as follows. On each slide, there are two clients, two client threads, a blue thread and an orange thread, and they are going to run SQL statements in order where each one completes. We'll relax that a little bit later. And I'm going to ask you a question about whether you think a certain thing should happen, whether a certain behavior should be observable to generally the orange client. The database is, other than these two clients, quiescent. There's nothing else going on in the background, so nobody else can sneak in and change things. It's just these two transactions.
Exploring Database Isolation Through Concurrent Transaction Scenarios
And you will see that some of the statements are wrapped in explicit transactions with a beginning and end, and some are just lines. The just lines are auto-committed, and so that is a single statement that commits at the end of the statement. So here's my first question. Blue is going to update all dogs to set the adopted flag to one, then begin a new transaction and then update adopted to zero for dog ID one. Now orange is going to run and say select star from dogs where adopted equals zero. Put your hand up if you think orange should see any results in your ideal database, the database that you want to build applications against.
Okay, well, here's the trade-off you've taken. If you said yes, then there are some adopted dogs that look unadopted to orange. Orange sees some dogs that have been adopted that appear to be unadopted, and that is going to lead to disappointed kids and families.
However, if you said no, what you mean is that some transactions should block, abort, or otherwise have to coordinate with transactions. This is the beginning of our performance versus correctness trade-off.
Let's make an example that's slightly more interesting. Here I have two interleaved transactions. I begin both of them. They both read all the dogs. Select star from dogs, obviously not a best practice, but a fairly common one. They read the entire dogs table, maybe to present it to a customer in a UI. Blue goes to update the line for dog ID equals one, orange goes to update the line for dog ID equals one, and blue commits. Put up your hand if you think orange should be allowed to commit.
Well, this one was uncontroversial. Yeah, a couple of believers here, a couple of believers in the weaker end of isolation. I'm going to push these plugs in. The slides are blinking on and off. And so if you said yes, you think this transaction should commit, then multiple families can adopt the same dog, right? This update could say, update the dogs, set me as the adoptee, and that is going to lead to disappointment. A dog is a single thing. We're not going to be able to split it in half and share it. We're just going to have to disappoint one of these customers who we said, your transaction committed.
If you said no, then you're against sad children. But again, you are choosing that there are some transactions that have to abort or have to block or can't run to completion. Let's tweak this example just a little bit. Here again, same beginnings. I'm reading all of the dogs in blue, I'm reading all of the dogs in orange. And then I'm going to update ID one in blue and update ID two in orange. Blue is going to commit. Put up your hand if you think orange should be allowed to commit. OK, that's a small majority of folks.
If you said orange is allowed to commit, then it becomes difficult to enforce cross-dog business logic in our database. For example, if we had a piece of business logic that said no customer is allowed to adopt more than one dog, I wouldn't be able to implement that purely with this schema, obviously because the schema is not wide enough, but even if I added a customer field to this particular table, two concurrent transactions could adopt a dog for the same customer at the same time and go on and commit.
If I say no, then I do have this power of implementing arbitrary business logic because I can look at those select results. I can implement whatever business logic I want in my application, and then I can have the transaction not commit if that logic is going to be violated. But if I do think that this transaction should not commit, right, if I think that orange should have been aborted or blocked there, then I'm only allowing one client at a time into this flow. They're both going to select all of the rows, and only one will ever be allowed to commit, no matter what they end up writing, or at least only one writer will be allowed to commit. And so I've reduced the effect of concurrency of the system to one. There is no allowed concurrency going on here.
OK, last version of this game. Here we're not talking about isolation anymore. We've gone off the rules a little bit, and we're talking about consistency instead. And we're not talking about isolation because isolation is a question of what goes on when transactions are concurrent. There is no concurrency here. Blue is allowed to run to completion as an auto-commit transaction, and then orange starts. So I say, update dogs, set adopted equals one where ID equals one. And then I run another statement after that one has committed. And remember, nothing else is happening in the database. It says select star from dogs where ID equals one and adopted equals zero. Put up your hand if you think orange should be allowed to return any results.
Not many people. Well, if I was building this application against a database replication tree with read replicas, there would be cases where orange would return results because orange would read against a read replica that had not heard the news of blue's transaction.
Now there are many ways to prevent that, but the naive building out of replication trees doesn't prevent it and does lead to this kind of what I'm going to boldly call weirdness.
The Trade-offs Between Strong Isolation and Performance
So that brings us to our most important question for the day. Is this just pedantry for nerds? Well, I mean, I don't mind. I'm happy to talk about such things, but I don't think so. I think this is actually a core idea that we need to get into the heads of more application programmers. The shape of the trade-offs that you're taking when you take on different database isolation levels.
Stronger isolation means that it is easier for an application programmer to achieve correctness because they don't have to think as much about the weird concurrency anomalies that can happen. However, it is harder for the application programmer to build applications that perform well under large amounts of traffic. And so you get lower concurrency and lower throughput, very dependent and very nonlinearly, as we will get to later in the talk, very dependent on the traffic patterns and the exact kinds of queries and transactions we're running.
Weak isolation is known to cause production bugs. We've seen that at AWS. I've seen that in my career, and there is fairly well-documented evidence of these kinds of issues. Strong isolation is also well known to cause performance issues in production. And so this isn't some kind of out-there claim that I'm making. Stronger isolation also makes it harder to achieve high availability, but maybe not in the way that most people are expecting. This non-linearity is sometimes quite surprising.
In the architecture of Aurora DSQL, for example, it would be more expensive for us to implement Read Uncommitted than it would be to implement Snapshot Isolation. Or at least the architecture avoids the Read Uncommitted anomalies by doing the cheapest possible thing. A lot of people will say, okay, well what I can do here is I'm going to use my database in a low isolation mode. I'm going to get the best concurrency, I'm going to get the best throughput, maybe I'm going to get the best latency, and I'm going to try and shim correctness at the application layer, maybe by sprinkling in explicit locks, maybe by sprinkling in for updates and so on.
And it turns out, at least in my experience, that for a lot of workloads, trying to do that correctness shimming at the application layer leads to a lower performance application and a worse outcome with more complexity and worse performance. And the reason for that is that if you show the database the workload it is trying to do, it can make the best decisions about how to execute that workload. If you show it only a part of the workload or some kind of obscure stream that is controlled by locks outside its view, it makes it much harder to make good decisions.
And then I said harder, it's harder to achieve high availability with strong isolation. And it's harder to achieve availability with strong isolation, but again, very nonlinearly. And so strict serializability is strictly no harder to achieve than Repeatable Read isolation from an availability perspective, and we'll talk about that a little bit more later.
Two-Phase Locking: A Classic Approach to Transaction Isolation
So let's dive in as we go down this path of understanding the shape of the trade-offs and talk about various classic techniques for implementing isolation inside database systems. And here, mostly I'm going to talk about the implementation of serializability, but as we get towards the end of this section, I'll also talk a little bit about Snapshot Isolation or what PostgreSQL calls Repeatable Read. For those who don't know, PostgreSQL is kind of lying to you. Its Repeatable Read level is actually Snapshot Isolation. It's not ANSI Repeatable Read. That is the right thing. The PostgreSQL developers made the right choice. ANSI Repeatable Read is a little bit silly.
Okay, so let's talk about making sure that this orange transaction doesn't see the uncommitted or to-be-rolled-back results of the blue transaction. And here I'm going to use a variant of the classic database isolation algorithm called Two-Phase Locking. This is an algorithm that has been around for, let's say, about 50 years. It was very well documented and had been well analyzed by the early 1980s.
Two-Phase Locking, or 2PL, so this is hardly breaking news. The idea is that as you read or write every row in the database, you take locks. You never release locks until you get to your commit or rollback stage, and then you release all of the locks at the end of time.
Here, on my update statement, I take a write lock on this row for ID equals one. That write lock is an exclusive lock. It says nobody else is allowed to read or write this row while I hold this lock. Then orange comes along and tries to get a read lock on that row, which is a non-exclusive lock. But it's not allowed to because that row, one of the rows that it is trying to read, is locked for write, and so it blocks.
Now we're going to relax the rules of our game a little bit and say, okay, now orange is blocked, we're going to let blue have its turn again. Blue rolls back. Any changes it made to the database are now removed because of atomicity, and orange can now run. It doesn't see blue's effects because blue's effects at this point have been rolled back. So we have avoided orange seeing this weird result of reading uncommitted data by blocking its runtime until its run is complete.
Let's choose another example. Here I'm going to do my two where I'm reading all of the rows and updating the same row. So I read ID 1, I read ID 1, read lock, read lock. Read locks are non-exclusive, so we're cool. Those can run. Now blue tries, and this is a little bit outside the 1970s 2PL, but blue tries to upgrade its read lock into a write lock. So actually, I'm planning on writing this row, so I'm going to go from locking it for read to locking it for write.
It can't do that because it is not the only lock holder on this row, and so it blocks. Again, we've relaxed the rules of our game to allow orange to run. Orange runs, and now orange detects that blue is waiting on orange and that orange is waiting on blue. And this is a deadlock. No progress can be made here, and so now we have to decide what to do.
There are a whole bunch of classic approaches to deciding what to do. You can stop orange, you can stop blue, you can panic your entire database process. There are all kinds of fun things you can do here, but you cannot allow either of these to continue. The reason fundamentally you can't allow either of these to continue is that there is no serialization of these two transactions that works, because orange has not observed blue's writes and blue has not observed orange's writes, and so there is no ordering of these two transactions that you can put in.
Now if these were single-shot transactions, kind of DynamoDB style transactions, we'd be cool because I could read all of those and I can order them in whatever order and I can restart them. But SQL semantics are different, and we have to kind of assume that by the time that select has returned to the client, the client might have had arbitrary side effects based on those results. Again, that's a little bit controversial about whether you should be allowed to have arbitrary side effects based on uncommitted results. The SQL spec, I think, strictly says no, you shouldn't do that, but most people do. It's a super common thing to do, even when those side effects are things like logging.
Optimistic Concurrency Control and Its Surprising Behaviors
Here's an alternative way to implement this with Optimistic Concurrency Control. This is the other classic database isolation algorithm, also dating back to at least the early 80s, where it was first described, maybe first described by Kung and Robinson in 1981. And so here, we do a different thing. We don't take any locks. Maybe we take some locks at commit time, and we can sort of talk about that later.
But here blue is allowed to run, begin. It does its update, but it doesn't actually change the database at update time. Instead, it takes out its notepad and it writes down, I am planning to update the database in this following way, and then it doesn't tell anyone about this. This is its secret notes. Then orange can run. It can read the row. The row has not been changed because blue has only written down its changes in its secret notepad, and orange is allowed to read the old version of the row.
Blue rolls back, which in this case is really cheap. It just throws out that page from its notepad and it's done, and these two transactions are both allowed to commit.
But now we can talk about a little bit of the weirdness of Optimistic Concurrency Control. So here, we've got a slightly different set of SQL. We're going to set adopted equals 0 for 2 rows in Blue. And now, Orange is going to read row number one.
Now Blue is going to commit. Success. Blue is allowed to commit at this point. Orange is now going to read row number 2. And now something very weird has happened. The first select statement from Orange has read the state of the database before Blue's transaction completed, and the second has read the state of the database after Blue's transaction has succeeded. This is a violation of serializability. It's a violation of the basic behavior of transactions.
And so if we are taking those select results and we're having side effects on them, we're telling them to a customer or we're writing them down or we're making a service call based on them, or we're putting them in another database, or many of those other things, now something very weird has happened. But it's okay because Orange is then going to say it's going to try and commit this transaction, and it's going to be told by the database, no, you can't commit this transaction because of a read-write conflict in the set of rows that you saw. And no, you're not allowed to commit this transaction, and so you shouldn't have done anything with those lies that I told you earlier. I'm sorry I told you those lies, but you shouldn't have done anything with them. And so we're good. But this is not really how people use databases. And you can say that they shouldn't do this, but this is a very, very weird behavior that often leads to bugs.
Multiversioning: The Coolest Technique in Database Systems
So how do we fix this behavior? Well, we fix this behavior with what I think is the coolest technique in all of databases, or maybe all of systems, multiversioning. Instead of overwriting data with new versions as new versions come in, I keep all of the data, marked with what version it corresponds to, and I simply append to the end of my versions. Now, obviously, I can't do this because this is going to grow forever, but just humor me for a second that this is what I'm going to pretend to do.
And then every transaction that starts can come along and say, I started at 3, and so show me the best version of this row you have, the newest version of this row you have, that is from time 3 or earlier. And so what's cool about this is that we have fixed that anomaly that we had with OCC. We fixed that weird behavior. But we've achieved two other things. One of them is, we've allowed writers to continue to add versions to the end without blocking readers, so that's cool. And we've allowed readers to continue to read older versions without making writers block, and so that's cool too. And so by keeping multiple versions of a row, we have made this whole game substantially easier, substantially easier for the database implementer, and substantially easier for the application programmer to get to a level of performance.
And so this is what this transaction looks like in the MVCC world. We begin. Blue does its updates. Those write to the database a new version, or maybe not, we can choose. Then Orange selects, but Orange selects from an older version, or it picks a time to select from. But it doesn't see Blue's writes yet because they are not yet committed. Orange's second statement reads from the same version of the database as Orange's first statement, seeing a consistent copy of the world, and then we can go on to commit.
So in all of these, the pattern here is that some coordination, whether it is locking or checking at commit time whether there are anomalies, some coordination is needed to achieve isolation. These two clients, Orange and Blue, need to go somewhere, either statement by statement or once at commit time, where everybody goes to the same place for the same row and says, am I allowed to commit this transaction that reads or writes these rows. Coordination limits throughput and coordination increases latency. Coordination is probably the only real hard problem in distributed systems. Now, I know there are a lot of really hard practical problems building large-scale systems, but coordination is the single hard theoretical problem.
And to build scalable and reliable systems, we need to reduce coordination as far as we can. Unfortunately, coordination cannot be reduced to zero if we are going to achieve any level of reasonable transaction isolation or consistency. It can be moved around to statement by statement, to the end of transactions, to different locks and services and so on, or into the application, but it can never be entirely avoided.
Snapshot Isolation vs. Serializability: Understanding the Key Differences
Let's zoom in one more step and talk about Snapshot Isolation versus serializability. What is Snapshot Isolation? Again, this is the isolation level that PostgreSQL calls Repeatable Read, and it's my personal favorite isolation level. I'm sure you have a personal favorite isolation level. What does Snapshot Isolation give us? Under Snapshot Isolation, we never read uncommitted data, and so we never see partially committed transactions. Reads are repeatable. If we read the same row or same predicate multiple times, we get the same results every time, unless of course we've done some writes ourselves, in which case our writes need to be merged in.
Reads all come from a single point in logical time, and you can think of the logical time as the kind of stream of transactions in the database. All of the reads come from a point in that stream corresponding to a single point in logical time. And conflicting writes are rejected. And so if one transaction writes row one and commits, and another concurrent transaction writes row one and tries to commit, it must be rejected. Snapshot Isolation is a fairly strong isolation level, but it is not serializable for a reason that we will see in a minute.
So what are its opinions? No, we don't see any results in orange because those results are uncommitted and we never read uncommitted data. No, we do not allow this transaction to commit because there is a write conflict here. Both of these concurrent transactions are trying to write the same rows in the database, and orange is not allowed to commit at this point.
What about this case? Again, similar to the previous one, one and two instead of one and one. Under Snapshot Isolation, this is allowed to commit. There is no write-write conflict here. The write sets of these two transactions is disjoint, and so this transaction is allowed to go ahead and commit as long as the underlying database has preserved the rest of the properties that we talked about, that repeatable read property, that single point in logical time property, and so on.
Now, this is not allowed under serializable. These two transactions are not serializable. And so at a serializable isolation level, the database has no choice but to reject one of these two transactions. And so in Optimistic Concurrency Control, what that would look like is orange would be forced to abort. Why is orange forced to abort? Well, we do a check where we have to check whether any of the reads that orange has done have been written to since orange started. And since orange started, that's a read-write conflict. And we have to abort the transaction because there is a read-write conflict here. Orange read row number one. Blue wrote row number one. Blue committed first, and so orange has a read-write conflict with blue and must be aborted.
And this is what it looks like in open-source PostgreSQL. We get to that commit point of orange, and orange's commit is rejected, saying error could not serialize access. This can actually happen in PostgreSQL earlier in the transaction if I was doing some more stuff with orange. It doesn't necessarily have to happen at commit time, but here it happens at commit time.
And so what's important about the difference between these two isolation levels and what we pay for with a slight relaxation in serializability? The difference between these isolation levels, the core difference, is that serializability coordination scales with reads, while Snapshot Isolation coordination only scales with writes.
It doesn't matter to the decision whether to commit a snapshot transaction what it has read during its lifetime. However, it does matter to the decision whether to commit a serializable transaction what it has read in its lifetime. By choosing serializability, we are pushing onto the application programmer the need to understand this and the need to minimize how many reads their transactions do and make those transactions as specific as possible in what they read.
What we've done is we have pushed complexity into the application if we don't want to serialize anything. This transaction that reads the entire table each time is absolutely disastrous for throughput under serializability. But other than the larger amount of data movement, which may or may not be a problem, it is not a problem for throughput under snapshot isolation. There you have pushed an implementation detail of the internals of the database, an implementation detail of isolation, into the heads of the application programmer.
Micro Benchmarks Reveal Nonlinear Performance Patterns Across Isolation Levels
Let's run a little micro benchmark and see what these results look like. I'm going to preface this with saying that all micro benchmarks are bunk and should be taken with a major grain of salt. The only thing that matters is your workload and your workload patterns. Having said that, here is my micro benchmark, and I'm going to make bold claims based on what I see here.
Here I have the throughput of two transactions contending. These transactions each write two rows and read N rows out of a database of 10,000 rows. Reading 100 rows is about reading 1% of the database. I'm running these transactions over and over. In fact, I'm running four concurrently here at three different isolation levels against open source PostgreSQL.
Here we see that read committed and repeatable read perform about the same. They're doing the same amount of work and moving the data around in approximately the same way. But serializable has this huge drop in throughput as conflicts start to happen. I have a theory for why this cliff is so cliffy at 32, but I'm not sure it's right, so I'm not going to share it. These results are repeatable across multiple versions of PostgreSQL, MySQL, and various other database engines.
Here's a different view. Here, instead of doing reads and writes, every transaction just does N writes and has no reads. This is a pure primary key update workload, a workload that is only doing updates on primary keys. What we can see here is repeatable read and serializable. In this example, repeatable read was way faster than serializable, so snapshot was faster than serializable.
In the pure updates example, snapshot is no faster than serializable. Again, this is a result that is repeatable across the snapshot levels or snapshot equivalent levels of a lot of database engines. Why is this happening? The reason is from the database's perspective, the transactions that need to abort are the same transactions.
At the snapshot level, they are all writes, so they are write-write conflicts and they need to be aborted. At the serializable level, there are read-write conflicts because the read set and the write set are the same, since every update on a primary key is a read-modify-write. This is what I mean when I say that the performance of different isolation levels is highly nonlinear and very workload dependent.
You might say serializable is really expensive, but it turns out that serializable is no more expensive than repeatable read for this particular example. Or you might say repeatable read is expensive, but it turns out that repeatable read is no more expensive than read committed in this example. It is entirely dependent on workload patterns and requires an understanding of workload patterns to predict performance.
That is a problem for database people trying to explain to application programmers how to write great high performance database code. This is the core problem of understanding the relationship between isolation, correctness, and performance.
How PostgreSQL and Aurora DSQL Implement Snapshot Isolation
Now let's zoom in a little bit more and talk about how PostgreSQL and Aurora DSQL implement snapshot isolation. I'll preface this by saying that I'm not a super deep PostgreSQL expert. I did read a bunch of the code for this talk, and so I think I'm getting this right, but I'll be happy to be corrected if I'm wrong.
So let's jump into the PostgreSQL documentation and see what it says about its repeatable read isolation level, which again is actually a snapshot isolation level. A transaction cannot modify rows changed by other transactions after the transaction began. That is our write-write conflict property for snapshot isolation. A query in a repeatable read transaction sees a snapshot as of the start of the first statement in this transaction. That is our single point in logical time property of snapshot isolation. And then read-only transactions will never have conflicts. This is a very cool property, by the way, of snapshot isolation. And this is true because there are never write-write conflicts in a read-only transaction. A read-only transaction cannot conflict with another transaction. There are weird anomalies that can happen if you observe read-write transactions in various orders, which I'm not going to go into in this talk, but this really helps performance in snapshot isolated databases.
So let's talk about the kind of internal rules here. I'm creating transaction number 4 at T equals 4. I should see the results of transaction 1. Transaction 1 is committed at the time that transaction 4 runs its first statement. Transaction 2 is running. It might have done a bunch of writes to the database. It might have tried to change a bunch of things, but it has not committed yet at my start point, and so I must not see its results. Similarly for transaction 3, even if it goes on to commit during the runtime of my transaction, and then transaction 5 is from the future and I don't see transactions from the future, because seeing transactions from the future is just too weird.
So how do we implement this set of rules in Aurora DSQL? Under the covers, oversimplifying a little bit but not too much, each row in the database is backed by a linked list of versions. So here I say select start from seats where ID equals 1. I see a version for T equals 5. I ignore it because I'm reading at T equals 4. I go back one more step down the list. I see a version for T equals 3. That is the latest version that is older than my version, so that one I read and I'm done. For ID equals 2, I see a version of T equals 5, which I ignore, but then the list runs out, and so there is no older version, and so that row just doesn't exist. And then for ID equals 3, the list only starts at T equals 2, and so I don't ignore anything. I just pick that old version and I read that one.
DSQL also has a high water mark that we use for consistency. So if that high water mark is at 3, I have to block this transaction until I see all of the writes that come in that are newer than 4, until I've seen every write that's come in for 4 or newer. And there is a low water mark that we use for garbage collection, which is the moral equivalent of PostgreSQL's vacuum. And we do that based on a very simple low water mark for reasons that I will talk about in a little bit.
PostgreSQL's approach is much, much more coordinated than this. Instead of being able to do this locally down at the storage level with the committed versions in the storage engine, it actually has to pay attention to the whole world of currently running transactions. And so it establishes a kind of lowest transaction ID that it can get all the versions from, the highest transaction ID that it sees and ignores all of the versions from, and then it has to sort of build this list of currently running transactions and decide what to ignore and what not to ignore. This is a huge amount of coordination which you wouldn't be able to get away with in the distributed system setting. But in the single machine setting where you have shared memory, it is not a major problem and really simplifies things and allows for a more efficient storage implementation. Yeah, so then let's talk about the way this shows up in SQL semantics. And so here again,
this is an old version of the slide. I'm selecting star from my dogs. I had a previous version of this talk that was about a stadium booking system, which is much more boring. I'm selecting everything, and then I'm trying to update the same row twice again. So what happens in PostgreSQL? Well, the orange one blocks. At that point, it's not allowed to update that row because there is a write lock on that row. Blue goes on to commit, and orange synchronously gets, most of the time, synchronously gets an error saying could not serialize access due to concurrent update. So orange is then aborted and has to try its work again.
What does DSQL do? Well, DSQL takes a much more optimistic approach, where it doesn't do any of the coordination necessary to determine whether a transaction is going to succeed until commit time. Instead of getting that synchronous no, this transaction is doomed, don't bother going on, you have to keep running that transaction till you get to the commit. The commit will be rejected because we check for those write-write conflicts, and here we are trying to write the same row twice. This is doing OCC validation at commit time.
So both approaches, there's no need for read locks because we just create new versions, which is pretty cool. We don't need to block writers on readers, and we don't need to block readers on writers, which is pretty cool. DSQL's approach, the big advantage of it is that clients can't control lock hold times, and I'll talk about that in a second. There is no need for long-lived locks or going to find the one copy of the lock in the system at all. It scales much better and has better operational properties.
PostgreSQL's approach is nice in that you get earlier knowledge that your transaction is doomed and it's not going to commit. So you do less wasted work on the client's side for transactions that are known to not be able to commit. PostgreSQL's approach also has no transaction runtime limit, so you can let transactions run for essentially an unbounded amount of time, typically until your server falls over. That is a real advantage. PostgreSQL's approach is also a little bit more on-disk efficient and saves a little bit of storage.
Why DSQL Chose OCC Over Blocking: Operational Advantages and Latency Savings
So what is the downside of blocking and why did we make this decision in DSQL? Well, because I've read the code a ton of times. I've read the code a ton of times where there are two transactions, they both do a bunch of reads, and they have to block on each other. Actually, no, the reads are irrelevant here. Blue updates a row. It takes a write lock, and so orange has to block, and then blue goes out to lunch, right? This could be a garbage collection pause. It could be a period of high packet loss in the network. It could be an operator literally going out to lunch, which is a postmortem I have read more than once.
So this is a major operational problem where you are allowing clients, the least trusted part of your system, applications that you don't have control over necessarily, to control the performance of other clients in the system. This is a real practical problem in all kinds of systems, even at relatively modest scale, and it's an even bigger practical problem when you allow human operators, because human operators do things like going out to lunch. When we were designing DSQL, this was one of the things that we were trying to avoid, and one of the reasons that we chose the combination of multiversioning and OCC, that means that transactions never block on other transactions, and clients can't block other clients no matter what happens.
The other big advantage of DSQL's approach is that it allows us to keep work in the Availability Zone until commit time. We'll get into the data behind this in a minute. In DSQL, when I do a select, most of the time, I go to a query processor and a storage replica in the same data center as my client application. That is a major latency savings. It's an even bigger latency saving in the multi-region setting, where regions could be tens to hundreds of milliseconds apart.
I go to a replica in my region. And then what's more, when I get down to write time, I also go and do my writes to a query processor. It's a sort of secret notebook in the Availability Zone. Again, a big latency saving. And only when I get to the commit do I have to go cross-AZ in single-region mode or cross-region in multi-region mode. And this is a major saving of performance, a major saving in latency for any non-trivial size of transactions.
PostgreSQL's picture isn't as good. And so you can stay in AZ if you are doing read-only transactions at the cost in most architectures of eventual consistency and the weird anomalies that come with that. But in read-write transactions, you typically have to schlep across the network to the AZ where the primary is to do your reads and writes. And depending on the storage implementation, those writes might have to go out over the network to a storage layer. Most don't though. But I do have to get from the client to the primary of the database for these transactions, which costs additional latency, especially in the multi-region active-active mode, but even in reasonable sync multi-AZ architectures.
Strong Snapshot Isolation, Consistency, and the Real Trade-offs in Distributed Systems
Let's go into one more topic in our last ten minutes or so here: strong snapshot isolation. The definition of snapshot isolation, the kind of academic definition of snapshot isolation, allows this to happen. And it mostly allows this to happen because the people who were writing those definitions were thinking about single system databases where this is a silly thing to allow your database to happen. And so when I said PostgreSQL repeatable read is snapshot isolation, I mean it is strong snapshot isolation that does not allow this kind of thing, reading against a primary. DSQL snapshot isolation level is also strong snapshot isolation. And the strong there means strongly consistent, not in the ACID sense, but in the distributed system linearizability sense. Weird things like this cannot happen.
Why can't they happen in PostgreSQL when you're reading against the primary? Because Blue's transaction ID is marked as committed before Orange starts and is therefore included in the set of data that Orange reads. And therefore we get strong consistency. And this, I think, is considered so obvious to database people that it is not really even mentioned in the documentation all that much.
Now again, when I get to read replicas, now I've got a problem because this read, Orange's read, is just a read. It could be routed by my middleware of whatever kind, or my smart client, to a read replica that hasn't yet heard the news of Blue committing. And so this can happen. And so if I have read replicas in a normal PostgreSQL setup, I do not get strong snapshot isolation. I just get a weaker version of snapshot isolation that doesn't offer strong consistency. It can still be the right choice, but it does cause a set of anomalies that your application programmers now need to worry about.
In DSQL, because we use high quality physical clocks, and if you're interested in the implementation details for this, you can check out my talk from re:Invent last year or Mark Bose's talk from re:Invent this year where we go into this in detail. But in DSQL, because we use high quality physical clocks with very well controlled bounds, Orange's snapshot time is greater than Blue's commit time. And so Orange will always see Blue's reads, and so this is a very similar version to the kind of core PostgreSQL version, but is in the distributed setting rather than in the single machine setting.
And so we get back to our question: is this just pedantry for nerds? Right? We said that isolation is not pedantry for nerds, but what about consistency? Is this something that I need to worry about in reality? I think it is. Because what I have observed working, building applications and working with teams that build applications against databases, is that application developers find eventual consistency very hard to reason about and find it very hard to write correct and not surprising application code against eventually consistent databases. I have also noticed that the customers of these systems find eventually consistent behavior very, very weird.
I've created a resource, I get a successful creation, I go to describe that resource, and I'm told that resource has never existed. Well, obviously that is a very surprising behavior, and it makes building things like automation against APIs that behave this way very difficult. You have to be super defensive. It also pushes complexity to the API design because you have to provide tools for clients to be very defensive.
It also pushes performance problems and metastability problems into our applications because people build retry loops to deal with this eventual consistent behavior. And as soon as you have a retry loop, you have a loop that increases the amount of work that the system is doing during periods of overload, and therefore makes that overload worse and makes systems worse. So I am a huge fan of strong consistency. It is not the right choice in every setting. There are settings like mobile and IoT and places that are hard to connect where weak consistency is the right choice, but I believe that for most systems built in the cloud, strong consistency is the right choice for applications built on databases.
I'm also going to say, perhaps boldly, perhaps controversially, shouldn't be controversial, that database systems can be highly available and strongly consistent. You might have heard otherwise. You might have heard people claim that this is not possible, but it very much is possible. If we have a database and an application spread, for example, across three availability zones, and one of those availability zones is partitioned off from the network, there is absolutely no reason that the majority partition, the two that are still working, cannot continue to be strongly consistent and highly available to their clients.
And this is the most common failure case for cloud deployed and data center deployed applications. The most common failure case is that one of the data centers drops off the network in a way that not only makes it unavailable, but also makes the application, the part of the application that's running in that AZ unavailable to clients too. And so you may as well just be unavailable in that partition. There are reasons not to make that choice, but for the vast majority of applications, I believe that it is the right choice. And I've provided some citations here for those who don't believe me about this.
I'll also say pretty boldly, you hear the sort of version of the iron triangle version of the CAP theorem, you know, consistency, availability, partition tolerance, pick any two. That is bunk. It is just not true. It is not a trade-off that exists in distributed systems. And if people cast it that way, they are very confused about the shape of the trade-off. Partitions always exist in any non-trivial sized system, and so we can't avoid having them.
We do need to choose between full availability, which is this weird definition of A, which isn't the definition that's in your head. If you read the paper of the proof of the CAP theorem, A means every client is able to access the database, which I think is honestly just a silly way to define availability, or consistency. But we can have what I believe is common sense availability to the majority of clients and consistency even in the face of the vast majority of patterns of real world network partitions that happen on networks as they are built.
The trade-off that is much more interesting and much more important, and I wish people would pay more attention to, is that strong consistency strongly requires additional latency if only on writes. Your two options here are to go to a primary, go find the primary, find the one copy, or to make multiple replicas. You can do this asynchronously and make clients, make readers wait, or you can do it synchronously and make writers wait and not make readers wait. But somebody ends up having to wait.
So we can look at the performance of our databases. And here it's important to say, what I'm measuring in this table is inter-data center round trip time. So these are not milliseconds. They are just the numbers of times that we have to cross between data centers. And that data center crossing takes one to two milliseconds. So depending on your workload, this can be a big deal or not. So in DSQL's design, you get strong read consistency and strong write consistency.
You don't have to cross data centers for reads or writes. And at commit time, you have to pay on average 1.67 data center crossings. That 1.67 comes from one third of the time you don't have to go and find what we call the adjudicator, which is this kind of lock server in the system. And then the 1.0 comes from having to replicate.
PostgreSQL with primary, if we go to the primary every time, then we get strong reads and we get strong writes. 0.67 RTTs for reads and writes, because two thirds of the time in my three Availability Zone setting I have to go to a different data center. And then 1.67 at commit time when I have to log in the writes. Now I can start doing eventual consistency stuff and get local reads at the cost of consistency and the benefit of my reads going down to zero RTTs.
DynamoDB, if I choose strong consistency, and this by the way is one of my favorite things about DynamoDB, is that it has this really selectable consistency mode. But I almost always choose strong consistency with DynamoDB. Read RTTs are 0.67, and writes and transactions are 1.67. Now when we get into the multi-region setting, this gets a little bit more pressing because the round trip time goes up from one or two milliseconds to 20 milliseconds, 100 milliseconds, 200 milliseconds, depending on where in the world your workload is running.
And here, if we look at the trade-off between DSQL, for example, where again reads are local and writes are local, we're not crossing between regions to do those things, so that's great. But at commit time, I have to pay 1.5 or sometimes 2, because we haven't finished an optimization, round trips, which can slow things down. And so if I compare that to the last row of the database, which is DynamoDB Global Tables, DynamoDB Global Tables don't have to cross for reads and don't have to cross for writes. That's great. I still have to cross Availability Zones, but I don't have to cross regions.
But the cost I have taken on is eventual consistency for my reads and post-commit merging of writes, which might or might not be surprising depending on the behavior you're looking at from your application. And so this is the shape of the consistency trade-off that I think is much more interesting, and application programmers, database people, and distributed systems people should be paying a lot more attention to this kind of thinking about latency as you think about how to build out systems that span multiple regions. Thank you very much.
; This article is entirely auto-generated using Amazon Bedrock.












































































Top comments (0)