DEV Community

Cover image for AWS re:Invent 2025 - Deep dive into Amazon Aurora DSQL and its architecture (DAT439)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Deep dive into Amazon Aurora DSQL and its architecture (DAT439)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Deep dive into Amazon Aurora DSQL and its architecture (DAT439)

In this video, Marc from the Aurora DSQL team provides a deep dive into DSQL's connection management and query processor architecture. He explains how DSQL combines relational database capabilities with distributed architecture, eliminating single points of failure through active-active design. The session covers DSQL's use of Firecracker microVMs for hardware-level isolation, Server Name Indication (SNI) for cluster routing, and the Relay service for secure connection handling. Marc details how DSQL achieves strong consistency using time-based synchronization with EC2 time sync service, implements automatic scaling through storage cloning and sharding, and provides activity-based pricing with DPUs. The architecture enables 10,000 default connections per cluster, availability zone-local routing for low latency, and eliminates the need for connection poolers like PG Bouncer, offering a truly serverless PostgreSQL experience.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Aurora DSQL: Bridging Relational and Distributed Databases

Hello and welcome everyone. My name is Marc, and I'm an engineer on the DSQL team. This is DAT439, a deep dive into Amazon Aurora DSQL and its architecture. Last year, my colleague Mark Brooker gave a fantastic talk with the same title as this one, where we gave a broad overview of the service and its capabilities. There is a lot going on under the cover of DSQL, so this year we have a series of deep dive talks where we go much deeper into specific areas of the architecture.

Thumbnail 0

In this talk, we will be discussing how DSQL manages connections, what the architecture of the query processor looks like, and the architecture of the session routing layer in front of it looks like. I hope you enjoy the talk. But first, let's do a quick recap of what Aurora DSQL is.

Thumbnail 40

Thumbnail 50

DSQL is a distributed implementation of a relational database. Relational databases are awesome and older than I am. You can run complex queries, evolve your schema, and add indexes on the fly. Meanwhile, we have services like DynamoDB, and we have been figuring out how to build and run distributed services for quite a long time at this point. The really cool thing about architectures like DynamoDB is that they do not have a single point of failure and they can scale horizontally rather than vertically.

Thumbnail 100

DSQL is based on PostgreSQL, and we are running PostgreSQL under the covers. It is a database that has been designed for running transactional workloads. Our customers have been asking us to bridge these two worlds: the world of relational databases and the world of distributed databases. At the same time, our customers have been asking us for a fully serverless database, one where there are no services to provision, manage, or patch. A service where you can pay per use and a service that scales up but also all the way down.

Thumbnail 110

Creating and Connecting to DSQL Clusters: A Simplified Experience

This is what it looks like to create a DSQL cluster. I am really proud of this page, and this is something that we intend to keep as simple as it is. There is no scroll bar here. You can click create cluster, and in just five seconds, which is something we recently released, you can have a cluster ready to go. There is nothing on this page that you have to decide upfront. There is no maintenance windows and no VPC settings. Everything on here is something you can change later: tags, deletion protection, re-encryption of your cluster with a different key, and toggling resource-based policies. So if you want, you can just click and off you are going.

Thumbnail 150

Here is another feature our team released recently: the query editor. What is really cool about this is that this is a PostgreSQL application running in your web browser, speaking web sockets directly to DSQL. Because DSQL is based on PostgreSQL, our expectation is that you can pick up the vast majority of PostgreSQL-based libraries, whether that is libpq-based, JDBC, or anything like that, and connect to DSQL.

Thumbnail 180

DSQL Architecture Overview: Query Processors, Journal, and Storage

So how does this all work? We have an application, maybe that is a shell or something that you built on Lambda, connecting to DSQL. We are sending SQL statements to a component called the query processor. This is going to be the star of our show today. The query processor is receiving the PostgreSQL wire protocol and SQL from your applications, and it is interpreting and running those queries.

Thumbnail 200

The second key component is the journal. The journal is our distributed transaction log and a core building block at AWS. It powers services like S3. The journal is where we get our durability guarantee. In single region, when the journal accepts a transaction, it is writing it to at least two availability zones. In a multi-region configuration, we are getting our transactions replicated to at least one other AWS region.

Thumbnail 230

The third key component is storage. Storage is a service that our team built from the ground up, designed to meet the performance needs that we had for DSQL. The way storage fits into this picture is that it connects to the journal, learns about transactions that you have committed, and takes those transactions and updates its local view of the world so that we can run queries.

Thumbnail 250

When we send a SQL select statement to DSQL, it is the query processor's job to parse, plan, and execute that query and turn that into a series of low-level read operations against storage. Meanwhile, if you are running any DML like an insert statement, any rows that you insert into DSQL are going to reside in the memory of the query processor. Nothing has been written to the disk.

Thumbnail 270

If you run an update statement, the query processor is first going to turn that into a series of select statements to go and read the existing values of those rows and then apply anything in the update statement like a set clause or incrementing a balance by one to produce new versions of those rows in memory. As you run your transaction, the query processor is doing all of this work ephemerally in memory. If your connection breaks or if you type rollback, that data is simply gone. Meanwhile, if you commit your transaction, the query processor is going to send that transaction over the network.

Thumbnail 310

Eventually it will be written to the journal, again to at least two availability zones. This is the point of commit in our system. Storage is subscribed to the journal, learning about these transactions and keeping itself up to date.

Thumbnail 330

You may be wondering about consistency. DSQL is strongly consistent, and we achieve this through a time-based synchronization protocol. When your transaction starts, the Query Processor does two things. First, it uses the EC2 time sync service, which is based on GPS satellite clocks, to get a really accurate microsecond-accurate reading of the current time. Second, it uses the AWS clock-bound service to correct even that very accurate time measurement for any potential errors. By doing this, we get what is called a linearizable timestamp, which is a value guaranteed to be in the future of any commit timestamps. The Query Processor includes that timestamp in requests to storage, guaranteeing that storage has always seen the latest data.

Thumbnail 400

Storage implements multi-version concurrency control and returns results for precisely the timestamp that we asked for. This means that as we run our transaction and many seconds may pass while we go back and forth, we see a frozen moment in time. Because DSQL is strongly consistent, we can open additional connections. These different connections can run on different Query Processors, on different physical hosts, in different EC2 availability zones, or in the case of multi-region, even in different AWS regions, and still get strongly consistent results.

Thumbnail 420

Thumbnail 460

DSQL is also active-active, which means that any of these Query Processors can also update data. They do that by coordinating with a service called the adjudicator, which sits in front of the journal. The adjudicator implements a concurrency control technique known as optimistic concurrency control. The idea behind this is that we take that bundle of everything we wanted to do in our transaction, and the adjudicator goes and looks at other commits that have been made recently and checks if there are any conflicts. We are not going to go deep into optimistic concurrency control today. In the talk last year with the same title, we went into much more detail. Yesterday, we gave a 500-level talk about optimistic concurrency control if you would like to learn more.

Thumbnail 520

Hands-Free Automatic Scaling Through Transparent Sharding and Replication

DSQL implements hands-free automatic scaling through a number of techniques. It does this automatically and out of the box. On this picture, any number of things might have happened. Let us pretend there is rescaling happening here. DSQL is continuously backing up your data. As we open more of these connections that are able to drive more and more traffic to the service, DSQL uses those recent snapshots to create clones of storage. Those clones connect into the journal, learn about recent transactions, and keep themselves up to date. Then the Query Processors intelligently load balance across those replicas.

Another thing this picture might represent is splitting for size. As we put more data into the service, these replicas are going to become bigger. DSQL tries to keep them at a manageable size. The reason we do that is because if we keep the data sets small, we can improve performance. In the case that one of our replicas fails, we can bring a replacement online very quickly because it does not have too much data to restore.

Thumbnail 530

DSQL can also scale out the journal layer through automatic and transparent sharding. It starts to send some of your updates to some adjudicators and some journals, and some of your updates to other adjudicators. For both read and write scaling, DSQL does this very proactively, monitoring and increasing load. It implements these scaling techniques transparently. It actually does not matter what sharding strategy it picks because DSQL is fully capable of doing cross-shard commits.

Thumbnail 560

Thumbnail 580

Connection Challenges in Traditional PostgreSQL: From Security to Scalability

That is our crash course out of the way. We are now going to start to speak a little bit about how connections work in DSQL. Before we do that, let us set the stage and talk about how connections would work in vanilla PostgreSQL. This is PostgreSQL that you may have downloaded off the internet, installed on your laptop, or PostgreSQL that you are getting on EC2, maybe through RDS or Aurora.

Thumbnail 590

Your application is going to connect in and do some kind of credential exchange. PostgreSQL has many mechanisms for doing this. Let us say we are doing a user password credential exchange. The server is going to look at our password, which is stored in the database. It is going to make sure that it matches what we asked for. Once it has authenticated and authorized our connection, the server is going to fork itself. This is the Unix fork system call, and so we are going to get a dedicated process for our connection. As we open more connections, either from the same host or from a different host, we end up with more of these sessions.

Thumbnail 610

Each of our connections, each of these sessions gets its own dedicated process. Because of this, each connection consumes some number of resources on the service, even if you're not doing anything. At some point, we may need to get a bigger host simply to have more connections.

Thumbnail 630

Thumbnail 650

We're going to go through a journey that many of you may have been on several times as you've built applications against a relational database and started to scale or solve problems like how to manage heat, availability, or failover. We have our application with a single customer, and we've just gotten started. The first question that we may have to answer is how do we secure this endpoint?

PostgreSQL is about a million lines of C and is 40 years old, with tons of optimizations in very performance-sensitive code. By running the service on the internet, you have to run the risk of some kind of security breach. It may not even be in PostgreSQL itself; it may be a security issue with the operating system or through libraries like OpenSSL that PostgreSQL is linked to. Even if PostgreSQL itself is secure, we may have something like a password, perhaps on a sticky note on your machine, which is not a best practice. The best practice is to do some sort of password rotation through Secrets Manager, but still you have a fundamental property that if somebody leaks that password, they may have access to your database.

Thumbnail 710

Thumbnail 720

This is a common reason for customers to run their database in something like a VPC, which can really complicate the lives of developers who are trying to connect to and work with the database. Hopefully we have more than one customer. As that other customer starts to connect and use our service, we can't just have one connection open to the database. We're going to need at least one more connection so we can do work in parallel.

Thumbnail 750

You don't want to be opening that connection as that customer arrives. Opening a connection in PostgreSQL can take some time, sometimes as much as one second. We have to establish a TCP connection, set up TLS, do multiple round trips to do handshaking, do credential exchange, fork that backend process, and have that backend process get ready. Then finally our application can prepare any data like prepared statements before it can get going. To solve this problem, many customers create client-side connection pools that ahead of time open a bunch of connections, moving all of that slow stuff off the critical path. When one of our customers comes along with an API request, we can just pop a connection out of the pool and get going.

Thumbnail 770

This leads us to a question: how big should this pool be? If our application is running at 10 transactions per second and you look on your monitoring graph and it's just sitting at 10, that probably doesn't mean that you can use a pool size of 10. If we zoom into any minute down to the second, we may see spikes in concurrency because our customers may not always be doing work at a steady rate. This is a problem known as peak to average, and it may lead you to require a much bigger pool than your steady state rate. Let's pretend it's about 100, a 10x increase.

Thumbnail 800

Thumbnail 810

Thumbnail 840

This is a formula from the RDS documentation that says for every one gigabyte of memory that an instance has, you can have roughly 10 connections. If we need 100 connections for our application, that means we can use the smallest instance type, the micro over there. But of course, we don't want to run one copy of our application. We want to be able to do deployments and survive the loss of an availability zone. If we're running in a typical three replica scenario, you may think that we can set that pool size to 33 only, but it actually doesn't work like that because if we lose a host, then the surviving hosts need to be able to handle those spikes. Now we need 300 connections.

Thumbnail 860

We can no longer use a micro. This is a really common pattern that customers have to increase their instance size just to get more connections. But you'll notice that the second part of this formula, that 5000, says that even if you're using an M8 4 XL with 64 gigs of memory or the beast of a machine, the 48 XL with 768 gigs of memory, you can still only have 5000 connections. To solve this problem, it's very common to deploy something like PG Bouncer or if you're an RDS customer, then you can use RDS Proxy. The way this works is that you're connecting to the proxy, and the proxy is going to hold those 300 connections and only use the 100 that you actually need on the back end, which is going to save resources and improve performance. This can be a really great solution in managing this complexity, but it's just one more thing for you to worry about to configure and to pay for.

Thumbnail 890

Thumbnail 900

Here's another challenge. What host are we connecting to? We have a single writer, a single place that we can do mutations, a single place where we can get strongly consistent reads. If it becomes unavailable or you can't contact it, then things are about to go bad. You have to somehow fence this machine off so it can no longer do writes.

Thumbnail 910

We have to find a standby. Hopefully we have one ready or we have to restore from backup. We have to kick the primary out of DNS. We have to bring the standby in. We have to make sure that our application is ready to reconnect to that machine. And even if you do all of this right, this can still lead to minutes or hours of downtime. And if you're using all of the best practices, really you shouldn't expect less than 30 seconds of downtime.

Thumbnail 940

Here's another challenge: scaling. If we need to scale up reads, we're going to add read replicas. We're going to need to configure reader endpoints. We're going to have to teach our application which of our APIs can tolerate eventual consistency in our read-only mode. And then start to do some sort of traffic splitting in the application. Some of these read replicas may be further behind than others. So our load balancing is not simply distributing the load, it's also excluding hosts that we shouldn't talk to.

Thumbnail 980

Thumbnail 1000

And even if you get all of that right, you're still left with eventual consistency, which is very difficult to reason about and build correct applications against. And so as we were thinking about what we wanted to build with Aurora DSQL, we were talking to customers to understand these pain points. These are the kinds of things that kept coming up. And if you really think about it, many of these problems are caused by one fundamental property: we have a single writer, a single point of failure. It needs to be there, it needs to be available, it needs to be scaled, it needs to be secure. And it needs to have enough capacity to do the work that you need out of it.

Thumbnail 1020

Thumbnail 1030

Building a Shared Endpoint: Using Server Name Indication for Cluster Routing

Meanwhile, with Aurora DSQL we have an active-active architecture. In Aurora DSQL, any of these query processes can accept reads or writes, and all of our reads are going to be strongly consistent. And so this was the start of our journey as we were building Aurora DSQL. How can we take this architecture, put something like a load balancer in front of it, fully encapsulate all of the problems that we've just described, and offer customers a much simpler experience.

Thumbnail 1040

Thumbnail 1050

Thumbnail 1070

And so this is our vision, our North Star. Our customers are going to come in with their application, and they're simply going to be given a query processor that they can use for the duration of their connection. So we're thinking about how to build this, and we're working backwards. And the first question that we have is, we have this picture in mind, but do we have one of these pictures per cluster? When you go to the Aurora DSQL API and you click that create cluster button, do you get a dedicated load balancer with your own query processes running behind it? Which leads to the following question: how many query processors should be behind this load balancer?

Now, if we had a dedicated endpoint per cluster, this question becomes very difficult to answer. We would require our customers to give us some kind of hint: "Hey, I need 100 connections. I need 200 connections." And so each of these query processors, they're just a Unix process that we're forking to accept more work. And so our minimum unit would be whatever we can pack onto the smallest instance. Let's say the smallest instance can run 50 query processors. That would mean that if you created an Aurora DSQL cluster and you went to that little box where you can hit plus or minus, the fewest connections you could offer is 50.

Thumbnail 1130

But as we saw before, that's not enough because what if that machine died? We want at least redundant capacity in other zones. We'd also not be able to offer you 51 connections. It would be 50 or 100 or 150. And then we want to think about patching. How do we do deployments to these hosts? And so it very quickly became clear to us that we should not do this. Instead, what we should work towards is a single shared endpoint. An endpoint that any cluster can use, and an endpoint that simply has as many query processes behind it as are needed for any of our customers at any point in time.

Thumbnail 1150

Thumbnail 1170

And if we're going to build this endpoint, we need to play well with the PostgreSQL ecosystem. Because if Aurora DSQL was simply an AWS service, we could design whatever API we wanted, such as, "Hey customer, you need to tell us which cluster you're using when you're running a transaction." But because we speak in the PostgreSQL wire protocol, we need to fit within the bounds of what that protocol could offer. And so as we were looking around for our options, we very quickly settled on using a feature of TLS called Server Name Indication (SNI).

Thumbnail 1200

And the way this works is when you create an Aurora DSQL cluster, we generate a random identifier that's that colored bit, yeabulo5gfu4gaqha72r4qq3qu. And that cluster is going to get its own endpoint using a feature of Route 53 called an alias record. So it looks like it's its own A record, but in fact, it's actually just a pointer to our shared endpoint. And the way this works is that when your client establishes a connection and starts doing that TLS handshake, the first thing it's going to do is send a message called the Client Hello. If you see that blue bit,

Thumbnail 1220

Thumbnail 1230

it includes the name of the server that it's trying to connect to. On the server side, we can look at that Client Hello, strip that name out, and now we know which cluster you're connecting to. This feature, Server Name Indication, is a TLS feature. It's not a PostgreSQL feature, but it's been integrated into PostgreSQL since version 14.

Thumbnail 1250

Thumbnail 1260

Hardware-Level Isolation with Firecracker: MicroVMs and Memory Optimization

Now we have the shared endpoint and SNI. What we're going to do is run this big fleet of EC2 instances. It's the job of the service team to make sure those instances are sufficient for all of our capacity needs, and we're going to take on the job of making sure they're up to date. If our application comes in, we could have the load balancer do something like round robin and just pick out a Query Processor. Here we go: one, two, three. We're now connected to all three of the hosts. Now look at the top. We're going to have another customer come along, this red cluster. They're going to come in and get a connection on the same host as us. All's well, except they're a bad actor. They found a security vulnerability in PostgreSQL and fully intend to exploit it.

Thumbnail 1310

They're connecting in, running some kind of drop tables query, and they're going to break out of PostgreSQL. Now they have access to the environment they've broken out into. Because they're just a process running on the same machine as our orange cluster, they're going to have access to everything on that host and could potentially read or write data from other customers. This is a big deal. At AWS, the bar for this kind of design is not process-level isolation or containers. It's hardware-level virtualization. Fortunately, this is the same problem that our friends over in Lambda faced many years ago, which they solved by building a hardware-level virtualization library known as Firecracker, which gives us a microVM.

Thumbnail 1350

Thumbnail 1370

By microVM, we mean it's almost exactly the same thing as an EC2 instance, but it's designed for machines that use much fewer resources. VMs launch in milliseconds and have minimal overhead. However, even Firecracker out of the box wasn't quite good enough for the performance we wanted to offer our customers. What we do is have a little agent running on our hosts, and these agents are going to create these VMs ahead of time into a warm pool. Here we are filling up a warm pool. I've shown six on the slide, but in reality, we're packing hundreds of these VMs onto a single machine. When our application is coming to connect and it's been assigned this host, we can simply take one of these Query Processors out of the pool.

Thumbnail 1390

Thumbnail 1400

We then connect directly into that Query Processor. In the background, the agent is going to fill up the warm pool. Because this is an in-memory operation, it's essentially popping an item from a linked list and establishing a TCP connection. This happens really fast and with minimal overhead. However, there is a really complicated economic challenge here. To explain this, let's look more closely at how connections work in standard PostgreSQL.

Thumbnail 1420

Thumbnail 1440

Our application is connected into the server, and you can see that server is using some amount of memory. The dotted box represents memory available to the system. Each of these blocks is something like a page of memory. You can see that our server is using many hundreds of megabytes of memory just to be running. When we fork one of these backend processes, even though that new process has essentially the same amount of memory requirements to run, the operating system is going to be clever and use pointers to say that the underlying memory I cloned from has not been modified. Both of these processes can share the same physical pages.

Thumbnail 1450

Thumbnail 1470

As we open more connections, we can keep doing this trick. This is going to significantly reduce the overhead of running these additional processes. Each of these connections, of course, needs its own memory. Pay attention to the orange one. It's running some kind of query and claiming new memory pages as it's doing something like loading all your data into memory so it can sort or run some sort of aggregation function.

Thumbnail 1480

Thumbnail 1490

Thumbnail 1500

Meanwhile, in our Firecracker world, we have something that looks very different. Here's a VM we've just launched into the pool. It's running a full operating system, a full copy of Linux. It's running our PostgreSQL server that's going to handle just one connection. It's also got a bunch of additional services that our team installed into that sandbox so we can do things like monitor, manage system lifecycle, and get logs out of the sandbox. If you add all of this up, we're looking at many hundreds of megabytes. Let's pretend it's 700 megabytes. If we do this a couple of times, we're almost at three gigabytes of memory.

Thumbnail 1510

So what do we do about this? Firecracker has a really cool trick here, which was developed by our friends over in Lambda, which we've blogged about extensively called Snapstart. The way this works is that when our host comes up for the very first time, it's going to create one of these VMs. I call it the seed because we're going to use it to grow new VMs. When the seed comes up, we do everything that you might expect. We launch Linux, we start PostgreSQL, we make sure everything's healthy, and we get right to the point where we're ready to accept our first connection.

Thumbnail 1550

Then we talk to our agent and say we're ready. What the agent's going to do is pause that VM and take all of those memory pages and write them to a file called the mem file. It's going to be persistent on disk. At this point, our seed has done its job, so we can kill it. Now that we have our seed, we can start new clones from that seed and grow new VMs. These VMs have the exact same memory layout. In fact, they're identical copies. They think they have the same IP address. They think that time is what the original seed had. They think they have the same MAC address. They're identical in every way.

Thumbnail 1580

Thumbnail 1590

Thumbnail 1600

Except just like with fork, because we've memory mapped in this file on disk, the contents of this VM are actually just pointers to the original snapshot. So far we haven't saved any memory, but as we start to launch additional VMs from that same snapshot, we get to use that same pointer, that same pointer chasing technique. Just like before, as one of our VMs starts to do unique work, it can claim additional memory, and we're only paying for that unique memory.

Thumbnail 1610

The Relay Service: Secure Authentication and TLS Session Migration

Instead of just doing round robin across these VMs, we have a service that runs behind the load balancer called the Relay service. When your application connects in and starts doing that TLS handshake, it is the Relay that it's talking to. Relay is a service that we built from the ground up in Rust, and it uses the S2N library so that we can have assurance about the security of the service. It uses S2N's capability to parse that SNI value.

Thumbnail 1650

The next thing your application is going to do is send over what's called an authentication token, which I want to take a moment to explain. Typically when you're using an AWS service like S3, let's say we're doing a get object request, that is an HTTP request, right? We have a bunch of headers, we have a body, and your SDK is going to sign that request using your AWS credentials to produce a new header using the signature version 4 algorithm. This signature is used to prevent any tampering. So if somebody was able to intercept that request and try to change the bucket or the object key that you were trying to download, then the signature would no longer match.

The other thing the signature does is it allows the S3 service to understand who the caller is, so we can do any authorization enforcement. Now S3 has a really cool feature called pre-signed URLs, where instead of actually signing that request and sending it over the network, you can just take a frozen version of that request. It looks like a URL and has the signature as a query parameter. You can share that URL with one of your friends, and they will be able to just go to that URL without an AWS SDK and download the object that you gave them permission to. You're not giving them your credentials, they can't write data to your bucket, and they can only download precisely what you asked them to.

The other thing you can do with SIGV4 requests is put an expiry in. For requests that are sent by your SDK they typically are expiring in something like 5 minutes. But with an S3 pre-signed URL you can customize that value for up to a week. So this is the underlying machinery that DSQL uses. It's actually the same machinery that IAM authentication in RDS uses. Your application is going to use the DSQL SDK to generate one of these authentication tokens. This is very fast, it takes just a few nanoseconds, because this is something that AWS has invested in heavily. If you're talking to S3 or DynamoDB every single request you make is being signed.

Thumbnail 1770

Thumbnail 1790

But what we're doing here is just generating one of these tokens periodically or once per connection request. This allows DSQL to integrate with the IAM data plane. It's going to do some fetching of your keys for your account, fetching of your policies and your tags. It does this very efficiently, so that Relay can authenticate and authorize your requests in many times locally using its cache. At this point, we have established your AWS identity. We're able to enforce any policies. For example, you can say users in this group or users in this role can only access clusters tagged in certain ways.

Thumbnail 1820

Thumbnail 1830

Thumbnail 1840

Resource-based policies on Aurora DSQL are a great place to enforce requirements like connecting from certain IP addresses or through private link. After writing login attempts at CloudTrail, Aurora DSQL has established the identity of your user and will talk to a placement service. The job of the placement service is to keep track of our fleet, which we'll go into in a few minutes, and suggest one of these query processes for your application to connect to. Once we've picked a host and a query processor, the relay simply sends your connection over to the query processor.

At this point, the relay doesn't really have anything interesting to do. The relay is a service that we built from the ground up and it's written in Rust. Those first few messages—the TLS handshake and the PostgreSQL authentication messages—require complex protocol parsing. The PostgreSQL protocol can lead you to make some very silly security mistakes. For example, messages have a length, and if you get that length wrong when you're assigning buffers, you can have all sorts of overflow and underflow attacks or arbitrary code execution.

The way we approached this in relay is that we designed it to process as few packets as possible. When we implemented the protocol parsing code for those packets, we were extremely careful and worked closely with our security teams to make sure we had done this correctly. Now that we've done that work, relay really has nothing interesting to do. Everything else that's going to happen on this connection should be handled by PostgreSQL. What we do is move the TLS session into the sandbox.

Thumbnail 1940

Thumbnail 1950

This is really cool. A TLS session is a layer 7 concept happening right at the top of the OSI model. We take all of the ephemeral encryption keys that are used for that session and package them over the network, securely sending them into the query processor. At this point, we use a function that zeroes out the memory in relay. Relay can no longer decode anything else that happens on this connection. When our application sends any data into the query processor, it takes that data and sends it over a TLS connection where it gets encrypted. It flows encrypted through the network load balancer, through the relay, and arrives at the query processor, which now has the keys necessary to decrypt our data.

Thumbnail 1960

Thumbnail 1970

Built-In Connection Pooling and Automatic Failure Handling

When our application connects into one of these hosts, instead of simply taking one of these query processors out of the warm pool, what we instead do is move it into a per-cluster pool, which holds capacity just for that host. The reason we do that is watch what happens when we open a second connection. We haven't dragged an additional query processor out of the warm pool. The agent is continuously filling the warm pool. Now that first connection we opened will be able to use one of these query processors directly from the pool. But we have two connections open. What happens if that other connection becomes busy? You don't have to do anything about this. Aurora DSQL is going to automatically take another query processor out of the warm pool. These query processes are already running, so this happens extremely quickly with very minimal performance impact.

The reason we do this is because think back to those problems our customers were telling us right in the beginning. As customers want to scale their application fleet out, they want to deal with the peak-to-average problem. Typically they open many more connections than they actively intend to use. We didn't want Aurora DSQL customers to have to worry about running PG Bouncer or running RDS Proxy, because that would just add another hop in the network. There would be something else you'd have to pay for. You would have to run many of them to get high availability, and it would have to be something that you scale in addition to everything else in Aurora DSQL. By doing it like this, we've built pooling into the service, and there's simply no need for you to worry about that.

Thumbnail 2070

The placement service I mentioned earlier is continuously patrolling the fleet. At a very high frequency, each of our placement services is connecting into these agents and asking questions. How busy are you? How many connections do you have? How many of these are being used? How much memory do you have? What is the CPU load? For these specific clusters, how many sandboxes do you have? How many of them are being used? The placement service is another service we built from the ground up in Rust, and it does all of this in memory.

Thumbnail 2120

The relay keeps this information in memory, so when the relay asks the placement service where it should place a new connection, the placement service can respond very quickly because it has current information readily available. According to our quotas page, by default, an Aurora DSQL cluster has a limit of 10,000 connections, which is twice what you get from the largest RDS instance size. If you want, you can configure this limit. We have customers running with many more connections than this.

Thumbnail 2140

Let's say we're opening connections and we have three open. If one of our metal instances fails or one of our relays fails, instead of experiencing a full application outage, you experience only a one-third drop in connectivity. Two of your connections remain healthy. What your application needs to do is open a third connection again. It flows through the Network Load Balancer, finds a healthy relay, and talks to a healthy placement service to get a new query processor.

We cannot tell your application to reconnect because many kinds of failures can happen within Aurora DSQL. For almost all of these failures, Aurora DSQL handles them automatically for you. However, we cannot have your application reconnect. So we decided to put a limit of one hour on connection age. After an hour, either you hang up or we hang up. Do not worry though—we will not close the connection while you are using it. If you happen to start a transaction at 59 minutes and 59 seconds, we give you some breathing room. We have smart features like jitters so we do not close all your connections at the same time.

The best practice here is to use a client-side connection pooling library, as we discussed earlier. You want to take all that slow and expensive TLS setup and credential exchange placement off the critical path. That is the best practice anyway. While you are at it, configure that client-side pooling library to have a maximum age of one hour. If you want, you can do health checks. When this kind of failure happens, your application will simply retry locally against your connection pool and find a healthy connection.

Availability Zone Local Routing for Optimal Performance and Resilience

PostgreSQL is a fairly chatty protocol. When it was designed, the clients and server were probably on the same machine. Even so, when you run interactive transactions where you send selects from your client to the server and get results back to the application, your application can look at those results and run complex business logic before going back and forth. Every time you cross the network, you spend time. So what we wanted to do in Aurora DSQL is give you the best possible performance.

Thumbnail 2300

Thumbnail 2250

We do this with the Route 53 feature, where your EC2's Nitro card sends additional information as part of the DNS request, tagging it with the availability zone that your server is in. When you connect to Aurora DSQL, by default, your application goes through a Network Load Balancer host in the same zone, through to a relay host in the same zone, which talks to a placement host in the same zone, which gives you a query processor in the same zone, which talks to storage in the same zone. By doing this, we give you the lowest possible latency. As we open additional connections, we take advantage of Aurora DSQL's active-active, strong consistency.

Thumbnail 2330

Thumbnail 2340

In the event that one of our availability zones becomes completely unavailable, the load balancer falls back to a healthy zone. This gives you a connection with slightly higher latency, but the system remains available. Any internal failures, for example, if one of our storage replicas in an availability zone fails, are automatically handled by the service by picking a healthy replica in the same zone. In the event that there are none, it will also go across availability zones. This is a really cool feature of Route 53. You can use it in your own applications if you want. We are very happy about it because even though Aurora DSQL being a distributed architecture would appear to have more hops, there are actually cases where connecting to Aurora DSQL can be faster than connecting to a single writer endpoint. Because we are able to give you a query processor in the same zone, no matter where the primary is. Because in Aurora DSQL, there is no primary.

Thumbnail 2380

Activity-Based Pricing with DPUs and the Benefits of DSQL Connections

Let's talk about pricing. Aurora DSQL is designed for activity-based pricing. If you are familiar with DynamoDB where you send put item and get item requests and pay for write capacity units and read capacity units based on the number of bytes you put into the system and the number of bytes you visit when you do those reads, Aurora DSQL uses a very similar concept.

Thumbnail 2430

However, we're actually going to take all usage and talk about it in what we call a distributed processing unit or a DPU. At the end of the month, you'll see a DPU on your AWS bill. If you go into the usage metrics on your DSQL clusters page, you will see that we break out these DPU usages in several ways. Let me explain that quickly.

The Query Processor is running PostgreSQL, and it's kind of like a Lambda function in the sense that you can ask this Query Processor to do work. You can have it do Fibonacci if you want. It doesn't actually have to do any reads or writes. As you consume time on these query processors, we're going to capture that time through a compute DPU metric, which is essentially the seconds that you were active on this Query Processor. So it's per second billing.

Thumbnail 2460

When the Query Processor does reads from the network to storage, we're going to capture the reads that you do in the same way that we would do in DynamoDB. We're looking at the bytes that you're visiting. If you're scanning a table with a filter, you may visit many bytes on the table, but the return bytes may be smaller. So think about how many bytes you're visiting as part of your cost optimization.

Thumbnail 2480

On the write path, what we're doing is we're looking at that commit payload, and we're looking at how many bytes you're putting into the system, and recording those as write DPUs. All of these are activity-based, which means that as your application scales up, you're simply going to be spending more on compute, writes, and reads. You shouldn't be thinking about how big your machine should be. You should simply be thinking about how much you're spending on DSQL.

When you put data into the system, we have to store it and keep it ready to go when you connect in and run a query. This is measured in gigabyte hours, just like you would expect from S3 or DynamoDB. DSQL can obviously scale for you. We may be adding multiple replicas behind the scenes. If you're running very high throughput, you may have 10, 15, or 100 replicas. With DSQL, you're only paying for a single copy of your data. Don't think about the number of replicas or those continuous backups that we're doing. Just think about the number of bytes that you've put into the system.

Thumbnail 2550

Because of everything we've covered in this talk, if your application goes to sleep at night because you don't have any customers using your application, you can simply close connections. You don't even have to close connections; you can leave them running if you want. The important piece is that you're not sending us any SQL. If you're not sending us any SQL, you're not spending any time on the Query Processors. You're not spending any of the compute DPU. You're not doing any reads or writes.

Thumbnail 2580

The only ongoing cost for DSQL in this scenario is that storage amount. When morning comes and you want to get going again, just open new connections. Everything's there and it's ready to go. There's really no difference in DSQL between going from 99 connections to 100 compared to going from 0 connections to that first connection. The way we think about this is you're just grabbing a Query Processor. Query Processors are always there and they're ready to go. This is going to really simplify your lives.

Thumbnail 2610

Wrapping up, connections in DSQL are secure. In today's talk, we spoke about how every connection in DSQL is running in its own microVM using best-in-class hardware-level virtualization and end-to-end encryption between your application and DSQL. Even members of the DSQL service team cannot run something like TCP dump and see your data. There is encryption. It's secure out of the box.

We actually encourage you to just run against our public endpoint and take advantage of those tokens because you don't have to worry about credential leaks. Keep them short-lived. If you want, you can set up a private link and configure resource-based policies to lock down those connections to your VPC. Connections in DSQL are scalable. It's the job of the service team to make sure that there's enough capacity out there for every customer, even if they all spike.

You can use as many connections as you need or as few, and you're only going to pay for what you're doing on those connections. If you have 100 connections and you're only using one connection, you're only paying for the activity on that one connection. Connections in DSQL are designed to be as simple as possible. There's no patching, there are no maintenance windows in DSQL. There's no need to run PG Bouncer or RDS Proxy. There's no single point of failure in the system. It's strongly consistent, and this is really going to simplify your job as an application developer.

Connections are snappy and they are fast. Think about everything we spoke about today. We have Availability Zone local routing. We're going to a very fast service written in Rust. We're doing placement out of memory. We're grabbing a connection out of a warm pool. Once you've got that connection open, connections are fast. You're basically speaking directly into that Query Processor that's ideally running in the same Availability Zone as you.

The service has been designed to handle mass reconnect storms. If all of your connections die due to some kind of networking event, we have extremely generous default throttling rules that will allow you to get back online as quickly as possible. With that said, that's the end of the talk. Thank you so much for attending. My name is Marc. I'd love to chat with you after the talk if you have any questions. Thank you for attending. Have a great re:Invent, and please do complete the session survey in the mobile app.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)