AWS re:Invent 2025 - DraftKings: Scaling to 1M operations/min on Aurora during Super Bowl (DAT320)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - DraftKings: Scaling to 1M operations/min on Aurora during Super Bowl (DAT320)

In this video, Joel Miller, Staff Software Architect at DraftKings, explains how they built their Financial Ledger on Amazon Aurora MySQL to handle a million operations per minute. He details the four major scale challenges during NFL Sundays: deposit spikes, debit traffic, in-game read traffic from balance checks, and post-game payout surges (30x increase). The solution involves sharding Aurora MySQL clusters with consistent hashing, separating reads to replicas with under 15ms latency, and optimizing stored procedures using out parameters instead of selects. Key benefits include cost reduction, faster CDC replication to warehouses, and automatic failovers that minimize downtime.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Operating at the Speed of Sports: DraftKings' Scale Challenge

Thanks, Claude. Have you ever considered what it would take to be able to run a million financial operations in a minute? At DraftKings, we have to be familiar with this kind of scale and support it on a regular basis. I'm Joel Miller, a Staff Software Architect at DraftKings, and what I want to share with you for the next few minutes is about how we built the DraftKings financial ledger on Amazon Aurora MySQL and how that enables us to get to the scale that we need.

But first, just a very brief bit about DraftKings if you're unfamiliar. DraftKings is a sports entertainment and gaming company, and we operate most notably in online sportsbook, but we have several other verticals that all interact with our ledger, including iGaming, fantasy sports, and lottery. Our products are available broadly across North America, in many US states and Canadian provinces. In any given month, we have millions of paying customers, but depending on exactly how big the sporting event can be that is driving customer traffic, we can see millions of paying customers much more quickly than that.

We like to say that to be successful, we have to operate at the speed of sports. But what is the speed of sports? It might take an hour or many hours for a game to complete, but big plays happen in the blink of an eye. The interception that no one saw coming, the double play that ends the inning, the Hail Mary that ties the game. These kinds of plays have outsized impact on sport and drive customer engagement, and when fans are engaged, customers come to the DraftKings platform.

So the title of this presentation is all about scale, but what does scale really look like to the DraftKings ledger on an NFL Sunday? It comes really in four big areas. The first thing to say is that we operate real money products, so in order to engage with the customer, the customer has to bring money on site. What you're looking at right now is a graph of deposit volume in the lead up to early games on NFL Sunday. As you can see, the farther out you are, the lower the scale is on a transaction by transaction basis, and as you get towards the right side of the graph, you're looking at the very beginning when the games start, and that's when traffic really starts to ramp pre-game. Customers are coming to the site depositing funds. We're working with our payment providers, but ultimately all that has to hit the ledger. We have to update balances and write transactions.

After the money is on site, the next thing is you have to be able to use it. This is a graph of debit traffic also in the lead up to game starts. What you can notice here is that this traffic, this graph is highly correlated to deposits. This makes a certain amount of logical sense, right? After you bring money on site, you simply want to be able to use it. But from the ledger's perspective, we've just compounded the problem. When deposits are busy, debits are busy. Everyone's trying to get their wager in, they want to enter into a fantasy contest, and the ledger has to be able to support that scale pre-match.

But surely once the game started, we have no problem, right? Well, not exactly. In-game, we have a sawtooth-like pattern problem. What you're looking at here is not write traffic at this point, but now read traffic. As I said, big plays have outsized impact and they drive customer traffic to the site. Well, any time a customer comes to the site, it needs to load up their balance. They need to know what's usable to them and what's not usable to them. You can see that in the top here, every single screen, as they move screens and go around the app, you can see that the balance always gets checked.

If you have a highly drafted, very popular quarterback throwing a touchdown pass to a highly drafted, very popular receiver, everyone in the country at the same time is opening up their fantasy app, and they want to see what the score is. They also want to be able to see if they've got a wager, whether the wager is part of a leg that's paying out, or if it's settled at this point.

Postgame, we have the biggest spike problem for DraftKings. The last graph you see here is after the game is done and statistics are finalized. Suddenly, every sport-related system at DraftKings knows how many people it wants to pay out, and they want to pay them out at the same time. We have very real business metrics about this. We have to be able to get money back into customers' wallets as quickly as possible once an outcome is known.

Here you're looking at just one of the verticals. Most times, as luck perhaps might have it, sportsbook and fantasy pay out at slightly different minutes than each other just because the fantasy scoring process is a little bit harder and takes a little bit longer. But even with that, you can see that the background level of transactions goes up by about 30x once either one of them wants to start paying out. This in a lot of ways is the biggest scale problem, where at one minute's notice, we have to be able to handle all of that volume on the ledger.

Building the DraftKings Financial Ledger on Amazon Aurora MySQL

So enough about the problem. How did we actually build the DraftKings Financial Ledger? On the left-hand side, you see the line of business services. I put a few of them on there lightly labeled. They're all running in Kubernetes in various locations, mostly on EKS, and they're doing synchronous debit traffic to the ledger in the center. Ledger is another service running on top of EKS.

The ledger service is also, if you look at the top of the graph here, receiving asynchronous payment instructions from all of the payment services themselves. When I said that fantasy and sportsbook payout, they don't do so synchronously. They drop messages onto a message broker, and the ledger is also listening to that broker in order to pay out those customers. At the bottom of the graph, you simply see DraftKings customers. Every time they're in the app, they're loading balance queries, and that's also directing traffic to the ledger on EKS.

But the real magic here is on the right side of the graph. A ledger is only as good as its data, and the Financial Ledger is built on top of Amazon Aurora MySQL. On the far right side, you can see any given cluster here, and I'll go back to the clusters in a minute. But very importantly, every cluster has read replication. In order to handle the scale of read traffic, you have to free up your writer to do what it's good at, and you have to make sure that the writer is doing writes and the reads are directed over to read replicas.

This is very easy to do with Aurora MySQL. I think you can even scale out read replicas to something like 15. It's incredible and easily distributes your read load so that it doesn't impact your writer. But you might notice the graph looks a little bit weird. There are multiple boxes on the Financial Ledger. We got great results from a single Aurora cluster, but as the business grew, we had to decide how we were going to continue allowing the ledger to scale to all of our customers on all of our products.

To that end, we sharded the ledger. We're still running on Aurora MySQL, but what happens now is we do a consistent hash of a user identifier to be able to direct users to one cluster or another.

In order to support the traffic in a distributed fashion, when we were optimizing the ledger to be able to continually handle growth, there are times where we simply decided to scale up vertically the size of the cluster. This is great, this is easy, and it works incredibly well. But ultimately, once you start getting to the edges here, vertical increases in size linearly don't drive linear increases in throughput. What does drive linear increases in throughput are actually more writers.

This was done specifically to be able to enhance the throughput, but it had some ancillary benefits for the company. First one being cost reduction. It's actually cheaper to run a bunch of 4 XLs than it was to run a single 24 XL. That gets even a little bit better as you talk about the clusters and read replicas, and if you're sizing all of your read replicas to be the same as your writer, this increases your cost savings.

Our second piece, which we were not expecting honestly but really did help, was on the warehousing side. Every piece of data at DraftKings, and the ledger is no exception, it is in fact the rule here, has to be warehoused so that we can use it later for analytics purposes and for data validations and other things. Our warehouse during big games could start to get extremely latent. It was doing CDC replication, still is doing CDC replication from our Aurora MySQL clusters.

When we decided to go shard the ledger, the speed of data getting to our warehouse also increased significantly. That was because we suddenly had five bin logs instead of one bin log to read. We could parallelize that data ingestion to the warehouse, and the warehousing became faster. But it's very easy to get bin log data out of MySQL, and Aurora MySQL helps us with that.

Key Lessons: Optimization Strategies for Million-Operations-Per-Minute Performance

Aurora MySQL helps us with a lot of pieces of this, to be honest. It's hard to imagine building it on another tech at this point. Aurora MySQL is able to scale in, out, up, and down rapidly. If we're in a situation where suddenly read traffic is knocking us over, latency is going up, our metrics aren't looking good, we can scale out read replicas in the blink of an eye. They come up quickly and are already synchronized with the dataset because of the underlying storage mechanics of Aurora.

Beyond that, the other really important piece about the read replication is that it's extremely low latency, an order of magnitude better than other things that we've tested. Generally, read replica latency on all of our read replicas on our cluster is under 15 milliseconds. There's no perceptible difference if you do any kind of immediate read after write. It has to be really, really immediate in order for the read replica not to be up to date.

From our standpoint, if you deposit, then go to the next page, the deposit confirmation page, and load it up, it's already going to have your balance, even though we're reading that balance from a read replica. It's simply fast enough that we can't beat it. The other important piece here from the managed service side is we don't have to do too much to it. There's no OS patching that we have to deal with, we don't have to expand the storage, and it's all handled for us, including automatic failovers.

Hardware fails. Hardware will always fail. All you can do is react. Previously, before we used this product, our reaction time was good, but it was on the order of minutes. If we had lost a writer and we had to promote a read replica up, it took time. We had noticeable outage. Automatic failovers from Aurora MySQL have made that a seamless process. I'm not going to say you won't notice it happening, but it'll be much quicker than you can do on your own.

So what did we learn in this journey? Optimization to be able to hit one million operations a minute is complicated but involves data, and involves testing.

One of the first things that I think is important to know is what's slow. The slow things are probably hurting your throughput, and I don't want to get into an argument about latency versus throughput, but they are correlated. Profiling is key to figure that out. It's one thing to know that your stored procedures are slow. It's in fact another thing to know why they're slow, because that's how you fix it.

Performance Insights, which is now I think CloudWatch Database Insights, was invaluable to us to be able to look into a stored procedure and see what was taking the most time when it wasn't working the way that we thought it was. Was it locking? Was it not using an index, table scanning? Was it simply improperly set up data? These things are available to you and you need to be able to see them in order to optimize.

One of I think the most interesting optimizations that we had to do, and I did not expect, was getting data out of stored procedures can be done multiple ways, and some ways are better than others. We typically perform selects at the end of stored procedures and return data in a tabular format back to the caller. This ended up actually hurting throughput significantly as opposed to taking scalar data and sending it back through out parameters. Our best understanding at this point has to do with temp table usage, to be honest with you, but I'm happy to talk about it later.

The key thing to remember is that if you've got scalar returns out of a stored procedure, put it in an out parameter. The amount of calls that you can make to that stored procedure, the throughput will increase greatly. But the last thing that I really want to hit home about is command query separation. I mentioned it once already, and it's really important to drive home to be able to handle scale.

You need to separate reads from writes, and especially with the ease in which you can spin up read replicas on Amazon Aurora MySQL, it's important that you do and that your systems know when they're calling, when they are making a read-only query or when they're trying to mutate data. And when you can direct those read-only queries to read replicas, there's no chance of you bogging down your writer in either CPU capacity or something's throwing down a weird lock that's preventing your updates. It simply can't happen if it's happening on a read replica and not the writer.

It's extremely important, and if there's one thing you take away from this presentation, it's separate reads from writes to handle scale. That's what I've got for you. I'm happy to stick around afterwards and answer any questions you have, but thank you for sharing your time with me.

; This article is entirely auto-generated using Amazon Bedrock.