Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Scaling MONOPOLY GO! to millions with DynamoDB & ElastiCache (DAT324)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Scaling MONOPOLY GO! to millions with DynamoDB & ElastiCache (DAT324)

In this video, Brandon from Scopely and Joseph Idziorek from AWS discuss how Monopoly GO! scaled to millions of concurrent users using DynamoDB and ElastiCache Valkey Serverless. Brandon explains their architecture decisions, choosing DynamoDB for user-scaling data to avoid MySQL's single-writer bottleneck and connection limits. He details handling sudden traffic spikes during mini-game events that triple baseline load in five minutes, using DynamoDB's on-demand mode with pre-warming APIs and ElastiCache's ECPU minimums. Cost optimization focused on reducing document sizes rather than API calls, with one table redesign cutting costs from 1-200 write units to just 2. Migration to ElastiCache Serverless eliminated manual scaling disruptions and Saturday pages, delivering 61% cost savings through automatic scaling instead of constant over-provisioning.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction and Why Scopely Chose DynamoDB for Monopoly GO!

Welcome, everyone. My name is Joseph Idziorek. I am a Director of Product Management at AWS, where I cover a number of our database services, including DynamoDB and ElastiCache. I'm really excited about this talk. Brandon is going to join us from Scopely, and Brandon has been giving talks on behalf of DynamoDB and AWS for almost 10 years now. I found his original talk on YouTube from 2016 on DynamoDB. Of all the customers that I work with, I love working with Scopely. I love working with Brandon. They are one of the very earliest adopters of DynamoDB, and they're phenomenally deep in the product. I think they make me uneasy about how well I know the product because they know it very well, and they've been pushing the limits.

They've been very early adopters of both DynamoDB and Serverless, and now Valkey Serverless as well. One of the most demanding workloads that we see on DynamoDB is gaming, because you can go from 100,000 to a million to millions of customers in a very short period of time. That's one of the workloads that we're very proud of, and we're very thankful for Brandon to join us today and share how Scopely has been scaling on DynamoDB and ElastiCache Valkey with Monopoly GO!. So welcome, Brandon.

Thank you. I'm Brandon, and I'm here to talk about Monopoly GO!. Monopoly GO! is a really big game. We launched it two years ago, and it ended up being about 10 times bigger than we thought it would be. It was the number one game that year by Mobile Metrics, and we brought on millions of users in a very short period of time. In our daily traffic peaks, we have over a million concurrent users playing the game at any given time. This presented a pretty big technical challenge.

Here's a basic overview of our architecture. We have a pretty basic, old-fashioned monolithic style architecture. We have a Unity client built with C# on the front end, and then on the server side we have a .NET application also using C#. Having C# on both sides is nice because there's shared code. The way that we chose databases was pretty important. We use the general rule of thumb where if the data scales with the number of users, we tend to put it in Amazon DynamoDB. If the data scales with the number of employees at Scopely, then we'll put it in MySQL because it's not very big. It's configuration data and things like that. For special use cases, we use Redis for things like leaderboards and temporary job stuff.

In this talk, I'm going to cover three main points. First, reliability at scale. I'll talk a bit about cost management, and then I'll share a story on operational simplicity and why that aspect of doing things is really important for Scopely. Let me start with reliability. Why DynamoDB? The main reason is because a lot of the alternatives have problems that DynamoDB doesn't. I'm not trying to pick on MySQL, but I'm going to use it as an example of why we don't use MySQL for most things. Aurora MySQL can only have up to one single host as the writer. So there's an inherent bottleneck there on the write throughput that you could have on the whole database.

We constantly run into connection limit issues. Because we're a very large game with a million concurrent users, we have to run something as many as 2,000 or more hosts at a time in order to satisfy those load requirements. Each of those hosts has to open up 10 or more connections to MySQL. When you do the math on that, you easily hit the upper bounds of the connection limits on the server side. We tried introducing things like MySQL proxy and scaling out reader nodes, and that helps, but it only scales to a point. There are also issues with unconstrained queries. If an engineer naively forgets to put a limit clause on a select star from users and then puts it into production, that could easily chew up resources.

With DynamoDB, it's really hard to do that on purpose. The most important benefit for me is that I never have to do cluster upgrades with DynamoDB. With MySQL, at some point they're going to make me upgrade a major version, and it almost always requires some downtime and some praying and finger crossing. It's a real pain, and we'd rather not have to ever do that. That's why we don't like MySQL and we prefer DynamoDB.

Handling Sudden Load Spikes from Time-Based Mini Games

Monopoly GO! has some inherent sudden load problems that are native to the game. I'll explain how the game works so you can understand why we have this problem. The game is really simple. If you open the app, there's one button you just click. The button rolls some dice, your token moves around on the board, and you collect some rewards. It sounds boring, but it's pretty fun when you add in all these other time-based events that we call mini games.

These mini games always have a start and a stop time. For example, a Saturday at 9 a.m., our race mini game might start. When the race mini game starts, the game suddenly has to do a whole lot more than it used to do. For example, it has to serve team recommendations because you have to form a team and all get in the car together. When you go around the board, you're now collecting extra points that you can spend to make your car go faster in the race. As soon as the event starts, we get this sudden increase in load.

You can see in this chart that we triple our baseline traffic in a short period, like five minutes. This doesn't actually tell the whole story because we have certain DynamoDB tables or other components that go from zero load to maybe 100,000 read or write units per second in that very same short time frame. Because of this load pattern, using DynamoDB makes it really easy. We can simply put our tables in on-demand mode and then call a pre-warm API ahead of time so that the table has enough underlying capacity to handle whatever load we think it could need in our wildest dreams. When the load does come, it's able to handle all of that load without throttling.

For other components like ElastiCache Serverless, there's a mechanism called ECPUs, and what we can do is set a minimum ECPU threshold before the event starts. Once the load spike starts and then the traffic stabilizes again, we'll just remove the minimum and let auto scaling take care of it after that. For EC2 and auto scaling groups, we do a similar thing. We'll just set a minimum on our auto scaling group and add a bunch of servers. When the load spike happens, then we'll remove the auto scaling group minimum, and then it will scale back down or stay the same depending on the load.

For our load balancers, we'll set a capacity reservation. This sounds like a lot of stuff, and we have to do it for a lot of different events. We got tired of doing it manually, and we eventually built a system that automates all of those activities. When we create a new DynamoDB table for a new feature, we almost always launch it in on-demand mode. We'll call the pre-warming API ahead of time because even when you create a table in on-demand mode, it doesn't necessarily have the underlying capacity to go to a million reads and writes per second right away.

Amazon recently introduced a pre-warming API that lets you say, "Hey DynamoDB, I want to pay this small one-time cost to make sure that you have enough partitions and underlying capacity to handle additional traffic should I need it." When we do one, we'll call the pre-warm API, then we'll launch the new feature, and we'll look at the traffic. If within a couple of weeks or a month we observe the traffic pattern and if it looks stable enough, we may switch it into provisioned mode to save money. If it's too spiky, we won't do that.

Note that when we switch into provisioned mode, we may also need to switch it back to on-demand mode when we get one of those sudden load spikes again. You're allowed to do this about 4 times a day, switching between provisioned and on-demand.

Even though we try really hard to avoid throttling, we don't always get it 100% right, so we do have some extra mitigations to avoid the user impact of throttling if it does happen. Our server client to DynamoDB will retry a few times. However, we like to keep our server API's responses pretty fast, so we will give up pretty quickly on the server, but we will hint down to the client that this is an intermittent failure, and you're welcome to retry as well. The client will retry a few times, and if it eventually succeeds, then the user might notice a bit of lag, but they won't actually get a crash or an error dialogue.

Cost Optimization Through Document Size Management and Migration to ElastiCache Serverless

Let me talk about cost. One of the biggest things that we look for when we're trying to optimize for cost is the size of our items in DynamoDB. One thing that people don't always realize is that when you read or write to an item in DynamoDB, the cost that you pay in read or write units is relative to the size of the document. If your document is 100 kilobytes, you're going to pay 100 write units to change even just one small aspect of that item.

We noticed that some of our biggest cost savings actually came from reducing the size of documents and not actually from reducing API calls or anything like that. One thing that I'm showing here in this chart is something that we built. DynamoDB doesn't have a way to expose metrics on your average, minimum, or maximum document sizes within a table, so we built this ourselves. We instrument our DynamoDB clients so every time we read or write from a document we report the document size, and that way we can get an average, a P95, or a maximum size of our documents. We can then set alarms on this so when our documents get too large, we can take a look at it and see what the issue is.

Also note that there's a hard limit of 400 kilobytes. With a couple of these documents on the screen, we're pretty close to hitting a problem. Here's an example of a table that has a cost problem. This is a friend invitation table. If you want to be friends in Monopoly GO!, you can send somebody an invitation and they can accept it. The way that we used to do this in the database is that we have one item for each user's friend invitation. When you send a friend invite to that person, you add that invite to a big JSON blob within that item.

For most users it's not a big deal because they only receive a handful of invitations, but for some users they get thousands and the documents can get quite large. So sending a new invite to that person is now really expensive. We changed this, and here's our new design. Instead of having one document per user, we have one document per invite, and then when a user wants to read their invitations, they query several items at once. In this situation we also added a local secondary index on the date so that we could query the most recent ones first.

Here's the comparison. In the old way, every time we would write to a document we'd pay somewhere between 1 and 200 write units depending on the size of the document. Now in the new way we pay a fixed 2 write units. In the old way we'd pay 1 to 50 read units, and in the new way we'd pay 1 to 2 read units. Some of the savings there comes from only having to query the first page. When the user opens the app they don't need to look at 1000 invites. They only need to look at the first page or the most recent ones, so that's why we got savings there.

Now I'll share a story on operational simplicity and why Scopely really values this aspect. We're game builders. We don't really want to spend a lot of time doing mundane database maintenance or manually scaling or tweaking things to get the best performance possible. We'd rather have a managed system that takes care of that for us.

Earlier, I mentioned that we had some Redis workloads for things like leaderboards, temporary job state, and large user lists. To serve these workloads, we had a Redis cluster that was not managed by AOBS. It was really difficult to scale without disruption. Every time we reached a certain size and needed to add shards, we couldn't do it without causing strong user impact. We'd see timeouts, connection drops, and all sorts of problems, so we were afraid of touching it once it got to a certain size.

We also had to set alarms on our memory usage. If we reached about 80% of the database capacity, we had no choice but to add more capacity. We didn't like having to get paged on a Saturday when we were about to run out of memory, having to add nodes, and then causing user disruption and explaining to everyone why the game wasn't functioning well that day. We were looking for alternatives and ultimately settled on ElastiCache Serverless. Mainly because it auto scales behind the scenes, doing exactly what we want. It grows and shrinks according to our needs and does that transparently.

One thing to note about ElastiCache is that it has the word "cache" in the name, but I think that's very misleading. It's perfectly fine for non-cache workloads. However, it doesn't have the same durability guarantees as DynamoDB or MySQL. Our old Redis workload didn't have those guarantees either, so we were totally fine with it. We also decided to use Valkey instead of Redis since it's open source and 20% cheaper, so it's pretty much a no-brainer.

Here are the results. We got our auto scaling, no more getting paged on Saturday, and no more responding to memory alarms. One thing that was actually a bit surprising to us was that it was 61% cheaper after the fact. We weren't looking for cost savings; we were just looking to save our own time. Finance was always happy with us nonetheless. We realized that most of those cost savings came because we had to stay over-provisioned constantly in our old cluster because we were too scared to scale it down for fear of causing user disruptions.

One of the downsides about ElastiCache was that we were early adopters, adopting it very early. During that time period, there were a few minor outages of one to three minutes that we noticed. We worked with AWS and they managed to fix it behind the scenes. We haven't seen this issue for months now and we ended up very happy with the result. Now we're in a much better place. We're not wasting our time with database tweaking and that sort of thing. That's the end of my talk. Thank you very much.

; This article is entirely auto-generated using Amazon Bedrock.