The Problem We Were Actually Solving
We were trying to build a system that could scale to 100,000 concurrent players on a $10k/month AWS bill. Sounds reasonable, right? But here's the thing - at the time, we were still using the default EventStore configuration, with a single node and the default in-memory storage, backed by Apache Kafka for message passing.
What We Tried First (And Why It Failed)
We tried scaling up the EventStore instance to 5 nodes, thinking that would fix the bottlenecks, but it didn't. What we didn't realize at the time was that each node was writing to Kafka, creating enormous message queues and leading to a data consistency issue in Cassandra.
Fast forward to 3 am on launch day, when the players just started pouring in. Our EventStore instance just stopped working, and the game went down. Players started complaining, and the game's Twitter account went from "Just launched!" to "Uh, thanks for playing, we'll try to get the game up again."
The Architecture Decision
The decision to switch to a distributed event store was, in hindsight, correct. We rebuilt the EventStore to use a sharding scheme based on AWS Kinesis Data Streams, with each shard backed by Cassandra. We also implemented connection pooling and a separate worker process to handle event processing.
We moved away from Apache Kafka to AWS Lambda as our message passing mechanism, which turned out to be a better fit for our use case. With Lambda, we didn't have to worry about message queues, and our event processing became much more efficient.
What The Numbers Said After
After this disaster of a launch, we went back to the drawing board and rebuilt the system from the ground up. We got the system up and running within a month, and the results were astonishing.
Our average response time went from 500ms to 50ms, and our event processing throughput increased by 10x. We were able to handle 500 concurrent players on the $10k/month AWS bill.
What I Would Do Differently
If I could do it over again, I'd make the following changes. First, I'd implement a better sharding scheme from the start, rather than trying to scale an existing one. Second, I'd focus on operational metrics from day one, rather than just focusing on launch day.
And last, I'd not underestimate the complexity of distributed systems. It's easy to talk about "horizontally scaling" a system, but it's much harder to get it right. I'd put more emphasis on testing and simulations before launch, and I'd get our ops team more involved from the start.
GitOps for infrastructure. Non-custodial rails for payments. Same principle: remove the human approval bottleneck. Here is the payment version: https://payhip.com/ref/dev4
Top comments (0)