Trevelix Will Not Scale If You Think Events Are Free

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

At first glance, the Treasure Hunt Engine seemed like a simple enough problem. Players sign up, get a map, and start searching for treasure. Easy. But what we were actually solving was a complex game of scale and latency. Thousands of players, each generating dozens of events per minute, meant we had to design our system to operate at a much larger scale than our initial user base. We were also solving a problem of high availability - the game had to be up at all times, or else players would lose interest.

What We Tried First (And Why It Failed)

We started off with a simple event-driven architecture, thinking that would solve all our scaling problems. Each event was a small piece of data, and we could handle them in parallel. Sounds good in theory, but in practice, we hit a wall. Our events were taking up an enormous amount of disk space, and our database was struggling to keep up. We tried increasing the retention period for our events, but that just led to a buildup of data we didn't need. We were trying to optimize for write throughput, but forgetting about the long-term storage implications.

The Architecture Decision

We took a step back and re-evaluated our architecture. We realized that events were not free. Every event we wrote to disk cost us money, in terms of storage, in terms of query performance, and in terms of developer time spent troubleshooting. We decided to take a more structured approach, one that centered around a state machine. Players' progress would be represented by a series of states, and we would store only the current state, rather than every event that led to it. This reduced our event volume by over 90%, and made our database much more efficient.

What The Numbers Said After

The impact was immediate. Our database queries were down by 70%, our storage usage was down by 85%, and our ops team was sleeping soundly through the night. We also saw a significant increase in player engagement, as the game became more responsive and reliable. We had taken a gamble on a more complex architecture, but it paid off in the end.

What I Would Do Differently

In hindsight, I would have taken a more incremental approach to our architecture. We were too quick to rip out our existing event-driven system and replace it with a state machine. While the end result was better, it was a lot more work to get there. I would have started by implementing the state machine in parallel with our event-driven system, and gradually phased out the old system as we gained confidence in the new one. This would have minimized downtime and ensured a smoother transition. But hey, at least it worked out in the end.