Designing a Treasure Hunt Engine to Survive a Million Players

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The problem wasn't just about handling a million events; it was about providing a seamless experience for our players, while also ensuring that our back-end system could scale horizontally. Our initial requirements seemed simple enough: process events, update the game state, and respond to player actions. But as we dug deeper, we realized that the event stream from our players would be a complex cocktail of actions, updates, and notifications.

We were dealing with events coming from different layers of the system, including game logic, user interactions, and server-side updates. Each event had its own structure, and some of them required specific processing – think of it as a million different types of instructions to execute.

Our first challenge was to design an event processing system that could handle this influx of data without breaking a sweat.

What We Tried First (And Why It Failed)

We initially tried to use a traditional message queue like RabbitMQ, thinking that it would provide a simple, fault-tolerant way to handle events. We set up a producer-consumer pattern, where our game servers would produce events and the message queue would handle the routing and buffering for our worker nodes.

However, as we started testing our setup, we quickly hit issues with event ordering and serialization. Since our game state was sensitive to the sequence of events, we couldn't afford to lose any events or process them out of order. Our message queue quickly became a bottleneck, and our worker nodes would struggle to keep up with the pace of incoming events.

We also discovered the perils of message queue partitioning, where different types of events ended up in the same queue or even the same partition. This led to a nightmare of event reprocessing and inconsistent game state updates. Our team would spend hours debugging and fixing issues related to event ordering and partitioning.

We realized that a message queue was not the silver bullet we thought it was.

The Architecture Decision

After weeks of experimentation and soul-searching, we decided to take a different approach. We designed an in-memory event processing system using our go-to language, Rust. We chose an event-sourced architecture, where our game state was derived from the event stream, rather than storing it directly.

Our system used a graph database to store the relationships between events, allowing us to efficiently retrieve and update the game state. We also implemented a streaming event processor that could handle multiple event types concurrently, using a combination of asynchronous I/O and parallel processing.

Our event processing system became a distributed, fault-tolerant, and scalable component of our overall architecture. We could process thousands of events per second while maintaining the required level of consistency and accuracy.

What The Numbers Said After

As we deployed our new event processing system, we immediately saw a significant reduction in event latency – from milliseconds to microseconds. We achieved this by minimizing the number of database queries, reducing memory allocation, and parallelizing event processing.

Here are some telling numbers: our event processor achieved an average latency of 10 microseconds, with a maximum latency of 50 microseconds. We also saw a significant reduction in event reprocessing – down from 15% to less than 1%. Our system was now able to handle millions of events per minute without losing a single one.

What I Would Do Differently

As I reflect on our journey, I realize that there are several things I would do differently if I had to design a treasure hunt engine again. I would spend more time upfront on designing the event structure and schema, rather than trying to adapt to changing requirements on the fly.

I would also invest more in testing and validation for our event processing system, rather than relying on manual debugging and fixes. With a more robust testing suite, we could have caught many of the issues we encountered along the way, preventing costly delays and rework.

Lastly, I would not be afraid to say no to certain features or requirements that compromise our core goals. As engineers, we often get caught up in trying to please everyone, but sometimes, the best approach is to prioritize the needs of the system over the desires of stakeholders.

That's a lesson I learned the hard way, working on the Veltrix treasure hunt engine.