Choosing the Right Event Engine for Millions of Players - Lessons from a Burnt-Out Config

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our users were already experiencing suboptimal latency, and we were seeing a significant overhead of up to 200ms due to the sheer volume of events being processed. With millions of users already playing our game, the pressure was on to deliver a smooth experience. However, the default configuration of our event engine was not only contributing to the poor performance but also causing us to miss crucial business data due to a lack of event correlation.

What We Tried First (And Why It Failed)

Initially, we tried to fix the performance issues by tweaking the configuration of our existing event engine. We would adjust the buffer size, alter the message queue depth, and play with the thread pool size. However, no matter what we did, we couldn't seem to make a significant impact on the performance. The reason was that the event engine was fundamentally flawed in its design, and it was only able to process events at a very low rate. To make matters worse, the event engine would often crash, causing our entire system to go down.

The Architecture Decision

After much analysis and discussion with my team, we decided to switch to a more robust event engine that was designed to handle high volumes of events and was fault-tolerant. We chose Apache Kafka as our event engine for several reasons: its ability to handle high-throughput, its distributed architecture, and its excellent support for event correlation and aggregation.

We configured Kafka to run on a cluster of 10 machines, each with 16 cores and 64GB of RAM. We then set up a message queue with a fanout strategy, allowing us to process events in a more efficient and scalable manner. Additionally, we implemented a circuit breaker to prevent cascading failures in case one of the consumers failed.

What The Numbers Said After

After making the switch to Kafka, we saw a significant improvement in our system's performance. The event processing time dropped from 200ms to below 10ms, and the latency of our system decreased by an average of 50%. We also saw a significant reduction in the number of crashes, from an average of 10 per day to only 1 per week.

What I Would Do Differently

In retrospect, I would have made the switch to a more robust event engine sooner. I would have also spent more time investigating the root cause of the performance issues rather than just tweaking the configuration of our existing event engine. Furthermore, I would have implemented proper monitoring and logging from the beginning to be able to quickly identify and diagnose performance issues.

Looking back, the choice of event engine was a critical decision that had a significant impact on the performance and reliability of our system. It's a reminder that even small decisions can have far-reaching consequences and that it's always better to spend time upfront to get things right rather than trying to patch up problems later.