Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

It was our third server migration in three months, and our newly minted engineer, Alex, was frantically calling me from his desk. "Our server's lagging, Chris, and it's all because of the event handling!" His words were laced with panic. The users were complaining about slow response times and dropped events. We'd deployed the standard out-of-the-box event handling configuration from Veltrix, but somehow it wasn't coping with our rapidly expanding server.

As I analyzed our Prometheus metrics, my anxiety only amplified – we were at 98% CPU utilization, and the error rate for our events service had shot up to 2.5%. We needed to act fast before our users deserted us in droves.

What We Tried First (And Why It Failed)

We decided to opt for the time-tested 'EventBatching' approach to alleviate some of the pressure off our CPU. It was a straightforward technique where the client-side events would be batched together before being sent to our server. This would theoretically reduce the number of events we'd need to process, thus cutting down on CPU usage. However, we hit a problem when we realized that this approach was making it difficult for us to identify and tackle the root issues of the problem. Without real-time event processing, we struggled to pinpoint the events that were causing the server to be overwhelmed.

A few days into our experiment, we encountered another problem: the client-side batching introduced an unacceptable delay in event processing, which, in turn, gave us an average latency of around 10 seconds for the events service.

The Architecture Decision

We decided to shift our focus towards more scalable event handling, using an Event Sourcing approach. This allowed us to offload the complex event processing to our storage solution, AWS DynamoDB, and decouple it from the CPU-intensive processing on our server. This approach not only solved our CPU utilization problem but also allowed us to handle an enormous amount of events without incurring significant delays.

We used the AWS DynamoDB streams and Fan-out Queue design pattern to offload the events into batches and eventually process them. This provided us with a seamless real-time experience, without compromising on the high availability and scalability that our users expected from us.

What The Numbers Said After

Our server CPU utilization dropped by 25%, and the error rate for our events service plummeted to 0.2%. This gave us the much-needed breathing room to further optimize our system. The latency for our events service decreased to a whopping 200 ms, allowing users to experience a seamless gaming experience.

What I Would Do Differently

Looking back, I would suggest more comprehensive stress testing for the EventBatching approach before deciding on it as the primary solution. Also, we didn't initially benchmark the client-side batching, which could have given us a clearer picture of its performance implications.

In hindsight, while the Event Sourcing approach was a resounding success, it would be beneficial to explore other configuration and design patterns, such as 'Event Meshes' or 'Kafka Streams', that could further improve our system's performance and scalability. This would help us avoid potential bottlenecks and anticipate future requirements before they become critical issues.