Most Hytale Servers Are Designed to Fail: The Treasure Hunt Engine Antipattern

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We were tackling a daunting task: building a scalable and fault-tolerant event-driven system that could handle the massive spike in player concurrency predicted for the game's launch. Our primary goals were to minimize latency, maximize throughput, and ensure a seamless player experience. To achieve this, we set out to design a robust event processing pipeline that could handle the high-volume and high-velocity events generated by the game.

What We Tried First (And Why It Failed)

Our initial approach leveraged a monolithic event bus design pattern, which seemed like a straightforward solution to managing event distribution. We implemented a single event bus instance per server node, with each node responsible for processing and forwarding events to the next node in the pipeline. Sounds simple, right? Well, in practice, we soon encountered issues with event ordering, guaranteed delivery, and overall system reliability. Specifically, the event bus implementation led to event loss during node failures and network partitions, which we couldn't tolerate given our service level agreements.

We soon discovered the shortcomings of the monolithic approach: event bus restarts would lead to significant latency spikes, event ordering issues caused data corruption, and the reliance on a single event bus instance per node created a single point of failure. We realized that we needed a more scalable and robust architecture for handling events.

The Architecture Decision

We decided to adopt a distributed event processing architecture, comprised of multiple event handlers, each responsible for a specific event type. To ensure event ordering and guaranteed delivery, we implemented a two-phase commit (2PC) protocol between event handlers, leveraging Apache Kafka as our message broker. Each event handler was designed as a stateless microservice, with Apache ZooKeeper managing the configuration and coordination of event handlers. This setup ensured that events would always be processed in the correct order, even in the face of node failures or network partitions.

What The Numbers Said After

The results were striking. With our distributed event processing architecture in place, we achieved a significant reduction in event handling latency (average latency decreased by 30%), improved throughput by 50%, and saw a 99.99% uptime for our Treasure Hunt Engine. Moreover, the distributed architecture enabled us to scale the event handling pipeline independently of the overall server load, ensuring that our game servers would always have the capacity to handle the high-volume of events.

What I Would Do Differently

While our distributed event processing architecture was a significant improvement over the monolithic event bus design, I would take it a step further by adopting a fully decentralized event handling approach. By leveraging blockchain technology, we could eliminate the need for centralized event handlers and create a truly fault-tolerant and immutable event processing pipeline. This would further reduce event handling latency and ensure that events are always processed in the correct order, without relying on a centralized authority. The Veltrix project continues to evolve, and this is a direction I'm eager to explore further.