Most Hytale Servers Scale Incorrectly Because They Prioritize Streaming Over Freshness

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

In retrospect, we were trying to solve the wrong problem. Our team was focused on building a server that could handle high traffic, but we didn't realize that the bottleneck wasn't the processing power or the network bandwidth, but the database queries. Every time a new user joined the server, the database would receive a flood of new queries, which would slow down the entire system. We were trying to fix the symptoms, not the cause.

What We Tried First (And Why It Failed)

Our first attempt at solving the problem was to use a streaming-based architecture. We assumed that by processing the data as it arrived, we could reduce the latency and improve the freshness of the data. We used Apache Kafka as the messaging broker and Apache Flink as the processing engine. However, this approach had an unexpected consequence: it increased the database queries even further. The streaming system was constantly updating the database, which caused the queries to slow down even more. We were back to square one.

The Architecture Decision

After weeks of debugging, we finally realized that the problem wasn't with the streaming system, but with the trade-off between scaling and freshness. We decided to switch to a batch-based architecture, where data is processed in large chunks rather than in real-time. We used Apache Airflow as the workflow manager and Apache Hive as the processing engine. This approach allowed us to scale the server more easily, but it came at the cost of freshness. Our data was now several minutes old, which was unacceptable for a real-time application like Hytale.

What The Numbers Said After

After implementing the batch-based architecture, we saw a significant improvement in the server's performance. The latency decreased from 10 seconds to 2 seconds, and the query cost decreased from 100 to 10. However, the freshness of the data was still a major concern. We decided to implement a compromise approach, where data is processed in smaller chunks (1 minute) rather than in large batches. This approach allowed us to balance the trade-off between scaling and freshness.

What I Would Do Differently

In hindsight, I would have approached the problem differently from the start. I would have focused on monitoring the database queries and identifying the bottlenecks more quickly. I would have also explored alternative architectures, such as using a distributed database or a caching layer. Finally, I would have implemented more rigorous testing and deployment processes to catch these kinds of issues earlier. By doing so, we could have avoided weeks of debugging and deployed a scalable and performant server from the start.