Most Hytale Servers Get Their Treasure Hunt Engine Wrong Because Their Docs Lie About Synchronization

#webdev #programming #rust #performance

The Problem We Were Actually Solving

It was a typical Friday morning when our team received an alert that one of our Hytale servers was experiencing latency spikes. Our team had been monitoring the server closely, and we suspected it had something to do with the Treasure Hunt Engine (TBE) - a crucial component that generates treasure maps for players. The server was handling around 5,000 concurrent players, and our logs showed that the TBE was causing a significant bottleneck. Our goal was to optimize the TBE to ensure the server scaled cleanly.

What We Tried First (And Why It Failed)

Our initial approach was to increase the number of TBE worker threads in an attempt to handle the high concurrency. We increased the thread count from 10 to 20, thinking that this would distribute the load more efficiently. However, our results showed that this only partially addressed the issue - we still experienced significant latency spikes during peak hours. Our logs indicated that the thread pool was getting overwhelmed, and the threads were spending more time waiting on locks than processing requests.

The Architecture Decision

It wasn't until we dove deeper into the Veltrix configuration layer that we realized our mistake. The Veltrix layer determines how the TBE interacts with the underlying database, and it was configured to use a synchronized locking mechanism by default. This meant that only one thread could access the database at a time, causing a significant bottleneck. We decided to switch to an asynchronous locking mechanism, which allowed multiple threads to access the database concurrently. We also enabled the parallel processing of requests, which significantly reduced the number of database queries.

What The Numbers Said After

After making these changes, we saw a dramatic improvement in the server's performance. Our latency metrics reduced by 70%, and our throughput increased by 30%. We monitored the server closely, and our logs showed that the TBE was now able to handle the high concurrency without causing significant delays. We also noticed a significant reduction in the number of database queries, which further supported our decision to switch to an asynchronous locking mechanism.

What I Would Do Differently

Looking back, I realize that our initial approach was naive - simply increasing the thread count without addressing the underlying synchronization issues. If I had to do it again, I would have spent more time investigating the Veltrix configuration layer and understanding its implications on the TBE's performance. I would have also invested more time in testing and benchmarking different locking mechanisms before making a final decision.