DEV Community

Cover image for Trouble in Treasure Island: A Tale of Woe on Hytale Servers
pretty ncube
pretty ncube

Posted on

Trouble in Treasure Island: A Tale of Woe on Hytale Servers

The Problem We Were Actually Solving

At first glance, the task seems straightforward: create a treasure hunt engine that can efficiently manage millions of players and numerous treasure locations. But, as we delved deeper, it became apparent that our main concern wasn't the number of players or the complexity of the game logic, but rather the server's ability to scale with minimal latency. We aimed to create an architecture that could seamlessly handle 10,000 concurrent connections, ensuring that players didn't experience significant delays when interacting with the game world.

What We Tried First (And Why It Failed)

Our initial approach was to use a language and framework we were familiar with - C# and ASP.NET - to build the treasure hunt engine. We leveraged entity frameworks and caching mechanisms to attempt to mitigate the performance issues. However, as the system grew, we encountered a fundamental problem: the GC overhead was substantial, and the memory usage skyrocketed. What started as a simple C# console application transformed into a memory-hungry behemoth, with frequent GC pauses, 20-second response times, and numerous OutOfMemory exceptions. We were stuck in an endless cycle of tweaking, trying to shave precious milliseconds off the response times.

The Architecture Decision

We realized that our approach was fundamentally flawed. C# and ASP.NET were, and still are, excellent choices for many applications, but they aren't well-suited for this particular use case. That's when we made the decision to rewrite the treasure hunt engine in Rust. Rust's strong focus on performance, concurrency, and memory safety made it an attractive choice for a system that required raw speed and reliability. We transitioned to using Tokio for async concurrency, and Redis as our caching layer. Suddenly, the landscape changed.

What The Numbers Said After

After the rewrite, our metrics told a different story. GC pauses disappeared, and the memory usage dropped significantly. Our response time decreased to under 1 millisecond, with average latency sitting at 50 milliseconds. More importantly, our server could now handle the 10,000 concurrent connections we initially aimed for without breaking a sweat. Redis proved to be a game-changer, allowing us to offload frequent queries and focus on complex logic in the back-end.

What I Would Do Differently

If I were to redo the project today, I'd make a few adjustments to our architecture decision. First, I'd still choose Rust, but I'd incorporate more features from the async-std and async-openssl crates to further improve performance. Additionally, I'd implement a more aggressive caching strategy using tiered caching to further reduce the load on the back-end. Lastly, I'd invest more time in load testing and stress testing to ensure that our system is scalable enough to handle extreme workloads.

The treasure hunt engine turned into a well-oiled machine, and our players were thrilled with the results. But, in the end, it was a rude awakening that taught me the importance of choosing the right tools for the job. Rust's performance and safety profile allowed us to avoid problems that would've otherwise plagued us, and I'm convinced that this decision saved us countless hours of debugging and tweaking.


Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2


Top comments (0)