DEV Community

Cover image for Veltrix Treasure Hunts Were Killing Our Server Until We Changed One Critical Thing
pretty ncube
pretty ncube

Posted on

Veltrix Treasure Hunts Were Killing Our Server Until We Changed One Critical Thing

The Problem We Were Actually Solving

I still remember the day our Veltrix server started to show signs of distress, with our treasure hunt engine being the main culprit. The engine, which was responsible for handling user interactions and updating the game state, was causing our server to scale unpredictably, leading to increased latency and crashes. As the systems engineer responsible for the server, I knew I had to get to the bottom of the issue before it was too late. After digging through the code and running some benchmarks, I realized that the engine's performance was being bottlenecked by the language and runtime we were using. The garbage collection pauses were killing us, with some pauses lasting up to 500ms, causing our server to become unresponsive.

What We Tried First (And Why It Failed)

At first, we tried to optimize the existing code, reducing the number of allocations and minimizing the amount of work done during each garbage collection cycle. We used tools like VisualVM to profile our application and identify performance hotspots. We also tried to tune the garbage collection settings, adjusting the heap size and the frequency of garbage collection. However, despite our best efforts, we were only able to achieve a 10% reduction in latency. It became clear that we were fighting a losing battle, and that a more fundamental change was needed. The profiler output showed that 70% of our time was spent in garbage collection, with the majority of that time spent in the mark phase. This told us that our heap was too large and that we needed to reduce the amount of memory we were allocating.

The Architecture Decision

After much discussion and debate, we decided to rewrite the treasure hunt engine in Rust. This was not a decision we took lightly, as we knew it would require a significant investment of time and effort. However, we believed that the benefits of using Rust, including its performance and memory safety guarantees, made it an attractive choice. We were particularly drawn to Rust's ownership system, which would allow us to write code that was both efficient and safe. We used the Tokio framework to build our engine, which provided us with a high-level abstraction over the underlying operating system. We also used the async/await syntax to write asynchronous code that was easy to read and maintain.

What The Numbers Said After

The results were nothing short of astonishing. With the new Rust-based engine, our latency decreased by a factor of 5, from 500ms to 100ms. Our allocation count decreased by a factor of 10, from 100,000 allocations per second to 10,000 allocations per second. Our server was able to handle a much larger number of users without scaling, and our crashes disappeared almost entirely. The profiler output showed that our time spent in garbage collection had decreased to almost 0, with the majority of our time spent in the actual game logic. We also saw a significant reduction in memory usage, with our heap size decreasing by a factor of 5. This told us that our new engine was not only faster but also more efficient.

What I Would Do Differently

In hindsight, I would have started by rewriting the treasure hunt engine in Rust from the beginning. While it was a significant investment of time and effort, the benefits we saw were well worth it. I would also have paid more attention to the allocation counts and latency numbers from the start, rather than waiting for the server to start showing signs of distress. Additionally, I would have used more tools like flame graphs and benchmarking frameworks to get a better understanding of our performance characteristics. I would also have considered using other languages like C++ or Go, which also have good performance characteristics. However, I believe that Rust was the right choice for us, given its unique combination of performance and safety features. We have since applied the lessons we learned from this experience to other parts of our system, and have seen similar improvements in performance and reliability.

Top comments (0)