DEV Community

Cover image for Why I Had to Rethink Our Entire Approach to Building a Scalable Treasure Hunt Engine
pretty ncube
pretty ncube

Posted on

Why I Had to Rethink Our Entire Approach to Building a Scalable Treasure Hunt Engine

The Problem We Were Actually Solving

I still remember the day our treasure hunt engine started to show its limitations, it was a real eye opener, our system was designed to handle a large number of users and generate puzzles on the fly, but as the user base grew, so did the latency and memory usage, our initial approach was to use a high-level language that would allow for rapid development, but it soon became apparent that this was not the right choice, the constant garbage collection pauses and the sheer amount of memory allocations were killing our performance, I recall one specific instance where our system was handling around 10,000 concurrent users and the average latency was around 500ms, which was unacceptable, we were using a custom-built JVM with a heap size of 16GB, but even that was not enough to keep up with the demand, the GC pauses were so long that they were causing our system to become unresponsive for seconds at a time.

What We Tried First (And Why It Failed)

My team and I tried to optimize the existing code, we used a profiler to identify the hotspots and worked on reducing the number of allocations, we also tried to use caching to reduce the load on the system, but no matter what we did, we could not seem to get the latency down, we were using a caching layer that was implemented using a popular in-memory data grid, but even that was not enough to keep up with the demand, the cache was being updated so frequently that it was causing a lot of contention and slowing down the system even further, I remember one specific metric that stood out, our system was allocating around 10GB of memory per minute, which was causing the GC to run constantly, we were using a tool called YourKit to profile our application and it was showing us that the majority of the time was being spent in the GC, this was a clear indication that our approach was not sustainable.

The Architecture Decision

After much deliberation, we decided to take a step back and rethink our entire approach, we decided to use Rust as the primary language for our treasure hunt engine, this was not an easy decision, as it would require a significant amount of work to rewrite the entire system, but we believed that it was necessary to achieve the performance and scalability we needed, we were attracted to Rust's focus on memory safety and performance, and we believed that it would allow us to build a system that was both fast and reliable, I recall one specific conversation with my team where we discussed the pros and cons of using Rust, we were all aware of the learning curve, but we were willing to take on the challenge, we also decided to use a custom-built data storage system, instead of relying on a traditional database, this would allow us to optimize the storage and retrieval of data for our specific use case.

What The Numbers Said After

After the switch to Rust, we saw a significant improvement in performance, the average latency dropped to around 50ms, and the memory usage was reduced by a factor of 10, we were able to handle the same number of users with a much smaller heap size, the GC pauses were gone, and the system was much more responsive, I remember one specific metric that stood out, our system was allocating around 1GB of memory per minute, which was a significant reduction from the 10GB per minute we were seeing before, we were using a tool called perf to profile our application and it was showing us that the majority of the time was being spent in the actual logic of the application, rather than in the GC or other overhead, this was a clear indication that our new approach was working, we also saw a significant reduction in the number of errors, the Rust compiler was catching a lot of potential issues at compile time, rather than at runtime, this was a huge win for us, as it allowed us to focus on the actual logic of the application, rather than spending time debugging issues.

What I Would Do Differently

In hindsight, I would have made the switch to Rust much sooner, the learning curve was definitely worth it, and the benefits we saw in terms of performance and reliability were well worth the investment, I would also have spent more time optimizing the storage and retrieval of data, we ended up having to do a lot of work to optimize our custom-built data storage system, and it would have been better to have done that work up front, I remember one specific decision we made to use a specific data structure, which ended up being a bottleneck, if I had to do it again, I would choose a different data structure, one that was more optimized for our specific use case, overall, I am glad we made the switch to Rust, it was a difficult decision, but it was the right one for our system, and it has allowed us to build a scalable and performant treasure hunt engine that can handle a large number of users.

Top comments (0)