What the Documentation Never Warns You About Our Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

At its core, the treasure hunt engine was a complex system that involved generating and resolving puzzles, managing user submissions, and persisting results to a database. Sounds straightforward, right? But here's the catch: we wanted this engine to support thousands of concurrent users, with a guarantee of latency under 100ms for each request. Easy enough, but not when you consider that each request would create a new, albeit temporary, puzzle instance. That's when things started to get complicated.

As the lead architect, I underestimated the impact of those temporary puzzle instances on our system's performance. Each instance required a significant amount of memory allocation, which, in turn, led to increased page faults, cache thrashing, and a substantial slow-down in our engine's response times. Our users were experiencing delays of up to 500ms, a far cry from the sub-100ms guarantee we promised.

What We Tried First (And Why It Failed)

Initially, we attempted to alleviate the memory pressure by implementing a simple caching mechanism using a mix of in-memory data structures and a disk-based store. The idea was to reuse existing puzzle instances instead of creating new ones. Sounds good on paper, but in practice, it added a whole new level of complexity to our codebase.

Our caching implementation introduced a new layer of abstraction, which led to increased overhead due to thread synchronization, cache coherence, and, finally, a cascade of pointer chasing in our code. The result? Our slow-down persisted, and we were at a loss for what to do next.

The Architecture Decision

It was then that I realized we needed to revisit our programming language choice. We had started this project in C++, which, while powerful, isn't exactly known for its memory safety and performance guarantees out of the box. The more I dug into our code, the more I became convinced that we needed a drastic change in approach.

We switched to Rust, a language known for its focus on memory safety and performance. The decision wasn't easy; I've always found Rust to be a challenging language to learn, especially when compared to C++. But the promise of predictability and correctness was too enticing to ignore.

We rewrote our engine's core components, focusing on using Rust's ownership system and borrow checker to eliminate memory safety issues. We also employed various techniques such as stack allocation, smart pointers, and pool-based allocation to minimize the number of heap allocations.

What The Numbers Said After

The shift to Rust was a turning point for our engine. We saw a dramatic reduction in memory allocation counts ( from 10,000,000 to 50,000), which in turn led to a corresponding decrease in latency ( from 500ms to 50ms). But what was most impressive was the significant reduction in errors – we went from 200 failures per minute to just a handful.

Our profiling data showed that we were now spending most of our time in CPU-bound tasks, rather than being held back by memory pressure. This was a clear indication that we had successfully traded memory safety for performance.

What I Would Do Differently

In hindsight, I would have pushed for the switch to Rust from day one. The learning curve was worth it, especially given the gains in performance and memory safety. However, I wish I'd been more diligent in my estimation of the memory requirements for our engine, as well as the caching complexities that arose from it.

If I were to do it again, I'd also consider using a more specialized language, like Swift or Kotlin, for the puzzle generation and persistence layers. Their higher-level abstractions and more explicit memory management would have eliminated many of the memory safety issues we faced.

As I look back on this experience, I realize that the documentation may have told us what we needed to do, but it was our own experimentation and perseverance that ultimately led us to the correct solution. And, as any systems engineer will tell you, that's the most valuable lesson of all.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2