Treasure Hunt Engine: Where Optimism Meets Reality

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We needed a high-performance system to index and retrieve documents from a large corpus of data. The catch was that our users cared not only about the speed of the search but also the accuracy of the results. This meant that THE had to be able to handle various query types, including Boolean searches and phrase matching, while providing relevant rankings. In other words, THE was not just a search engine but a critical component of our application's user experience.

What We Tried First (And Why It Failed)

Initially, we implemented THE using a generic event-driven architecture, leveraging our existing event bus and worker queue. Each query was dispatched as an event, processed in a separate worker thread, and then returned to the client. While this approach seemed elegant, it quickly hit its limits. The event bus was too chatty, causing network overhead and delays, while the worker queue introduced latency and jitter. Most frustratingly, our users reported incorrect results and query timeouts, which we attributed to the threading model and lack of proper synchronization.

The Architecture Decision

After months of firefighting and countless performance optimizations, we realized that the language and runtime were the primary constraint holding us back. Our codebase was written in Go, which, although fast in many scenarios, couldn't keep up with the intricacies of our multi-threaded, concurrent query processing. We decided to migrate THE to Rust, specifically targeting the Tokio runtime for its high-performance asynchronous I/O capabilities. This change had far-reaching implications, including rewriting the entire query processing pipeline and revising our worker thread management.

What The Numbers Said After

The impact of our architectural decision was nothing short of dramatic. By switching to Rust, we managed to reduce the median response time for THE from 250 ms to under 50 ms, with most queries completing within 20 ms. More impressively, our latency distribution shifted significantly, with 99.9% of queries finishing within 100 ms. We also observed a substantial drop in allocation counts and memory usage, which allowed us to reduce our server's memory requirements and increase the overall throughput.

What I Would Do Differently

In retrospect, our initial decision to use a generic event-driven architecture was well-intentioned but flawed. We should have taken a more holistic approach, considering the performance implications of each component from the outset. If I were to revisit the design, I'd choose a more efficient threading model, such as Tokio's executor, and consider using a more specialized query processing framework, like Apache Kafka or OpenSearch. Nevertheless, our experience with THE taught us a valuable lesson: understanding the performance constraints of our language and runtime is crucial to building high-throughput systems. By acknowledging these constraints early on, we can make informed design decisions that ultimately lead to more scalable and responsive applications.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2