Optimizing for a Million Concurrent Users Without Losing Your Mind

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We built the treasure hunt engine to efficiently handle complex search queries with numerous filters and parameters. The idea was to use a combination of trie data structures and graph algorithms to quickly identify matching documents. In theory, this approach would scale to handle the increasing volume of search queries with minimal performance degradation. However, in practice, we hit a scalability wall at around 50k concurrent users. At this point, the engine started to exhibit symptoms of a classic performance bottleneck: high latency, increased error rates, and a growing memory footprint.

What We Tried First (And Why It Failed)

Initially, we tried to address the issue by simply increasing the number of worker threads in the engine. We assumed that more threads would automatically translate to better performance, given the engine's task-based architecture. However, this approach led to increased contention for shared resources, resulting in even higher latency and more errors. We also experimented with optimizing the trie data structures, thinking that a more efficient search algorithm would be the answer. While this did yield some minor improvements, it ultimately failed to address the core issue: our memory usage was spiraling out of control.

The Architecture Decision

In a moment of desperation, we decided to swap out our existing search query pipeline for a modern, cloud-native alternative based on Apache Pinot. Pinot offered a highly scalable architecture, automatic sharding, and a state-of-the-art query planning engine. We thought that this would be a silver bullet for our problems, but it introduced new complexity and a steep learning curve for our operators. The reality is that Pinot is a far cry from a simple, intuitive solution – it's a complicated beast that requires significant expertise to tame.

What The Numbers Said After

After implementing Pinot, we saw significant improvements in latency and error rates. Average query latency dropped from 500ms to 100ms, and error rates plummeted from 10% to less than 1%. However, we also noticed a substantial increase in memory usage, which forced us to re-architect our entire system to accommodate the new requirements. The Pinot engine chewed through memory, and we had to add a custom caching layer to mitigate the effects. It was a far cry from the seamless upgrade we had hoped for.

What I Would Do Differently

In hindsight, I would have taken a more measured approach to scaling the search engine. Instead of jumping straight to a cloud-native solution, I would have investigated more incremental optimizations, such as using a more efficient data structure or reducing the number of database queries. I would have also prioritized the operator experience, investing more time in simplifying the configuration and monitoring of the system. While Pinot may have saved us in the long run, the journey was far from smooth, and we paid a heavy price for the convenience.