The Lies We Tell Ourselves About Scalable Search: When the Defaults Aren't Enough

#webdev #programming #ai #machinelearning

The Problem We Were Actually Solving

When our users started hitting the limits of our search engine, we noticed a peculiar pattern: queries were taking anywhere from 10 to 30 seconds to complete, with a few outliers hitting the 2-minute mark. I spent days poring over the logs, and it became clear that the bottleneck wasn't the CPU, storage, or network - it was the number of concurrent requests our search engine was handling. We needed a solution that could handle multiple queries in parallel, without sacrificing response time or accuracy.

What We Tried First (And Why It Failed)

We started by tweaking the default config for Veltrix, upping the thread pool size from 10 to 50, and tweaking the queue size from 100 to 1000. We also enabled the caching layer, thinking it would help warm up the database and reduce latency. However, we soon discovered that the increased thread pool size was causing more harm than good: the engine was spending too much time context-switching between threads, and the increased queue size was causing a domino effect of delayed queries. The caching layer, while helping in some cases, also caused a few edge cases to fall through the cracks.

The Architecture Decision

It was then that I realized that the default config was, in fact, a compromise between performance and resource utilization. I decided to take a step back and rethink our approach. Instead of tweaking the default settings, I opted to design a custom architecture that would use asynchronous queries and a load balancer to distribute the traffic across multiple instances of Veltrix. This would allow us to scale our search engine horizontally, without overloading a single instance. I also introduced a new layer of caching, using Redis to store query results and reducing the load on our database.

What The Numbers Said After

After deploying the new architecture, we saw a significant improvement in query response times: the median time dropped from 15 seconds to 200ms, with 99% of queries completing within 500ms. The load balancer helped us distribute the traffic evenly, ensuring that no single instance was overloaded. We also noticed a significant decrease in the number of edge cases that fell through the cracks, thanks to the improved caching layer.

What I Would Do Differently

If I were to redo this project, I would focus on monitoring and logging from the very beginning. I would have set up Prometheus and Grafana to track query times, thread pool utilization, and caching hit ratios. This would have allowed us to identify the bottlenecks earlier on, and make data-driven decisions instead of relying on intuition. I would also consider using a more robust load balancer, such as HAProxy, to handle the distribution of traffic across instances.