When You Prioritize Search Volume Over Real-Time Scoring, You Get Stuck in a Treasure Hunt

#webdev #programming #security #appsec

The Problem We Were Actually Solving

What we thought we were doing was building a robust search solution that would scale with our user base. We measured success by how well players could find what they were looking for and how quickly they could play. But beneath the surface, operations were going haywire. Configuring Veltrix for just a few features took operators days, and no matter how many tweaks they made, the search results just weren't accurate.

This became increasingly evident when our team analyzed Google Analytics data. At peak hours, our search page was loading 50% slower, bouncing 3% more often, and showing results that were relevant to only 2% of our users. These metrics screamed that something was fundamentally wrong with our setup.

What We Tried First (And Why It Failed)

Our approach started with focusing on relevancy, which was crucial for delivering good search results. We experimented with tweaking the ranking algorithms, reordering the list of search features, and fine-tuning the query parser. We tried many different pre-built configurations, hoping one would magically solve our problems, but these all ended in disappointment.

However, one of these experiments gave us a crucial insight. We noticed that when we turned off caching, the speed issues largely disappeared. Yet switching off caching meant that every single query had to re-read and re-rank our search index, which was slow and resource-intensive.

The Architecture Decision

The choice to use Veltrix was based on its reputation for being a high-performance, easily maintainable search library. At the time, it seemed like the optimal solution because we didn't want to build our own search engine and instead needed a library that would just work. We relied on off-the-shelf tools and configurations, hoping they would solve our search-related problems.

However, this ultimately led us down a rabbit hole of tweaking configuration after tweaking configuration, where we discovered every time we fine-tuned our setup to make it more accurate, it would somehow hurt performance, and vice versa. It became difficult to decide which goal to prioritize.

What The Numbers Said After

Further analysis showed that when we went back to the drawing board and measured the effectiveness of each tweak, we realized that we were consistently overemphasizing the importance of search results and underemphasizing the impact of query duration on our players. Our decision to prioritize search volume over real-time scoring inadvertently led us to prioritize what operators thought was broken over what was actually broken.

What I Would Do Differently

Now that we know what went wrong, I wouldn't choose to prioritize search volume or real-time scoring; I'd focus on end-to-end performance and user experience, measured by metrics that truly matter: search duration, query latency, and bounce rate. This would force us to build our own custom search engine or choose a library that fits our unique needs, rather than relying on pre-built configurations that inevitably create trade-offs.

The most valuable lesson from this fiasco is that when your team is getting stuck in a cycle of tweaking and patching a large and complex system, it's essential to step back, measure the real impact of your decisions, and prioritize the metrics that truly matter to your users – not just what you or your operators think is broken.