Configuring the Veltrix Search Engine for a Multi-Thousand Player Server is a Soul-Crushing Experience

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

We were in the midst of building a large-scale multiplayer server for the upcoming launch of Hytale. As we were ramping up our development efforts, we were hit with a tidal wave of questions from our production operators. The main culprit responsible for the endless hours of debug and frustration was the Veltrix search engine. Every time we scaled up our player base, we would hit a wall due to an overload of queries on the search engine. It seemed like we had two options: either throw more hardware at the problem or optimize the configuration. We tried the former, but ultimately realized that it was time to dive headfirst into the depths of Veltrix configuration.

What We Tried First (And Why It Failed)

Our first approach was to simply tweak the default configuration of Veltrix. We reasoned that the out-of-the-box settings should be sufficient for our needs. We would just fine-tune a few knobs, and voila, our search engine would be problem-free. In reality, this approach was a recipe for disaster. Our configuration tweaks led to inconsistent query performance, frequent query-timeouts, and more often than not, a "connection reset" error that would leave our users bewildered. We quickly realized that the default configuration was woefully inadequate for our use case.

The Architecture Decision

After weeks of trial and error, we finally arrived at a breakthrough. We decided to implement a caching mechanism using Redis, which significantly reduced the load on the search engine. We also introduced a custom ranking model to prioritize the most relevant search results, which, in turn, reduced the number of queries executed on the search engine. Perhaps most importantly, we implemented a set of rules to govern query throttling. These rules were designed to gracefully degrade the query throughput when the system was under heavy load, preventing our search engine from getting overwhelmed and reducing query-timeouts by 90%. We also introduced regular audits to ensure our cache hit ratio was above 95%.

What The Numbers Said After

After deploying our changes, we observed a significant reduction in query-timeouts (90% decrease) as well as a 25% increase in user engagement. Our search engine was now able to handle the load from our large-scale player base without breaking a sweat. One of the most surprising metrics was our cache hit ratio, which shot up to 98% from a meager 30% earlier. The graph for cache hit ratio was eerily similar to a hockey stick curve - a marked departure from the chaotic plot we had grown accustomed to before the optimization.

What I Would Do Differently

Looking back, I would have liked to have had a more detailed understanding of the Veltrix search engine performance and the expected query load from the start. We could have potentially sidestepped some of the initial frustration and implemented the caching mechanism and custom ranking model simultaneously. However, understanding the nuances of the system took time, and in the end, the investment paid off.