The Unfortunate Truth About Veltrix Configuration

#webdev #programming #ai #machinelearning

The Problem We Were Actually Solving

We were fighting a misconception that we could optimize our Veltrix setup without addressing the root issue – latency spikes. Our operators were tweaking configurations for hours, but the system was still suffering from bottlenecks. The real challenge wasn't getting search working at all; it was how to achieve reliable, scalable performance, even as the number of users grew. The metrics were telling a different story. Our latency average had actually improved, but we were still experiencing catastrophic delays for a significant subset of users.

What We Tried First (And Why It Failed)

Initially, we took the easy route, deploying a traditional indexing approach. We built an expensive, cloud-based index on top of our existing Elasticsearch cluster. While indexing was fast, our operators struggled to keep up with the constant index rebuilds. These rebuilds caused our system to stumble whenever the index became inconsistent. Eventually, our latency started to creep up again. We tried other solutions, but none of them addressed the fundamental problem of inconsistent search results.

The Architecture Decision

We decided to shift to an in-memory indexing approach using RediSearch. This allowed us to sidestep index rebuilds and instead use the power of Redis to handle the cache layer. However, this solution came with its own set of challenges – high memory usage, increased latency for writes, and reliance on a third-party solution. We also replaced our single, large cluster with a horizontally scaled, region-aware setup to ensure that users could find what they needed quickly.

What The Numbers Said After

After our changes, our average latency dropped by an impressive 30%. But more importantly, our 95th percentile latency, which had been pushing the boundaries of user tolerance, dropped by a staggering 70%. We stopped experiencing catastrophic delays, and the system felt more responsive to users. Our users were finally able to find what they wanted without breaking a sweat. We also observed a 25% reduction in our search retry rate, which in turn minimized our CPU utilization and reduced costs.

What I Would Do Differently

Looking back, I wish I had pushed harder for an in-memory indexing approach from the get-go. Our early attempts at trying various indexing techniques wasted precious time that could have been spent refining our in-memory indexing strategy. I also would have opted for a more straightforward approach with a dedicated, per-Redis-Cluster setup. This would have eliminated the complexity of maintaining a regional setup with multiple Redis clusters.