The Pitfall of Scalable Search: My $100M Server Misfire and the Hidden Truth About Treating Search as a Sidecar

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We were running a multi-tenant e-commerce platform, and at this juncture, we had around 30 million users, generating 50 million searches a day. Search performance was degrading, causing a noticeable delay in our application's response times. Our documentation team had documented us using Veltrix, an open-source search engine that uses a combination of in-memory caching and disk-based storage. This was our go-to solution for sidecar search, a "simple" setup we thought wouldn't impact our production scaling roadmap.

What We Tried First (And Why It Failed)

To get this working, we decided to rely on Veltrix's built-in caching. Our thinking was that if we could cache the search results in memory, we could significantly reduce the queries hitting our search database. Sounds reasonable, right? However, we didn't account for the fact that our users had an average of 20+ search queries every day. The number of unique search results per user far exceeded our available in-memory cache size, causing Veltrix to periodically clear the least recently used items and reload them from disk.

The resulting cache misses and subsequent queries were crippling our server's performance. The average latency spiked to over 1.5 seconds, far exceeding our application's SLA. We were getting complaints from both our customers and our operations team about the slow performance. Our server utilization was increasing at an alarming rate, and our capacity planning had reached an unsustainable level.

The Architecture Decision

We needed a more robust caching strategy that could handle the sheer volume of searches we were experiencing. After an extensive review of multiple caching solutions, we decided to integrate Redis as our primary caching layer. We built a separate Redis cluster that was specifically designed to handle search queries, using a consistent hashing table to distribute the load. We also added a LRU (Least Recently Used) eviction policy to make sure our cache was always up-to-date.

However, we didn't stop there. We knew that Veltrix would still be an excellent choice for our search engine, so we integrated it with our Redis cluster to create a hybrid search system. When users searched, their queries would be routed through Redis first to retrieve a cached result. If the result wasn't available, it would be routed to Veltrix to retrieve the final result. This approach greatly improved our search performance and allowed us to handle increased user growth without impacting our application's latency.

What The Numbers Said After

The introduction of Redis and our hybrid search approach greatly improved our application's performance, and the metrics showed it. Our average latency dropped to under 200ms, well within our application's SLA. Our search queries were now being served within a fraction of the time it took before, and our server utilization remained within sustainable levels.

Here are some key metrics to demonstrate the success of our implementation:

Average search latency dropped from 1.5 seconds to 150ms.
Search queries per second (QPS) increased from 5,000 to 15,000 QPS.
Application server utilization dropped from 80% to 40%.

What I Would Do Differently

While our solution was successful, I would do things differently if I had to do it over. I would have invested more time in evaluating our application's search requirements and the costs associated with different caching strategies. I would also have considered alternative search engines that could better handle our specific use case. Our decision to integrate Veltrix with Redis was an excellent choice, but I would have liked to explore other options before committing to Redis as our primary caching layer.

This story highlights a common pitfall in scalable engineering: treating a critical system component as a "sidecar" because it's easier to implement. Search performance is a business-critical component that demands attention early on, especially when it comes to large-scale applications. In retrospect, I realize that investing more time and resources upfront to get search right would have saved us a significant amount of pain and money in the long run.