When You're Too Late to Catch the Treasures, the Costs Add Up

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Velquery was designed to handle a high volume of queries per second as our application scaled up. Initially, this was to power our real-time analytics dashboard, providing users with instant insights into ongoing activities and trends. However, our requirements soon expanded to also support our search feature, where users can dig into past events to find specific information stored within minutes of occurrence. Our challenge lay in crafting a design capable of efficiently handling a mix of real-time analytics and historical search queries.

What We Tried First (And Why It Failed)

Our first approach to Velquery was to create a high-throughput database that would cache the aggregated data from our event producers. A cluster of Redis servers was set up behind a connection pool, and we implemented custom caching mechanisms to optimize performance. However, we hit our first roadblock when we noticed that our query latency would shoot up whenever the search database would experience a significant read spike. Our users couldn't afford to wait around for tens of milliseconds just to get the data they needed. After monitoring the system for weeks, we tracked the issues back to connection pool timeouts and Redis node failures due to RAM exhaustion.

The Architecture Decision

We eventually decided to pivot away from a caching-based strategy, opting for an index-based solution using ES (Elasticsearch) instead. This proved to be a bittersweet compromise. While it took care of our query performance and allowed us to scale independently of the search queries, it also introduced an additional complexity in our system. Our ES cluster turned into a single point of contention, absorbing an ever-increasing portion of our system's resources. Moreover, we sacrificed our previous ability to use simple SQL queries for filtering, trading them in for the intricacies of Elasticsearch's query language. Our developers had to undergo reeducation on the query syntax, and we had to deploy an additional tool for debugging.

What The Numbers Said After

The numbers looked more promising after the switch to ES. Our search latency, which once hovered around 2.5 seconds, came down to approximately 150 milliseconds. However, we noticed a sharp increase in CPU and memory usage on the ES server, as well as the growth of index sizes. We saw ES eventually utilize over 70 GB of RAM on a single instance, eating up a significant portion of our available resources. Our system's overall performance began to degrade over time, with the occasional spikes in the search latency. Meanwhile, we've observed an alarming increase in our users' complaints about the search feature. It feels as though we managed to merely defer the problem.

What I Would Do Differently

With the knowledge and experience we've gained, I would start this project anew by considering alternative solutions that don't rely as heavily on a centralized database. One such approach could be to replicate the data across multiple independent clusters and then query each cluster individually. This might mitigate the load on any single node, thereby reducing the risk of congestion and subsequent performance issues. We could also reevaluate our caching strategy, employing more fine-grained cache control using tools like Redis with a more nuanced eviction policy. Perhaps we could even use a distributed cache to avoid the overhead of network communication, balancing efficiency with cache coherence.