The Treasure Hunt Engine We Refused to Scale Through

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Upon digging deeper, I discovered that our operators were struggling with the engine due to its inherent design. The Treasure Hunt Engine's responsibility included indexing, ranking, and retrieving search results across multiple data sources. What our operators often missed was the fact that the indexing task was taking longer than it should, causing the entire engine to bottleneck. This wasn't a problem with the engine itself, but rather with how we had structured our resource utilization.

What We Tried First (And Why It Failed)

Our initial approach to solving this problem was to vertically scale the machine running the indexing task. We increased the RAM and CPU cores, expecting the operation to complete faster. However, this didn't have the desired effect. The machine's utilization remained high, and the indexing task still took an excessively long time to finish. This led us to realize that we were dealing with a horizontal scaling problem, not a vertical one. The root cause of the issue lay in our database schema design and the number of connections we allowed.

The Architecture Decision

In a major change to our system architecture, we implemented Apache Kafka to handle data ingestion, allowing us to distribute the indexing load across multiple nodes. This required a complete rehaul of our data pipeline and rethinking of our database schema. We also set up a circuit breaker to detect and prevent cascading failures in case one of the nodes failed. In our tests, we were able to successfully scale the engine to handle the increased load.

What The Numbers Said After

The results were clear. Our average response time dropped by 75% after deploying the new architecture. We also saw a 40% reduction in latency and a 25% decrease in the number of failed queries. Our operators no longer experienced the treacherous bottleneck that had plagued the system. More importantly, we were now able to maintain the same level of service as we grew, without the risk of system-wide failures.

What I Would Do Differently

One thing I would do differently is pay closer attention to the trade-offs of premature optimization. Our initial approach to solving the problem was based on a false assumption that vertical scaling would work. Unfortunately, we spent valuable time and resources on this approach before realizing our mistake. If I had to go back, I would have focused on re-examining our database schema and resource utilization from the very start.

By the time we had implemented Kafka and the circuit breaker, we had already accumulated a considerable amount of technical debt. Although it paid off in the end, it would've been more efficient to address the root cause of the issue from the start, rather than trying to scale our way out of it.