The Devastating Cost of Premature Optimisation in Our Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Behind the scenes of our Treasure Hunt Engine, we had a service responsible for fetching and caching a massive graph of relationships between game items. This service, dubbed the "Relationship Resolver", was responsible for answering approximately 3 million requests per hour at our peak load. What we thought we were solving was a classic problem of query optimization - the Relationship Resolver was consistently one of the top contributors to our system's latency. However, in digging deeper, I discovered that the issue wasn't with our queries at all, but with the underlying data structure.

What We Tried First (And Why It Failed)

Our initial approach was to focus on rewriting the Relationship Resolver to leverage a more efficient query language, specifically Apache Cassandra's CQL. We spent weeks rewriting our schema and rewriting our queries to take full advantage of Cassandra's distributed design. We even went so far as to implement a sophisticated caching layer to store frequently accessed data. However, to our surprise, the results were underwhelming. Our queries were faster, but our latency remained stubbornly high. It wasn't until we dug into the issue further that we discovered the root cause - our data structure was inherently flawed, causing the Relationship Resolver to perform a full scan on our database for every single request.

The Architecture Decision

After several weeks of poking around the codebase, I finally discovered the problem. Our data structure, a simple key-value store, was designed to handle small, discrete pieces of data, not massive graphs of relationships. The inevitable result was that our database was being queried with a simple "get all" request, which of course led to a full table scan. I made the decision to replace our existing data structure with a graph database, specifically Amazon Neptune. With its support for Gremlin queries and optimized storage for graph data, I was confident that we could finally tackle this problem head-on.

What The Numbers Said After

With the new data structure in place, we saw a staggering improvement in performance. Our average response time dropped by 80%, from a whopping 1.2 seconds to a mere 250 milliseconds. To put that into perspective, that's a 400% increase in throughput without any changes to our application code or infrastructure. But the numbers didn't stop there - with our new data structure, we also saw a significant reduction in disk usage, from 1.5TB to a mere 300GB. And with that came a corresponding decrease in our operational costs.

What I Would Do Differently

In hindsight, there are a few things I would do differently. Firstly, I would have caught the problem earlier - by using tools like New Relic's latency analysis and digging into the issue sooner, we could have avoided weeks of unnecessary work on query optimization. Secondly, I would have considered a graph database from the start - it's clear now that it was the correct solution, but I would have liked to have explored it earlier. And finally, I would have taken a more careful approach to data structure design - by creating a more scalable data structure from the start, we could have avoided this entire headache.