The Lies We Tell Ourselves About Treasure Hunt Engine

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

When I first joined the team, our Treasure Hunt Engine, a core system used for personalization, was struggling to meet its query cost and pipeline latency SLAs. With a query cost of $500 per hour and pipeline latency of 30 minutes, it was clear we had a problem. The culprit was a custom-built aggregation layer, which despite its simplicity, had an unforgiving architecture that made scaling and maintenance a nightmare.

What We Tried First (And Why It Failed)

One of our initial attempts was to optimize the aggregation layer by introducing a caching layer. We thought that pre-computing aggregate values and storing them in cache would greatly reduce the query cost and latency. However, in reality, the cache was often evicted due to its limited size, causing the aggregation layer to recompute the same values multiple times. This led to an increase in compute resources and, paradoxically, longer latency. The query cost also rose due to the additional caching layer, pushing us further away from our SLAs.

The Architecture Decision

After months of tinkering, we decided to switch to a stream-based architecture. Instead of aggregating data in a batch process, we now aggregate data in real-time using Apache Kafka and Apache Flink. This allowed us to process data in parallel, reducing the latency and query cost. We also introduced a more robust data model that reduced data duplication and improved data quality. The key insight was to treat aggregation as a continuous process, rather than a one-time event. This approach not only improved performance but also allowed us to better handle changes in data patterns and user behavior.

What The Numbers Said After

After the switch, our pipeline latency dropped to 5 seconds, and our query cost decreased to $50 per hour. We also saw a 25% reduction in data duplication and a 90% reduction in cache eviction rates. The system was now able to handle a 50% increase in user traffic without breaking a sweat. With fresh data processed and aggregated in real-time, our engineers could now focus on developing new features rather than fighting fires.

What I Would Do Differently

In hindsight, I would have introduced a streaming architecture from the start. The benefits were clear, but our initial attempts to optimize the batch process prolonged the pain and suffering of our engineers. In addition, I would have invested more time in designing a robust data model upfront. The current model, although improved, still requires frequent updates to accommodate changing data patterns and user behavior. A more robust data model would have allowed us to adapt more easily to these changes, reducing the maintenance burden on our engineers.