The Lie of Perfectly Scalable Treasure Hunts

#webdev #programming #dataengineering #python

My team at Veltrix spent months trying to tame a beast of a system - our server-side search engine, crucial for powering live treasure hunts. We were convinced that our architecture should handle an exponential growth in users seamlessly, but behind the scenes, our operators were frantically battling outages. In hindsight, our obsession with scalability had blinded us to the underlying infrastructure's true limitations.

## What We Tried First (And Why It Failed)

We initially approached this challenge with a batch processing mindset. Our search query pipeline, which indexes search data, was designed to run in large batches overnight. This allowed us to keep costs low and scale our infrastructure with ease. However, as our user base grew, so did the latency of our searches. When users started complaining about unacceptably long query times, we discovered a 7-minute average latency for batch queries. Users were essentially locked out of the treasure hunt experience, which we couldnt scale up quickly enough to accommodate. It was clear that batch processing was not the right answer.

## The Architecture Decision

Armed with the knowledge that streaming was the way to go, we decided to overhaul our search pipeline to a micro-batch ETL process. This change gave us sub-second latency for search queries while keeping our costs contained. We implemented Apache Kafka as the message-broker, allowing for asynchronous processing and decoupling the search queries from the indexing pipeline. With this change, our average query latency plummeted to under 200ms, and our users were able to participate in treasure hunts seamlessly. As our systems continued to grow, our latency remained consistently under 200ms.

## What The Numbers Said After

Implementing the micro-batch ETL pipeline revealed some interesting metrics. Our daily query volume reached 10,000 requests, peaking at 30 requests per second, without any noticeable latency increases. Our indexing pipeline processed 100,000 documents daily, while keeping the entire system running cost-effectively. We set a freshnes SLA of 2 hours for all search data, and we consistently met that goal.

## What I Would Do Differently If I Had the Chance

In reflection, I notice that we spent a lot of time optimizing our query performance without paying sufficient attention to data quality at the ingestion boundary. We encountered inconsistencies in our event sourcing system, resulting in query times that were 2x longer than we expected. This taught me that data quality should be a top priority in any scalable system.

DEV Community

The Lie of Perfectly Scalable Treasure Hunts

Top comments (0)