Designing Treasure Hunts for 100,000 Participants: The Flawed Assumptions and the Architecture That Saved Us

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

We quickly realized that our primary concern wasn't simply handling a high volume of users; it was also about providing real-time insights into the treasure hunt's progress. Our operations team needed to monitor participation rates, challenge completion statistics, and track users' locations in real-time. To achieve this, we designed a custom events pipeline using Apache Kafka, Apache Cassandra, and Amazon Redshift.

What We Tried First (And Why It Failed)

Initially, we opted for a batch-oriented approach, processing events every 15 minutes using Apache Spark. While this approach provided a basic level of scalability, it failed to meet our real-time analytics requirements. Every 15 minutes, our operations team was presented with outdated information, and our users experienced inconsistent delays between completing challenges and seeing updates on the leaderboard.

To further exacerbate the issue, we soon realized that our batch window was causing significant query delays in our Amazon Redshift warehouse. Our query latency skyrocketed to over 10 seconds, resulting in a 30% drop in user engagement. It became clear that our initial design was flawed.

The Architecture Decision

We shifted our focus towards a streaming-oriented architecture, using Apache Kafka as the central event hub. We configured a three-tier architecture, where our Apache Kafka events would first be stored in Apache Cassandra, and then offloaded to Amazon Redshift for long-term data storage and analytics. By processing events in real-time, we achieved a significant reduction in our query latency, decreasing it by 90% and bringing it down to under 1 second.

What The Numbers Said After

With our new streaming-oriented architecture in place, we were able to meet our freshness SLAs of 1 minute for user participation rates and 5 minutes for challenge completion statistics. Our query cost decreased by 70%, and our pipeline latency was reduced to around 200ms. This not only improved our users' experience but also reduced our operational costs.

What I Would Do Differently

If I were to implement a similar system today, I would consider using a more robust event store like Apache Pulsar or Google Cloud Pub/Sub. These systems provide better fault tolerance and scalability features, which would allow us to handle larger volumes of users with ease. Additionally, I would implement more advanced data quality checks at the ingestion boundary to prevent errors and inconsistencies from propagating through the system.