Baking a Scalable Treasure Hunt Engine: A Cautionary Tale of Wrong Architecture Decisions

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

At our company, we had a reputation for launching new features quickly and scaling up fast to meet demand. However, this approach often led to under-invested configuration layers that held our systems back at the most inopportune moments. We were determined to make our treasure hunt engine scalable from inception. Our goal was to create a server that could handle a minimum of 10 consecutive days of 5% user growth (equivalent to 1 million new users) without any downtime or performance degradation.

What We Tried First (And Why It Failed)

Initially, we opted for an event-driven architecture using Apache Kafka for our message broker and Apache Cassandra for our database. This setup allowed us to scale our data model and our compute model independently. However, this separation of concerns came at a significant cost in terms of pipeline latency and query cost. Our query latency had ballooned to over 500 milliseconds, which made our system unresponsive to user input. Our query cost had also skyrocketed, resulting in a monthly bill that was double our estimated cost.

The Architecture Decision

We eventually realized that our problem wasn't with our architecture per se, but rather with our lack of consideration for the data ingestion boundary. We had been trying to optimize our data warehouse, but our poor data quality at the ingestion point was consistently leading to expensive and inefficient queries. We thus decided to move to a batch processing architecture using Apache Beam and Apache Bigtable. This setup allowed us to handle high volumes of data in a more cost-effective manner and reduced our pipeline latency to under 200 milliseconds.

What The Numbers Said After

After switching to our new batch processing architecture, we saw a significant reduction in our query latency and cost. Our query latency dropped to under 200 milliseconds, allowing us to respond to user input in a timely manner. Our monthly query cost was also reduced by 75%, making our system much more profitable. Our data freshness SLAs were met for the first time, allowing us to roll out new features and content faster.

What I Would Do Differently

If I had to do it over again, I would prioritize data quality at the ingestion boundary much earlier in our project timeline. Our mistakes taught us that it's always cheaper to invest in data quality upfront than to pay the price later in the form of expensive and inefficient queries.