Optimizing for Disaster at 10 Million Queries Per Hour

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

We built Veltrix as an internal platform for our team to query our search data. Initially, our usage was low, and our architecture was more than sufficient. However, as our business grew, we hit an unexpected wall when we reached 10 million queries per hour. It turned out that we were running into issues with query partitioning and data skew, causing our queries to take way longer than expected. We were constantly hitting our query latency SLA of less than 50 milliseconds.

What We Tried First (And Why It Failed)

Our first attempt at solving this issue was to increase the number of shards in our partitioned table. We thought that by distributing the data more evenly, we could speed up our queries. However, this only temporarily alleviated the problem. As our traffic continued to increase, we kept running into issues with data skew, and our queries would still take an inordinate amount of time to complete. We were stuck in a vicious cycle of adding more shards only to see our query performance degrade again.

The Architecture Decision

We decided to switch from a batch-oriented architecture to a streaming one. We implemented a system where our data is ingested into a Kafka topic and then processed in real-time by our query engine. This allowed us to handle our queries much more efficiently, as our system could handle the high volume of traffic in a much more scalable way. We also implemented a data quality check at the ingestion boundary, which allowed us to catch errors and anomalies early on, preventing them from propagating throughout our system.

What The Numbers Said After

After implementing the new architecture, we saw a significant reduction in our query latency, from an average of 120 milliseconds to less than 30 milliseconds. Our query cost also decreased by 70%, allowing us to save on our cloud bill. We were able to meet our query freshness SLA of 99.99%, which means our team can trust the data they're getting from our system.

What I Would Do Differently

In hindsight, I would have prioritized tackling the data skew issue much earlier on. If we had implemented a more robust data partitioning strategy from the start, we could have avoided the issues we had with increasing the number of shards. While our new streaming architecture has been a game-changer for our system, I recognize that it's not the only solution to this problem. I'll be keeping a closer eye on data skew going forward, and I'm confident that we can avoid similar issues in the future.