Five Metrics to Measure Before You Redesign Your Treasure Hunt Engine (Again)

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

At first glance, it seemed like we were solving the classic batch vs streaming problem. We were using Apache Flink to process events in real-time, but we had decided to batch the results and update the database every 10 minutes. The reasoning was that it was more efficient to reduce the load on the database and allow our ETL pipeline to catch up. What we hadn't accounted for was the compounding effect of latency in our system. Every 10 minutes, we were pushing a large batch of updates to the database, which was causing the latency to spike.

What We Tried First (And Why It Failed)

Our initial solution was to simply increase the batch size to try and smooth out the updates. We figured that if we were pushing 10,000 updates at once, it would be less noticeable to our users. But this only made things worse. Our database started to experience query cost spikes, and our users were complaining about slower response times. We were trying to solve the latency problem by making the system work harder, not by making it more efficient.

The Architecture Decision

It wasn't until we took a step back and looked at the system as a whole that we realized the root cause of the problem. We were trying to solve the latency problem at the wrong point in the system. Instead of focusing on the batch updates, we should have been looking at the ingestion boundary. We replaced Apache Flink with Kafka-based data integration and our database with a low-latency NoSQL database. We also switched to a streaming architecture, where every update was processed and stored in real-time. This allowed us to remove the batch updates and the compounding latency.

What The Numbers Said After

After deploying the new architecture, our numbers started to look a lot better. Our average latency dropped to under 1 second, and our query cost decreased by over 70%. Our users were happy, and our system was performing like it was supposed to. We had finally solved the problem, and we were no longer redesigning the system to accommodate the failed architecture.

What I Would Do Differently

In retrospect, I wish we had done this from the start. We should have focused on the ingestion boundary and the streaming architecture from the beginning. By trying to solve the latency problem at the wrong point in the system, we ended up with a suboptimal solution that required multiple redesigns. If I were to do it again, I would focus on the metrics that matter most: average latency, query cost, and freshness SLAs. I would also make sure to get the architecture decision right from the start, rather than trying to fix it in hindsight.