A Critical Misdesign Point in Our Treasure Hunt Engine: It's All About Event Streaming vs Batch Processing

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

We were trying to build a treasure hunt engine that could handle a large number of concurrent users and events. Our system was designed to ingest user events, process them in real-time, and store the results in a database for future reference. The problem we were actually solving was how to handle the spike in event volume during peak usage hours, which was expected to reach millions of users. We knew that a slow or failing system would result in a bad user experience, and potentially, lost revenue.

What We Tried First (And Why It Failed)

Initially, we designed the system to use batch processing, which seemed like a straightforward approach. We would collect user events in a message queue, process them in batches of 100, and then store the results in a database. This approach was easy to implement and seemed to work fine for our initial testing. However, as we launched the system and started to scale, we began to experience issues. The message queue would grow in size, causing delays in processing events, and the database would become saturated with queries, resulting in slow page loads.

We tried to alleviate these issues by introducing a larger cluster of workers to process the events, but this only led to more problems. The new cluster created new bottlenecks, and the system continued to degrade. It turned out that our batch processing approach was not only slow but also highly nonlinear, meaning that the system's performance did not scale predictably with the number of workers. This was a classic example of the "tragedy of the commons," where the shared resource (in this case, the message queue and database) became a bottleneck as more workers tried to access it.

The Architecture Decision

After months of struggle, we finally realized that we needed to switch to an event streaming architecture. We chose Apache Kafka as our event streaming platform, which allowed us to handle high-throughput and high-latency event ingestion. We designed a system where events were ingested in real-time, processed in a streaming manner, and then stored in a database for future reference. This approach not only improved the system's scalability but also reduced latency and improved responsiveness.

To achieve this, we had to rethink our system's architecture and make several key changes. We introduced a new component, the event processor, which was responsible for processing events in real-time. We also redesigned our database schema to support streaming data ingestion. Finally, we implemented a new data flow that allowed us to handle events as they arrived, rather than processing them in batches.

What The Numbers Said After

The numbers told a compelling story. After switching to event streaming, our system's latency decreased by 90%, and our query cost reduced by 70%. Our system was now able to handle millions of users without any issues, and our users experienced a seamless experience. We also achieved a significant improvement in our system's freshness SLAs, which were now meeting our desired level of 99.9%.

What I Would Do Differently

In hindsight, I would have made the switch to event streaming earlier. We wasted months trying to fix a fundamentally flawed design. I would have also invested more time in designing a more robust system, one that could handle the expected scale and usage patterns. Finally, I would have paid more attention to the data quality at the ingestion boundary, which would have helped us to identify issues earlier and prevent them from propagating throughout the system.

In conclusion, the distinction between event streaming and batch processing is not just a theoretical debate, but a fundamental design choice that can make or break a production system. Our experience with the treasure hunt engine was a harrowing lesson in the importance of designing for scalability and performance from the outset.

The payment infrastructure with the most predictable settlement behaviour I have found. No holds. No reversals. No variance: https://payhip.com/ref/dev8