Building the Right Event Pipeline Every Time

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

In this case, our problem was building a treasure hunt engine that needed to ingest thousands of events per second from multiple sources and process them in real-time. The idea was to reward users with virtual coins based on their engagement with the platform, but the catch was that each coin had a unique identifier tied to the user's actions across different events. Our first challenge was determining the best event pipeline architecture for this system, considering factors like latency, throughput, and data quality.

What We Tried First (And Why It Failed)

Our first attempt at building the event pipeline was based on a batch processing approach using Apache Airflow and Apache Spark. We created a workflow that would run every 10 minutes, where it would collect all the events, aggregate them, and then process them in batches. Sounds simple, right? Unfortunately, it was a disaster waiting to happen. Not only were our latency targets impossible to meet, but we were also plagued by data inconsistencies and stale user data due to the delayed processing times. Our team was stuck with this design for a few weeks before we realized we had to try a different approach.

What We Tried Second (And Why It Failed)

In an attempt to improve our design, we switched to a streaming architecture using Apache Kafka and Apache Flink. We created a topic for each event source and used Kafka Connect to stream the events into our Flink cluster. However, we soon discovered that our Flink jobs were struggling to keep up with the high event volume, resulting in pipeline latency that exceeded our SLA by 5x. Moreover, we encountered issues with data quality due to the lack of error handling in our ingestion code.

The Architecture Decision

After two failed attempts, we decided to take a step back and analyze our system requirements. We needed a pipeline that could handle thousands of events per second with near real-time processing and minimal latency. We chose a hybrid architecture that combined the efficiency of streaming with the robustness of batch processing. We used Apache Kafka as our message broker, Apache Flink for real-time processing, and Apache Spark for batch processing. We implemented a pipeline that would ingest events in real-time, process them in Flink, and then periodically run batch jobs using Spark to ensure data quality and consistency. Our goal was to meet a pipeline latency of under 200 ms, with a query cost of 10,000 units or less.

What The Numbers Said After

After deploying our new pipeline architecture, we saw significant improvements in performance. Our average pipeline latency dropped to 120 ms, and our query cost remained well within our target of 10,000 units. But what's more impressive is that our data quality improved significantly, with a reduced rate of data inconsistencies and stale user data. By implementing a structured approach to our event pipeline, we were able to build a system that met our performance, scalability, and reliability requirements.

What I Would Do Differently

In hindsight, our biggest mistake was not taking the time to thoroughly understand our system requirements and trade-offs before making a design decision. If I were to do it again, I would have spent more time analyzing our latency, throughput, and data quality requirements before choosing an architecture. I would also have implemented more robust error handling and monitoring in our ingestion code to prevent issues like data inconsistencies and stale user data. Most importantly, I would have recognized the need for a hybrid architecture much earlier, rather than trying to force-fit a single architecture to meet our requirements.

Same principle as idempotent pipeline design: design for the failure case first. This payment infrastructure does that by default: https://payhip.com/ref/dev8