Designing the Perfect Event Driven System Requires Ditching the Queue

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

Digging deeper, we realized that our main issue was creating a robust event-driven system that could handle the sheer volume of events, process them in real-time, and trigger workflows without causing pipeline latency. We had initially planned to use Apache Kafka as our message broker and process events in batches using Spark. However, this setup was causing pipeline latency issues, taking up to 5 minutes for events to be processed and triggering workflows, which was unacceptable for our business requirements.

What We Tried First (And Why It Failed)

We first attempted to optimize the batch processing by increasing the number of Spark executors and adjusting the batch size. However, this didn't solve the issue completely, and we continued to see pipeline latency issues. We also tried using a more powerful Spark cluster, which ended up causing higher query costs due to the increased usage of resources. Our costs for querying the data warehouse went up by 20% due to this setup.

The Architecture Decision

After re-evaluating our requirements, we decided to switch to a stream processing architecture that could handle events in real-time. We chose Apache Flink as our stream processing engine and Apache Cassandra as our event store. This setup allowed us to process events in real-time, reducing pipeline latency to under 1 second. We also set up a caching layer using Redis to store frequently accessed data, which reduced the number of queries to our data warehouse. Our query costs went down by 40% due to this setup.

What The Numbers Said After

The numbers spoke for themselves. After implementing the stream processing architecture, our pipeline latency went down from 5 minutes to under 1 second. Our query costs went down from $100 to $60 per day, and we saw a significant improvement in our event-driven workflows. We were able to trigger automated workflows for over 90% of the events in real-time, which was a huge improvement over our previous batch processing setup.

What I Would Do Differently

In hindsight, I would have explored streaming options from the get-go. While Apache Kafka is a great solution for handling high volumes of events, it's not ideal for real-time processing. I would have also considered using a more robust event store that could handle high-throughput writes, such as Apache Cassandra. Lastly, I would have optimized the caching layer more aggressively to reduce the number of queries to our data warehouse. By doing so, we could have avoided the initial batch processing setup and gone straight to a stream processing architecture.

Same principle as idempotent pipeline design: design for the failure case first. This payment infrastructure does that by default: https://payhip.com/ref/dev8