I Still Have Nightmares About Our Event Handling Disaster

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was part of a team that built a large-scale event-driven system, and we were tasked with handling millions of events per second. The system was designed to process these events in real-time, and any delays or losses would have significant consequences. We spent months designing the system, and when it went live, it quickly became apparent that our event handling was not up to par. Events were being lost, and our operators were struggling to keep up with the volume. We were using Apache Kafka as our event bus, and our initial configuration was based on the default settings. We had 10 brokers, 20 partitions per topic, and a batch size of 1000. However, we soon realized that this configuration was not suitable for our use case.

What We Tried First (And Why It Failed)

Our first attempt at solving the problem was to increase the batch size to 5000, hoping that this would reduce the load on our brokers. However, this only made things worse, as our brokers started to run out of memory. We were getting OutOfMemoryError exceptions, and our system was becoming increasingly unstable. We also tried to add more partitions to our topics, but this only led to increased latency and decreased throughput. Our p99 latency was over 100ms, and our throughput was barely 1000 events per second. It was clear that we needed a more structured approach to configuring our event handling.

The Architecture Decision

After much debate and analysis, we decided to take a step back and re-evaluate our event handling configuration. We realized that our problem was not just about increasing the batch size or adding more partitions, but about understanding the tradeoffs between throughput, latency, and reliability. We decided to use a combination of Apache Kafka and Apache Storm to handle our events. We configured our Kafka brokers to use a batch size of 2000, and our Storm topology to use a parallelism of 10. We also implemented a retry mechanism to handle failed events, and a monitoring system to detect any issues. This new configuration allowed us to achieve a p99 latency of under 10ms, and a throughput of over 10,000 events per second.

What The Numbers Said After

After implementing our new configuration, we saw a significant improvement in our event handling. Our p99 latency decreased by over 90%, and our throughput increased by over 1000%. We were also able to reduce our error rate by over 50%, and our operators were able to manage the system with ease. Our metrics showed that we were handling over 15,000 events per second, with a latency of under 5ms. We were also able to reduce our broker memory usage by over 30%, and our CPU usage by over 20%. These numbers clearly showed that our new configuration was a success, and that we had made the right decision.

What I Would Do Differently

Looking back, I would do several things differently. First, I would have spent more time understanding the tradeoffs between throughput, latency, and reliability. I would have also done more testing and simulation before deploying our system to production. I would have also considered using other tools and technologies, such as Apache Flink or Amazon Kinesis, to handle our events. Additionally, I would have implemented more robust monitoring and alerting systems to detect issues before they became critical. I would have also spent more time training our operators on how to manage the system, and how to troubleshoot issues. Overall, our experience with event handling was a valuable lesson in the importance of careful planning, testing, and configuration. It showed us that even with the best tools and technologies, a poorly configured system can still fail, and that a well-configured system can still have issues if not properly managed.