Veltrix's Catastrophic Event Routing: A Case Study in What Not to Do (And How We Fixed It)

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

It's been two years since Veltrix launched its treasure hunt engine, and it's still the go-to platform for adventurous teams and companies. At its core, Veltrix is an event-driven architecture that relies on real-time data to deliver a seamless treasure hunt experience. However, as the user base grew, so did the complexity of our event routing system. Our team was consistently dealing with issues such as data duplication, latency, and increased downtime. On one particular fateful evening, it all came crashing down.

May 15th, 2024, was supposed to be a routine day. Our users were engaged in their treasure hunts, and our system was humming along. That was until 3:42 AM, when a critical error message started appearing on our monitoring dashboard: "Message replay detected. Event not processed." It turned out that our event producer had missed over 300 messages due to a burst of high traffic, which caused our event router to become overwhelmed. The damage had already been done – several teams were left without their treasure hunts, and our users were beginning to lose trust in the platform.

What We Tried First (And Why It Failed)

The initial solution to this problem was an overly simplistic one – we decided to implement a "Message Queue with a Buffer Size." Our reasoning was that if we increased the buffer size, we could absorb the increased traffic load and prevent future data losses. We slapped together a solution using RabbitMQ and adjusted the buffer size to 1000. Sounds reasonable, right? Well, it wasn't.

The first issue we encountered was that our buffer size wasn't dynamic, meaning that as the load increased, we'd hit the buffer size and messages would start getting lost. The second issue was that our event producers were producing events at an alarming rate, causing the message queue to grow exponentially. This led to increased latency and dropped messages. We were basically trading one problem for another.

The Architecture Decision

Fast-forward to August 2024, when our team gathered to address the issues with our event routing system. We decided to take a more structured approach and architect a proper event-driven system, leveraging the power of Apache Kafka and Confluent's Stream Processing. We separated our event producers and consumers into their respective clusters, ensuring that each producer was isolated from the others and only sent their events to the correct topic. We also implemented a topic partitioning strategy to minimize the load on each partition.

But here's the critical part: we also created a series of event routing rules based on the event type, event source, and target system. This allowed us to dynamically route events to the correct system without relying on static routing configurations. By decoupling the event producers from the event consumers, we ensured that events were processed in a predictable and reliable manner.

What The Numbers Said After

We ran a series of tests to validate our new architecture before deploying it to production. The results were astounding – we saw a 90% reduction in message drops, a 70% decrease in latency, and a 50% increase in throughput. We also witnessed a significant reduction in the occurrence of "Message replay detected" errors.

As for the specific numbers, our event producer was producing an average of 500 events per second (EPS) before the upgrade. Post-upgrade, we saw an average of 750 EPS, with peaks reaching up to 1200 EPS without any issues. Our event consumer was able to process an average of 350 EPS before the upgrade, whereas post-upgrade, we saw an average of 550 EPS, with peaks reaching up to 850 EPS.

What I Would Do Differently

Looking back, there are a few things I would have done differently. Firstly, I would have implemented a more robust monitoring and logging system to detect issues before they became catastrophic. This would have allowed us to identify and resolve the root cause of the problem more quickly.

Secondly, I would have involved the entire team in the architecture decision, rather than relying on a small group of experts. This would have ensured that everyone was aligned with the problem statement and the solution.

Lastly, I would have started by addressing the root cause of the problem – the high traffic load on our event producers – rather than trying to fix the symptoms. This would have saved us a lot of time and effort in the long run.

It's a sobering reminder that, even with the best of intentions, our solutions can sometimes create more problems than they solve. But with the right approach, tools, and teamwork, we can create a system that's more resilient, scalable, and reliable.