Veltrix is Succeeding at Event Configuration While Most Operators Get It Catastrophically Wrong

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Back in 2018, we launched a revamped version of Veltrix, aimed at simplifying complex event-driven architecture for developers. The previous system was monolithic, event-driven, and slow. Our team set out to create a cloud-native, multi-regional system with thousands of concurrent requests, where event-driven architecture played a critical role in data processing.

In a bid to optimize the system for high throughput, we implemented a distributed event queue, RabbitMQ. However, despite the initial promise of faster processing and reduced latency, we soon realized that events started piling up at an alarming rate, causing our system's performance to degrade. We had to rethink our approach to event configuration to avoid collapse.

What We Tried First (And Why It Failed)

Initially, we tried to solve the event configuration problem using the Amazon SQS fan-out pattern. SQS allowed us to fan out events to multiple queues for parallel processing, supposedly reducing latency and improving throughput. We'd push events to the main SQS queue, and SQS would automatically distribute them across multiple worker queues, making it easier for our system components to process events concurrently.

However, we soon ran into issues with SQS message groups and visibility timeouts. As events were being processed, we encountered problems with multiple worker queues competing for the same event, resulting in duplicated events and errors. Visibility timeouts further compounded the issue, causing SQS to requeue events that hadn't been processed, and thus clogging our system even further. Adding to the problem was the fact that SQS was not built for high-priority events, making its performance unreliable.

The Architecture Decision

We eventually decided to adopt a different approach for event configuration. We developed a custom solution, employing Apache Kafka as our message broker. We partitioned our events across multiple topics, each specifically designed for high-priority events, while using a combination of Apache ZooKeeper and Apache Kafka's cluster management capabilities to ensure event ordering and reliability across our system's components.

Our Kafka-based system helped us distribute events in real-time, reduce event latency, and increase application throughput. We configured our system to handle thousands of concurrent requests, using multiple broker nodes to handle high loads, with HAProxy acting as a load balancer to distribute the workload evenly.

What The Numbers Said After

We compared the performance of our original SQS-based solution to our new Kafka-based system. Key metrics included average latency, overall throughput, and successful event delivery rates. Using New Relic for performance monitoring, we observed a 75% reduction in average event latency and a 95% increase in events processed per minute, with an average successful delivery rate of 99.99%.

We also noticed a 4x decrease in dead-letter queues, indicating fewer duplicated events and errors. This shift allowed us to further optimize our workflows and streamline event-driven processing across our distributed system.

What I Would Do Differently

If I knew then what I know now, I would have spent more time fine-tuning Kafka configuration for optimal performance. Specifically, I would have focused on fine-tuning the number of partitions per topic, adjusting the replication factor and leader election timeout for smoother cluster operation. Additionally, I would have implemented more sophisticated topic routing logic to better accommodate dynamic event flow and allow for more flexible scaling of our event-driven architecture.

Looking back, our experience serves as a reminder that premature optimization is a false god. What worked for us, ultimately, was refactoring our architecture towards a more flexible, partitioned, and scalable event-driven system – one that would be capable of handling massive volumes of events without sacrificing performance or event ordering.