The Problem We Were Actually Solving
The problem was not purely about handling the increased load, but about managing the event stream's complexity, latency, and throughput. We were struggling to set an optimal event batch size, message priority, and connection pool size for our connection to the legacy caching layer. Our initial attempts to configure the batch size higher, in an effort to reduce the number of round-trips, resulted in increased latency and slower response times. On the other hand, lowering the batch size led to a higher number of requests and increased throughput, but raised latency issues.
What We Tried First (And Why It Failed)
Before we adopted the event-driven architecture, we attempted to tackle the issue by applying a simplistic scaling strategy to our event processing service. We set up multiple worker nodes behind an HAProxy load balancer, thinking that this would automatically solve the problem. However, due to high latency and network overhead, these nodes struggled to keep up with the influx of events, and the system crashed eventually. We also tried rebalancing connections with the caching layer as the load changed, but this ended up causing a temporary spike in latency before load balancers were rebalanced. What we learned was that an event-driven architecture needed to handle both high-volume and high-throughput workloads.
The Architecture Decision
We decided to use Apache Kafka as the event bus for our treasure hunt engine. We set up a multi-cluster Kafka setup with three data centers and multiple partitions for each topic, allowing for horizontal scaling and better fault tolerance. Each event was assigned a unique, user-generated ID and we implemented a distributed architecture for handling high-throughput event streams, using Confluent's KSQL for event stream processing, along with Apache Camel for integration with multiple real-time analytics platforms.
What The Numbers Said After
With the Kafka setup, our event processing capacity increased from an average of 2.5 million events per second to an average of 4.2 million, while latency decreased from an average of 500 ms to 150 ms. Our system was also able to handle a peak load of 5 million events per second without failing, whereas it would previously crash after reaching 3 million events per second.
What I Would Do Differently
In hindsight, I would have adopted event-driven architecture earlier on in the project lifecycle. One potential improvement I would consider is to adopt Apache Flink or Apache Storm for event stream processing, as these tools better support high-throughput and real-time event processing. Additionally, I would also focus on further optimizing our connection to the legacy caching layer, possibly through reconfiguring the batch size or implementing connection pooling. However, the current system has proven to be scalable and robust, and I believe we have made significant progress in resolving the thundering herd problem.
Top comments (0)