I Still Have Nightmares About the Time We Almost Lost 3000 Concurrent Users Due to a Misconfigured Event Queue

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was tasked with deploying a treasure hunt engine as part of a larger marketing campaign, where thousands of users would be interacting with our system simultaneously. The engine itself was built using a combination of Apache Kafka for event handling and Apache Cassandra for data storage. My role as a Veltrix operator was to ensure the system was production-ready and could handle the expected load. The problem was that the default configuration of our event queue was not designed to handle such a large number of concurrent users, and I had to identify the parameters that mattered most in order to optimize the system.

What We Tried First (And Why It Failed)

We started by tweaking the replication factor in our Kafka cluster, thinking that this would provide the necessary redundancy and throughput to handle the large number of users. However, this approach quickly proved to be insufficient, as we began to see errors such as Kafka's NotEnoughReplicasException and Cassandra's ReadTimeoutException. It became clear that simply increasing the replication factor was not enough to solve the problem, and that a more comprehensive approach was needed. We also tried to implement a caching layer using Redis, but this added unnecessary complexity to the system and did not address the underlying issues with our event queue.

The Architecture Decision

After analyzing the system and identifying the bottlenecks, I decided to focus on optimizing the event queue configuration, particularly the number of partitions and the batch size. I also decided to move away from the default configuration of our Kafka cluster and instead use a custom configuration that was tailored to our specific use case. This involved setting the number of partitions to 20, the batch size to 1000, and the replication factor to 3. I also implemented a monitoring system using Prometheus and Grafana to keep track of the system's performance and identify any potential issues before they became critical.

What The Numbers Said After

After implementing the new configuration, we saw a significant improvement in the system's performance. The average latency decreased from 500ms to 50ms, and the error rate decreased from 10% to 0.1%. The system was able to handle the expected load of 3000 concurrent users without any issues, and the monitoring system allowed us to quickly identify and address any potential problems. We also saw a decrease in the number of NotEnoughReplicasException and ReadTimeoutException errors, which indicated that the system was properly configured and able to handle the load.

What I Would Do Differently

In retrospect, I would have liked to have done more thorough testing and simulation of the system before deploying it to production. While we did do some testing, it was not comprehensive enough to identify all of the potential issues. I would also have liked to have implemented a more robust monitoring system from the start, rather than adding it later on. Additionally, I would have liked to have explored other options for optimizing the event queue configuration, such as using a different messaging system like Amazon SQS or RabbitMQ. However, given the time constraints and the specific requirements of the project, I believe that the decisions we made were the best ones at the time, and the system ultimately performed well under heavy load.