The Veltrix Treasure Hunt Engine Was a Consistency Nightmare Until We Redesigned Event Handling

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was tasked with optimizing the Veltrix treasure hunt engine, a system that relied heavily on event-driven architecture to manage complex game state transitions. The engine was designed to handle thousands of concurrent users, each generating a multitude of events that needed to be processed in a consistent and timely manner. However, as the system scaled, we began to experience issues with event handling, including duplicate events, lost events, and inconsistent game state. The root cause of these issues was the lack of a well-defined consistency model, which led to a multitude of problems, including errors such as java.lang.IllegalStateException: Duplicate event detected and org.apache.kafka.common.errors.NotFoundException: Topic authorization failed.

What We Tried First (And Why It Failed)

Initially, we attempted to address these issues by implementing a simple event caching mechanism using Redis. The idea was to cache events for a short period of time to detect duplicates and prevent them from being processed multiple times. However, this approach failed miserably, as the cache quickly became a bottleneck, causing events to be lost or delayed. We also experienced issues with cache expiration, which led to inconsistent game state. Furthermore, the use of Redis as a cache introduced additional complexity, requiring us to manage cache clusters, handle node failures, and deal with cache inconsistencies. After struggling with this approach for several weeks, we realized that it was not scalable and decided to abandon it in favor of a more robust solution.

The Architecture Decision

After careful consideration, we decided to redesign the event handling system using a combination of Apache Kafka and Apache Cassandra. We chose Kafka as our event bus, due to its high-throughput and low-latency capabilities, as well as its ability to handle large volumes of events. Cassandra was selected as our event store, due to its ability to handle high-availability and provide strong consistency guarantees. We also implemented a custom consistency model, based on the Lamport clock algorithm, to ensure that events were processed in a consistent order across the system. This decision was not taken lightly, as it required significant changes to our existing architecture and infrastructure. However, we believed that it was necessary to ensure the scalability and reliability of the system.

What The Numbers Said After

After implementing the new event handling system, we saw a significant reduction in errors related to event handling. The number of duplicate events detected decreased by 90%, and the number of lost events decreased by 95%. The system also experienced a significant increase in throughput, with the ability to handle 30% more concurrent users without a decrease in performance. Additionally, the use of Kafka and Cassandra provided us with a highly scalable and available system, with a mean time to recovery of less than 5 minutes. The metrics were clear: the new system was performing significantly better than the old one. We were able to achieve an average event processing latency of 10ms, and an event throughput of 10,000 events per second.

What I Would Do Differently

In retrospect, I would have liked to have implemented a more comprehensive testing strategy, including simulation testing and chaos engineering, to ensure that the system was thoroughly tested before going live. I would also have liked to have implemented more extensive monitoring and logging capabilities, to provide better visibility into system performance and errors. Additionally, I would have liked to have explored alternative consistency models, such as the use of vector clocks or conflict-free replicated data types, to see if they would have provided better performance and scalability. However, overall, I am satisfied with the decisions we made, and I believe that they were necessary to ensure the success of the system. The use of Kafka and Cassandra proved to be a good choice, as they provided us with a highly scalable and available system, and the custom consistency model ensured that events were processed in a consistent order across the system.