Veltrix Treasure Hunt Engine Mistakes Were Inevitable Given Our Over-Reliance On Kafka

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I still remember the day our team decided to implement the treasure hunt engine for our Hytale servers using Veltrix, we were excited about the possibilities it offered but also aware of the complexity it introduced, our goal was to create an immersive experience for players while ensuring the system could handle a large volume of concurrent users, we knew that getting the events configuration right was crucial to the success of the project. Our team spent countless hours discussing the best approach, we considered using Apache Kafka as the primary event-driven messaging system, and after some debate, we decided to go with it, mainly due to its high throughput and scalability.

What We Tried First (And Why It Failed)

Initially, we tried to use a simple topic-based approach with Kafka, where all events related to the treasure hunt were published to a single topic, this seemed like a straightforward solution, but it quickly became apparent that it was not suitable for our needs, the main issue was that the topic became a bottleneck, with too many events being published and consumed, this led to significant delays and even caused some events to be lost, we also experienced issues with event ordering and consistency, which was critical for the treasure hunt engine to function correctly. We used the Kafka console consumer to monitor the events and noticed that the lag between the producer and consumer was increasing steadily, at one point, it reached over 10 seconds, which was unacceptable for our use case. We tried to tweak the Kafka configuration, adjusting the number of partitions and the batch size, but it only provided temporary relief.

The Architecture Decision

After realizing that our initial approach was flawed, we decided to take a step back and re-evaluate our architecture, we concluded that a more structured approach was needed, one that would allow us to handle the complexity of the treasure hunt engine while ensuring the system remained scalable and performant. We decided to use a combination of Kafka and a dedicated event store, this would allow us to decouple the event production from consumption and provide a clear audit trail of all events. We chose to use EventStoreDB as our event store, mainly due to its high performance and support for concurrency. We also introduced a set of APIs that would handle the event publishing and consumption, this would provide a clear interface for our services to interact with the event store and Kafka. The new architecture looked like this: our services would publish events to Kafka, which would then be consumed by a dedicated event processor, this processor would handle the event storage and publishing to the event store, the event store would then provide a stream of events that our services could consume. We used Apache Avro for serialization and deserialization of events, which provided a compact and efficient format.

What The Numbers Said After

After implementing the new architecture, we saw significant improvements in the performance and reliability of our treasure hunt engine, the average latency for event processing decreased from over 10 seconds to less than 100 milliseconds, and the event loss rate dropped to near zero. We also saw a substantial decrease in the CPU utilization of our Kafka brokers, from an average of 80% to around 20%. The event store provided a clear audit trail of all events, which helped us debug issues and improve the overall quality of the system. We used Prometheus and Grafana to monitor the system, which provided valuable insights into the performance and behavior of our architecture. The metrics showed that our system could handle a large volume of concurrent users without significant performance degradation.

What I Would Do Differently

Looking back, I would have preferred to use a more lightweight event-driven messaging system, such as RabbitMQ or Amazon SQS, instead of Kafka, while Kafka provides high throughput and scalability, it comes with a significant amount of complexity and operational overhead. I would also have implemented a more robust event validation and handling mechanism, this would have helped us catch and handle errors more effectively, reducing the number of events that were lost or incorrectly processed. Additionally, I would have invested more time in monitoring and testing the system, to ensure that it was operating within the expected parameters and to identify potential issues before they became critical. We used a combination of JUnit and TestNG for unit testing and integration testing, but I would have liked to have used more advanced testing tools, such as Gatling or Locust, to simulate a larger number of concurrent users and test the system under heavy load.