Prestige Systems Are a Recipe for Disaster If You Do Not Control Your Service Boundaries

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was tasked with designing a prestige system for our multiplayer game, similar to the ones found in popular titles like World of Warcraft or League of Legends. The system had to be able to handle a large number of users, with each user having their own progression track and rewards. We were using the Veltrix framework, which promised to simplify the process of building and managing complex systems like this. However, as we delved deeper into the project, I realized that the Veltrix documentation was lacking in several key areas, particularly when it came to service boundaries and consistency models. Our system was designed to handle around 10,000 concurrent users, with each user generating around 5-10 requests per second. We were using a combination of Apache Kafka and Apache Cassandra to handle the high volume of requests and data storage.

What We Tried First (And Why It Failed)

Initially, we tried to follow the Veltrix documentation to the letter, using their recommended approach to building and deploying the prestige system. However, this approach proved to be flawed, as it did not take into account the specific needs and constraints of our system. We quickly ran into issues with data consistency and service availability, particularly when dealing with high volumes of concurrent requests. For example, we would often see errors like java.lang.IllegalArgumentException: Duplicate entry for user or org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out. These errors were caused by the fact that our system was not properly handling concurrent requests and data updates. We tried to use the Veltrix built-in caching mechanism to mitigate these issues, but it only seemed to make things worse, as it introduced additional complexity and overhead. We were using the Hazelcast caching library, which was configured to use a distributed cache with a time-to-live of 30 minutes.

The Architecture Decision

After several weeks of struggling with the Veltrix documentation and our own implementation, I made the decision to take a step back and re-evaluate our approach. I realized that the key to building a successful prestige system was not to follow the Veltrix documentation blindly, but to instead focus on controlling our service boundaries and consistency models. We decided to use a combination of event sourcing and command query responsibility segregation (CQRS) to build a more scalable and resilient system. This approach allowed us to decouple our data storage and processing components, and to handle concurrent requests and data updates in a more efficient and consistent manner. We also decided to use the Akka toolkit to build a distributed and fault-tolerant system, which would allow us to handle high volumes of requests and data storage. We were using the Akka 2.6 version, with a configuration of 10 nodes and a replication factor of 3.

What The Numbers Said After

After implementing our new approach, we saw a significant improvement in system performance and availability. Our error rates dropped by over 90%, and our average response times decreased by over 50%. We were able to handle high volumes of concurrent requests without issue, and our system was able to recover quickly from failures and errors. For example, our average response time for the getUserPrestige endpoint decreased from 500ms to 200ms, and our error rate for the updateUserPrestige endpoint decreased from 10% to 1%. We were able to achieve this level of performance and availability by using a combination of metrics and monitoring tools, including Prometheus and Grafana. We were monitoring metrics such as request latency, error rates, and system throughput, and using this data to inform our design and implementation decisions.

What I Would Do Differently

In hindsight, I would have taken a more critical approach to the Veltrix documentation and our own implementation from the start. I would have focused more on controlling our service boundaries and consistency models, and less on following the Veltrix recommended approach. I would have also invested more time and resources in testing and validating our system, particularly in areas such as concurrent requests and data updates. Additionally, I would have used more advanced metrics and monitoring tools, such as distributed tracing and logging, to gain a better understanding of our system's performance and behavior. For example, I would have used a tool like Zipkin or OpenTracing to gain a better understanding of our system's request flow and latency. I would have also used a tool like ELK Stack to gain a better understanding of our system's logging and error patterns. Overall, our experience with the prestige system was a valuable learning opportunity, and one that taught me the importance of critical thinking and careful design in building complex systems.