Hytale Battle Pass Was a Nightmare Until I Stopped Optimizing for the Wrong Metrics

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was tasked with designing a scalable system to manage Hytale's battle pass progression, which involved tracking player progress, updating rewards, and handling concurrent requests. The system had to handle a large volume of users and provide a seamless experience. As I dove deeper into the problem, I realized that the biggest challenge was not just handling the traffic, but also ensuring that the system was fair and transparent. The initial design used a combination of Redis and PostgreSQL to store player data and progression, but it quickly became apparent that this approach was not scalable.

What We Tried First (And Why It Failed)

My initial approach was to optimize the system for low latency, using Redis as the primary data store and PostgreSQL as a fallback. I used the Redis Gears plugin to handle data processing and aggregation, but this approach failed miserably. The system was unable to handle the volume of requests, and we started seeing errors like RedisConnectionException: Connection timed out. The PostgreSQL fallback was not able to handle the load either, and we saw errors like PostgreSQLException: connection limit exceeded. It became clear that optimizing solely for latency was not the right approach. I also tried using Apache Kafka to handle the request queue, but it added unnecessary complexity to the system.

The Architecture Decision

After the initial approach failed, I took a step back and re-evaluated the system's requirements. I realized that the system needed to prioritize consistency and fairness over low latency. I decided to use a combination of Apache Cassandra and Apache ZooKeeper to manage the battle pass progression. Cassandra provided a highly available and scalable data store, while ZooKeeper ensured that the system remained consistent and fault-tolerant. I also introduced a caching layer using Hazelcast to reduce the load on the database. This approach allowed us to handle a large volume of requests while ensuring that the system remained fair and transparent.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in the system's performance. The average response time decreased from 500ms to 50ms, and the error rate decreased from 10% to 0.1%. We were able to handle a peak load of 10,000 requests per second without any issues. The system's availability increased from 95% to 99.99%, and we saw a significant reduction in the number of support requests related to battle pass progression. The metrics were promising, and the system was able to handle the volume of users without any issues.

What I Would Do Differently

In hindsight, I would have prioritized consistency and fairness over low latency from the beginning. I would have also invested more time in designing a robust caching layer, as it ended up being a critical component of the system. I learned that optimizing for the wrong metrics can lead to a lot of wasted time and effort. I would also have used more monitoring and logging tools, such as Prometheus and Grafana, to get a better understanding of the system's performance and identify bottlenecks earlier. Additionally, I would have considered using a more modern data store like Amazon DynamoDB or Google Cloud Spanner, which provide a more scalable and managed experience. Overall, the experience taught me the importance of prioritizing the right metrics and designing a system that is fair, transparent, and scalable.