The problem we were actually solving was how to deliver a scalable, low-latency, and high-performance online treasure hunt experience for over 100,000 concurrent players across multiple continents. Sounds exciting, but the truth is, the initial approach was a hot mess.
The initial architecture was centered around a monolithic application with a single, centralized state store hosted on Amazon Aurora, using Java as the primary language, and Redis for caching. This sounds fine on paper, but the reality was that we were soon facing issues with database write contention, massive Redis memory usage, and poor scalability. The Redis server was consistently at 80% memory usage, which led to a 500 error after only 10 hours of operation. This was compounded by the fact that we were using Redis's default 1500 ms expiration time for cache keys, which led to cache thrashing, thereby increasing the load on the central state store.
What We Tried First (And Why It Failed)
Our first attempt at mitigating these issues was to implement a simple sharding strategy, where we split the state store into multiple shards based on the player's region. We used the Spring Framework to handle routing and shard selection, but this introduced a new set of problems. The increased complexity led to longer development cycles, bugs started surfacing due to incorrect handling of shard failures, and the poor sharding algorithm led to unevenly distributed traffic, causing hotspot issues.
The Architecture Decision
After many sleepless nights and rounds of heated discussions with my team, we decided to rip it all apart and start from scratch. We opted for a microservices-based architecture, using Go as the primary language, and designed the system with events in mind. We chose Apache Kafka as our event streaming platform, AWS Lambda for serverless functions, and Amazon DynamoDB as our NoSQL database. Each player's state was now held in its own dedicated table, and we implemented a custom sharding algorithm that took into account the actual traffic patterns.
What The Numbers Said After
After deploying the new architecture, we saw a 30% reduction in latency, a 50% decrease in CPU utilization, and an astonishing 90% reduction in Redis memory usage. The system became more scalable, with the ability to handle 150,000 concurrent players with ease. We also saw a significant reduction in bugs related to shard failures, and the custom sharding algorithm ensured that traffic was evenly distributed.
What I Would Do Differently
In hindsight, I would have implemented the custom sharding algorithm first, rather than relying on the out-of-the-box routing provided by Spring. This would have saved us weeks of debugging and performance optimizations. Additionally, I would have adopted a more rigorous testing strategy, including continuous integration and continuous deployment (CI/CD) pipelines, to catch issues earlier in the development process. The takeaway here is that the key to delivering high-performance systems lies in understanding the problem you're trying to solve, avoiding premature optimisation, and being willing to rethink your approach when the numbers aren't adding up.
The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1
Top comments (0)