Before I dive into the details, let me set the scene. Our company had just acquired a new feature: a treasure hunt engine that would allow users to create and share complex, real-time, multi-player hunts. The twist? We had to scale this monstrosity to tens of thousands of concurrent users within 12 weeks – or risk losing our new business unit to a competitor who'd done this before.
## What We Tried First (And Why It Failed)
Our initial approach was to throw a bunch of caching layers at the problem, thinking that if we could just keep the treasure hunt state in memory, we'd never have to worry about scaling. We deployed a Redis cluster, a memcached proxy, and even experimented with caching some of the hunts' state in the application itself. Sounds like a good idea, right? In theory, our approach made sense, but in practice, we quickly hit a wall.
One particular incident stood out. We launched the treasure hunt engine with a relatively simple hunt that had tens of concurrent players. Things seemed fine at first, but as the hunt progressed and our Redis cluster began to fill up with cached state, we started to see some weird performance issues. Turns out, that Redis cluster we set up was not only caching treasure hunt state, but also the entire hunt's logic – including the infamous "gold chest" mechanic, which, when triggered, would update the entire hunt's state for every player. Suddenly, our Redis cluster was serving up tens of thousands of redundant updates a second. We were hitting our Redis cluster's memory limits, causing page faults, and eventually, our entire system would grind to a halt.
Error messages began to flood our logs: "connection refused" from Redis, followed by frantic alerts from our monitoring system and, on one memorable occasion, a production outage that lasted nearly 24 hours.
## The Architecture Decision
Fast forward to the after-action discussion, and we realized that our caching-first approach was a classic example of premature optimisation. We had optimised for low latency, rather than designing a system that could handle the load.
We decided to take a step back and re-design our system, this time prioritising reliability and scalability. We abandoned our caching setup and opted for a more distributed architecture. Specifically, we introduced a message queue (RabbitMQ) to handle updates to the hunts' state, decoupling our database writes from the hunting experience for players. We also added a separate read-only replica of our hunt database, which we could easily scale to handle the load of caching hunt state using a dedicated caching layer.
This setup allowed us to handle the load without compromising on the user experience. The RabbitMQ queue helped us catch up on any backlog of updates, and the dedicated caching layer was able to serve up the relevant hunt state to players without overloading our system.
## What The Numbers Said After
The numbers were remarkably different from our caching-first approach. Our system was able to handle tens of thousands of concurrent users without breaking a sweat. Our median response time for hunts improved by a factor of 10, and our system-wide latency dropped from an average of 1.5 seconds to under 500 milliseconds.
Meanwhile, our uptime improved dramatically – from 95% to 99.99%. We no longer experienced those grueling production outages, and our support team was able to focus on actually helping customers rather than firefighting.
## What I Would Do Differently
If I had a second chance, I'd focus even more on the principles of distributed transactions and eventual consistency. While our separate read-only replica of the hunt database helped us scale, it introduced some complex challenges around keeping the replica up-to-date with the latest updates from the queue.
I'd probably use something like Apache Kafka or Amazon Kinesis to handle event handling, allowing us to more easily manage the consistency model and handle eventual consistency in a more robust way.
In the end, while our caching-first approach was a good idea in theory, it was the wrong approach for our specific use case. Scaling a high-traffic system like our treasure hunt engine requires a more nuanced understanding of distributed architecture and the trade-offs involved in different design choices.
The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1
Top comments (0)