The Treacherous Scaling Path of Our Treasure Hunt Engine

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

Our system was a combination of Ruby on Rails, PostgreSQL, Redis, and Memcached. It was a beauty to behold, but it was also a ticking time bomb waiting to unleash its pain on us. We had a treasure hunt game that required a list of all active games that the user could join. The list was derived from a database query that fetches all active games and filters out those that were already joined by the user. Simple enough, you'd think.

What We Tried First (And Why It Failed)

We attempted to load all active games into Redis to speed up the query. Our reasoning was that the games were static and didn't change rapidly, so we could cache the results for a longer duration. Sounds great, right? Wrong. We set a TTL of 30 minutes, thinking that it would give us enough time to update the cache when new games were added or old games were removed. What we didn't account for was the fact that new games were being added at an exponential rate during peak hours, effectively invalidating the cache and causing Redis to fetch new data from the database every 30 minutes.

The Architecture Decision

We had to rethink our caching strategy. We introduced a mechanism to update the Redis cache incrementally, using Redis's pub/sub functionality to notify our cache layer when a new game was added or an existing game was removed. It was a bit of a hack, but it worked beautifully. We also limited the size of our Redis cache to prevent it from consuming all available memory. We had to carefully monitor the cache size and adjust the TTL accordingly to prevent it from growing out of control.

What The Numbers Said After

Our server started to scale much more cleanly after the architectural change. We saw a significant reduction in the number of cache misses, which in turn reduced the load on our database. Our Redis cache size remained stable, and we were able to maintain a consistent cache hit ratio of around 80%. The number of slow queries decreased by an order of magnitude, and our server was no longer hanging at the first sign of growth.

What I Would Do Differently

If I had to do it again, I would design a more robust caching strategy from the ground up. I would use a combination of Redis and Memcached to store our cached data, taking advantage of their complementary strengths. I would also consider using a more sophisticated caching algorithm, such as a least-recently-used (LRU) or least-frequently-used (LFU) eviction policy, to improve cache hit ratios and reduce cache thrashing. Finally, I would ensure that our caching layer is properly monitored and tuned to prevent it from becoming a performance bottleneck.

The lesson I learned from this experience is that optimising for demos over operations can have disastrous consequences. We had to put in extra effort to design a system that would scale cleanly, but it was worth it in the end. I hope that future engineers can avoid the same mistakes we made and create systems that are truly scalable and performant.