The Problem We Were Actually Solving
At first glance, it seemed like the problem was straightforward: our Treasure Hunt Engine was taking too long to process requests. Our monitoring tools showed high latency and CPU utilisation across the board. However, upon closer inspection, I discovered that the issue was more nuanced. The engine had been designed as a monolithic service, handling both event generation and caching in a single process. This made it difficult to scale horizontally, as we'd need to replicate the entire engine, including caching, across multiple nodes. Meanwhile, our caching layer, built using Redis, was maxing out its capacity due to the engine's high write load.
What We Tried First (And Why It Failed)
Initially, we attempted to address the performance issue by vertical scaling the Treasure Hunt Engine, upgrading its hardware and increasing its thread count. However, this approach only delayed the inevitable. As user base and request rates continued to rise, we started seeing Redis connection timeouts and cache eviction issues due to the growing write load. We also experienced intermittent crashes due to memory overflow errors caused by the cache. This was a temporary fix at best, and our search data showed that operators consistently hit this exact problem at the same stage of server growth.
The Architecture Decision
After careful consideration and analysis of our system's constraints, we decided to refactor the Treasure Hunt Engine into a microservices architecture. We broke out the caching component into a separate Redis cluster, allowing us to scale it independently of the engine. We also separated event generation into its own service, designed to take advantage of event-driven architecture principles. This change gave us the flexibility to scale services horizontally, reducing bottlenecks and improving overall system reliability. To further improve performance, we implemented Circuit Breaker and Rate Limiter patterns using Netflix's Hystrix library, allowing us to handle failures and bursts of traffic more efficiently.
What The Numbers Said After
The refactored system showed significant improvements in performance and scalability. Our average response time decreased by 30%, while CPU utilisation dropped by 25%. We managed to handle the 300,000 requests per minute without any issues, and our Redis cluster showed no signs of bottlenecks. We also reduced the number of application crashes by 70%, resulting in improved system uptime and reduced mean time to recovery (MTTR). Our search data showed a clear correlation between server growth stages and system performance, giving us a solid foundation for future scaling exercises.
What I Would Do Differently
In retrospect, I would've invested more time in load testing and stress testing before attempting to scale the system. Our initial attempts to vertical scale the engine were based on theoretical understanding, which often proved to be incorrect. This lack of real-world testing led to unnecessary downtime and increased costs. Additionally, I would've monitored Redis performance metrics more closely, potentially catching the connection timeout and cache eviction issues earlier. By doing so, we could've implemented a more targeted fix and avoided the costly refactor.
Top comments (0)