Leveraging the Wrong Scaling Patterns Will Lose You in Production

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

In hindsight, we were trying to optimize for the wrong problem. We were optimizing for the treasure hunt engine to scale horizontally within a single availability zone in AWS, ignoring the warning signs that we were going to get slammed with requests and lose all that scaling to an eventual single point of failure once we started distributing our data across regions. Our team was convinced that scaling vertically with more instances within the same availability zone would solve all our problems, and that was where we erred.

What We Tried First (And Why It Failed)

Our initial solution was to add more RDS instances behind our NGINX load balancer, hoping that we could scale out our database to meet the increased traffic. But we soon realized that adding more instances didn't solve our disk I/O problem, nor did it alleviate our high CPU usage issue. Our MySQL instances were choking on the increased traffic, and we were seeing long query times that were impacting our ability to serve requests. At that point, we knew we'd have to go back to the drawing board.

The Architecture Decision

We decided to implement a sharding solution using a combination of AWS Route 53, S3, and DynamoDB. We created separate shards for different regions, and each of these shards handled the treasure hunt sequence for its respective region. This would solve our scaling issues and allow us to distribute our traffic across multiple regions and availability zones. We also implemented a caching layer using Redis to reduce the load on DynamoDB and MySQL. However, we still had our issues.

What The Numbers Said After

After implementing our sharding solution and caching layer, we saw a significant reduction in our slow_request_ratio metric. But we were still getting intermittent issues with our NGINX instances timing out and throwing 502 errors. Our Apache JMeter tests showed that our system could still handle the increased load, but the real-world performance was far from ideal. It turned out that our Redis instances were still not being properly sized for the increased traffic, and we were seeing some nasty Redis timeouts that were impacting our application's ability to serve requests.

What I Would Do Differently

In hindsight, I would have advocated for a more decentralized architecture from the start. I would have pushed for a solution that didn't rely so heavily on RDS and MySQL, and instead opted for a more event-driven architecture that leveraged the scalability and durability of our AWS services. I would have also implemented more robust monitoring and logging to catch issues like the Redis timeouts before they became critical. Most importantly, I would have taken the time to educate our team about the scaling challenges we were about to face, and we could have avoided the 3am call from the ops team.