The Bane of Scaling Treasure Hunts

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We initially tried to optimize the algorithm for puzzle generation to avoid overloading the database. Our attempts involved rewriting the puzzle generation logic to use a more efficient data structure and reducing the number of database queries. Despite these efforts, the system still crumbled during peak hours. Upon further investigation, we found that our real problem wasn't the algorithm, but the way we scaled our server fleet as the number of users increased.

What We Tried First (And Why It Failed)

We initially used the standard AWS auto-scaling feature, which increased the number of EC2 instances based on CPU utilization. However, this approach led to increased latency and packet loss during the scaling process. We observed that our latency numbers on AWS CloudWatch would spike to 500ms during scaling, causing our users to experience a poor experience. Our allocation counts on CloudWatch would show a rapid increase in memory and CPU usage, indicating that the system was starved for resources.

The Architecture Decision

We decided to switch to a containerized architecture using Docker Swarm mode and Kubernetes. This allowed us to use a more sophisticated scaling strategy based on actual request volumes rather than CPU utilization. We also implemented a load balancer using HAProxy to distribute traffic more evenly across our instance fleet. By using a more granular scaling approach, we were able to reduce latency and packet loss during scaling events. Our latency numbers on CloudWatch improved dramatically to around 50ms.

What The Numbers Said After

After adopting our new architecture, we saw a significant reduction in latency and packet loss during user spikes. Our CloudWatch metrics showed a decrease in CPU utilization, allowing us to run our instances more efficiently. We observed a 30% reduction in latency and a 50% reduction in packet loss during peak hours. Our allocation counts showed a more even distribution of resources, eliminating memory and CPU starvation.

What I Would Do Differently

If I were to redo our architecture decision, I would probably choose a serverless architecture like AWS Lambda or Google Cloud Functions. While this would require rewriting our application code to be event-driven, it would also allow us to scale our system more efficiently and cost-effectively. Using a serverless architecture would eliminate the need for us to manage instance scaling and deployment, allowing us to focus on developing new features rather than managing the underlying infrastructure.