Treasure Hunt Engine on Rails Crashed Under Load

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our system, the Treasure Hunt Engine, was built to handle user-generated treasure hunts. The hunts themselves were essentially complex queries on our large graph database. We wanted to support a maximum of 100 concurrent users, with each user submitting a new query every 5 seconds. Sounds like a challenge, but not an insurmountable one. The trouble started when we hit 50 concurrent users. The app would hang for about 90 seconds, then spew out this lovely error: "java.net.SocketTimeoutException: Read timed out". It was clear we had a problem with our query handling.

What We Tried First (And Why It Failed)

Initially, we thought the solution lay in optimizing our database queries. We switched from using the Neo4j graph database to Amazon Neptune, hoping the new engine would provide speedier query execution. However, this change only shifted the problem elsewhere. Our app was still taking ages to respond, and now we were also dealing with significant additional costs for the Neptune instance. To make matters worse, we started seeing "Service Unavailable" errors on our load balancer, indicating that even our relatively small app was getting overwhelmed by the sheer volume of queries.

The Architecture Decision

At this point, it became clear that our query handling was the main culprit. We decided to switch to a distributed caching layer using Redis. We'd cache the result of each query for 1 minute, allowing subsequent users to receive the pre-computed answer instead of re-running the query on our graph database. This change had a profound impact: our app was now able to handle 100 concurrent users, and even with the increased load, we were only seeing 5% cache misses - indicating our Redis layer was performing admirably.

What The Numbers Said After

The numbers don't lie. With Redis in place, our average query execution time decreased from 5.3 seconds to 0.27 seconds, a reduction of 94.9%. Our service availability also shot up to 99.9%, from 93.7% previously. What's more, our Neptune instance was no longer strained, and we saved 23% on our database costs. In short, our users were much happier, and our ops team wasn't pulling its collective hair out.

What I Would Do Differently

In retrospect, I'd approach the problem differently from the start. We should have added the Redis caching layer much sooner, rather than trying to optimize our database queries. It's often tempting to dive straight into database tuning, but in this case, it's clear we were overcomplicating things. The key takeaway is to identify the root cause of the problem and address it directly. Don't prematurely optimize your database - focus on architecting your system to handle the load in a scalable way from the start. The cost of premature optimization is higher than you think.