Designing Treasure Hunts That Don't Haunt You at 3am: A Story of Premature Optimization and Database-Driven Chaos

#devops #kubernetes #webdev #programming

The Problem We Were Actually Solving

It's a dark and stormy night in May 2024 when I get paged for the third time in four days for our "Treasure Hunt Engine" (TJE). The TJE is supposed to handle our annual employee hackathon, where teams have to solve a series of challenges within a set timeframe. The twist? The challenges are all "meta" in nature - teams have to solve a problem, and then use the solution to solve another problem, and so on. The system handles user submissions, scoring, and overall team rankings. The problem, as I soon realized, was not the TJE itself, but the way it was optimized for demos over operations.

What We Tried First (And Why It Failed)

We had implemented TJE as a monolithic Flask application, using an in-memory Redis instance for caching and a SQLite database for storing submissions and team rankings. We had also added a "caching proxy" layer to reduce the load on the application, which in turn used a combination of memcached and Redis to store frequently accessed data. Sounds great, right? We had reduced the load on the database, and were able to serve up to 100 concurrent users without breaking a sweat. But, as we soon found out, our greatest enemy was not the users, but the clock.

As the hackathon approached, we noticed that our application was getting slower and slower, with query latency creeping up into the hundreds of milliseconds. Our caching proxy was doing its job, but the underlying database was becoming a bottleneck, thanks to the sheer number of queries it was getting. We tried tweaking the database configuration, but nothing seemed to work. We were on the verge of adding more nodes to the database cluster, but that would have required a complete rewrite of our caching proxy.

The Architecture Decision

As I dug deeper into the issue, I realized that our caching proxy was not the problem. It was a symptom of a larger issue - our reliance on in-memory caching and our failure to implement a proper data store that could scale horizontally. I decided to rip out the entire caching proxy and replace it with a distributed database solution, using PostgreSQL as our primary store and TimescaleDB for our time-series data. We also implemented a real-time data streaming solution using Apache Kafka and Confluent Cloud.

What The Numbers Said After

After migrating to the new architecture, our query latency dropped by over 90%, and our overall system throughput increased by over 500%. Our team was able to handle over 500 concurrent users without breaking a sweat, and our data streaming solution allowed us to see real-time updates and rankings for our teams. We also saved a significant amount of money by reducing our database cluster size and eliminating the need for additional caching proxy nodes.

What I Would Do Differently

Looking back, I realize that we should have taken a more gradual approach to scaling our database and caching layers. We should have implemented a canary deployment strategy, where we would have routed a small percentage of traffic to the new caching proxy and database solution, and then gradually increased the percentage over time. This would have allowed us to catch any issues early on and make adjustments before rolling out the new solution to the entire system.

If I had to do it again, I would also consider using a service mesh like Linkerd to manage our microservices architecture and provide real-time observability and control. This would have allowed us to catch issues early on and make adjustments without having to restart the entire system.