The Wrong Way to Build a Treasure Hunt Engine: A Cautionary Tale of Premature Optimisation

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our treasure hunt engine was supposed to serve users with real-time updates on treasure locations, provide a leaderboard for the top hunters, and allow administrators to create new hunts and modify existing ones. Sounds straightforward, but the key was to make it scalable and fault-tolerant. We had a large user base, and the slightest lag could cause users to lose interest.

What We Tried First (And Why It Failed)

Initially, we decided to implement a database solution using PostgreSQL, with a focus on read scalability. We chose PostgreSQL because of its strong consistency model, support for multiple query types, and active community. However, as we began to populate the database with user data, we encountered issues with query performance. Our treasure hunt locations were stored as a series of latitude and longitude coordinates, leading to a large number of queries for nearest-neighbour searches. To mitigate these slow queries, we added an additional layer of caching using Redis. This helped, but performance issues persisted.

The Architecture Decision

We eventually turned to a document-oriented database approach, specifically MongoDB, and implemented a service-oriented architecture (SOA). We introduced a microservices design, where each hunt was represented by a separate service, which allowed us to leverage process isolation and simplify maintenance. We also adopted a distributed caching layer using Hazelcast, allowing for real-time updates without overwhelming our PostgreSQL database. These changes significantly improved query performance and reduced the overall load on our database.

What The Numbers Said After

After the switch to MongoDB and the SOA design, our treasure hunt engine saw a 30% increase in throughput, a 25% reduction in latency, and a 70% decrease in database reads. Specifically, our nearest-neighbour searches saw an average latency drop from 250ms to 75ms. We were able to handle 10 times more concurrent users without any noticeable performance degradation. Our error rates decreased from 2% to 0.5%.

What I Would Do Differently

Looking back, I would have invested more time in designing our initial database schema and querying strategy. This would have allowed us to avoid the premature optimisation trap and possibly sidestep the performance issues that arose later on. I would also have considered a more gradual rollout of our SOA design, incrementally introducing new services while monitoring the impact on existing components. In retrospect, a more measured approach would have saved us valuable time and resources.