Treasure Hunt Engine: Where Poor Documentation Meets Premature Optimisation

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At its core, the Treasure Hunt Engine was a queue-based architecture that processed hunt requests from thousands of concurrent users. The system's primary function was to generate hunts with unique combinations of items, each with its own probability of being included. Sounds simple, but the real challenge lay in ensuring fairness, diversity, and randomness, while also meeting performance and scalability requirements.

Our monitoring tools showed that the system was constantly under pressure, with average queue times of 200ms and a peak of 500ms during peak hours. The system's complexity was exacerbated by its use of multiple repositories, each containing probabilistic models and algorithmic logic for generating hunt items.

What We Tried First (And Why It Failed)

When I first started working on the system, I decided to focus on optimising the queue processing logic. I added multiple threads and a custom async queue implementation, convinced that increasing concurrency would solve the performance issues. However, this approach introduced new problems: increased memory usage, thread-safe issues with our probabilistic models, and a proliferation of edge cases that made debugging a nightmare.

The system's errors started to escalate – "Could not acquire lock on probability table", "Invalid item count in queue", and "Thread abort due to memory exhaustion" became commonplace. Our system metrics told a grim tale: average queue times increased by 50%, and transaction rollbacks rose by 20%. It was clear that our optimisation efforts had only made things worse.

The Architecture Decision

After months of struggling with the system's poor design, we made a critical decision: to refactor the system around a single, consistent data model. We replaced our multiple repositories with a unified, schema-less storage solution that allowed us to merge the probabilistic models and algorithmic logic into a single, manageable entity.

We also introduced a new consistency model, based on causal consistency and event sourcing, which allowed us to resolve conflicts and inconsistencies at the root of our data model. The system's complexity was finally under control, and our monitoring tools showed a significant reduction in queue times and transaction rollbacks.

What The Numbers Said After

The metrics told a clear story: average queue times decreased by 70%, and transaction rollbacks dropped by 90%. The system was now able to handle peak loads without breaking a sweat. We also saw a significant improvement in our system's latency, with a 30% reduction in average response times.

What I Would Do Differently

Looking back, I would have approached our optimisation efforts with a more critical eye. We were so focused on tweaking the system's parameters that we failed to address the root cause of our performance issues: the system's poor design and inconsistent data model.

In hindsight, I would have recommended a more radical approach: ripping out the existing system and rebuilding it from scratch, with a focus on simplicity, consistency, and a clear data model. By doing so, we would have avoided the pitfalls of premature optimisation and saved ourselves months of frustration and debugging.