DEV Community

Cover image for Designing a Treasure Hunt Engine from Hell
mary moloyi
mary moloyi

Posted on

Designing a Treasure Hunt Engine from Hell

The Problem We Were Actually Solving

We thought we were building a cutting-edge search engine that would allow customers to create custom treasure hunts with ease. But in reality, we were trying to optimize for demo days over operational stability. We had a team of three UI designers and two product managers who wanted to showcase the app's capabilities at the upcoming industry conference. The rest of the team, including our lead developer and I, were tasked with making it happen. We were told to "just make it work" – without understanding the long-term implications of our decisions.

What We Tried First (And Why It Failed)

Our first attempt was to use the popular search engine library, "Lumineer", to power the treasure hunt engine. We thought it would save us time and avoid the need for a custom solution. But after a few days of development, we realized that Lumineer was not designed to handle the complex filtering and ranking logic that our customers required. We were getting a constant stream of errors related to "IndexOutOfRangeException" and "NullPointerException". It was clear that we needed a more bespoke solution.

The Architecture Decision

After some heated debates with the design team, we finally decided to go with a hybrid approach that used a combination of Elasticsearch and a custom plugin for filtering and ranking. We thought this would give us the best of both worlds – the scalability of Elasticsearch and the flexibility of a custom plugin. However, we didn't factor in the impact of plugin loading order and configuration cascading failures. Our custom plugin was loading before Elasticsearch, causing the entire system to crash whenever a user tried to access the treasure hunt engine.

What The Numbers Said After

The week after launch was a disaster. Our customer support team was flooded with complaints, and our operator teams were working around the clock to resolve the issues. We had to do some damage control and implement a quick fix that temporarily bypassed the custom plugin. However, the true extent of the damage became apparent when we reviewed the metrics. Our average response time had gone up from 50ms to over 5 seconds, and our error rate had increased by 300%. It was clear that we needed a complete overhaul of the treasure hunt engine's architecture.

What I Would Do Differently

If I had to do it all over again, I would take a different approach from the start. I would prioritize a more modular design that separates the search logic from the application logic. This would have allowed us to use a more robust search engine like Solr or a cloud-based solution like Algolia. I would also carefully consider the configuration and plugin loading order to avoid cascading failures. And most importantly, I would make sure that our design team is involved in the architecture discussions from the start – so that we don't end up optimizing for demos over operational stability.

Looking back, I realize that the Treasure Hunt Engine was a perfect example of a system that was designed to fail. But it was also a valuable learning experience that taught me the importance of prioritizing operational stability and modularity in system design.

Top comments (0)