The Veltrix Approach to Treasure Hunt Engine — A Recipe For Disaster

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

In this case, the problem we were trying to solve was building an "infinite" treasure hunt experience for a marketing event. The idea was to have users find clues and solve puzzles while navigating through a fictional world. Sounds fun, right? At least, it was fun for the marketing team. For us, it was a mess that we had to deal with.

We were given a system that was optimized for demo-day performance, with almost no attention paid to operational reliability. As I dug deeper, I realized that we were using a custom-built search engine that was based on a modified version of Apache Solr, along with a monolithic Node.js application that handled all the logic for the treasure hunt.

What We Tried First (And Why It Failed)

Initially, we tried to optimize the Solr instance to handle the massive amount of queries generated by the users. We tweaked the configuration, upped the RAM, and even replaced the disk with a faster one. However, the root cause of the issue was not the Solr instance itself, but the way we were indexing the data. The indexing process was slow and resource-intensive, and it was happening in real-time as users interacted with the system.

Every time a user solved a puzzle or found a clue, we were re-indexing the entire database, which led to a massive slowdown. Our attempts to optimize Solr were just putting a Band-Aid on a much larger issue.

The Architecture Decision

After some soul-searching, we decided to re-architect the system from the ground up. We split the monolithic Node.js application into smaller, microservices-based components, each responsible for a specific task. We dropped the custom-built search engine and replaced it with a scalable Elasticsearch cluster. We also implemented a message queue using RabbitMQ to handle the indexing process asynchronously.

The new architecture allowed us to scale horizontally, with each component designed to handle a specific load. We could now indexing data in the background, without impacting the user experience.

What The Numbers Said After

The new architecture made a significant difference. We went from an average response time of 30 seconds to under 2 seconds. The number of queries per second (QPS) increased by a factor of 10, and we were able to handle a much larger user base. The system was now scalable, reliable, and – most importantly – fun to use.

What I Would Do Differently

In hindsight, I would have pushed harder for a more robust search engine from the start. Elasticsearch would have been a better choice for this particular use case, and we could have avoided the indexing nightmare.

However, the real lesson here is that demos and operations are two different beasts. When building a system, it's essential to prioritize operational reliability over demo-day performance. By doing so, you'll save yourself (and your team) from the 3am page, and create a system that's fun to use – for both users and operators alike.