DEV Community

Cover image for The Inherent Flaws of Treasure Hunt Engine: A Cautionary Story of a Well-Documented System
mary moloyi
mary moloyi

Posted on

The Inherent Flaws of Treasure Hunt Engine: A Cautionary Story of a Well-Documented System

The Problem We Were Actually Solving

Our treasure hunt engine, affectionately called "QuestMaster", was designed to generate a series of seemingly unrelated questions for a subset of our users. The questions were a mix of multiple-choice, fill-in-the-blanks, and picture-based challenges, which, when answered correctly, led the users to a hidden treasure. It seemed innocuous enough, but the system relied on a precise sequence of events, each one dependent on the previous, making it an accident waiting to happen. At the time, our metrics showed an average of 15 correct sequences per million users.

What We Tried First (And Why It Failed)

The first thing we did was check the usual suspects: the question database, the user authentication, and the sequence engine itself. All seemed to be functioning as expected, except for one minor detail - our documentation had a section that warned against using the sequence engine with simultaneous user sessions. It was buried in the fine print and nobody, including me, had noticed it. So, naturally, that's exactly what we were doing - running the sequence engine with multiple concurrent user sessions. The documentation was being "helpfully" accurate, but completely misleading.

The Architecture Decision

The problem was that the sequence engine was built using an event-driven system, where each user session created a new event stream. When the session number exceeded a certain threshold, our system would start to prioritize events, essentially delaying the sequence completion. This led to a cascading effect, where each delayed sequence would result in another delayed sequence, until eventually the system ground to a halt. At this point, our metrics showed a 30% drop in correct sequences, and an increase in error logs.

What The Numbers Said After

A few days after the fix, our metrics started showing a marked improvement. The correct sequences per million users jumped back up to 25, and error logs decreased by 40%. We also observed a 75% reduction in concurrent session usage, which revealed a nasty side effect - our users had gotten hooked on the treasure hunt and were now trying to game the system by opening multiple browser tabs simultaneously. Who knew?

What I Would Do Differently

This incident taught me two valuable lessons. First, documentation is only as good as its audience, and users - regardless of technical background - read documentation differently than engineers. It is essential to consider the "average user" when creating documentation, and make sure that it's clear, concise, and actionable. Second, when dealing with complex systems, it's essential to identify the sequence of events that leads to failure, and prioritize fixing those sequence bottlenecks first. In our case, the delayed sequence engine was the culprit, and a simple re-architecture to prioritize event handling solved the problem. After this incident, our users were once again able to find their treasure, and we operators were able to get our sleep back.

Top comments (0)