The Great Misunderstanding of Treasure Hunt Engine: Lessons from the Veltrix Operator trenches

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

It was 2018 when we launched Treasure Hunt Engine at Veltrix, a real-time event-driven system designed for large-scale e-commerce applications. Our goal was to deliver personalized product recommendations to millions of users within a millisecond. We were proud of our system, but as the user base grew, so did the complexity of our event architecture. It was then that we realized our operators were struggling to understand the intricacies of the Treasure Hunt Engine, despite our extensive documentation. They would often reach out with questions like "Why does the engine take so long to warm up?" or "What's going on when every second event fails?"

What We Tried First (And Why It Failed)

Our first approach was to create a comprehensive wiki documenting every single parameter and configuration option of the Treasure Hunt Engine. We called it the "Treasure Hunt Engine Cookbook." It was a 20-page behemoth outlining every possible use case, including deployment strategies, data ingestion, and cache settings. The idea was to empower our operators with knowledge, but as the wiki grew, so did the confusion. The sheer volume of information overwhelmed our operators, making it difficult for them to find the critical information they needed. It took an average of 7-10 minutes to troubleshoot a simple issue, and our SLA was suffering.

The Architecture Decision

We realized that our wiki was not the solution to the problem. Instead, we decided to re-design the Treasure Hunt Engine's user interface to prioritize the most critical parameters and configuration options. We created a series of guided workflows, tailored to the specific use cases of our operators. The new UI exposed only the most relevant settings, eliminating unnecessary decisions and reducing the mental overhead for our operators. We also introduced a concept we called "parameter zones," which grouped related settings together and provided clear explanations for each zone. The results were stunning – our average ticket resolution time decreased by 70%, and our SLA improved significantly.

What The Numbers Said After

Our metrics reflected the success of the new UI. The top 5 most common tickets, which previously took an average of 22 minutes to resolve, now took only 4 minutes. The "Why does the engine take so long to warm up?" question, which was a top 10 ticket previously, dropped off the radar completely. Our user satisfaction scores, measured through regular operator surveys, increased by 25%. We even saw a reduction in the number of support requests for the Treasure Hunt Engine, which meant our team could dedicate more resources to feature development and innovation.

What I Would Do Differently

If I could go back, I would focus even more on the human aspect of the problem. I would have spent more time interviewing our operators, observing their workflow, and understanding the pain points they faced. I would have also considered a more incremental approach to redesigning the UI, testing and refining the new workflows in small batches. This would have allowed us to gather feedback earlier and avoid the monumental task of re-designing the entire UI at once.