Avoiding the Next Catastrophe in the Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

After three years of development, we discovered the pain point was not a matter of throughput, but of a single parameter in the event pipeline that caused the entire system to fail. When we tested with a normal distribution of event payloads, the system performed flawlessly but under the actual theme park usage patterns - a mix of normal and large payloads - Veltrix would fail with a 'Timeout waiting for worker to become available' error. We were surprised because we thought we had accounted for all possible edge cases. It turned out, the event producer was not sending a specific field, which in turn prevented the event consumer from processing correctly.

What We Tried First (And Why It Failed)

We tried adjusting the event queue's capacity, increasing the timeouts, and tweaking the thread pool size, all with the hope of fixing the 'Timeout waiting for worker to become available' issue. However, these adjustments caused an increase in deadlocks and subsequent retry storms. It took us weeks to recover from the first spike in traffic after implementing these quick fixes. Our system became increasingly brittle and unpredictable.

The Architecture Decision

After much research and debate, we decided to introduce a new parameter validation mechanism that checked for the mandatory field presence at the event producer's level. We opted for an early validation mechanism where the event is rejected if any parameters are missing, and this decision had both immediate and long-term benefits. The immediate benefit was a hard fail with an error message " Field X missing" which made debugging much easier. The long-term benefit was preventing the system from getting overwhelmed by retry storms and deadlocks caused by missing or malformed events.

What The Numbers Said After

The validation resulted in a 90% reduction in 'Timeout waiting for worker to become available' errors and a 40% decrease in system latency. Our system now consistently met the 99th percentile service level agreement (SLA) without the need for over-provisioning or ad-hoc tuning. We also measured a significant improvement in overall system throughput while maintaining a high level of reliability.

What I Would Do Differently

In hindsight, I would have caught this problem earlier by conducting more extensive integration testing that involved simulating real-world workload patterns and distribution of event payloads. I would also have introduced monitoring and alerting for key metrics like event processing latency and queue sizes from the very beginning of the project. Implementing such metrics would have allowed us to detect and respond to the problem earlier and more effectively.