Avoiding Catastrophic Event Fan-outs: Lessons Learned from a Misconfigured Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The treasure hunt engine was responsible for sending personalized location-based challenges to millions of players. When a new challenge was created, the system had to notify the appropriate services by publishing events to our event bus. However, as the load increased, we noticed a disproportionate growth in event throughput, causing the event bus to slow down. Our team was surprised to find out that a single service, the "treasure service", was responsible for sending nearly 70% of the total events.

What We Tried First (And Why It Failed)

Initially, we attempted to scale the event bus by increasing the number of event brokers from 5 to 10. We also introduced a sharding mechanism based on a modulo operation to distribute events across the brokers. Unfortunately, this led to a significant increase in event brokerage errors (average error rate: 12% compared to 2%). The problems started when all the brokers were handling the "treasure service" events simultaneously, causing a massive increase in message latency and duplicate events.

The Architecture Decision

To address the issue, we decided to move away from a push-based event-driven architecture and adopted a pull-based design for the treasure service. Instead of sending events for every new challenge, we introduced a cache-based strategy where the treasure service would periodically poll the challenge service for new updates. We also implemented rate limiting on the challenge service to prevent overwhelming the event bus during periods of high activity. This change had an immediate impact on reducing the total number of events by 45% and event brokerage errors by 85%.

What The Numbers Said After

Here are some key metrics that demonstrate the effectiveness of our architecture decision:

Event bus throughput increased by 25% without any significant increase in latency.
Duplicate events and event brokerage errors were reduced by 95%.
Treasure service processing time decreased by 50%.

We also saw a considerable drop in incident reports from developers and players, who were experiencing issues due to the treasure service sending incorrect or duplicate events.

What I Would Do Differently

Looking back, we could have avoided the need for rate limiting and caching if we had implemented a more robust sharding mechanism based on a consistent hashing algorithm. This would have ensured an even distribution of events across the brokers, reducing contention and duplication.

In conclusion, the lessons learned from our treasure hunt engine misconfiguration serve as a reminder of the importance of designing event-driven architectures with scalability in mind. By taking a structured approach to tackling event-driven system designs and prioritizing pull-based communication, we can avoid catastrophic event fan-outs and maintain high levels of system reliability and performance.