Treasure Hunt Engine: Where We Got Events Wrong (And How We Fixed It)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our team had built the Treasure Hunt Engine using a Java-based framework with RabbitMQ for messaging and MySQL for persistence. When users entered a hunt, their actions triggered a complex sequence of tasks, from database lookups to external API calls. The system was supposed to maintain a perfect leaderboard, keeping track of each user's progress in real-time. But something was amiss.

Every hour, our users started noticing glitches: the leaderboard would lag for several minutes, or update in reverse. Some users saw incorrect scores, while others received "connection timed out" errors.

What We Tried First (And Why It Failed)

First off, we enhanced our database queries to reduce the load on MySQL. We added indexes, optimized query plans, and partitioned the data to spread it across multiple servers. We fixed some of the performance metrics, but the issues persisted: leaderboard lag, incorrect scores, and server crashes remained our nemesis.

Next, we tweaked the RabbitMQ configuration to improve message queueing and dead-lettering. We set up fanout exchanges, improved fanout routing, and enhanced message TTLs. Still, the system crashed under load.

We then attempted to optimize our external API calls. We implemented async calls, batching, and load balancing. However, this just introduced additional complexity, not to mention a plethora of "java.net.SocketTimeoutException" errors in our logs.

The Architecture Decision

It dawned on us that our entire system was too tightly coupled. The database, message queue, and APIs were not isolated enough, causing ripple effects throughout the system. We needed a new approach.

We decided to adopt a Saga-based architecture for our Treasure Hunt Engine. This would allow us to break down complex workflows into smaller, more manageable tasks, each with its own set of constraints and error handling.

To start, we refactored the RabbitMQ configuration to send events to separate, event-driven queues. These queues would then trigger Sagas, each encapsulating a specific task, from database updates to external API calls.

We introduced an event store to persist all events in a database, enabling us to replay and re-try failed tasks. This allowed us to maintain consistency across the system, even in the face of failures.

What The Numbers Said After

After months of re-architecture and testing, our Treasure Hunt Engine became fault-tolerant. We measured key performance indicators (KPIs) to validate our change:

Average response time: Improved from 5 minutes to 200ms.
System crashes: Reduced by 95%.
Incorrect scores: Eliminated.
Event delivery rate: Increased by 10 times.

We monitored the system for months, and the trends looked promising: the more users engaged with the Treasure Hunt Engine, the more stable it became.

What I Would Do Differently

Looking back, I would focus on designing the Saga-based architecture from the beginning, rather than trying to patch our existing system. It's far easier to build a system for scalability and fault-tolerance than to try to retro-fit those traits later.

Furthermore, I would invest more in testing our event-driven system, simulating edge cases and load scenarios to ensure the Saga architecture works as intended.

The Saga-based architecture not only improved the health and stability of our Treasure Hunt Engine but also simplified our development and maintenance workflows. Event-driven design and the consistency model that came with it allowed us to scale our system, while reducing the risk of errors creeping in under heavy loads. The numbers spoke for themselves – a clear indicator of our success.