Catastrophic Architecture: How Misunderstanding Events Derailed Our Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were tasked with building a real-time treasure hunt engine for an upcoming event. The system needed to handle thousands of users simultaneously, with a high-quality user experience that included seamless location updates and accurate rewards assignment. As the lead systems engineer, I took it upon myself to ensure that the events architecture was sound, given the complex interactions between users, the game state, and the backend services. What ensued was a painful journey of discovery, where I realized that even with a solid understanding of distributed systems, events can be a minefield if approached incorrectly.

What We Tried First (And Why It Failed)

Initially, we employed an event-driven architecture where every action, from a user's move to a treasure's location update, was represented as an event and published to a message bus. This allowed for loose coupling between our microservices and scalability through horizontal scaling. However, within a week of launch, our message bus was struggling to keep up with the sheer volume of events. The latency began to creep above 1 second, far surpassing our 500ms goal, which led to frustrated users and a treasure hunt that was less exciting than it should have been. At this point, I decided to take a closer look at the event bus's performance.

The Architecture Decision

I opted to introduce an event store, specifically a distributed event store like Eventuate, to offload event processing from the message bus. This change allowed us to persist events to disk, reducing the load on our in-memory message bus and enabling us to cache frequently accessed events. However, I soon discovered that our event-driven design didn't provide sufficient guarantees on event ordering and delivery. In a scenario where user A and user B both received a treasure location update, but only user A acted on it, we ended up with a situation where user B received rewards for an action they didn't perform. This led to a reevaluation of our event model and the introduction of event versioning to ensure that events were processed in the correct order.

What The Numbers Said After

After making these changes, we saw a significant reduction in message bus latency, dropping from 1.2 seconds to 150 milliseconds. Our event store's throughput improved as well, allowing us to handle 10,000 events per second without breaking a sweat. In terms of system resource utilization, we observed a 30% reduction in memory usage and a 25% decrease in CPU utilization, which translated to cost savings and a more scalable system overall. However, the true test came during the event itself, where our system could handle over 5,000 concurrent users simultaneously without any noticeable degradation.

What I Would Do Differently

In hindsight, I would have introduced event versioning and event ordering guarantees from the outset. This would have saved us weeks of troubleshooting and allowed us to avoid the temporary fix of event caching. Additionally, I would have opted for a more robust event store that supported higher levels of concurrency and was specifically designed for the task of event storage. Eventuate worked for us, but it was a close call, and I've since learned the importance of choosing the right tool for the job.