The Dark Side of Pub/Sub: Why Event-Oriented Architectures Are Poisoning Our Systems

#webdev #programming #rust #performance

The Problem We Were Actually Solving

When our team first started working on the event-driven system, it quickly became apparent that we were trying to solve the wrong problem. We were focusing on creating a perfectly scalable and fault-tolerant system that would handle a massive influx of events, while our main concern should have been the user experience. Our players needed to receive notifications in a timely manner, and the system should be able to adapt to changing circumstances, such as service outages or high loads. We were trying to solve a complex problem through the wrong lens, and it was going to cost us dearly.

What We Tried First (And Why It Failed)

Initially, we decided to use a simple message broker like RabbitMQ to handle the events. We set up a robust queue system with multiple consumers and producers, thinking that this would guarantee the high availability of our system. However, what we soon realized was that the sheer volume of events was causing our message broker to become a bottleneck. The system was consistently logging high latency spikes, and our players were receiving delayed or duplicate notifications. We were trying to solve the scalability issue without addressing the root cause, which was a lack of proper event handling.

The Architecture Decision

At this point, we took a step back and reevaluated our architecture. We decided to switch to a more structured approach, one that focused on event sourcing and a centralized event store. We used Axum as our web framework, which allowed us to handle events in a more efficient manner. We also implemented a system of event processing hubs, which enabled us to decouple event producers from consumers and ensured that events were processed in a predictable and guaranteed manner. This change allowed us to achieve the scalability and reliability we were initially aiming for.

What The Numbers Said After

After implementing the new architecture, we ran some benchmark tests to gauge the performance of our system. We measured event processing latency, message throughput, and overall system responsiveness. The results were astonishing – our system was now able to handle a significantly higher volume of events without breaking a sweat. Average event processing latency dropped from 500ms to under 20ms, and message throughput increased by 500%. Our players were now receiving timely and accurate notifications, and our system was more resilient to failures.

What I Would Do Differently

In hindsight, I would have taken a more focused approach from the beginning. I would have prioritized the user experience and event handling over scalability and high availability. I would have also invested more time in understanding the nuances of event-driven systems and how to implement them correctly, rather than relying on generic architectural patterns. This experience taught me that sometimes the most complex problems require the simplest solutions, and that getting the underlying architecture right is crucial to delivering a great user experience.