Opting for Chaos: The Cost of Premature Optimisation in Event Configuration

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

It was 2018, and our team at Veltrix was working on a high-scale event-driven architecture for a real-time treasure hunt game. The service would dispatch users to physically visit real-world locations, and upon completion, the players would trigger a virtual reward. We were racing to meet a launch deadline with an ambitious feature set and a massive user acquisition plan. Our primary goal was to deliver a seamless experience that scaled with the growth of our user base.

What We Tried First (And Why It Failed)

Initially, we focused on implementing a custom event-driven architecture that we thought would provide the optimal performance. We spent countless hours working on a highly configurable event bus, a task queue, and a series of loosely coupled microservices. We chose Apache Kafka as our event bus, RabbitMQ for asynchronous task processing, and Python Flask as our microservices framework. The configuration was complex, with multiple interdependent components and a deep rabbit hole of options. We expected that the resulting architecture would be robust, scalable, and flexible.

However, the more we configured, the more complexity crept into the system. Our event bus and task queue became sources of contention, causing performance issues, and our microservices struggled to communicate effectively. The system was fragile and error-prone, with error messages like TimeoutException: Task timeout: 3600 seconds and Connection refused: Connection timed out. We struggled to diagnose and debug issues, and the system was on the verge of collapse under moderate load.

The Architecture Decision

After months of firefighting and debugging, we made a critical decision: we defaulted to the simplest possible configuration. We opted for a single, centralized event-driven architecture using AWS Simple Notification Service (SNS) for event publishing and Amazon SQS for task processing. We chose a minimalistic microservice composition using AWS Lambda and API Gateway. We discarded our custom event bus and task queue, and focused on building scalable microservices using serverless technologies.

This decision was not driven by a desire to simplify our environment, but rather by a necessity to contain the complexity of our prior design. We were no longer concerned with writing custom, highly configurable event buses and task queues. Our new design was centered around the capabilities of AWS services, and we were able to take advantage of their optimized performance and scalability.

What The Numbers Said After

After implementing our new architecture, we achieved a 5x reduction in error rates, a 3x increase in system throughput, and a 90% decrease in mean time to recover (MTTR) from failures. Our system was now capable of handling spikes in traffic, and the latency remained consistent even under high loads.

Our metrics showed a significant improvement in event processing latency, with an average latency of 150ms compared to 500ms previously. Our event delivery success rate increased from 70% to 95%, and our task processing pipeline was able to handle 10x the number of tasks with the same resources.

What I Would Do Differently

In hindsight, I would not dive headfirst into writing custom event buses and task queues. While our initial design was appealing in theory, it was overly complex and opaque. Instead, we should have started with a robust, industry-proven event-driven architecture using AWS services. Our team should have focused on building scalable, event-driven microservices using serverless technologies from the outset.

This story serves as a reminder that simplicity and robustness often lie in embracing the capabilities of well-designed commercial products rather than trying to optimize every detail ourselves. In our case, opting for chaos led to a premature optimisation that nearly doomed our project. We eventually learned that embracing the defaults can often be the correct decision, especially when it comes to complex event-driven architectures.