The Pitfalls of Premature Optimisation in Event-Driven Systems

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The real issue wasn't the load itself but rather the lack of visibility and control our operators had over our microservices architecture. As our load increased, so did the number of errors our monitoring system, Prometheus, reported. The usual suspects - deadlocks, timeouts, and resource exhaustion - were clogging our logs, making it nearly impossible for our operators to pinpoint the root cause of our slowdown.

Specifically, our operators were struggling with the following issues:

Our users were experiencing inconsistent results when querying the Treasure Hunt Engine.
Our metrics collection was skewing due to delayed metric reporting from the microservices responsible for handling Treasure Hunt Engine requests.
We started receiving errors such as java.net.SocketTimeoutException: Read timed out and com.mysql.cj.jdbc.exceptions.SQLErrorCodeSQLExceptions: Communication link failure across various services, indicating network connectivity and database issues.

What We Tried First (And Why It Failed)

We thought the solution lay in scaling our database, upgrading our caching layer, and adding more servers to our microservices architecture. Sounds reasonable, right?

We took the following steps:

Our first attempt was to scale our MySQL master node to have 8 replicas to improve read performance. However, due to the high volume of queries, the replicas started experiencing write contention, further exacerbating our problem.
Next, we upgraded our Redis cache to a cluster setup to alleviate the load on our application servers. Unfortunately, this exposed internal inconsistencies in our application code, resulting in users experiencing inconsistent results when accessing the Treasure Hunt Engine.
Lastly, we added more servers to our microservices architecture, hoping to distribute the load more evenly. Unfortunately, our Docker images didn't provide enough disk space, leading to an increase in disk usage, causing container restarts, and a cascading effect that further slowed down the system.

The Architecture Decision

It was then that we realized our true problem wasn't scaling our infrastructure but rather optimizing the communication between our microservices. The root cause was our inconsistent and synchronous communication pattern between services, resulting in a bottleneck when handling Treasure Hunt Engine requests.

To address the problem, we introduced a message broker (RabbitMQ) to handle asynchronous communication between our services. This design change allowed us to decouple our services, ensuring that each service could operate independently, making it easier for our operators to diagnose and troubleshoot issues.

What The Numbers Said After

After implementing the message broker, we noticed a significant improvement in our metrics collection. Using Grafana, we observed a 20% drop in query latency and a corresponding 25% reduction in overall server load. Furthermore, the error rate fell by 50%, with java.net.SocketTimeoutException and com.mysql.cj.jdbc.exceptions.SQLErrorCodeSQLExceptions: Communication link failure being replaced by java.io.IOException: Connection closed by server exceptions, indicating a more controlled shutdown of connections.

What I Would Do Differently

If I had to do it all over again, I'd focus on identifying the problem earlier, long before our server load increased to its current scale. My strategy would be to identify the communication bottlenecks sooner, using tools like Jaeger and Zipkin to monitor our service interactions, and then focus on optimizing the interaction patterns between services first.