The Myth of Low Latency: Why Event Meshes Make Your System Slow

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At Veltrix we had a simple monolithic service that handled everything - orders, products, inventory etc which resulted in high failure rates (30-40 % in extreme cases) on certain pages during peak hours. We wanted to break it down and decouple it with the event mesh to solve the high failure rates.

What We Tried First (And Why It Failed)

Our first implementation of an event mesh was built on top of Apache Kafka. We were excited because we had heard of the low latency capabilities and the scalability of the system. However we quickly hit the limitation of Kafka (specifically the max.in.flight.requests.per.connection and replication.factor properties) which resulted in a high number of request retries (40% of all requests would result in at least one retry) on our e-commerce platform during peak hours. We would then end up with hundreds of dead-letter queue messages because of the high failure rates - our system would end up in an incorrect state.

The Architecture Decision

We moved to RabbitMQ's QMF v3 (an AMQP 0-9-1 messaging protocol) and implemented something called a Request-Response event mesh. This system has a request and response event pair to handle the event and wait for the event to be processed. Since we used RabbitMQ's async publish/subscribe model, our code was a lot simpler than when we were using Kafka with multiple threads and connection pools, this led to fewer threading issues and lower failure rates (2-5%). However it added latency (20-30ms on average) which was an added cost.

What The Numbers Said After

We measured a 30-50% increase in request latency (measured by the request.duration metric in New Relic) after shifting to the Request-Response event mesh. But we saw a 70% decrease in failed requests. Our dead_letter_queue was almost empty and we saw a significant reduction in the max_retries metric (from 40 requests to 5 requests on average). However, as a direct consequence of this system design, I had to increase the timeout of our request to match the new latency of the system, which then resulted in a cascading effect where our timeout would have to be increased even further to account for the high latency of our cache requests (average 80ms for cache GET).

What I Would Do Differently

If I had to go back, I would probably use a mix of both systems that we tried - Kafka for event routing and RabbitMQ for request-response. The delivery_mode property in RabbitMQ would be set to persistent and the events published to Kafka would be set to acks=2 which would give us a low-latency event mesh for our e-commerce platform with low failure rates (less than 1%).