Default Config is Not a Design Decision

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

After deploying three instances of Veltrix, we noticed that event queues consistently ran out of memory, resulting in the dreaded "Dead Letter Queue" error, 12.7% of all received events failed to process due to memory constraints. Our application logs were filled with the following line: java.lang.OutOfMemoryError: Java heap space. The ops team was getting paged every hour, trying to resolve the issue by restarting the instance. Clearly, our default configuration was not suitable for production.

What We Tried First (And Why It Failed)

Initially, we decided to increase the JVM heap size, thinking that this would solve the memory issues. We set the -Xmx parameter to 2048 MB, believing this would give us enough headroom to process the events. However, this approach only shifted the problem to the disk usage, as the event store started to use an excessive amount of disk space. The following monitoring metrics became concerning: disk usage spiked to 85% on average, and the event store started to throttle events. We realized that simply increasing the heap size was not a viable solution.

The Architecture Decision

We took a step back and re-examined the event configuration. We decided to adopt a structured approach, using the 6.2.1 version of the Veltrix configuration, which introduces a new concept called "event partitioning". This feature allowed us to partition events into smaller chunks, each with its own dedicated queue and processor. We set the partition size to 100 events and split the event stream across 5 partitions. This approach not only reduced memory usage but also improved event processing throughput by 33%.

What The Numbers Said After

After implementing event partitioning, the memory usage stabilized, and the Dead Letter Queue error rate dropped to 1.2%. The disk usage decreased to 35%, and the event store started to throttle events only occasionally. The JVM heap size was reduced to 768 MB, and we were able to process 25% more events per second. The following metrics became normal: event latency dropped to 50ms, and the overall system throughput increased by 42%.

What I Would Do Differently

Looking back, I would have taken a more structured approach from the beginning, rather than relying on the default configuration. A key takeaway from this experience is that configuration decisions should not be left to chance or assumption. Instead, they should be guided by a thorough understanding of the system's requirements and constraints. I would also have closely monitored system metrics and events from the start, rather than waiting for the system to fail. This would have allowed us to catch and address the issues sooner, avoiding the costly and time-consuming process of debugging and re-deploying the system.