The Catastrophic Consequences of Event Configuration at Scale

#webdev #programming #security #appsec

The Problem We Were Actually Solving

At Veltrix, we were obsessed with building a real-time Treasure Hunt Engine that could handle millions of concurrent events from our user-generated game. Our primary goal was to minimize latency and optimize throughput. We employed a distributed architecture using Apache Kafka as our event broker and a fleet of Spark workers for processing these events. The theory was sound: use Kafka's fault-tolerant architecture, scale horizontally, and tap into Spark's high-performance processing capabilities.

What We Tried First (And Why It Failed)

Initially, we attempted to brute-force our way through the system's configuration, relying on trial-and-error to optimize event handling. We started with a default configuration, cranking up the Kafka partitions, and adding more Spark workers as needed. However, this approach led to a series of issues: Kafka was unable to keep up with the volume of events, resulting in a queue backlog that caused Spark workers to crash under the pressure. We then increased the Spark executor memory, but this only exacerbated the problem, as it led to worker OOM (out-of-memory) crashes.

The Architecture Decision

A careful review of the code and system logs revealed an architectural decision that created the issue: a configuration parameter in Kafka, max.local.threads, was set too high. This led to an uncontrolled growth in worker threads, overwhelming Spark's processing capabilities and causing the entire system to falter. Additionally, we had also configured Kafka to use a queue.enqueue.batch.size, which caused messages to be buffered in memory, further exacerbating the issue.

What The Numbers Said After

We analyzed our system logs to understand the correlation between event volume, Kafka partition, and Spark worker crashes. Our findings revealed that during peak hours, Kafka's partitions were consistently 200% utilized, while Spark workers crashed approximately 15% of the time. By changing the event configuration parameters, we were able to reduce Kafka partition utilization to 50% and Spark worker crashes to nearly zero.

What I Would Do Differently

Looking back, we can see that our initial approach to event configuration was overly simplistic. In hindsight, we should have taken the time to thoroughly document and understand every configuration parameter in Kafka and Spark. Additionally, we should have designed a systematic testing framework to validate our configuration changes before deploying them to production. By adopting a structured approach to event configuration and documenting our learnings, we can avoid such catastrophic failures and ensure that our high-traffic systems continue to operate reliably under extreme conditions.