Treacherous Defaults: Why Our Event Engine Blew Up (Literally) and How We Got It Right

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We had 300,000 IoT devices sending events at a rate of 10,000 per second to an Apache Kafka cluster. Our initial goal was to process these events in real-time, feeding insights to our customers. Sounds straightforward, but things got complicated when our default Kafka producer configuration was set to "auto-create" topics. Essentially, this meant that every time a new device connected, a new topic was spawned, and the producer was rebalanced. It worked out great for the first few days, but we soon realized that our topics were multiplying like rabbits, and we were experiencing an OOM (out-of-memory) error on every machine in our cluster. Our Kafka producer was effectively becoming more complex than we had ever intended.

What We Tried First (And Why It Failed)

We initially tried to mitigate this by setting a maxPartitionRequests limit and a bootstrap.servers list. Our team thought this would help but soon realized that we were dealing with tens of thousands of partitions, which required constant rebalancing. This created a bottleneck due to the high overhead of metadata operations. Essentially, our Kafka cluster became a bottleneck, causing messages to back up and eventually fail processing entirely. To be honest, we were lucky the system didn't just completely crash; a reboot would have taken down the entire cluster with it. We also encountered issues with Kafka's default acks=all configuration, causing high latency due to waiting for all replicas to store committed messages. Our team was baffled, as the solution seemed simple but backfired in production.

The Architecture Decision

We took a step back, re-evaluated our problem, and made a key architectural decision. We chose to use Apache BookKeeper to manage partitions, leveraging its built-in support for high-performance replication. We also set up a custom topic naming convention to avoid the "topic explosion" problem. We also decided to use Kafka Streams to simplify the data processing pipeline, replacing our earlier Kafka Producers with Kafka Streams. Our Kafka producer configuration was updated to set acks=1, allowing our system to write data to a single broker, reducing latency. We set buffer.memory to 1048576 bytes, reducing the high memory usage that was happening. Our team was worried about how this would impact system availability, but we realized that our system couldn't continue with the current configuration. It was either fix it or we risk taking down the entire system.

What The Numbers Said After

After implementing these changes, our Kafka cluster's performance was restored to normal. We reduced partitions from 50,000 to 500, by grouping IoT devices into batches. Our topics were now manageable and predictable. We also experienced a reduction of memory usage from 80 GB to 50 GB, freeing up resources and reducing operational costs. The system was performing in real-time now, without issues or bottlenecks. Additionally, we also noticed that our processing times for events reduced from 150 ms to 30 ms, and our system handled over 50000 concurrent requests, with no issues.

What I Would Do Differently

If I were to do it again, I'd consider using Apache Pulsar from the start. Pulsar boasts better performance and handles issues like topic explosion and partition management more effectively, eliminating the need for BookKeeper or a custom topic naming convention. We'd also implement a rate limiter to throttle incoming requests, preventing the system from being overwhelmed in the first place. Lastly, we'd implement a robust monitoring and logging setup to detect performance issues early and prevent the system from becoming unstable. By taking these lessons to heart, I believe we can create better systems that can handle high traffic and changing requirements without blowing up.