The Problem We Were Actually Solving
Wed built the Treasure Hunt Engine to process 5 million events per minute, a system that rewarded users for completing micro-tasks like checking into a store or scanning a QR code. The engine was supposed to aggregate these events into leaderboards in real time, using a Kafka Streams topology that joined user actions with reward rules.
The problem wasnt scale—wed tested the topology with 10x load in staging and it held up. The problem was latency variance. During peak hours, events from certain regions would take 10x longer to appear in the leaderboard than others. The variance wasnt random: it correlated with the physical location of the Kafka brokers. We were using AWS MSK in ap-southeast-1 for Asia and eu-west-1 for Europe, with cross-region replication via MirrorMaker 2.0.
But the real issue wasnt network—it was operator error.
What We Tried First (And Why It Failed)
Our first fix was to increase the Kafka Streams num.stream.threads from 4 to 8 on each pod. The logic was simple: more threads meant more parallelism, and parallelism meant faster processing. The change worked in staging, so we rolled it out in production.
At 2AM, the CPU usage on the Streams pods spiked from 35% to 95%, and the garbage collector started spending 60% of its time in full GC. The latency improved slightly, but the pods became so busy that they couldnt keep up with the changelog topics, and the state stores started falling behind.
Next, we tried tuning the commit.interval.ms from 30 seconds to 5 seconds. The idea was to reduce the commit lag, making the state stores more current. But this increased the write load on the changelog topics, which were already replicated across regions. The result? Higher disk I/O on the brokers, and the replication lag between regions grew from 500ms to 3 seconds. The customer in Tokyo saw their events delayed even more.
Finally, we tried resharding the input topics. We increased the partition count from 128 to 256, thinking more partitions would mean better distribution. But the range partitioner in Kafka Streams assigned contiguous partitions to the same thread, which meant a single thread could be processing 16 partitions at once—way more than it could handle. The Streams app fell behind by 30 minutes.
The Architecture Decision
At 6AM, after four failed hotfixes and two cups of cold coffee, we admitted we were optimizing the wrong thing. The real problem wasnt the Streams app—it was the event routing.
We had designed the system with a single input topic, user-actions, and a single output topic, leaderboard-updates. But different types of events had different latency requirements. A checkin event needed to appear on the leaderboard in under 100ms, while a daily-bonus event could tolerate 5 seconds. We were treating all events the same, and Kafka Streams was processing them in FIFO order, no matter the priority.
So we threw out the Streams topology and rebuilt the pipeline as two separate flows:
We split the user-actions topic into three: user-actions-p0 for high-priority events like checkins, user-actions-p1 for medium-priority events like task completion, and user-actions-p2 for low-priority events like daily bonuses. Each topic had a dedicated Streams app with its own thread pool and commit interval.
The user-actions-p0 app used 8 threads and a 1-second commit interval, while the user-actions-p2 app used 2 threads and a 10-second commit interval. We set the retention for user-actions-p0 to 1 hour, since we didnt need to replay stale events, and for user-actions-p2 to 7 days, in case we needed to debug a bonus miscalculation.
Most importantly, we moved the Streams apps closer to the brokers. Instead of running them in Kubernetes across regions, we deployed them as EC2 instances in the same AZ as the brokers, connected via private VPC endpoints. We also switched from MirrorMaker 2.0 to Kafkas built-in rack awareness, so partitions were distributed across AZs, not regions.
The change took 12 hours, but it worked. By the next peak, the P99 latency for user-actions-p0 was 78ms, and the user-actions-p2 events were still under 5 seconds. The replication lag between regions dropped to 100ms.
What The Numbers Said After
After a week of volume testing, the metrics told the story:
- End-to-end latency P99: 82ms (down from 30s)
- Streams app CPU usage: 45% (down from 95%)
- Replication lag between regions: 120ms (down from 3s)
- State store commit lag: 200ms (down from 5s)
But the real win wasnt the numbers—it was the absence of pages. For the first time in three months, the on-call team wasnt woken up at 3AM by a latency spike. The system was boring. And thats the point.
What I Would Do Differently
If I could go back, Id never have let the Streams app handle mixed-priority events in the first place. Kafka Streams is a great tool, but its not a Swiss Army knife. Its designed for event processing, not event routing. We should have split the streams at the ingestion layer, not after the fact.
Id also have resisted the urge to solve everything with configuration. More threads, smaller commits, bigger partitions—these are all levers you can pull, but theyre
Treated the payment platform as infrastructure. Found the single point of failure. This is the replacement I put in place: https://payhip.com/ref/dev4
Top comments (0)