Why Your Event-Driven Pipeline is a Latency Death Trap (And How We Fixed It)

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

The treasure-hunt system we inherited used an event bus built on Kafka with exactly-once semantics turned on, idempotent producers enabled, and the replication factor set to 3. Sounds bulletproof, right? Our latency SLO said we had to respond to every event within 200 ms end-to-end. The first prototype achieved 192 ms median latency under a 1,000 events-per-second load. That was the demo.

Production told a different story: 180 ms median, 1.2 seconds average, and a 99th percentile that crossed 2 seconds when the load hit 6,000 events per second. The treasure-hunt leaderboard wasnt missing the first-place user because of bad algorithms; it was missing because the event bus was buffering commits before acknowledging writes. The acks=all setting with min.insync.replicas=2 meant every write waited for two brokers to commit before returning, even when the followers were on the same rack as the leader. Thats 60–80 ms of extra latency per hop. Worse, the idempotent producer layer added another 30–40 ms of serialization overhead because every event had to be deduplicated before being published. Our SRE called it a wall of safety; I called it a latency wall.

What We Tried First (And Why It Failed)

The first fix was obvious: downgrade to acks=1. We shaved 70 ms off the median latency in staging, but the 99th percentile still spiked to 1.8 seconds when a follower broker got OOM-killed and the controller stepped in. Kafkas unclean leader election was flipping every 45 seconds under load, and the new leader had to rebuild state from scratch. That introduced 1.2 seconds of head-of-line blocking for every consumer.

We tried increasing num.io.threads from 8 to 16 and bumped num.network.threads to 6. Throughput went up, but latency variance doubled because the OS scheduler couldnt keep up with the thread explosion. Then we tried moving to Tiered Storage with remote log segments. The idea was to offload older segments to S3 and keep only the hot ones on disk. What actually happened: the S3 multipart uploads introduced 300 ms of jitter every time a segment rolled, and the consumer lag graph looked like a sawtooth chain saw.

The Architecture Decision

We stopped trying to tune Kafka and ripped out the event bus entirely. The new pipeline uses a two-tier system:

Tier 1 is a memory-first, single-leader Redis cluster with RedisJSON and RedisTimeSeries. Every event is written to the leader with wait=1—meaning we wait for the leaders in-memory commit only. We set repl-ping-replica-period 50 to keep followers in sync without constant heartbeat noise. The leaders replica count is set to 2, but we dont wait for disk flushes; we rely on the leaders AOF rewrite to drain to disk asynchronously. Median latency dropped to 45 ms. The 99th percentile sat at 110 ms even under 12,000 events per second.

Tier 2 is an append-only log stored in ClickHouse. Every Redis event writes a row into a ReplacingMergeTree table with a dedup key. We run a nightly TTL to drop events older than 7 days. ClickHouse handles the analytical queries for leaderboards and fraud detection, while Redis handles the real-time updates. The tradeoff: we lost exactly-once semantics between tiers, so we implemented an idempotent consumer layer in the application using a monotonically increasing event ID and a Postgres advisory lock table. That added 15 ms of latency but gave us deterministic leaderboard updates.

What The Numbers Said After

After two weeks of production traffic, the numbers were brutal but honest:

Median latency: 43 ms (down from 180 ms)
99th-percentile latency: 112 ms (down from 2.8 seconds)
End-to-end treasure-hunt leaderboard update: 85 ms (down from 280 seconds)
Cost per million events: $0.04 (up from $0.03) because Redis memory usage increased 2.3×.

The real win wasnt the latency numbers; it was the SLO. We moved from a 99.9% availability SLO to 99.95%. We stopped waking up at 3 a.m. to watch Kafka controller logs. The marketing team still demos the Redis cluster with wait=1, and nobody outside engineering notices the difference between a Redis commit and a Kafka commit anymore.

What I Would Do Differently

I would not have trusted Kafkas demo metrics for even a single sprint. Kafkas own documentation calls out that acks=all with min.insync.replicas=2 is a latency killer; I ignored it because it looked safe. Second, I would have sized the Redis cluster for 2.5× the peak load instead of 1.5×. We hit memory fragmentation under 11,000 events per second because jemallocs arenas couldnt keep up with the TLS churn. Third, I would have built the idempotent consumer layer before the first load test instead of after the outage. The advisory lock table became a bottleneck when we scaled to 50 consumer pods, and we had to rewrite it to use a Redis stream-based deduplication layer.

Finally, I would have insisted on staging traffic that mimicked real user behavior instead of synthetic 1,000 events-per-second bursts. Our staging cluster never saw the same event size distribution as production, so the Kafka commit latency spikes were invisible until we went live. Real user events were