DEV Community

Cover image for The Operators Regret: How We Blew Up the Event Bus at 3 AM
Lillian Dube
Lillian Dube

Posted on

The Operators Regret: How We Blew Up the Event Bus at 3 AM

The Problem We Were Actually Solving

At 02:47 the Redis counters began to drift by as much as 18 %. Players who had just spent 300 gold on a dig turned around and screamed at Discord that the server had stolen their loot. We had a classic symptom: event loss.

Our original topology was Kafka → Kafka Streams → Redis. It looked good in the whiteboard diagram. We had 6 brokers, 240 partitions, a replication factor of 3, and acks=all on the producer. The logs said everything was healthy—lag was near zero, broker CPU idle at 20 %. The problem wasnt latency; it was missing messages.

The first clue came from the kafka-consumer-groups command. One consumer group—leaderboard-rebuild—showed a lag of 4.2 million messages. The Streams application had fallen behind, GC pauses every 30 seconds, and the RocksDB state store couldnt keep up. By the time it caught up, players had already opened another chest and the old events were gone.

We needed exactly-once semantics across three systems: Kafka, Kafka Streams, and Redis. Thats not the same as at-least-once; its exactly-once with external side effects. The docs dont cover that.

What We Tried First (And Why It Fai

We started with the obvious: increase Kafka Streams threads, switch to commit.interval.ms=100, and raise the max.poll.records to 1000. The lag dropped from 4.2 M to 1.8 M, but the Redis counters still wobbled ±3 %. The Streams app was now spamming Redis with MSET commands that Redis couldnt atomicize, so counters were racing.

Next we tried an outbox table in PostgreSQL. We added Debezium with snapshot.mode=initial and max.batch.size=2000. The Kafka Connect cluster spun up fine, but the PostgreSQL WAL filled at 2 GB/s and the P99 latency on player API went from 18 ms to 820 ms. Players started reporting timeouts on treasure openings.

Then we tried transactional producers with transactional.id=treasure-01 and transaction.timeout.ms=60000. At 40k events/s the broker started throwing TransactionTimeoutException with message Timeout expired after 60000ms while waiting for acks from brokers. The cause was simple: we had set max.in.flight.requests.per.connection=5 while using idempotence. The broker held five inflight requests, and if any failed, the transaction coordinator rolled back the whole batch. We lost 12 % of treasure events during brief network flaps.

Every fix moved the problem to a different layer, but the underlying truth was the same: Kafka alone cannot guarantee exactly-once delivery when the consumer writes to an external store.

The Architecture Decision

We ripped out Kafka Streams and replaced it with a choreographed saga.

  1. Kafka remains the event backbone with acks=all, retries=3, max.in.flight.requests.per.connection=1, and transactional.id removed. We kept idempotence (enable.idempotence=true) because it prevents duplicates under retries, but we stopped pretending we could do exactly-once across systems.

  2. We introduced an idempotent sink: a dedicated service called TreasureSink. It consumes events in order, uses a monotonically increasing sequence number, and writes to Redis via Lua scripts wrapped in MULTI/EXEC. The sequence number is stored in a sink_sequence key with a TTL of 60 s. If the service restarts, it starts from the sequence number it last committed to a Redis sorted set called processed:{event_type}.

  3. We moved the leaderboard counter to Redis Streams (XADD, XREADGROUP, XACK). Instead of rebuilding the entire leaderboard every minute, we stream each TreasureOpened event with the players delta. The leaderboard rank is cached in-memory, updated by a Lua script that leverages Rediss atomic increment and sorted set ZINCRBY. If the node restarts, we replay the last 10k events from the stream, which takes <200 ms.

  4. We accepted that we could only guarantee exactly-once within Kafka if we controlled the entire pipeline. We removed Kafka Connect and PostgreSQL from the critical path. The treasure events now go Kafka → TreasureSink → Redis Streams → Leaderboard Service. The only external system is Redis, which we control end to end.

The tradeoff was operational complexity. We now have three services instead of one, plus a Redis cluster with streams enabled. The CPU on the TreasureSink pods runs at 35 % steady state but we have hit 1500 events/s without latency creep. The Redis Streams memory usage is 4.2 GB at 4 M events/minute with a 7-day retention, which costs us an extra $420 per day on our cloud bill.

What The Numbers Said After

  • Event loss: 0 % in the last 90 days. The last missing event was on the day we fixed the outbox, and it was a network partition that Debezium couldnt handle anyway.
  • P99 latency on treasure open: 28 ms. The Redis Streams read is 12 ms, the Lua script is 6 ms, and the round trip to the region is 10 ms. This is up from 18 ms before the outage, but we traded 10 ms for correctness.
  • Kafka broker CPU: 68 % under peak load. We scaled to 9 brokers instead of 6.
  • TreasureSink GC pauses: 2 ms every 45 s. Nothing that triggers a latency spike above 35 ms.
  • Cost per 1 million events: $0.046. The outbox architecture with PostgreSQL

Top comments (0)