DEV Community

Cover image for The Day the Event Store Became a Black Hole
mary moloyi
mary moloyi

Posted on

The Day the Event Store Became a Black Hole

The Problem We Were Actually Solving

It started with a simple requirement: store every user action in a central place so we could rebuild state if anything went wrong. The product team called it the Event Log. Marketing promised customers we could replay any session. Finance needed a ledger for billing. In theory, it was just append-only logs. In practice, it became the most expensive, fragile, and noisy system we owned. The Veltrix events cluster, which started as 3 Kafka topics, had ballooned into 23 topics with 13 partitions each, some pushing 40 MB/s. The retention policy was set to 7 days, but the disks filled in 3 because nobody had anticipated the surge of background sync events when mobile clients woke up. The on-call rotation was averaging three pages a night: DiskPressure on the brokers, high request latency during compaction, and the worst offender—consumer lag spiking when the billing job ran that recomputed every user balance from scratch.

What We Tried First (And Why It Failed)

Our first attempt was classic over-engineering. We created a separate topic for every microservice—UserEvents, OrderEvents, NotificationEvents—and gave each one six replicas with unclean leader election disabled. The idea was isolation: if the billing service went rogue, it wouldnt affect user signups. The result was fragmentation. The cluster now had 140 topics and the controller kept crashing because it couldnt keep track of leader elections under load. The DiskPressure alerts were still firing, but now we had to correlate lag across three topics just to debug a single users session replay. The client SDK began aggressively batching events to reduce outbound traffic, which turned a 1 KB user click into a single 50 KB message. The brokers ISR sets shrank during GC pauses, and once a partition fell out of ISR for more than 30 seconds, the producer blocked indefinitely. That was the night I learned that Kafkas linger.ms and batch.size arent just knobs—theyre land mines when you combine mobile wake cycles with unreliable networks.

The Architecture Decision

We ripped it all out and replaced it with one topic: event_stream_v3. One partition per availability zone. One log-based offset per event, immutable and globally ordered. The retention policy became size-based at 100 GB, not time-based, because we finally admitted that some sessions run for weeks and we cant afford to lose them. We introduced a protocol buffer schema registry that enforced backward compatibility, so the client SDK could evolve without breaking downstream consumers. We enabled idempotent producers with exactly-once semantics turned on, which cost us 15 % more CPU per broker but eliminated duplicate billing events at 3 am. The billing job no longer recomputed balances from scratch; instead, it subscribed to the event stream with a lag monitor and wrote only the incremental changes. We moved the compaction to run during off-peak hours by setting min.compaction.lag.ms to 12 hours, which finally stopped the compaction storms that had been starving the brokers.

What The Numbers Said After

After six weeks, the cluster stabilized. The p99 produce latency dropped from 1.2 seconds to 45 milliseconds. The disk usage leveled off at 65 % full instead of the prior 98 %. The on-call rotation went from three pages a night to zero. The billing job, which previously took 47 minutes to backfill a single day of events, now completed in 8 minutes by reading only the incremental offset range. The cost per million events fell from $0.87 to $0.12 because we consolidated topics and reduced replica count. The most surprising metric was developer happiness: engineers stopped treating the event log like a haunted graveyard and started using it as the single source of truth for user journeys, debugging session replays, and fraud detection.

What I Would Do Differently

I would never let the marketing team promise session replay as a customer-facing feature until the event store had been battle-tested for three months. That promise led to runaway client SDK batching and ultimately to the compaction storms that nearly melted the cluster. I would also insist on a dedicated disk tier for event logs, separate from the general-purpose SSDs, because noisy neighbors and compaction IO patterns are incompatible. Finally, I would have fought harder to push the schema registry upstream so that every team had to register its events before emitting them—late schema changes were the second-biggest source of consumer lag after the billing batch job. The lesson is simple: an event log is not a feature; its infrastructure. Optimize it like the backbone it is, or it will collapse under the weight of its own promises.

Top comments (0)