Why Our Event Sourcing Pipeline Blew Up During the Black Friday Peak (And How We Fixed It)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We had rebuilt the billing subsystem from a simple INSERT table to an Event Store using Kafka. The goal sounded reasonable: make the audit trail immutable and enable re-processing. What we didnt realize was that the operators—me included—were treating Event Sourcing like a traditional database with unbounded growth.

The real question wasnt how to store events. It was: what happens when the event stream becomes the system of record and the operators cant stop the deluge? We had upstream services happily emitting events, downstream projections updating, and the log growing at 12 GB/minute. Our Kafka cluster had 5 brokers, RF=3, and segment files set to 1 GB rolling every 24 hours. That configuration worked fine for 100K events/day. It didnt work for 3.2M/s.

Metrics showed the ingestion latency spiking to 22 seconds when the cluster filled its 300 GB buffer. The operators first instinct was to scale the brokers, but that just shifted the problem: more brokers meant longer controller elections, more open file handles, and a 7-minute window where the cluster was read-only. We needed a different approach.

What We Tried First (And Why It Failed)

First, we tried throttling upstream services with a token bucket. We built a Go rate limiter that dropped events when the bucket filled. The downstream billing engine kept retrying with exponential backoff, creating a thundering herd. The latency went from 22 seconds to 45 seconds, and the billing SLA breached.

Next, we tried partitioning the event topic by customer ID. We chose a hash function that spread high-value customers evenly, but the hot partitions created a bottleneck. One partition ended up with 70% of the traffic because our top 10 customers generated 65% of the events. The disk I/O on that single broker hit 92% utilization, and the segment flushes lagged. We saw a flood of kafka_server_log_flush_timeouts while the partition leader stayed stuck.

Finally, we tried offloading old events to S3 and keeping only recent events in Kafka. The idea was to keep the cluster healthy by archiving events older than 7 days. But the offload job ran every hour and scanned 80 million events to find candidates. It added 40 seconds to the average replay time and blocked the compaction thread. The operators were now managing two systems: Kafka for recent events and S3 for history. The complexity was brutal.

The Architecture Decision

We ripped out the Kafka-based event store and rebuilt the pipeline around Apache Pulsar with tiered storage.

The key decision wasnt the broker choice—it was the operator boundary. We drew a hard line between the event pipeline and the billing projection. The pipeline became a write-once log with no projection logic inside it. The billing service consumed the log and built its own materialized view using a separate Postgres cluster with async logical replication.

We set the Pulsar retention policy to 3 days on the billing_events topic, with a rollover threshold of 5 GB. Any event older than 72 hours was automatically offloaded to S3 via the built-in tiered storage. The compaction was handled by a separate managed service that ran on a different cluster, so it couldnt starve the ingestion layer.

We also introduced a backpressure signal. When disk usage hit 85%, the broker published a broker_disk_pressure metric. The upstream service listened to this metric and dropped non-critical events while allowing billing events to flow through. The signal was simple: a single Prometheus alert rule firing into a webhook that the Go service consumed. No rate limiter, no backoff storms.

The tradeoff was clear: we sacrificed perfect replayability for operational stability. If the billing service needed to reprocess a month-old event, it had to replay from the projection store, not the raw event log. But the operators slept better knowing the pipeline wouldnt collapse during peak traffic.

What The Numbers Said After

After the switch, the ingestion latency stayed below 2 seconds even at 4,800 events/sec during our next peak. The Pulsar cluster ran at 25% disk utilization and 45% CPU, with no timeouts. The tiered storage offloaded 85% of the raw event volume, reducing the hot storage footprint from 300 GB to 45 GB.

The billing projection lagged by 3.2 seconds during the peak, but the SLA was 5 seconds. That lag came from Postgres async commit and WAL shipping, not from the event pipeline.

The operators dashboard still showed disk pressure, but now it was on the projection cluster, not the event pipeline. When the projection cluster hit 90% disk, we triggered a scale-up without affecting billing. The separation of concerns finally worked.

What I Would Do Differently

I would not have built the event store in Kafka. Kafka is a log, not a service boundary. Operators treat logs like databases and try to add projections, compaction, and indices—all of which break the log abstraction. Pulsars tiered storage and built-in offload were the right primitives from day one.

I would also avoid conflating audit and billing in the same log. Audit is immutable by definition; billing projections are mutable. Keeping them separate avoids the temptation to query the log for billing data and corrupting the write-once guarantee.

Finally, I would have set the retention policy before the first event was produced. Our on-call team spent hours tuning retention during the peak because the defaults were unbounded. That single decision—defining operator boundaries up front—saved us from another Black Friday meltdown.