Veltrix Event Sourcing Pitfalls: Why Our First Config Broke at 500k Events/Second

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our event-sourced payments service needed to reprocess every transaction from 2021 for a fraud detection rewrite. We expected this to be a batch job taking hours. Instead, we hit the wall at 200k events—our consumers were falling behind, p99 ingestion latency spiked to 1.2 seconds, and the team started calling it the "fraud detection rewrite that broke prod."

The issue manifested as consumer lag but traced back to the snapshotting cadence. We had set it to 10,000 events—enough to keep memory bounded while avoiding constant disk reads. The math looked solid: 10k events × average 1KB/event × 10 minutes between snapshots ≈ 10MB heap per consumer. At 500k events/second, that meant 50 snapshots per consumer per second. Our throughput was 20x higher than projected.

What We Tried First (And Why It Failed)

First we tuned the event store itself. We switched from PostgreSQL 15 to a dedicated TimescaleDB instance with hypertables, added async commit, and pushed WAL buffers to 16MB. The p99 dropped from 1.2s to 850ms—still unacceptable for real-time fraud scoring.

Next we tried increasing the snapshotting interval to 50,000 events. This reduced snapshot load by 80%, but introduced a new failure mode: when consumers restarted, they had to replay 50k events instead of 10k. Memory usage during replay spiked to 800MB per consumer, triggering OOM kills in our Kubernetes pods. The latency spike during restarts lasted 4-6 seconds, violating our SLA.

We also evaluated Kafka Streams exactly-once semantics with changelog compaction. The docs promised no duplicates, but our integration tests showed 0.3% duplicated events under network partitions. Fraud scoring cannot tolerate duplicates. We turned it off after four hours.

The Architecture Decision

We abandoned time-based or count-based snapshotting entirely. Instead we implemented a hybrid approach using event time watermarks we called Effective Event Time (EET).

Each event carries an EET in nanoseconds derived from the payment timestamp. Consumers maintain a sliding window of the last 60 seconds of EET. When the window advances past the last snapshot EET by more than 5 seconds, we trigger a snapshot. This ensures snapshots happen during natural lulls in event time, not during peak ingestion.

The trigger condition looks like this:

if (current_max_eet - last_snapshot_eet) > 5_000_000_000 &&
 snapshot_queue_size < 100 {
 trigger_snapshot()
}

We moved snapshots to a separate write-optimized topic in Kafka with retention set to 7 days and log compaction. Consumers read from this topic using a separate consumer group with fetch.max.bytes=5MB and max.partition.fetch.bytes=1MB to avoid ballooning memory during snapshots.

We also switched to RocksDB as the state store backend instead of in-memory maps plus disk overflow. RocksDB gave us 3.2x higher write throughput during snapshots while keeping read latency under 10ms for our fraud scoring queries.

Finally, we baked the snapshot policy into the consumer builder so it couldn't be overridden at runtime. Three engineers accidentally reconfigured it during the incident, so now it's a compile-time constant.

What The Numbers Said After

After the change, we reran the fraud detection reprocessing job. At 500k events/second, p99 ingestion latency stabilized at 15ms, down from 850ms. Consumer restart latency dropped to 200ms. Our Kafka consumer lag went from 45 minutes to under 30 seconds during peak.

The RocksDB state store used 450MB per consumer at steady state, peaking to 720MB during snapshots—not ideal, but within our pod limits. The write-optimized snapshot topic added 150GB/day of storage, which we archived to S3 after 7 days using tiered storage.

We measured snapshotting overhead using the Kafka metrics topic. Before the change, snapshot time accounted for 42% of CPU time in consumer pods. After, it accounted for 8%. The EET watermark increased snapshot frequency during lulls, but reduced it during spikes—keeping CPU usage flat.

The fraud detection rewrite completed in 47 minutes instead of the predicted 14 hours. The accuracy improved by 0.12% because we could now process every event in real time without skipping.

What I Would Do Differently

I would never again rely on memory-bounded count-based snapshotting for high-throughput event sourcing. The assumption that time correlates with load is wrong. Our payment spikes hit 800k events/second between 10:15 and 10:30 AM local time but fell to 80k events/second overnight. A fixed 10k snapshot interval would have broken our night shift too.

I would also avoid RocksDB for read-heavy fraud scoring workloads. RocksDB excels at write throughput but adds 6-8ms of latency per read compared to in-memory maps. For our next revision, we're evaluating Apache Arrow Flight SQL with a CDC stream from the event store. The plan is to rebuild the fraud scoring state in a vectorized columnar store, reducing read latency to under 1ms while keeping write throughput above 1M events/second.

Finally, we should have instrumented Effective Event Time from day one. Our legacy events only had a transaction timestamp, not an event creation timestamp. We spent two weeks retrofitting nanosecond precision into a schema that assumed millisecond accuracy. Event time should be a first-class citizen in the schema, not an afterthought.