Veltrix Events: How We Blew Up One Cluster a Week Until We Stopped Doing This One Stupid Thing

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We thought we were optimizing for exactly-once semantics. That was wrong. The real problem was scale-to-zero on Tuesday mornings. We had built a system where the event sourcing layer assumed that the downstream consumers would always be alive and never paused. When Kubernetes rolled nodes on Tuesday night (because of a bad cluster-autoscaler setting), half the consumer pods restarted, GC paused for 12 s, and the remaining pods tried to catch up by flooding the same Postgres partition with row-level locks.

The docs say: use a monotonically increasing partition key. What they do not say is: if that key is also the primary key, every consumer will fight over the same tail partition and Postgres will thrash.

What We Tried First (And Why It Failed)

We tried three things before we admitted defeat.

First, we moved the partition key from event_id to (tenant_id, ts_bucket). That fixed the contention, but now we had hot partitions when one tenant sent a 60 MB/sec burst. The pub/sub side still used a single topic, so we hit the 10 MB/sec limit of Cloud Pub/Sub. The metric we tracked was publish latency 99th percentile: it jumped to 800 ms within 30 seconds of a tenant burst.

Second, we switched to Apache Kafka and created 2,048 partitions. The Kafka cluster was on bare metal, so we had to manually tune the number of log segments and flush intervals. We set num.io.threads=32 and num.network.threads=8, but after two weeks the disk flush latency (kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec) showed 4 ms baseline and 120 ms during the burst. The ZooKeeper ensemble kept falling behind because we were running 3-node ZK with a 500 MB snapshot size and a 30 s sync interval. The Kafka controller logs showed session expiration storms.

Third, we tried Debezium Postgres connector with parallel streaming. We configured slot.parallelism=4 and set slot.max WalSize=1 GB. The connector fell over at 50 GB of WAL because it tried to materialize every transaction ID and the JVM heap hit 14 GB resident. The error was java.lang.OutOfMemoryError: GC overhead limit exceeded and the connector restarted every 90 seconds. We could not tune G1GC to make it stable.

Every fix moved the bottleneck somewhere else. The only constant was the Tuesday pager.

The Architecture Decision

We stopped trying to make the event pipeline exactly-once and started making it bounded-stale. The decision was:

Accept duplicates as a cost of simplicity.
Use a partitioned event log (Kafka) with a separate topic per tenant.
Move the deduplication state out of Postgres and into a RocksDB store embedded in each consumer.
Run the consumers as a Kubernetes Deployment with 5 replicas per tenant, each replica reading its own partition.
Set a global retention of 14 days and a max lag of 5 minutes. If a partition lagged beyond 5 minutes, the consumer scaled to 2x replicas.

The tradeoff was clear: we lost exactly-once semantics, but we gained bounded latency and elastic scale. The cost was duplicate events that we had to handle downstream. We measured the duplicate rate at 0.3 % during normal load and 1.8 % during bursts. The downstream service already had an idempotency key cache in Redis, so the extra 1.8 % was acceptable.

We chose Kafka over Pulsar because the Pulsar tiered storage had a 400 ms read latency for offloaded segments, which violated our 200 ms SLA for event replay. The Kafka cluster used Tiered Storage in preview, but the first segment was still on SSD, so we got 2 ms fetch latency.

What The Numbers Said After

We ran the new setup for six weeks. The latency metrics:

End-to-end publish latency 99th percentile: 120 ms (previously 800 ms)
End-to-end consume latency 99th percentile: 180 ms (previously 5.2 s)
Publisher throughput per tenant burst: 120 MB/sec sustained (previously 10 MB/sec)
Consumer memory per pod: 1.8 GB (all RocksDB block cache) with 5 replicas per tenant
Kafka disk usage growth: 4.2 TB over 6 weeks (with 14-day retention)
Replica scaling events: 12 in six weeks, all triggered by lag > 5 min

The Tuesday pager never fired again. The only remaining failure mode was a tenant with a 15-minute sustained burst that pushed lag to 7 minutes; the auto-scaler kicked in at 8 minutes and caught up in 90 seconds. The alert threshold was tuned to 5 minutes because that gave us one extra minute of safety margin before users noticed.