How Our Event-Driven Pipeline Blew Up Because We Trusted the Default Config

#webdev #machinelearning #programming #ai

We set out to build a treasure-hunt engine on top of Veltrix Event Streams in the summer of 2025, thinking we could drop the default configuration in, wire up the APIs, and be live in two weeks. We thought Veltrix was just another managed Kafka; it wasnt. The default settings shipped with three silent assumptions that shredded our SLA on day four.

Our first deployment created a partition skew of 92 % on the events topic. With only two brokers, one broker handled 92 % of the load while the other idled like a brick. Latency for end-of-hunt notifications spiked to 2.8 seconds, which is forever when your players expect under 400 ms. The logs didnt scream partition leader skew—they just showed a few hundred event_type=hunt_completed messages stuck on the overloaded broker. We spent three days blaming network, then switched to Veltrixs own metrics dashboard and saw the skew in one click.

We peeled the config apart and found three traps baked into the default. Trap one: compaction lag on the internal _consumer_offsets topic was set to 3600 seconds. That meant any consumer group restart re-read 1.2 million tombstone records before it could start processing new events, adding 45 seconds of hang-up for every rollout. Trap two: the default retention.ms on events was 604 800 000 ms (one week). We assumed events were transient, but our replay logics later needed 28-day look-back for compliance. The compaction storm that followed rewrote 3.4 TB of segments and blew the clusters write cache for eight minutes. Trap three: partitioning strategy was keyed only on hunt_id. With 400 concurrent hunts, we got 400 partitions. The controller kept auto-rebalancing every time a pod restarted, eating 30 % of the clusters CPU budget on controller tasks instead of moving data.

We had to redesign the topic layout from scratch. First, we moved the hunt events to a three-tier topic hierarchy: events.hunt.lifecycle, events.hunt.player, events.hunt.audit. Tier two (player) became the largest topic at 12 TB, so we split it by hunt_id modulo 128 to keep partition sizes under 90 GB each. Tier three (audit) went to 64 partitions with keys = SHA-256(event_id) so every record is evenly distributed. We set retention.ms per topic: 86 400 000 for lifecycle, 6 034 000 for player, 2 592 000 000 for audit. We then turned compaction.min.cleanable.ratio down to 0.1 for the lifecycle topic because tombstones there are sparse, but left it at 0.5 for audit where we need strict compaction. Finally, we disabled auto-leader-rebalance.enable and put a custom rebalance controller that only triggers when partition replica count drops below 2 or broker disk usage crosses 80 %.

Three weeks after the re-architecture the numbers looked sane. The partition skew on the player topic dropped to 4 %. End-of-hunt latency 99th percentile stayed under 300 ms. The cost per million events dropped from $0.32 to $0.09 because we stopped re-reading tombstones on every restart. The write cache cold starts fell to zero; the controller CPU went from 30 % idle to 2 % consistent.

Still, I would not make the same choices again. We over-engineered the audit tier by giving it 64 days of retention. At 5 MB per audit record, that rack alone costs $2.3 k per month we dont actually use. We could have offloaded audit events to cold storage after 7 days and kept only a rolling window of 14 days on the hot tier. Also, we locked ourselves into Veltrix-specific knobs like auto-leader-rebalance.enable; a future migration will require rewriting topic configs or running dual-write. If I could restart the project, I would wrap the topic creation in Terraform modules that version the retention, compaction, and partitioning logic, then gate deployments with a linting step that checks skew across all topics before any pod ships to prod.

DEV Community

How Our Event-Driven Pipeline Blew Up Because We Trusted the Default Config

Top comments (0)