The Configuration Death March That Nearly Broke Our Event Pipeline

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We started with a vanilla Veltrix cluster running 22.10.3, the version that ships with the official Helm chart. The marketing team wanted to correlate feature-flags with user behavior, so we routed every flag toggle through the event pipeline. In the first week, the firehose was 300k EPS—comfortable. By week four, it was 2.8 million EPS, and every 60 seconds the cluster froze for 10 seconds while ClickHouse merged between 35 and 45 GB of new parts. The lag metric MergeTreePartsTTL hit 62 seconds; the dashboard turned solid red. PagerDuty pages averaged 8 per day. We traced the lag to the parts_to_delay_insert setting: it was defaulting to 0, so ClickHouse kept trying to insert while merging. No one in the docs mentioned that at 3 million EPS you needed to raise that to 30 seconds to give merges breathing room. We had simply copy-pasted the chart values.

What We Tried First (And Why It Failed)

Our first attempt was to throw hardware at it. We doubled the replica count and moved from NVMe to Optane. The freeze duration dropped to 6 seconds, but the lag still peaked at 45 seconds during peak hours. Then we tried increasing background_pool_size from the default 16 to 64, hoping more background workers would finish merges faster. Result: CPU steal skyrocketed from 12 % to 38 % because 64 workers were hammering the same SSD controllers. The ClickHouse error log filled with Could not allocate block, not enough memory and Too many open files from the increased max_thread_pool_size. At that point we realized the problem wasnt CPU or disk IOPS—it was configuration arbitrage. Every knob we turned simply moved the bottleneck to another resource without fixing the fundamental impedance mismatch between the event rate and the merge strategy.

The Architecture Decision

We stopped tweaking hardware and started redesigning the configuration model. We split the event pipeline into two ClickHouse clusters:

Hot cluster: 3 nodes, ReplicatedMergeTree, parts_to_delay_insert = 30, background_pool_size = 8, 3-day retention, replicated to SSD. This cluster handled raw ingestion and kept two hours of buffer so merges could catch up without hitting the wall.
Warm cluster: 5 nodes, MergeTree, monthly partition strategy, 180-day retention, replicated to HDD. This cluster ran heavy aggregations and cost us $0.02 per GB scanned instead of $0.12.

The critical change was in the schema itself: we moved the event_time column out of the primary key and into a separate low-cardinality UserDefinedFunction column that ClickHouse materializes as a projection. That cut the insert path length by 40 % and eliminated the monster merges that were creating 45 GB parts. We also introduced a configuration registry in etcd that watches for schema changes and automatically adjusts parts_to_delay_insert based on the current EPS. At 1 million EPS it sets 20 seconds; at 4 million EPS it jumps to 50 seconds. The registry is versioned so we can roll back the configuration in under thirty seconds if we introduce a regression.

What The Numbers Said After

After the new configuration went live, the 60-second lag spikes disappeared. The 99th percentile event ingest latency dropped from 140 ms to 28 ms. Merge durations stabilized between 1.2 and 2.3 seconds instead of 6–10 seconds. Our ClickHouse CPU utilization fell from 78 % to 42 %, and the disk IOPS dropped from 32k to 11k during peak hours. The replica lag metric ReplicatedFetchLag never exceeded 0.8 seconds. PagerDuty went from 8 pages/day to 0.4 pages/day. Cost per million events dropped from $0.32 to $0.18 because we stopped over-provisioning hardware. We also discovered that the projections feature, which the Veltrix docs call optional, saved us $3k/month in storage by eliminating duplicate data.

What I Would Do Differently

I would not have let the configuration be an afterthought. The first mistake was treating the Helm chart values as gospel. The second mistake was assuming that ClickHouse auto-tuning would scale. I should have built a chaos test that replayed a month of production events against a staging cluster sized at 10 % of production before the first load test. The chaos test would have immediately flagged the parts_to_delay_insert problem at 1.5 million EPS and saved us two weeks of firefighting. The third thing I missed was the impedance mismatch between ingestion bursts and merge windows. We should have modeled the merge queue as a bounded buffer and added a circuit breaker in the collector that throttles events when the queue depth exceeds 200k parts. That circuit breaker would have prevented the 45 GB part avalanche before it started.

The lesson is simple: configuration is architecture. The docs wont tell you how to set parts_to_delay_insert at 4 million EPS because they werent written by people who had to stare at a 60-second lag spike at 3 a.m. If you treat the docs as a starting point instead of an endpoint, youll end up rewriting the configuration anyway—just with pager duty in the middle.