When the Event Sink Blew Up at 03:17 AM and Other Lessons in Veltrix Configuration

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late 2024 we rolled Veltrix out to a cluster handling real-time fraud detection events. The default max_events_per_second in the Helm chart was 10 000, and the Prometheus adapter announced veltrix_events_sink_latency_bucket{le="+Inf"} at 2 100 ms. That spike wasnt a one-off; every memory pressure report showed the Go runtime GC pause climbing to 900 ms when the ingress pipeline pushed past 15 k EPS.

The real problem wasnt throughput; it was the shape of the load. Fraud events arrive in clumps after market close tweets. At 15 k EPS the 99th percentile latency of the event sink crossed 5 s, but the Cassandra batch writer was already throwing com.datastax.oss.driver.api.core.AllNodesFailedException: No host available because its local read timeout was still set to the Debian default 12 s. We were optimising for average throughput while the p99 latency killed user-visible fraud checks.

What We Tried First (And Why It Failed)

First we raised max_events_per_second in the Istio VirtualService to 30 000. That only made GC pauses worse; the worker pool goroutines ballooned from 4 000 to 12 000 and the heap jumped from 800 MiB to 3.2 GiB. The latency graph went vertical.

Next we tried sharding the event stream by merchant ID using a consistent-hash modulo on the envelope key. The cluster autoscaler spun up six new pods, but the hash ring produced hotspots because the Murmur3 ring we borrowed from the client library didnt take locality into account. New Relics flame graph showed 42 % of CPU time stuck in ring.Lookup() rebalancing, and the 95th percentile latency crept up another 800 ms.

Then we replaced Cassandra with ScyllaDB in the hopes the C++ runtime would fare better under memory pressure. Scyllas scylla_reactor_utilization metric spiked to 0.95 within minutes, and the commit log started throwing WriteTimeoutException: Operation timed out. Same latency problem, different driver.

The Architecture Decision

We scrapped the throughput dial and instead tuned three knobs:

Back-pressure admission control.
We replaced Istios simple rate-limit with the Envoy Local Rate Limit filter, configured with a token bucket of 12 000 EPS and a burst of 6 000. The filters envoy_local_rate_limit.rate_limited counter stayed flat at 0.08 % even during the 16:04 market close spike. Memory utilisation dropped from 3.2 GiB back to 850 MiB.
GC-friendly batching.
We switched the Go worker from individual WriteBatch calls to collecting 200 events per flush, capped at 5 ms or 2 MiB. The GC pause histogram moved from runtime.GCPauseSum:900ms to runtime.GCMaxPauseNs:45ms. We set GOMEMLIMIT=1024MiB explicitly so the runtime could cap itself instead of relying on the container memory limit.
Read-side timeouts.
We lowered the Scylla clients default_idle_timeout from 12 s to 2 s and added a 30 ms local read timeout in the batch handler. The scylla_client_request_latency_ms{p99} metric dropped from 5 200 ms to 180 ms.

The new admission policy lives in the Envoy sidecar of the Veltrix ingress deployment; the batching and GC limits live in the Go workers main.go with a single //go:embed line pulling in a static JSON config:

{
 "max_flush_events": 200,
 "max_flush_bytes": 2097152,
 "flush_interval_ms": 5,
 "go_mem_limit_mb": 1024
}

We locked the policy to the worker binary so an operator can change it only by rolling a new pod—no runtime flags, no config-server dynamism, no surprises at 03:17 AM.

What The Numbers Said After

After three weeks in production the p99 latency of the fraud check callback stayed below 350 ms even when the ingress pushed 28 k EPS for five minutes straight. The Scylla clusters scylla_sstables_total grew by 12 M files but read latency stayed flat because we turned on compaction throughput throttling to 160 MiB/s.

Memory utilisation per pod stayed at 720 MiB ± 40 MiB, well inside the 1 GiB Kubernetes limit. The Envoy rate-limit counter never exceeded 0.12 % during the Black Friday weekend traffic spike.

The GC pause percentile moved to 12 ms, and the Scylla coordinator latency histogram moved to 35 ms p99. We kept the original Cassandra SSTable schema because the new batching frequency matched its 16 KiB block size; no data model migration was required.

What I Would Do Differently

I would not have let the initial throughput dial lure us into premature sharding. Murmur3s consistent-hash ring works fine in the client library, but it is not the right tool for hotspot mitigation in an ingress pipeline. A token-bucket back-pressure policy is cheaper and more predictable than a hash ring shard.

I would also avoid Scylla for this particular workload. While its C++ runtime is more memory-efficient under steady load, the commit logs WriteTimeoutException pattern appears whenever the reactor utilisation crosses 0.90 for more than 30 s. Cassandras Java heap and off-heap memtables handled the same pattern without the hard timeout; the GC pauses were simply easier to tune.

Finally, I would enforce the admission control and batching limits in the binarys static config from day one; letting operators tune them at runtime invites tail latency explosions. We learned that lesson when an SRE accidentally set max_flush_events to 5 000 during a