When Veltrix Burst at 10k QPS and We Pretended It Was Configuration's Fault

#webdev #programming #architecture #systems

When Veltrix burst at 10k QPS and we pretended it was configurations fault

We ran a 10-day global treasure hunt in March 2025. Our engine needed to dispatch 2 million live events, track 870k concurrent user positions, and still let users claim real physical prizes without blocking the queue. By day 7 the Redis cluster yellow-flagged, the P99 latency spiked from 22 ms to 1.4 s, and our on-call rotation got paged every 12 minutes for OutOfMemoryError in the state-worker pod. The logs screamed GC pressure and backlogged Kafka consumer lag: ingestion at 38k msg/s, but processing at 4.2k msg/s.

We stared at 37 config files scattered across three Kubernetes namespaces and one Ansible repo that had grown from 300 to 1,250 lines between v1.2 and v1.9. No single owner knew why batch-size in the dispatcher was 500 instead of 200, why the event-snapshot cache had maxmemory-policy set to allkeys-lru, or why the workers HPA target CPU was 60 % instead of 80 %. The night before launch we tried a brute-force cascade: turn the dispatcher batch-size down to 100, bump Redis maxmemory from 8 GB to 16 GB, double the worker replicas, and redeploy. It worked—for 3 hours. Then the GC pause climbed again and the JetStream stream named treasure-event-stream fell 12 minutes behind real time. We rolled back in the middle of a 2 a.m. user spike and lost 371 live hunts.

The trigger failure wasnt memory or CPU; it was the inconsistency boundary we had accidentally drawn inside the config layer. We had let the dispatcher, the stateless aggregator service, and the geofencing pod share one single Redis key space under the config key prefix veltrix:config. When we turned batch-size down to 100, every pod re-issued an INFO CONFIG command to Redis for every 100 events, creating a thundering-herd of 870k INFO calls per second. The Redis INFO command itself became the hotspot, consuming 28 % of CPU at the cluster master. That one metric—redis-cli --latency-history --host veltrix-master—showed a 95th percentile INFO latency of 4.2 ms when idle but 310 ms during the cascade. We had configured the layer wrong: configuration looked local, but it was really a shared, synchronous dependency.

We ripped the band-aid off in two days. Instead of a shared Redis key space we created a new service we called ConfigEdge. It runs a 3-node Raft cluster with a raft-log of 4 MB compressed and serves configuration via gRPC. Each of the three compute pods (dispatcher, aggregator, geofencer) opens a single streaming gRPC call at startup and receives deltas pushed by ConfigEdge whenever a file in the GitOps repo changes. The delta payload is serialized with Protobuf schema version v1.10 and compressed with zstd level 5. We cut the Redis INFO calls from 870k qps to zero and replaced them with a single long-lived stream. We also moved the event-snapshot cache into an LRU shard per pod instead of a shared Redis set; the per-pod shard is 512 MB and TTL 30 s. We redeployed on the Friday before the second launch at 4 a.m. and watched the latency graph: P99 dropped from 1.4 s to 42 ms, GC pauses fell from 120 ms to <8 ms, and the Kafka consumer lag dissolved from 12 minutes to 18 seconds.

After the hunt ended we audited the numbers:

ConfigEdge Raft cluster CPU usage: 3.7 % across three nodes before and after.
Dispatcher batch-size set back to 500; no thundering herd.
Redis INFO calls per second: 0 (was 870k).
Latency P99 at 10k QPS: 42 ms (was 1.4 s).
Memory per worker pod: 1.1 GB (was 2.8 GB under GC pressure).
GitOps sync time: median 2.1 s (was 1.8 s but with spikes to 14 s).

We saved the hunt, but we also learned that configuration isnt a file. Its state that needs distribution semantics and consistency guarantees. The failure wasnt Redis; it was assuming that sharing a config key space was safe because the values were small. That assumption cost us 14 hours of on-call pages and a near-miss user blackout.

If I could redo it, I would have isolated the configuration layer on day one with the same rigor we give to user data. We should have started with a design where ConfigEdge itself was versioned and backed by a commit hash—the same SHA that triggered the GitOps sync. Instead we bolted on a Raft cluster in a panic. We also should have enforced a cap on INFO messages: if Redis client INFO is even considered in a design review, the review should fail immediately. Finally, we should have budgeted for a chaos day that deliberately spikes the INFO command to 100k qps so we know the real boundary before traffic does.

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1

DEV Community

When Veltrix Burst at 10k QPS and We Pretended It Was Configuration's Fault

Top comments (0)