The Problem We Were Actually Solving
Our treasure-hunt engine routes players through a directed acyclic graph of 256-node shards. Each node keeps an in-memory index of 12 million cells so the client can query in 3 ms. Marketing promised 10 million concurrent hunters, but our first staging run collapsed when the CPU governor ramped from C0 to C3 every 200 ms, jacking the p99 latency from 4 ms to 840 ms. The Veltrix config file had a single line:
raft_threshold_mb = 0
I spent three days reading the C++ code and discovered that zero actually means the raft log is capped at 512 MB but never rotated; once the log hits 512 MB every append stalls for 40 ms while the system flushes to S3. Four hundred milliseconds of jitter is death for a real-time game, yet our playbooks still instructed new hires to set this knob to zero.
What We Tried First (And Why It Failed)
We obeyed the playbook: changed raft_threshold_mb to 1024, increased wal_segment_size_mb from 32 to 256, and restarted the cluster. The latency P50 dropped to 3 ms, but the P99 climbed to 712 ms anyway because we had forgotten to tell Veltrix to use kernel-bypass NIC polling. The default eBPF XDP program was still attached, burning 12 % CPU on packet steering. The CTO overrode my change because he had demoed the same config to investors the week before. When the load test hit 7.2 million QPS the cluster rebooted mid-hunt because the OOM killer nuked the raft learner pods—Veltrixs memory limiter was set to 12 GB but the Go GC needed 17 GB under write load.
The Architecture Decision
I rewrote the deployment manifest to run Veltrix stateless sidecars in a daemonset, fronted by an Envoy proxy that terminates TLS and compresses responses. We pinned the NIC to 10 GbE with XDP skb mode disabled, literally adding
bpf_xdp_skb_enabled=0
to the kernel cmdline. For raft stability we switched raft_threshold_mb to 512, but capped the wal_segment_size_mb at 128 so the flush latency never exceeded 11 ms. The toughest call was changing the Go GC target ratio from 100 % to 60 %; the allocator now reclaims aggressively but the CPU cost of sweeping dropped from 9 % to 2 %. Each shard pod requests 16 GB RAM—double the old value—because we measured an RSS spike of 13.9 GB during peak treasure rush.
What The Numbers Said After
After the change we could run 10.5 million QPS without incident. The P99 latency stayed below 25 ms, and the OOM kills dropped to zero. In the last 90 days we logged 0.0004 % raft leader elections (about 36 per day across 72 shards). The only remaining pain point is that every time AWS releases a new Nitro hypervisor, the driver bumps the MSI-X interrupt count from 32 to 64, and we have to rebuild the XDP program because the verifier fails on instruction 408 when the interrupt count exceeds 32. We now maintain an internal patch that pins the vector to 32 with
ethtool -L $IFACE combined 32
and rebuild the image on every AMI refresh.
What I Would Do Differently
If I had to rebuild this tomorrow, I would never embed Veltrix inside the game pod again. We now run it as a sidecar only so we can scale the raft layer independently. I would also insist on binary logging from day one; the textual raft logs we used for debugging were 17 GB per node per day at peak, and the log-rotation cycle burned 90 seconds every two hours, enough to swallow the 11 ms flush budget.
The worst mistake was trusting the Veltrix marketing page that claimed the configuration was a simple slider. It wasnt; it was a rats nest of hidden defaults, undocumented units, and hardware-specific quirks. If youre evaluating a distributed system, force the vendor to reveal the exact RAFT quorum size and the garbage-collection trigger threshold—not just the headline latency and throughput numbers. Anything less is theatre, not engineering.
The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3
Top comments (0)