The Moment Veltrix Blew Up and We Had to Write Our Own Shard Router

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The Treasure Hunt Engine isnt a search engine; its a live, multiplayer game where 100,000 players simultaneously dig through 5 TB of LZ4-compressed JSON blobs to find hidden keys. Each game room is a shard, and each shard must route writes to the correct player within 500 ms p99. Our SLA was written around the assumption that Veltrix would evenly distribute these shards across 40 nodes. What the docs didnt tell us was that Veltrix uses a modulo-based shard key hash, which collides when the shard count exceeds 32,768 (2^15). At 40 nodes we were at 65,536 virtual buckets, so every 4th request was hitting the same bucket, overloading node 7. The heap profile from Valgrind showed 1.2 million active TCP connections sitting in TIME_WAIT on that node because the backlog queue was 90,000 deep.

What We Tried First (And Why It Failed)

We started with Veltrixs cloud-init template that spins up 40 pods in Kubernetes. The operator guide said to set shard_count = 128 and replication_factor = 3, so we did. After the first load test with k6 we got p99 latency of 2.1 seconds, but the memory profile from heapster showed RSS on node 7 at 14.7 GB while node 16 was at 3.2 GB. Veltrixs JVM heap sizing guide recommended -Xmx8G, but node 7 was swapping because the off-heap caches were leaking. We tried tuning the concurrency thread pool from 200 to 800, but the lock contention in the ShardManager class showed up as 42% system CPU in perf stat. The metrics from Prometheus confirmed that the gossip delay between nodes was 750 ms instead of the expected 150 ms because the heartbeat interval was hard-coded at 200 ms and the packet loss between AZs was 1.8%.

We also tried the Veltrix CLI command veltrix-admin rebalance --force, which claimed to redistribute shards. The command ran for 11 minutes, during which the cluster received 4.2 million mutations. The rebalance process itself created a 12 GB snapshot on every node, and the disk I/O wait pushed p99 latency to 4.8 seconds. After the rebalance finished, the shard distribution was still uneven because the algorithm didnt account for the fact that our shard key was a 64-bit UUID with only 4 bits of entropy in the first byte.

The Architecture Decision

We decided to abandon the Veltrix virtual shard router and move to a custom two-layer system. Layer one uses a Rust ShardRouter that implements jump consistent hashing with a 64-bit fingerprint. We chose jump consistent hashing because with 40 nodes and 5 million shards the distribution variance is less than 0.001% compared to the Veltrix modulo approach, which had 12% variance at 65,536 buckets. The router is compiled as a cdylib and loaded into the Tokio runtime. Layer two uses a compacted B-tree index stored in Memmap with a custom allocator (mimalloc-rs) to keep RSS under 4 GB per node. We removed the JVM entirely because the GC pauses were causing 90 ms jitter at 2.3 million QPS. The new router handles 6 million QPS per node with p99 latency of 120 ms and tail latency at 280 ms.

We also replaced the gossip protocol with a hybrid approach: local shard ownership is gossiped via a 10-byte heartbeat every 100 ms, while global cluster state is pushed from a single leader elected by Raft. The leader is elected using etcds Raft implementation, which we chose because it supports joint consensus and can handle the 10 MB cluster state size without compaction pauses. The raft logs are stored on NVMe SSD with a custom io_uring poller to keep I/O wait below 0.5 ms per write.

What The Numbers Said After

After the migration we ran a 24-hour burn-in with 3 million concurrent users. The metrics from prometheus showed:

p50 latency: 78 ms
p95 latency: 130 ms
p99 latency: 175 ms
p99.9 latency: 280 ms
RSS per node: 3.6 GB ± 120 MB
Heap allocations: 48 MB per second
GC pauses (removed): N/A
Disk I/O wait: 0.4 ms per write

The jump consistent hashing reduced the shard imbalance from 12% to 0.0004%, eliminating the hotspot on node 7. The mimalloc-rs allocator reduced fragmentation by 43% compared to the system allocator. The io_uring poller cut the raft write latency from 12 ms to 0.8 ms. The only regression was a 15% increase in CPU usage because the Rust router does more work per packet than the JVM bytecode interpreter, but we traded CPU for determinism and tail latency.

What I Would Do Differently

I would not have trusted Veltrixs modulo shard key advice for a system that grows beyond 32 nodes. The virtual bucket collision is not obvious in the docs; its buried in