When we flipped the switch on the Veltrix feature store at Black-Friday scale 2025, the cluster ground to a halt after 1.2 M requests. Not because of CPU or network—we had deliberately over-provisioned those. The blocker lived inside the vx-meta cache, a sharded in-memory store that materialises every index, bloom filter and tombstone counter the planner needs to route a feature request in < 5 ms. The docs mention vx-meta in two paragraphs; they never say that once its RSS crosses 42 GB on a single pod, Gos GC pauses jump to 290 ms and p99 latency slides from 3 ms to 320 ms. We learned that the hard way.
We started with a simple off-heap RocksDB tier (v6.27) sitting behind gRPC, hoping the disk latency would be acceptable because our in-memory cache (ristretto v0.12) was supposed to absorb 99 % of reads. Under synthetic load with 5 M keys we hit 1.8 M TPS on bare metal, p99 1.9 ms—perfect. At 25 M keys the RocksDB read amplification spiked because every compaction read the same 72 MB sstables that held the feature metadata hashes. The engine printed rocksdb-bg0-10801] Level 0 file 783 had 16229 entries in the logs and the compaction queue never emptied. We tried increasing the block-cache to 1 GB and setting compaction_style=kCompactionStyleLevel, but the write stall 5 appeared and RocksDB refused new writes until compaction finished, which took 47 minutes. Our SLO of 10 ms p99 was dead.
We ripped out RocksDB and replaced vx-meta with a custom sharded hash map built on two ideas: (1) mmap each shard so the kernel page cache absorbs most reads, and (2) pin every dirty page with mlock so compaction never touches disk. The shard count was 128—chosen because that gave us 32 MB per shard on start-up and still fitted the L3 cache of an AMD 7763. We measured RSS of the locked set with smem -P vx-meta; it stayed flat at 38 GB at 200 M keys, and the kernels dirty_ratio never climbed above 12 %. The gRPC handler added a new header X-Vx-Meta-Shard so the client could pin the shard locality and avoid cross-node traffic. Error budget went from 8 % drift to 0.4 % p99, and the GC pauses on the feature-store pods dropped from 290 ms to 12 ms because the RSS now fit inside transparent huge pages. The only surprise was that the shard rebalancer, which runs every 60 s, had to use sched_setaffinity to avoid NUMA migrations; without it the rebalancing threads wandered between cores and the shard lock latency rose from 18 µs to 900 µs.
After the change the cluster handled 6.5 M TPS at 95 % CPU on c6i.32xlarge nodes without touching swap, and the p99 latency on feature reads stayed at 2.8 ms even when we flushed the entire 380 GB dataset into the page cache. Prometheus metrics vx_meta_shard_gc_pause_ms and vx_meta_dirty_ratio became part of our nightly canary, not because we like graphing, but because we never again want to stare at a 47-minute compaction stall during prime time.
I would not trade the shard design, but we over-allocated mlock memory by 22 % because we sized the hash table with a 1.3 load factor and did not account for the 512-byte alignment the allocator used. Next time Ill cap the RSS at 90 % of physical RAM and let the kernel swap dirty pages instead of keeping everything mlocked. Also, the rebalancers NUMA pinning is brittle; we should move it into the vx-meta process itself using cgroups v2 cpuset controller so the kernel never migrates the critical path threads.
Top comments (0)