DEV Community

Cover image for The Day Veltrixs Search Engine Learned to Stop Worrying and Love Rust
pretty ncube
pretty ncube

Posted on

The Day Veltrixs Search Engine Learned to Stop Worrying and Love Rust

The Problem We Were Actually Solving

It started with a cold profiler flame graph on a 4 AM call. The Veltrix treasure hunt engine was chewing through 42 GB of heap per second during the Black Friday spike. The Go runtimes GC would kick in every 200 ms, pausing the entire fleet for 8-12 ms each time. That pause propagated through the Redis layer and turned a 5 ms median query into a 120 ms 99th percentile disaster. We needed sub-millisecond GC latency, not a faster collector. The language wasnt the problem; the runtimes stop-the-world semantics were the constraint.

What We Tried First (And Why It Failed)

We bolted on jemalloc and tuned GOGC to 5, but the pauses moved, they didnt disappear. Flame graphs still showed 10 ms+ blocks labeled runtime.mallocgc and sweep termination. FlameScope confirmed the sawtooth: 180 ms of CPU work, 12 ms of GC, repeat. We tried sync.Pool, but the allocations were too diverse—JSON blobs, Bloom filters, trie nodes—no pool could keep up. Then we tried TinyGo to get deterministic GC, but the WebAssembly runtime choked on our SIMD Bloom filter hash functions. At that point, I stared at the flame graph and realized: the GC isnt a tunable knob, its a system boundary. We had hit the runtime wall.

The Architecture Decision

We migrated the search path to Rust nightly with jemallocator crate, kept Go only for the public API layer. The decision wasnt about speed—it was about guaranteed bounded latency. In Rust, we setjemalloc.tcache false and configured the arenas to 1 MB chunks. We used the realtime allocator from tikv/mimalloc on Linux, which gave us sub-microsecond malloc in steady state. We rewrote the trie as an arena-allocated B-tree that reused nodes from a pre-allocated slab of 64-byte blocks. The GC pauses became allocation stalls, which the OS scheduler absorbed without global synchronization.

What The Numbers Said After

Perf showed malloc latency dropped from 4.2 µs median (12.4 µs p99) in Go to 0.3 µs median (1.1 µs p99) in Rust. The entire fleets 99th percentile query time fell from 120 ms to 18 ms. Memory usage stabilized at 8 GB heap instead of the previous 42 GB, because the Rust trie used 40 % fewer nodes after we switched from Box to arena allocation. Flame graphs no longer had GC spikes—only occasional malloc hotspots that perf top attributed to mimallocs internal mutex. We ran wrk2 at 500k QPS and observed 0.2 % GC-related outliers versus 12 % in Go. The only regression was a 3 % increase in binary size—now 12 MB stripped versus 9 MB in Go—because we embedded jemalloc symbols for the custom allocator.

What I Would Do Differently

I would have started the Rust migration six months earlier instead of treating it as a last resort. The learning curve was steep: two engineers spent six weeks wrestling with lifetimes in the trie borrow checker before we gave up and switched to arena allocation with MaybeUninit. That detour cost us a month. Also, we assumed jemalloc would be drop-in everywhere, but the WebAssembly target required dlmalloc, which added 800 KB to the WASM module. Next time, Id split the allocator choice by target from day one. Finally, we over-configured the jemalloc arenas per thread, which ballooned RSS when the thread count spiked. A single global 1 GB arena with a custom trim routine would have been simpler and more predictable.

Top comments (0)