How Our Search Backend Blew Up At 10K QPS And What We Did In The First 48 Hours

#webdev #programming #rust #performance

The Problem We Were Actually Solving

At Veltrix we ran 18 search clusters across three regions, each indexing 4 billion records. The cluster serving EMEA traffic ran on HotSpot JVM 17, G1 collector, default ergonomics. Latency was fine at 50 ops/sec but exploded to P99.5 > 2 seconds when we hit 6K QPS, even though CPU was at 45% and GC pauses averaged 12 ms. We blamed the query planner, the tokenizer, the network.

Then we added a synthetic load of 10K QPS with only 100-byte keys. P99.5 jumped to 4.2 seconds, CPU spiked to 92%, and the GC cycle histogram showed 800 ms safepoint stalls every 2 seconds. I ran jHiccup and watched 500 ms hiccups every GC cycle. The JVM wasnt collecting memory fast enough, so the allocator fell back to synchronous safepoint yields that froze every thread. I had tuned Xmx to 24 GB, but the default G1 ergonomics set Pause Goal to 200 ms, so it couldnt keep up. Latency wasnt the problem; the runtime was.

What We Tried First (And Why It Failed)

We tried three JVM tunings in two weeks. First, we set MaxGCPauseMillis=50 and triggered a concurrent mode failure at 8K QPS because the young generation filled faster than the background collector could keep up. We raised HeapRegionSize to 8 MB hoping to reduce remembered set overhead, but that increased marking pauses to 320 ms. We switched to Shenandoah after reading its pause times were unbounded but predictable. At 9K QPS we still saw 1.8 second safepoint pauses because Shenandoahs concurrent evacuation couldnt keep up with allocation throughput of 1.2 GB/s. The JVM was now the bottleneck, not our code.

We switched the query engine from Lucene to Tantivy—Rust, SIMD, zero-copy deserialization. Tantivy handled 10K QPS with P99.5 < 60 ms on the same hardware. The difference wasnt algorithmic; it was the runtime. JVM safepoints didnt exist in Rust, and the allocator never triggered stop-the-world.

The Architecture Decision

We built a Tier 0 service called Treasure Hunt Engine that fronted the search clusters. It received HTTP/2 search requests, forwarded them to the nearest Tantivy index, and streamed results back as NDJSON. Inside the router we replaced the JVM query planner with a Rust micro-service compiled with rustc 1.77.0-nightly (2024-04-05). We chose jemalloc as the global allocator because its arena cache kept thread-local allocations at 200 ns per op. We disabled jemallocs background thread to reduce tail latency spikes and relied on arc_swap for hot config reloads without locks.

We deployed the service to two AWS m6i.2xlarge instances in each region, 100 ms RTT to the Tantivy indexes. We used tower-http with tokio 1.28 for async I/O and set max_concurrent_requests to 1024. We measured 9.8K QPS with P99.5 latency at 58 ms and P99.9 at 110 ms. GC CPU time dropped from 24% to 3%. Allocation throughput stayed below 300 MB/s, and jemalloc reported 0 syscalls for arenas larger than 1 GB.

What The Numbers Said After

After two weeks in production:

$ curl -s http://treasure-hunt:9090/metrics | grep search_engine
search_engine_requests_total{region="eu"} 98417234
search_engine_errors_total{region="eu"} 231
search_engine_latency_seconds_bucket{le="+0.05"} 82198421
search_engine_latency_seconds_bucket{le="+0.10"} 98314443
search_engine_latency_seconds_bucket{le="+1.00"} 98417234
search_engine_jvm_gc_time_seconds 0.0
search_engine_allocated_bytes 241289703

We ran a controlled failover test: kill -9 the primary pod, then measure recovery time. The Rust service restarted in 2.3 seconds; the JVM router took 28 seconds due to heap dump generation. During the failover, search QPS stayed above 9.5K with latency under 80 ms. The JVM router dropped 18% of requests because its thread pool exhausted.

What I Would Do Differently

I would not have spent six weeks tuning the JVM. The moment I saw jHiccup reporting 500 ms safepoint stalls every 2 seconds, I should have run a controlled A/B with a Rust prototype. Instead, I wasted two sprints fighting GC heuristics that were never going to keep up at 10K QPS.

We also underestimated the cost of observability. In Rust theres no built-in flamegraph tooling like async-profiler. We had to build our own tokio-console integration and a custom jemalloc profile exporter. Next time Id budget a week for observability scaffolding before the first load test.

Last, we chose jemalloc for its low latency tails, but at 10K QPS with 1 KB responses, the allocator was still burning 5% CPU. Id experiment with mimalloc next, because its page-level arenas reduce cross-core contention and might cut that to 2%.

The learning curve for Rust async was real: 3 engineers spent 6 weeks untangling Pin, Poll, and Cancellation. But once the router stabilized, we never worried about runtime again. The JVM taught me that engineering is about taming safepoints; Rust taught me that engineering is about never seeing a safepoint at all.