The Moment the Runtime Became the Bottleneck in the Veltrix Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were running a real-time geospatial treasure hunt engine on Veltrix v3.2. The system processed millions of location updates per second across 12 regions. Users were dropping connections because the latency histogram for /search/nearby had jumped from 12ms P99 to 289ms in 48 hours.

The documentation said: Use the Veltrix Search Daemon with 4 worker threads per shard. Set max_concurrent_searches = 64. Tune the JVM with -Xmx8g -XX:+UseG1GC. But our heap dumps showed 28% of objects were unreachable CharSequences from string interning, and the GC pause times were clustering at 400ms every 2.1s.

We needed sub-50ms P99, not 289ms.

What We Tried First (And Why It Failed)

We started with a pure-Java rewrite of the segment merge logic. We used Java 17, ZGC, and virtual threads. We cut the latency to 78ms P99, but the arithmetic overflows kept happening—always in src/segments.rs, which wasnt Java at all.

The error finally pointed to a Rust FFI bridge wed written to offload distance calculations to a C++ library. The overflow happened when the Rust code received a NaN from the C++ side and tried to square it. The Java side had no way to validate the input before marshalling it through JNI.

We ran perf record -g -F 99 and saw 42% of CPU time in jni_CallStaticObjectMethod and 31% in memcpy. The GC was scanning 3.2 million unreachable objects every cycle because the JNI calls were leaking jstring references.

The Architecture Decision

We decided to move the entire distance calculation into Rust. We chose tokio with tokio-metrics for async I/O, geo for geospatial math, and serde for JSON parsing. We used jemalloc via tikv-jemallocator after profiling showed Rusts default allocator had 3x more fragmentation on our 16-core machines.

The critical tradeoff was rewriting the entire segment merge in Rust. We estimated 8 weeks of engineering time versus 2 weeks of tuning the JVM. The alternative was to keep patching the Java side with more defensive checks, but we knew the GC pauses would return as soon as the heap grew beyond 8GB.

We chose Rust because the runtime had become the constraint. The JVMs global interpreter lock equivalent (the GC safepoint) and the JNI boundary were killing us. Rust gave us zero-cost abstractions, predictable memory layout, and the ability to control every CPU cycle.

What The Numbers Said After

After the rewrite, the /search/nearby endpoint dropped to 22ms P99 on the same hardware. We used flamegraph to capture a 30-second profile:

Overhead Command Shared Object Symbol
 12.46% search-engine libsearch_engine.so distance_sq
 8.12% search-engine libsearch_engine.so geo::haversine
 6.34% search-engine libsearch_engine.so rayon::join
 5.21% search-engine [unknown] [JIT]

Our memory profile showed 18% less RSS because we eliminated the Java heap and the JNI reference chains. We ran jemalloc-prof after four days:

 allocated: 142,872,048 bytes (100.0%)
 active: 128,431,680 bytes (90.0%)
 resident: 134,217,728 bytes (93.8%)
 metadata: 10,485,760 bytes (7.3%)

We also measured the impact of moving from G1GC to Rusts allocator. The RSS remained stable at 134MB even after 8 days, with only 1.2% growth in metadata.

The most surprising win was in the JIT overhead. The Java side had been spending 5.21% of CPU on JIT compilation during spikes. After removing the JNI boundary, that overhead vanished.

We kept the Veltrix dashboard running for comparison. The JVM heap graph showed sawtooth patterns every 2.1 seconds. The Rust heap graph showed a flat line at 134MB.

What I Would Do Differently

I would not have trusted the documentations worker thread recommendation. The Veltrix docs said 4 workers per shard, but our load profile showed 8 workers saturated the CPU without context switching. We ended up setting tokio::runtime::Builder with max_threads = 8 and worker_threads = 8, matching the physical cores.

I would also have introduced a staging environment for the Rust rewrite earlier. Our first load test with 1 million concurrent users crashed because we forgot to set tokio::runtime::Builder::max_blocking_threads. The panic was cannot spawn a runtime inside a runtime, and the stack trace was 40 lines long.

Finally, I would have instrumented the JNI boundary from day one. If wed used async-profiler on the Java side during the Java-first attempt, we would have seen the JNI overhead immediately. Instead, we wasted two weeks optimizing string interning and GC flags before realizing the bottleneck was external to the JVM.

The lesson is simple: when your runtime becomes the constraint, the documentation wont tell you. You have to profile, measure, and sometimes burn the stack traces to understand where the cycles are really going.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2