The Moment We Killed the Python VM for the Treasure Hunt Engine

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were running a live event platform at Veltrix that generated unique treasure hunts for 40 k concurrent users. Each hunt required a real-time solver that recomputed paths through a graph of 8 k nodes. The server was Python 3.11 with NetworkX and NumPy pinned to LLVM 16 builds. After the first 10 k users, CPU time for the solver hit 85 % with 300 ms p99. We tried Cython, then PyPy, then offloaded the solver to Rust via CFFI. The latency stayed flat, but memory climbed linearly with user count. The GIL was serializing every thread and the NumPy allocator was fragmenting 32 kB chunks across 8 GB of heap. The NetworkX runtime alone was 6.2 ms per query, and we needed 10 µs.

What We Tried First (And Why It Failed)

The first fix was a thread pool of 32 workers feeding the Python solver. perf showed 93 % of the time spent in PyEval_EvalFrameDefault. We added a GIL release around the heavy solver call, but at 30 k users the RSS ballooned to 6 GB and the allocator started failing with mmap fragmentation warnings from jemalloc. jemalloc reported 240 k active arenas and 1.8 million small allocations per second. The malloc profile confirmed 12 GB of virtual memory allocated even though the working set was 2 GB. The kernel started killing processes with OOM when the RSS touched 8 GB.

We then moved the solver to a separate Rust process and used nanomsg for IPC. Latency dropped to 8 ms p99, but the context switch cost per query climbed to 1.4 µs with 45 k switches per second. The allocator bottleneck simply moved: Rusts jemalloc in the solver process now fragmented at 4 kB blocks, and we saw 400 k live allocations per second. The per-user memory footprint was 140 kB, and with 80 k users we were again at 11 GB RSS.

The Architecture Decision

At that point we stopped patching the runtime and replaced the whole solver stack. We rewrote the graph engine in Rust with petgraph and ndarray, kept zero-copy deserialization with zerocopy, and exposed the solver through a Tokio async channel. We disabled jemalloc in favor of mimalloc because mimallocs arena reuse dropped active arenas from 240 k to 120 and reduced RSS by 25 % under load.

We also changed the graph representation: instead of storing adjacency lists as Vec, we switched to a 16-bit compressed sparse row format with 4-byte node weights. The adjacency list shrunk from 1.2 MB to 270 kB per hunt instance. We used a custom allocator for the graph instances—one arena per hunt thread—and the allocator never returned memory to the OS, reducing mmap churn.

The solver loop became:

let graph = CSRGraph::from_bytes(&bytes)?;
let results = graph.dijkstra_batch(&sources, &targets)?;

We benchmarked with criterion.rs and saw 7 µs mean and 32 µs p99 for a batch of 64 queries. At 80 k users the RSS plateaued at 3.2 GB and RSS/conn dropped to 41 kB.

What The Numbers Said After

After the rewrite we ran a 90-minute load test with 95 k parallel users. The p99 latency was 35 µs; p999 was 72 µs. The RSS stayed below 3.4 GB, and the allocator reported only 312 active arenas. jemalloc in mimalloc mode showed 1.2 million allocations per second but with zero external fragmentation. The kernel oom-killer never triggered.

A 30-second flamegraph from perf showed 87 % of CPU time in the solver, 8 % in the Tokio scheduler, and 5 % in the mimalloc arena reuse code. The scheduler latency histogram showed 95 % of tasks scheduled within 3 µs. The previous Python stacks 27 ms p99 now looked embarrassing.

We deployed the new engine to production and enabled canary for 5 % of traffic. Error rate dropped from 0.12 % to 0.001 % and the 95th percentile search latency fell from 28 ms to 34 µs. The team immediately added a feature flag to drop the Python VM from the build pipeline.

What I Would Do Differently

I would never have started with Python for a CPU-bound solver that needs sub-100 µs latency. The mistake compounded because we chased GIL releases and CFFI instead of rewriting the stack. If I had to do it again, I would prototype the solver in Rust from day one and verify allocator behavior with massif. I would also insist on compressed graph formats early; the 78 % reduction in memory footprint was worth more than the 4× speedup.

We learned the hard way that the runtime is the constraint, not the algorithm.