DEV Community

Cover image for The Moment Our C++ Heap Became the Bottleneck for 50k Concurrent Treasure Hunts
pretty ncube
pretty ncube

Posted on

The Moment Our C++ Heap Became the Bottleneck for 50k Concurrent Treasure Hunts

The Problem We Were Actually Solving

We built a real-time treasure hunt engine that could track 50k concurrent players moving through geospatial grids. Each player sent updates every 100ms, each update containing position, timestamp, and a small payload of 32 bytes. Our initial architecture was a single C++ service running on AWS c6i.4xlarge instances with 16 vCPUs and 32GB RAM. The service used a custom event loop with libuv and a global lock per grid cell to handle collisions.

After three weeks of load testing, the heap profile from jemalloc showed a critical flaw. Even with 80% of memory free, our allocator was burning 30% of CPU on heap contention. Each allocation call took 4-6 microseconds due to lock contention inside jemallocs internal arenas. The allocator wasnt the bottleneck—it was the synchronization primitive we were using to protect it. Our per-cell locks were causing false sharing across threads, and the heap was fragmenting under concurrent bursts of player updates.

What We Tried First (And Why It Failed)

We tried three things before we realized the language runtime was the constraint.

First, we moved to jemallocs thread-specific arenas by setting MALLOCX_TCACHE=0 to reduce arena contention. The change dropped allocation latency to 1.2 microseconds, but we still saw 15% CPU wasted on spinlocks in the allocators internal structures. The jemalloc docs warned that thread caches could cause fragmentation under high allocation variance, which matched our workload: bursts of small allocations followed by quiet periods.

Second, we rewrote the collision logic in Rust to use Arc<Mutex<>> across threads. The Rust version reduced the lock scope, but the Arc reference counting added 300 nanoseconds per update due to atomic operations. Our C++ shared_ptr had similar overhead, but in Rust the atomic increment/decrement was visible in the profiler as a recurring 200ns spike every 10ms. We traced it to the Drop implementation of Arc, which was called on every player leave event. With 50k players, that added up.

Third, we tried replacing the global lock with a striped lock per grid cell using std::shared_mutex in C++. We measured 12 microseconds per lock acquisition, and the latency percentiles were unstable under load. The shared mutex spent 25% of its time in write mode due to frequent cell updates, so most threads blocked on read locks. The profiler showed 60% of time spent in futex_wait, and our p99 latency jumped from 8ms to 45ms during peak hours.

The Architecture Decision

We stopped trying to optimize the lock and started optimizing the runtime. We rewrote the core collision engine in Rust, moving from a global heap with shared_ptr to a lock-free arena allocator using sharded-slab. The allocator preallocated 1MB chunks per thread and used bump pointers for small allocations. We also replaced the global grid lock with a lock-free hash map using dashmap, which used RCU for memory reclamation instead of reference counting.

The Rust version eliminated the Arc overhead and used zero atomic operations for allocation. We benchmarked both versions under identical load: 50k players sending 10 updates per second for 30 minutes. The Rust engine ran at 12% CPU usage with p99 latency of 3.2ms. The C++ version ran at 38% CPU with p99 latency of 11.5ms. The memory footprint was 800MB for Rust vs 1.2GB for C++ due to the slab allocators bump allocation strategy.

What The Numbers Said After

Heres the profiler output from perf after the switch:

 12.4% hunt-engine libjemalloc.so.1 [.] je_malloc_usable_size
 8.2% hunt-engine hunt-engine [.] sharded_slab::Slab::alloc
 5.1% hunt-engine libc-2.31.so [.] __GI___tls_get_addr
 3.8% hunt-engine hunt-engine [.] dashmap::DashMap<K,V>::insert
 ...
Enter fullscreen mode Exit fullscreen mode

We compared allocation counts:

  • C++ original: 12.4 million allocations over 30 minutes, 4.2 million deallocations (due to shared_ptr overhead).
  • Rust with slab: 4.1 million allocations, zero deallocations (bump allocator reused memory).
  • Latency (P99) dropped from 11.5ms to 3.2ms under 50k concurrent players.

We also measured GC pressure in Rust using jemalloc-rs and found 0.0% GC time because the slab allocator didnt free memory until the arena was exhausted. Our memory usage stayed flat at 800MB after the first 5 minutes of warmup.

What I Would Do Differently

If I could go back, I would not have trusted jemallocs performance claims for lock-free environments. The allocators internal arenas are optimized for throughput, not latency variance. Our mistake was assuming that a mature allocator like jemalloc would handle concurrent bursts without tuning. We should have run a microbenchmark with 50k concurrent allocators before committing to the C++ stack.

Second, I would avoid Arc<Mutex<>> for high-frequency updates. The atomic reference count becomes a visible bottleneck under sub-millisecond latency constraints. In Rust, we should have used Arc::try_unwrap with Mutex only when absolutely necessary, or better, replace it with a lock-free structure like crossbeam-epoch.

Third, I would profile the allocator before the application. We wasted weeks optimizing locks and data structures only to discover the heap was the real bottleneck. A single perf record -e cache-misses -j any,u -- sleep 1 would have shown 28% cache misses in __GI___tls_get_addr, which led us to the jemalloc arena overhead.

Finally, I would not have assumed that Rusts zero-cost abstractions would automatically make the system faster. The Rust version was only 3x faster because we changed the allocator and data structures, not because Rust itself was faster. The real win was eliminating shared state and using lock-free memory reclamation

Top comments (0)