The Day I Realised the HashMap Was the Bottleneck

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We were building a treasure hunt engine for a global event with 50,000 concurrent players. The core requirement wasn't scale—it was consistency under load. Players had to see exactly the same state, regardless of which server pod they hit. Our first design used an in-memory HashMap to track player positions, updated via WebSocket messages. After 12,000 concurrent connections, p99 latency for position updates spiked to 420ms. Not 42ms—420ms. Players reported seeing their avatars teleport across the map. The problem wasnt the network; it was the HashMaps cache line bouncing under write contention. Every position change required a CAS operation on a single shared map, and the CPU cores were spending more time invalidating cache lines than doing actual work.

What We Tried First (And Why It Failed)

We tried sharding the HashMap by player ID. That reduced lock contention but introduced a new problem: hotshard skew. The top 1% of players were generating 40% of updates, so three shards handled 90% of traffic. We tried read replicas with eventual consistency, but the hunt required absolute consistency—no cheating via stale state. Adding more replicas just increased memory pressure and GC pauses in our JVM-based service. Profiling with async-profiler showed 38% of CPU time in sun.misc.Unsafe CAS loops and 22% in GC. We even tried offloading to Redis with Lua scripts, but the round-trip latency added 18ms baseline overhead, which violated our p99 target of under 50ms.

The Architecture Decision

After two weeks of tuning and profiling, we faced a reality check: the language and data structure were the constraints, not the algorithm. We rewrote the state tracker in Rust using a lock-free, sharded approach with dashmap::DashMap and crossbeam-channel for async updates. We replaced the HashMap with a Vec of AtomicPtr blocks, each sized to a CPU cache line (64 bytes), and used epoch-based reclamation via crossbeam-epoch. The key insight wasnt parallelism—it was elimination of false sharing and lock thrashing. We moved heap allocations off the hot path by preallocating player slots in a bump allocator during initialization. The Rust version eliminated GC pauses entirely and reduced memory usage from 5.2GB to 1.8GB under load.

What The Numbers Said After

Post-migration, p99 latency for position updates dropped from 420ms to 8ms. Memory usage stabilized at 1.8GB with 50,000 players, a 65% reduction. GC pauses vanished; our JVM-based service had seen 12ms GC pauses every 400ms under load, which disrupted WebSocket heartbeats. The Rust version showed zero GC pauses and a consistent 1.2% CPU usage per core during peak load. We measured heap allocations with jemallocs profiler: the Java version allocated 4.7MB/s during updates; the Rust version allocated 8KB/s. The shard hotspots disappeared because we used a consistent hashing scheme with virtual nodes, distributing load evenly. We deployed this as a sidecar alongside our main service, communicating via gRPC over shared memory with a Unix domain socket. Latency between services dropped from 0.4ms (TCP loopback) to 0.02ms.

What I Would Do Differently

I would not have wasted weeks trying to tune the Java HashMap. Once we identified the data structure as the bottleneck, we should have prototyped the Rust solution in a day. The learning curve for Rusts ownership model was steep, but the real pain was not the language—it was unlearning decades of JVM assumptions. We initially tried to model the state as a struct with lifetime annotations, which led to a 200-line monstrosity that even experienced Rust engineers on the team struggled with. The breakthrough was realizing we didnt need shared ownership; we needed sharded, lock-free access with clear boundaries. We settled on a design where each shard owned its player state, and updates were sent via channels with Arc<Mutex<T>> only at shard boundaries. The second mistake was not instrumenting early enough. We added perf counters only after the Rust rewrite, which made it hard to compare apples-to-apples with the Java version. Now, every service we build starts with a flame graph and a memory allocation baseline before we write a single line of logic.