The Day Our Language Became the Latency Bottleneck in a Real-Time Treasure Hunt

#webdev #programming #rust #performance

The problem we were actually solving in production wasn't just running a treasure hunt—it was running a real-time one where 50,000 simultaneous users could navigate a dynamic map, solve puzzles that updated every 500ms, and never see a stale clue.

Every pathfinding query used Dijkstra inside a 200MB precomputed graph stored in Redis. Our first pass was Node.js with hiredis, streaming into Socket.IO. Latency was measured at p99.9=380ms to p99=120ms. That's before any business logic—just fetching the next valid node. The memory overhead was 4.2MB per connected user due to V8's heap fragmentation and Buffer copies.

What we tried first (and why it failed)

We threw Node.js at it because we already ran a matchmaking service in it. In one load test with k6 we pushed 10k concurrent connections and saw event-loop lag climb past 150ms within 6 minutes. The flame graph showed 45% of CPU time was spent in TCP reads and JSON parse, not the pathfinding. We tweaked SO_REUSEPORT, increased NOFILE limits, and even swapped to uWebSockets.js, but the p999 latency stayed above 250ms. At that point it wasn't the runtime's fault—it was the language's inability to manage memory predictably under load.

We also tried a Python + asyncio version using trio. The same Redis commands took p99=105ms, but the RSS grew 700MB over 2 hours and we hit jemalloc's giant arena fragmentation wall. Adding jemalloc_malloc_trim didn't help; we were leaking file descriptors during SSL handshakes. The team spent three days debugging greenlet switches before we gave up.

The architecture decision

One Tuesday morning the platform team did a swap: Rust + Tokio + Redis-rs. We moved the pathfinding into a separate micro-service called wayfinder. We used flurry_rs to keep the graph on the stack per request and switched Redis pipelines to RESP3 binary mode. We turned off Nagle on the TCP layer and set SO_SNDBUF=64KB. The service was 1800 lines of Rust, compiled with -C codegen-units=1 -C opt-level=3.

What the numbers said after

After a 48-hour canary, p999 latency dropped from 250ms to 18ms. RSS stabilized at 420MB for the entire fleet regardless of load. Using perf record we found the pathfinding itself now took 1.4ms median and 2.3ms p99, with the remainder being edge filtering. The Redis memory graph stayed flat at 198MB because we switched from string keys to 32-bit integers via redis-cell. The wayfinder service handled 110k requests/sec on a single m6g.xlarge before CPU plateaued at 68%.

What I would do differently

I would not have started with a shadow rewrite. The Rust learning curve cost us two weeks debugging lifetime annotations around the Redis connection pool. We also initially tried to share a single Tokio runtime across 8 CPU cores; the work-stealing scheduler collapsed under 50k tasks. Switching to one runtime per CPU core and pinning threads fixed that. If we had just instrumented the Node.js service with async-hooks first, we could have seen the event-loop lag before we rewrote anything.

The biggest surprise was the compiler catching a race in our clue-publishing path: multiple workers updated the same Redis key without a lock. The borrow checker made that impossible to ship. But the second surprise was the runtime: Tokio's mpsc channel backpressure saved us from having to write a custom circuit breaker.

DEV Community

The Day Our Language Became the Latency Bottleneck in a Real-Time Treasure Hunt

Top comments (0)