When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We ran the Treasure Hunt Engine at Veltrix—our real-time game backend that serves 15 k QPS from players who expect to resolve a treasure within 50 ms or they rage-quit and refund. The performance target is hard: 99th percentile latency must stay under 50 ms end-to-end, including network marshaling, game state lookup, and leaderboard write. In December 2025 we hit a wall: the Go runtime stopped scaling past 2.4 k concurrent connections on a single c6i.4xlarge instance. We were seeing 67 ms p99s and 8 % allocator contention under load. That third 9 wasnt moving no matter how many connections we sharded. Flame graphs showed 32 % of CPU time inside the schedulers steal loop; the Go GC wasnt the bottleneck yet, but the scheduler was fighting itself under high context-switch rates. The team was ready to throw threads at it, but I knew that would only deepen the queueing delay. Something deeper had to change.

What We Tried First (And Why It Failed)

We started with Go 1.22.2, using net/http with fasthttp, then switched to github.com/valyala/fasthttp which cut GC pressure by 20 %, but the p99 crept up again once we crossed 3 k connections. I pulled the Linux perf data:

 23.45% [kernel] __x86_indirect_thunk_rax
 18.72% main runtime.schedule
 12.87% main runtime.lock
 9.41% main runtime.mallocgc

The steal loop inside runtime.schedule was burning 18 % of CPU before any business logic ran. We tried increasing GOMAXPROCS from 4 to 8, which helped p95 but pushed p99 past 60 ms because the scheduler now incurred more cross-CPU migrations. We even moved to Go 1.23s new arena allocator for the treasure state tree, yet the allocator still fought the GC for the same 128-byte structs. The wall wasnt memory, it was the runtimes ability to schedule continuations fast enough under 300 µs span pressure. I knew we could shard deeper, but each new shard added cross-zone RPC latency and erased the gains. At that point we faced a choice: either accept the wall or change the language.

The Architecture Decision

I proposed a rewrite in Rust 1.80 with tokio 1.40 and hyper 1.0, choosing an async runtime that offered work-stealing but also compile-time guarantees about blocking. The tradeoff was a 4-week rewrite of the treasure lookup path and a 2-week port of the Redis-backed leaderboard façade. We rebuilt the state tree as a lock-free shard with arena-backed arenas (via bumpalo) to avoid both GC pauses and allocator hotspots. The critical call path looked like:

arena::Arena -> io_uring -> tokio::spawn -> lock_free_map::Entry

We kept the same Redis write path for leaderboards but moved to redis-rs with a connection pool that used blocking calls in a separate thread pool to avoid polluting the async runtime. The compiler caught three data races that would have taken weeks to reproduce in Go, and miri found a slice bounds error that only manifested under 70 k concurrent sessions. We deployed to a single c6i.4xlarge with the same hardware budget and set the same load test: 15 k QPS, 30 k concurrent sessions, 50 ms SLA.

What The Numbers Said After

After the Rust rewrite, the perf numbers flipped. Using perf record with flamegraph:

 31.6% treasure_hunt [kernel.kallsyms] __x86_indirect_thunk_rax
 12.4% treasure_hunt tokio::runtime::scheduler::current
 6.8% treasure_hunt lock_free_map::get
 5.2% treasure_hunt hyper::server::conn::http1::keep_alive
 4.1% treasure_hunt redis::cmd

The scheduler overhead dropped from 18 % to 12 %, the cross-context migrations vanished because Tokios work-stealing scheduler stays on fewer cores, and the lock-free map held steady at 6.8 %. The end-to-end latency histogram showed:

p50 12.3 ms
p95 31.8 ms
p99 46.2 ms

That p99 was finally inside the 50 ms SLA even at 30 k concurrent sessions, and the allocator showed zero GC pauses—just one arena reset per shard per minute. The memory footprint stayed within 512 MB RSS, 38 MB heap, because arena allocations are zero-cost frees. The Go version had leaked 12 MB per 1 k sessions due to fasthttps internal buffers; Rusts arena reset the buffers in bulk.

What I Would Do Differently

If I could go back, I would not have spent 3 weeks optimizing the Go runtime before admitting the scheduler was the bottleneck. The Go scheduler is excellent, but it is not designed for microsecond-scale continuations under 30 k concurrent sessions. Id also avoid Tokios io_uring for now; the benefits didnt outweigh the complexity once we stabilized. Finally, I would have pushed harder to move the leaderboard façade to a lock-free shard inside the Rust process instead of outsourcing to Redis. The round-trip Redis latency added 4–6 ms at p95, and we could have replicated the leaderboard in-memory using a sharded skip list. That single change would shave another 5 ms off the p99, but we shipped on time and the players stayed happy.

DEV Community

When the Runtime Was the Wall: How Rust Broke a 50 ms SLA and Saved the Day

Top comments (0)