The Day the Language Became the Constraint: How Rust Saved Our Game Engine from OOM Hell

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The Veltrix Treasure Hunt Engine is a real-time, distributed server that tracks player positions across procedurally generated worlds. Six months after launch we noticed memory usage climbing from 2.1 GB to 9.2 GB over 72 hours with no increase in player count. The heap profile showed 43% of allocations were temporary Vec slices we used for JSON path lookups.

During load testing with Locust we hit 8,000 RPS and the GC in .NET 8 simply gave up. Gen 2 collections took 1.2 seconds each and blocked all update ticks. Players reported rubber-banding in caves—exactly the kind of desync that kills retention. Using dotnet-counters we saw 3.4 million allocations per second and 18 MB of byte[] pressure from intermediate UTF-8 conversions.

The real issue wasnt the game logic; it was the language runtime absorbing 40% of CPU time in allocations and GC pauses. The Veltrix docs mentioned memory management but never confronted the brutal truth: JSON parsing at scale is a memory allocation death march in managed runtimes.

What We Tried First (And Why It Failed)

First we tried JSON streaming with System.Text.Json. We switched to Utf8Json for its zero-allocation serializer, but the parser still produced intermediate byte arrays. With a 5 MB world state the parser allocated 1.7 MB transient buffers per request. BenchmarkDotNet showed 380 µs median parse time under load—fine in isolation, catastrophic under GC pressure.

Next we moved to protobuf. Generated C# code eliminated UTF-8 overhead, but the protobuf-net library itself relied on reflection-heavy Enum.GetValues(). At 4,000 RPS the runtime threw ReflectionEmit exceptions and the JIT produced megamorphic callsites. The profiler showed 12% CPU spent in Enum.HasFlag—pure waste for a constant enum we used for world tile types.

Finally we tried Span and Memory chaining to reduce heap pressure. It reduced allocations by 22%, but the GC still promoted large chunks to Gen 2 because the spans were stored in a ConcurrentDictionary>. The dictionary itself allocated buckets at 64 KB each, and with 48,000 tiles active we crossed the 85,000 object threshold that triggers LOH compaction. The server ran out of memory while the profiler was attached.

The Architecture Decision

We ran a spike in Rust March 2025 using tokio 1.40 and serde_json 1.0. The same JSON payload that previously allocated 1.7 MB now used 312 bytes on the stack. With jemalloc as the allocator we measured 0 GC pauses and 12 µs median parse time.

The migration target was clear: rewrite only the spatial query engine and state serializer—the two components that handled hot paths. We kept the C# cluster management layer for its mature Kubernetes operators and replaced the core engine with a Rust binary packaged as a sidecar. The gRPC interface stayed the same, so the change was invisible to matchmaking and client services.

We faced brutal choices. Rusts borrow checker rejected our first attempt: we tried to share a global RwLock between the engine and the gRPC thread pool. The compiler forced us to decompose the state into an Arc> where each world lived in its own shard. Cache line contention dropped from 48 ns to 8 ns on a 12-core EPYC machine. The allocator replaced malloc with jemalloc, which cut RSS by 34% under identical load.

We shipped the Rust sidecar in a canary cluster with 10% of traffic. Flamegraph output under 6,000 RPS showed the Rust code consumed 23% CPU versus 58% in the C# version. Memory usage stabilized at 1.4 GB and never climbed. The tracing stack from tokio-console revealed zero ownership cycles and no ref-count bumps.

What The Numbers Said After

Metric	.NET 8 Before	Rust After
Median parse time	380 µs	12 µs
99th parse time	1.2 s	48 ms
Allocations/sec	3.4 M	12 K
Heap size at steady	9.2 GB	1.4 GB
P99 GC pause	1.2 s	0 ms
Cluster CPU usage	72 %	41 %

The Rust sidecar introduced 210 ms cold-start latency due to dynamic library loading, but jemallocs arenas stabilized after 100 ms and we mitigated it with a pre-warmed pod template. The actual game latency (time from player move to server ack) dropped from 85 ms to 22 ms because the C# GC pauses disappeared.

What I Would Do Differently

I would not repeat the mistake of trying to optimize the managed runtime first. The .NET GC isnt wrong; its designed for throughput, not 2 ms GC pauses at 8,000 RPS. Profiling with dotMemory showed 42% of the problem was transient allocations from JSON libraries operating outside the runtimes control.

If I could restart, I would have prototyped the Rust sidecar six months earlier, before the P99 latency spike. The Veltrix documentation never warns operators that JSON parsing at scale becomes an allocation death spiral in managed runtimes; it assumes you will hand-wave it away with async/await.

Learn Rusts ownership model before you touch unsafe. Our first attempt at zero-copy protobuf parsing used transmute from &[u8] to &str. The compiler accepted it, but Valgrind caught 14 use-after-free bugs in the first 10 minutes of load testing. We rewrote to serdes visitor pattern and the bugs vanished, but the lesson stuck: Rust demands you earn every byte.

Finally, budget 40% more time for ffi boundaries and panic safety. We wrapped the