The Day Our Configs Were Backwards (And How Rust Fixed It)

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our game server at Veltrix had a leak that grew by 1.2MB per second under load. No stack traces, no panics—just the alloc counter in /proc/self/status climbing like a drunk spider on caffeine. The game loop looked innocent:

while let Some(player) = next_player() {
 handle_player(player); // Nothing allocates here, right?
}

Right? We were using Rust with Tokio, so we assumed the borrow checker had our back. Turns out, our next_player was actually a tokio::mpsc::Receiver that buffered every message indefinitely because we'd tuned the channel size to 1024—never realizing it defaulted to unbounded capacity.

The moment I realized the language couldn't save us from our own configuration was when I ran tokio-console and saw 4,096 pending move requests lingering in the channel. Each request was a small struct, but 4k was already pushing us toward OOM.

What We Tried First (And Why It Failed)

First attempt: blame the runtime. We tried switching Tokio's scheduler from multi-threaded to current-thread, thinking fewer threads would reduce allocations. It dropped allocations by 15%, but the leak persisted.

Then we tried limiting the channel explicitly:

let (tx, rx) = tokio::sync::mpsc::channel(128);

We naively assumed 128 was reasonable. Wrong. In production traffic, spikes of 500 concurrent players meant we hit backpressure immediately. Players reported timeouts when the channel filled up.

Our third attempt was to increase the bound to 1024, the default. This worked for a week—until memory shot up again. The real issue wasn't capacity; it was lifetime.

Every message in the channel held a String for the player's session token. When a player disconnected, we dropped the sender, but the receiver kept the last message alive because the channel's internal buffer held a reference. We were leaking session tokens with every disconnect.

The Architecture Decision

We finally traced it to tokio::sync::mpsc using Arc<Message> internally. Even after dropping the sender, the Arc kept the message alive until processed. With 10k players per match, that was 10k strings in limbo.

The fix wasn't just configuration—it was ownership. We switched to tokio::sync::mpsc::unbounded_channel with an explicit backpressure layer using Semaphore:

let sem = Arc::new(Semaphore::new(1024));
let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel();

while let Some(player) = rx.recv().await {
 let permit = sem.clone().acquire_owned().await?;
 tokio::spawn(async move {
 handle_player(player).await;
 drop(permit); // Release capacity
 });
}

No more unbounded growth. The semaphore capped concurrency at 1k, and Rust's ownership model ensured tokens were dropped immediately when the permit was released.

But this wasn't free. We now had to handle backpressure explicitly. Players would get Service Unavailable if the semaphore was full. We had to add telemetry:

if sem.available_permits() == 0 {
 metrics::counter!("backpressure_rejects").increment(1);
}

What The Numbers Said After

After the switch, memory stabilized:

Metric	Before	After
Allocated heap (RSS)	4.2GB	1.8GB
GC cycles (if we'd used GC)	N/A	0
Channel latency p99	12ms	8ms
Backpressure rejections	0	23 per minute at peak

We ran a 24-hour load test with 50k simulated players. RSS never exceeded 2.1GB, and the allocator reported 0 leaks in jemalloc's prof.active after shutdown.

The semaphore added 4ms to p99 latency when full, but we accepted that tradeoff for stability.

What I Would Do Differently

I would have started with tokio-console on day one. We wasted weeks assuming the runtime was the issue. Had we run:

tokio-console subscribe tokio/channel/size

weeks earlier, we'd have seen messages piling up immediately.

Also, I wouldn't have trusted defaults for anything involving player data. Tok's default channel size is usize::MAX—unbounded. Tokio's time module defaults to 1ms for timers, which caused jitter under load. Every default must be questioned when you're handling real players.

Finally, don't treat Rust as a silver bullet for config issues. The compiler guarantees no leaks within a single crate, but leaks between crates or through external tools (like Tokio) are your problem. Configuration isn't a runtime concern—it's an ownership concern.

And never assume your game loop is safe just because you're using Rust.