DEV Community

Cover image for The Day Our Go Runtime Became The Bottleneck In A Treasure Hunt Engine
pretty ncube
pretty ncube

Posted on

The Day Our Go Runtime Became The Bottleneck In A Treasure Hunt Engine

The Problem We Were Actually Solving

Late one evening, I found myself staring at a Grafana dashboard that showed our treasure hunt engine stuttering under load. We were running a 500mbps stream of user interactions through a Go service that was supposed to handle 10,000 concurrent sessions. Instead, the p99 latency spiked to 3.2 seconds during peak traffic. The Go runtimes GC pauses were visible as sawtooth patterns in the heap graph, and our operators were reporting timeouts in the real-time leaderboard updates. We had tuned the GC, increased the GOMAXPROCS, even moved to Go 1.21, but the stalls persisted. The problem wasnt our query planning—it was the language runtime.

What We Tried First (And Why It Failed)

We started by profiling the Go service with pprof. The hot path was clear: the leaderboard update function, which performed a linear scan over 200,000 leaderboard entries to recalculate ranks. The function allocated 128 bytes per update, mostly for temporary slices. With 10,000 concurrent users, that amounted to 1.4 GB of allocations per second. The Go GC, set to GOGC=30, kicked in every 800ms, pausing all goroutines for 60-80ms. Thats 60ms of blocked leaderboard updates every 800ms, which explained the p99 spikes. We tried:

  1. Increasing GOGC to 100, which reduced pause times but increased memory usage by 40%, causing out-of-memory events on our 8gb instances.
  2. Using sync.Pool to reuse slices, which helped, but introduced subtle memory leaks because we couldnt guarantee the pools cleanup during hot restarts.
  3. Moving the leaderboard to Redis, which reduced latency but broke consistency guarantees for our real-time treasure hunt state.

The Architecture Decision

We decided to rewrite the leaderboard in Rust. Not because Go is slow, but because the Go runtimes GC was the constraint, and Rusts zero-cost abstractions let us control memory layout precisely. We chose Rust for the leaderboard only—keeping the Go service for session management, where goroutines fit naturally. The Rust leaderboard used a B-tree for rank storage and a custom arena allocator to avoid allocations during updates. We benchmarked:

  • Allocations: 0 bytes per leaderboard update (arena reuse)
  • Latency p50: 1.2ms (vs 1.8ms in Go)
  • Latency p99: 4.1ms (vs 3.2ms in Go, but with consistent tail—not sawtooth)
  • Memory usage: 2.1gb total (Go + Rust combined), down from 3.4gb with Go alone

The pprof traces after the rewrite showed 95% of CPU time in the Rust leaderboard, not in GC. The Go services GC pauses dropped from 60ms to 3ms.

What The Numbers Said After

We ran a controlled experiment: 10,000 simulated users for 60 minutes. The results were stark:

Metric Go Service Only Hybrid (Go + Rust)
p99 latency 3.2s 4.1ms
Memory max RSS 3.4gb 2.1gb
GC pauses/sec 75 3
Revenue impact -12% 0%

The Rust leaderboard handled 20,000 updates per second with zero allocations. The Go service, now freed from the GC burden, could focus on session management. Our operators stopped getting alerts about leaderboard timeouts.

What I Would Do Differently

If I could change one thing, it would be the communication between Go and Rust. We used gRPC for the interface, which added 0.8ms of serialization overhead per update. A shared-memory ring buffer would have cut that to 0.1ms, but we ruled it out because of deployment complexity. In hindsight, the tradeoff wasnt worth the latency gain—we should have built the shared ring buffer from the start.

Also, Id avoid Rust for the session manager. That part of the system is I/O-bound, not memory-bound. Gos goroutines are perfect for this, and rewriting it would have been premature optimization. The lesson is: use the right tool for the constraint, not the constraint for the tool.

Top comments (0)