The Problem We Were Actually Solving
We werent trying to build the fastest event system in the world. What we needed was an engine that could absorb 50k RPS of player actions during the galas treasure hunt without dropping a single event or introducing more than 5ms of tail latency for 99.9th percentile. The Go service—built on Gin, a single Redis cluster, and a PostgreSQL write-through cache—had started life blisteringly fast. Under 10k RPS it handled 200k events/sec with 1.2ms median latency and 0% drops.
Then the gala invite went viral. Not the official one—the TikTok influencer thread that promised real loot for the first 100k players. By 2am the day before the event, we were seeing spikes to 75k RPS. The mutex around the event deduplication cache became a brick wall. Worse, under 50ms GC pauses were injecting latency spikes that violated the 5ms SLO. We tried sharding the cache across 8 Redis instances, but cross-shard writes introduced 8ms of latency on hot paths. Increasing the cache TTL to reduce Redis load only made deduplication errors spike from 0.3% to 2.1% because players could replay actions within the new window.
What We Tried First (And Why It Failed)
We bolted on a C++ shim using libuv and hired a contractor who swore he could squeeze 100k RPS out of a single thread. The shim lived behind a gRPC endpoint and bypassed the Go cache entirely for simple actions. It worked—on his machine. In production, under 55k RPS, the shims shared_ptr reference counting became a bottleneck. The profiler showed 18% of CPU time in std::shared_ptr::release, and 9% in atomic ops fighting cache line bouncing. The latency histogram turned bimodal: 1.3ms median for 80% of traffic, but 42ms for the unlucky 20% that hit the reference counters bottleneck.
Next we tried rewriting the deduplication layer in Zig, betting on comptime to eliminate the allocator overhead that was amplifying GC pressure. The Zig version ran clean at 70k RPS with sub-millisecond latency, but only when we pre-allocated every arena. Once the arena filled, the custom allocator panicked with out-of-memory, and the panic handler took 200ms to unwind—a non-starter for a live event.
The Architecture Decision
By June 2025, we faced a brutal choice: keep patching the Go monolith until it collapsed under GC pressure, or rip it out entirely and bet the company on Rust. We chose Rust—not because it was trendy, but because we could build a lock-free event pipeline that owned its own memory, eliminated GC pauses, and gave us fine-grained control over cache line layout.
We rebuilt the core in Rust using tokio, but not the default runtime. We rolled our own runtime that pinned every shard to a dedicated CPU core and used a work-stealing scheduler only for cross-shard fanout. The event deduplication cache became a lock-free LruCache based on crossbeam-epoch, with each shard backed by a mpsc channel that fed into a single Redis pipeline. No GC, no arena thrash, no shared_ptr overhead.
The compile-time guarantees let us pin every cache line to avoid false sharing. We measured cache misses with perf stat: before Rust, 42% of L1 misses were due to mutex contention; after Rust, cache misses dropped to 8% and the mutex was gone entirely.
What The Numbers Said After
We redeployed the Rust pipeline two weeks before the 2026 gala. The load test that had previously cratered at 50k RPS now handled 120k RPS with 0% drops and 99.9th percentile latency of 3.4ms. The allocation profile showed 18MB/s of heap pressure compared to Gos 142MB/s under load. The GC pause disappeared entirely; the longest observed latency spike was 1.2ms from a single eviction in the LruCache.
During the actual gala, we recorded 87k RPS peak and 2.3 million unique events with zero data loss. The Redis cluster breathed easy because the Rust shards pre-deduplicated 68% of events before they hit Redis. Our SLO burn rate stayed green the entire night.
What I Would Do Differently
Looking back, Rust saved our event pipeline, but the learning curve was steeper than the drop from 42% CPU in a mutex to zero. We underestimated how much unsafe code wed need to write when interfacing with the Zig allocator for hot reloads. That led to a 30-minute outage during the beta when a race in the FFI bridge corrupted a shards event buffer.
I would have isolated the FFI boundary inside a dedicated crate from day one and written property-based tests with proptest to hammer the interface. We also overcommitted to pinning every shard to a core. During noisy neighbor incidents on the metal host, we saw core migration stalls that added 8ms to tail latency. Now we use cgroups to reserve cores but allow the scheduler to migrate threads when the host is under memory pressure.
Most critically, we should have instrumented the Rust pipeline with the same fidelity as the Go one before we cut over. We had to retro-fit a custom metrics exporter halfway through load testing because our Prometheus labels didnt survive the Rust refactor. That omission cost us six hours of debugging during the final rehearsal.
Top comments (0)