The Night Our Event Pipeline Crashed Because We Didn't Measure Memory First

#webdev #programming #rust #performance

The Problem We Were Actually Solving

It started with a scream from the observability dashboard. At 02:47 on a Sunday, our event ingestion pipeline hit 98% memory usage and refused to accept new events. The Go runtime we'd trusted for three years suddenly looked like a liability. The profiler showed 4.2 GB of heap allocated for just 18 million events in the last 30 minutes—numbers that would have been acceptable if each event didn't carry four nested structs of metadata. Our vertical scaling limit was 32 GB RAM, and we were floating 800 MB above it. The Go runtime's GC pauses climbed to 200 ms during peak load, which meant we were dropping events faster than we could acknowledge them.

What We Tried First (And Why It Failed)

We tried three band-aids before realizing they were all symptoms of the same disease. First: we increased worker concurrency from 16 to 32, which doubled our CPU usage and made the GC pauses worse. Second: we added a Redis-backed buffer to smooth traffic, but each event serialization added 1.4 microseconds of latency and introduced another failure point when Redis memory spiked to 95%. Third: we tried tuning GC parameters, setting GOGC=50 and GOMEMLIMIT=28GiB, but the heap still grew uncontrollably because we were allocating temporary slices in hot paths.

The Architecture Decision

We rewrote the event processor in Rust. Not because we loved Rust, but because valgrind --tool=massif showed Go allocating 780 bytes per event unnecessarily. We chose Tokio as the runtime but disabled the work-stealing scheduler to reduce contention, and moved all event metadata into a single Arc<Event> to cut allocations by 68%. We used crossbeam-channel for communication and sized each queue to 1024 events to bound memory growth. The most painful decision was rejecting the serde_json crate for raw parsing with memchr and itoa routines—we saved 400 nanoseconds per event but lost ergonomics. We accepted the trade-off because our SLA required 99.99th percentile latency under 2 milliseconds.

What The Numbers Said After

After the rewrite, memory usage stabilized at 1.2 GB for the same 18 million events. GC in Go had been 200 ms pauses; Tokio's scheduler had 30 microsecond task switches. The 99.99th percentile latency dropped from 2.4 milliseconds to 0.8 milliseconds. Tools told the story: perf stat showed 2.3x fewer cache misses, and jemalloc reported zero fragmentation after two weeks of uptime. We also saved 40% on cloud bills because we shrunk instances from c6g.4xlarge (16 vCPU, 32 GB) to c6g.2xlarge (8 vCPU, 16 GB). The only regression was build time: our Rust binary grew from 18 MB to 24 MB, but we accepted that because deployment frequency dropped by 40% after we stopped panicking at 3 AM.

What I Would Do Differently

I would have measured memory allocation patterns before touching concurrency. We assumed Go would handle our event shapes because Go is supposed to be memory-safe, but our event shapes violated that assumption. If I could go back, I'd write a memory-bound microbenchmark on day one with criterion.rs and compare Rust's alloc::vec versus Go's slices. I'd also avoid the temptation to optimize prematurely; Rust taught me that safety and performance are not mutually exclusive, but they require different tooling. On the other hand, I would not choose Rust for every subsystem—our metadata API still runs in Go because the team productivity gain outweighs the 8% latency cost. Context matters, and Rust shines when the bottleneck is memory, not logic.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.