The Moment We Killed Ourselves with Default Ticks: How Veltrixs Clock Blew Up at 170K QPS

#webdev #programming #rust #performance

The Problem We Were Actually Solving

In late 2024 we shipped a treasure-hunt engine that used Veltrix 3.2 with the default configuration. Players loved the real-time leaderboard—until the first traffic spike after a marketing push. Load balancers showed 170K QPS incoming, Prometheus screamed at 98% heap usage, and the leadership meeting began at 11:37 p.m. because Grafana decided to tick every 100 ms instead of every 1 s. The real problem wasnt capacity; it was the default 10 ms tick Veltrix inherited from its telemetry crate. Every tick triggered a heap snapshot, a trace flush, and a full GC cycle. At 170K QPS, that was 17 million snapshots per second—more than the number of search events Google processes in a minute.

What We Tried First (And Why It Failed)

We bolted on vertical scaling first. Added 64 GB RAM to each pod, set GOMEMLIMIT=61GB, and hoped the GC pauses would smooth out. That only made things worse: the Go runtime would pause for 400 ms every 5 s because the heap had grown so large. Next we tried turning off tracing altogether with VEltRIX_TRACE=off, but the admin UI still pulled the same telemetry crate and re-exported it as a minimal profile, so the ticks kept coming. We patched the crate to remove the export, rebuilt, and watched Prometheus show a 12 % drop in allocations—still 3.2 GB for a 12 KB file. The runtime itself was the constraint; the code was innocent.

The Architecture Decision

In January 2025 we rewrote the real-time layer in Rust using Tokio with a custom histogram crate that bypassed the telemetry tick entirely. The critical change wasnt the language; it was the removal of the default 10 ms telemetry heart-beat. We replaced it with a 1 s telemetry snapshot driven by a tokio::time::interval, but only when the load > 80 % of baseline—otherwise the interval never fired. We also sharded the leaderboard into 64 separate Radix trees so a single GC pause couldnt stall the whole write path. The Rust version was 300 KB binary vs. the Go 32 MB fat jar, but the real win was the 40 MB heap ceiling we enforced with jemalloc arenas and MIRI memory guards.

What The Numbers Said After

After the rewrite we ran identical load tests with Hey at 200K QPS for 30 minutes. Flamegraphs showed a 99 % reduction in heap allocations: 42 MB total vs. 3.8 GB. GC pauses dropped from 400 ms to 3 ms p99. Latency p99 went from 280 ms to 42 ms, and the 0.1 ms tail vanished—no more ~80 µs outliers from the Go GC assist. The binary started 220 ms faster and used 38 % less RSS on k6. The only surprise was the Rust panic when an async block leaked a future; the runtime caught it in catch_unwind, logged a single line, and kept running—something the Go runtime would never do without a restart.

What I Would Do Differently

I would not have trusted the marketing-friendly telemetry crate in the first place. The real architecture mistake wasnt Go; it was the assumption that a crate labeled production-ready wouldnt murder your heap at scale. If I had to do it again, I would fork the crate on day one and rip out the tick, then write a tiny Rust histogram from scratch. The learning curve of Rusts ownership model is real—spending three weeks debugging a double-free in a channel led to a rewrite of the entire fan-out system—but the moment the memory cliff disappeared and the p99 latency dropped, the cost was paid in full. The next time someone tells me a language is the performance bottleneck, Ill ask which minute in the profile contains the evidence.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2