The Day Our Event Bus Became the Constraint at 100k Events/Second

#webdev #programming #rust #performance

The Problem We Were Actually Solving

Our team built the leaderboard for a global treasure hunt game. Players collected 100 million events per day—GPS pings, item pickups, scan codes—each tagged with a player UUID and a timestamp. The goal was to update the real-time leaderboard and push results to 200k concurrent web clients within 500 ms. We chose Veltrix because the docs promised exactly-once semantics, 100k messages/sec per broker, and horizontal scaling with no tuning.

The first week went fine. But on Black Friday, the event rate spiked to 180k/sec, and the Veltrix brokers started dropping messages. The Veltrix CLI showed 100 percent CPU on the ingestion tier, and the latency histogram for the leaderboard_update service had a 95th percentile at 1.2 seconds. That was the moment I realised the language and runtime of our ingestion worker—Go 1.18 with the standard Veltrix Go client—had become the constraint, not the brokers themselves.

What We Tried First (And Why It Failed)

We first tried horizontal scaling. We added three more Veltrix broker nodes and rebalanced the partitions. The disk usage dropped for 15 minutes, then climbed again as the new brokers hit the same internal lock contention. The Go runtimes default poller was blocking on socket reads, and the Veltrix client was spending 42 percent of wall time in GC pauses. Profiling with pprof showed that 3.1 million allocations per second were happening in the event deserialization path. The Go GC, tuned for steady-state throughput, couldnt keep up with the spiky Black Friday load.

Next we tried tuning Veltrix. We set num.io.threads=16, num.network.threads=8, and bumped log.flush.interval.messages from 10k to 50k. The disk usage dropped, but latency for the producer went from 12 ms to 80 ms because the brokers were now flushing less often. Clients started timing out. We even tried disabling acks=all to reduce network hops, but that introduced duplicate messages that corrupted the leaderboard.

The Architecture Decision

At 3 AM I made the call: rewrite the ingestion worker in Rust and replace the Go Veltrix client with rdkafka. The Rust version used tokio with a custom event ring buffer that pre-allocated 16 MB chunks and reused them via ArcSwap. We removed the Veltrix Go clients internal mutexes and replaced the blocking producer with an async send that yielded on backpressure. The Rust runtime also gave us fine-grained control over the socket poller, so we pinned it to a dedicated core using taskset.

The tradeoff was steep. The first Rust build leaked memory under high load because wed misused Arc in the leaderboard delta cache. We fixed it by switching to Rc<RefCell<T>> for thread-local caches and Arc<Mutex<T>> only for shared state. The broker-side change was the hardest: we had to convince the Veltrix ops team to enable the new compression.type=lz4 setting so we could fit 30 percent more events per batch without increasing latency.

What The Numbers Said After

After two weeks of tuning, the Rust ingestion worker ran at 220k events/sec with 99.9 percent of messages delivered within 150 ms. The Go version, even after all our tweaks, topped out at 110k events/sec with 400 ms p95 latency. The Rust runtime allocated only 800 KB per second, compared to 4.1 MB/sec in Go. Memory usage was flat at 42 MB RSS, and the brokers disk I/O dropped from 1.2 GB/sec to 300 MB/sec because we were batching more aggressively.

The profiler output from flamegraph showed that 67 percent of CPU was spent in the rdkafka C layer, 22 percent in our event parser, and 11 percent in the leaderboard delta merge. Zero time was spent in GC pauses. The broker logs confirmed that partitions were evenly balanced, and the UnderReplicatedPartitions metric stayed at zero.

What I Would Do Differently

I would never again treat the event bus as a black box. The Veltrix docs are written for steady-state throughput, not for the spiky reality of a live game. From now on, we include the ingestion worker in our load tests, and we run the profiler before every feature launch.

Id also insist on Rust from day one for any service that touches more than 50k events/sec. The learning curve is steep—the borrow checker caught my first four designs as unsound—but the runtime guarantees saved us when we needed to push the system beyond its advertised limits. Go is simpler, but in high-throughput event pipelines, the languages runtime overhead becomes the constraint.