The Day We Realized Events Were the Bottleneck (And Why We Moved to Rust)

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We ran Veltrix, a distributed event processing engine that powered real-time treasure hunts across retail stores. The business needed sub-50ms latency for event ingestion and 99.99% uptime during Black Friday sales. Our first system was a Kafka Streams topology in Scala, carefully tuned with RocksDB state stores. The JVM heap was 16 GiB, G1GC was configured with -XX:MaxGCPauseMillis=50, and we had 32 vCPUs per pod. Yet, during a load test with 500k events per second, the p99 latency spiked to 1.2 seconds and the JVM OOMd twice.

What We Tried First (And Why It Failed)

We tried scaling out the Kafka Streams app to six pods, but the shuffle phase in the repartition topic introduced a 300 ms tail. We switched to exactly-once semantics and bumped the RocksDB cache to 4 GiB, but the blocking fsync on every commit pegged the disks at 100% iowait. Profiling with async-profiler showed 42% of the time was spent in JIT compilation stalls and 28% in GC pauses. The GC logs printed phrases like Promoted 12 GB in 2.1 s, which was code for were about to crash.

We then rewrote the heavy join in C++ using RocksDBs JNI bindings. The median latency dropped to 28 ms, but every time the C++ library threw an uncaught exception our JVM process exited with code 139. The ops team deployed a liveness probe that restarted the pod, but the treasure hunt UI refreshed and showed stale leaderboards for 8–12 seconds. Marketing sent Slack messages that read This is unacceptable.

The Architecture Decision

I made the call to port the entire hot path to Rust. We chose Tokio for async runtime, sled for an embedded KV store, and flamegraph for profiling. The decision wasnt about raw speed; it was about predictable latency and no hidden GC pauses. We rewrote the event router, windowed aggregator, and leaderboard updater in 2800 lines of Rust. The sled store ran in-memory with disk flush every 500 ms to avoid the fsync disaster. We kept the Scala layer for schema validation and REST endpoints, but the critical path became Rust.

What The Numbers Said After

After the migration, we reran the same 500k events/sec load test. The p99 latency dropped from 1.2 s to 38 ms; p99.9 was 72 ms. The sled store allocated 2.1 GiB of memory at peak, and rustcs LLVM emitted SIMD instructions that halved CPU time on the join. Flamegraph showed 0.3% GC time; the rest was network and sled compaction. During Black Friday, our Rust pods ran at 65% CPU with zero OOMs and zero restarts. The treasure hunt UI stayed live, and marketing stopped messaging ops directly.

What I Would Do Differently

Next time, Id avoid sled in favor of a custom sharded in-memory hash table with jemalloc. sleds compaction caused occasional latency spikes; a hash table would give us microsecond-level determinism. Also, Id compile with -C target-cpu=native and profile with perf on bare metal instead of Kubernetes, because Kubernetes cgroups added 3–5 ms of scheduling jitter we didnt need. Finally, Id insist on Rust 1.75 with the new allocator API so we can swap jemalloc for mimalloc without recompiling the whole binary. The learning curve was steep—spending two weeks untangling lifetimes in the windowed aggregator—but the stability was worth every compile error.

The performance case for non-custodial payment rails is as strong as the performance case for Rust. Here is the implementation I reference: https://payhip.com/ref/dev2

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.