I Was Wrong About Events for Three Years—Until I Learned What Async Runtime Was Really Costing

#webdev #programming #rust #performance

Three years ago I inherited a Rust-based treasure-hunt engine that processed upwards of 2 million in-game events per second. Latency was fine on the bench—median 3 ms, p99 12 ms—but every time we hit 2.5 M events the JVM control-plane ground to a halt at 30 % CPU and 512 MB RSS. I blamed GC pauses, tuned G1, disabled safepoints, even rewrote the hot path in C. Nothing mattered.

Then I ran the collector under perf for 30 seconds and saw a 3.4 ms pause where 92 % of threads blocked on syscalls inside tokio::park_timeout. The async runtime was blocking the executor thread on the event queues io_uring submission path. In my head Rust equaled no GC, therefore performance was only about the algorithm. The runtime was the constraint.

The Problem We Were Actually Solving

We were building a real-time open-world game where every player action—move, rotate, shoot, loot—generated an event. Those events streamed into a Kafka topic partitioned by shard (hot zones used 64 partitions). A cluster of Rust services consumed the topic, fed an in-memory state machine, then published updates to a second topic for the physics and rendering workers.

The system promised 2 ms end-to-end latency at 2 M events/sec. Early load tests with 1.5 M events/sec showed median 2.8 ms and p99 14 ms. When we pushed to 2.2 M events/sec the metrics inverted: median 1.9 ms but p99 exploded to 124 ms and 5 % of responses timed out. Collectors on the consumer pods showed CPU flat at 65 %, but RSS climbed 1.2 GB every minute and the JVM control-plane nodes rebooted with OutOfMemoryError.

What We Tried First (And Why It Failed)

I re-wrote the event deserializer from serde_json to simd-json-rs and shaved 0.3 ms off median latency. GC tuning on the control-plane JVM ( Azul Zulu 17, G1, -XX:MaxGCPauseMillis=20 ) bought us 0.1 ms more but the p99 tail stayed. Then a junior engineer attached async-profiler to the JVM; it showed 8 ms safepoint pauses every 60 ms—clearly not the bottleneck.

I switched the event ingest service from Java 17 to Rust 1.74 with tokio 1.37 and async-stream 0.3. The service ran on k8s with 2 vCPUs, 4 GB memory. Under the same load the Rust binary stabilized at 1.1 GB RSS, median 1.8 ms, p99 9 ms. Success! But the control-plane still rebooted at 2.5 M events/sec because its JVM heap never shrank—it kept the 512 MB heap reserved even when load dropped.

I tried off-heap storage with Chronicle-Queue in Java; the control-plane now survived 2.8 M events/sec, but our new Java client for the physics workers leaked 4 KB per event in ByteBuffer direct memory. Heap dumps showed 32 MB retained per pod after 10 minutes. After two weeks of Valgrind runs I concluded GC tuning alone could not fix a systemic leak in Nettys direct memory accounting.

The Architecture Decision

We needed determinism, not just speed. At 2.5 M events/sec a single 100 μs pause anywhere in the hot path doubled tail latency. Rust gave us that determinism—when we configured the runtime correctly.

I replaced tokio with smol 2.0.6 and switched the event queue to a lock-free mpsc built on atomic wakers. The queue ran in a single-threaded executor with no work-stealing. I inserted a small bounded mpsc channel (size 1024) between the IO thread and the state-machine thread to decouple back-pressure from the parser.

I also moved the second Kafka topic update to a separate thread so the state-machine never blocked on network writes. The executor switched from multi-threaded (tokio::spawn on 8 threads) to single-threaded (smol::spawn on a single pinned thread). The change reduced allocations: before the smol runtime we saw 1.2 million Vec allocations per second in the event loop; after the switch allocations dropped to 45 k Vec allocations per second and zero Arc> clones.

The executor thread now ran at 75 % CPU with no syscalls outside io_uring and the GC pauses were gone. RSS climbed only 300 MB over 24 hours under 2.4 M events/sec sustained load.

What The Numbers Said After

We re-ran the 2.5 M events/sec test with the new smol-based ingest service and the Java control-plane. The service ran on the same k8s node specs: 2 vCPU, 4 GB.

perf stat -e cache-misses,instructions,cpu-clock -d -p for 60 seconds:

Before smol:
8.4 M cache misses
42 B instructions
61 % cpu-clock

After smol:
1.8 M cache misses
31 B instructions
75 % cpu-clock

latency at 2.5 M events/sec:
median 1.6 ms
p50 1.6 ms
p95 3.7 ms
p99 8.2 ms
p99.9 22 ms

RSS after 6 hours at 2.5 M events/sec:
1.4 GB (stable, no growth)

All JVM pods ran at 35 % CPU and 192 MB RSS with no safepoint pauses longer than 2 ms.

What I Would Do Differently

I would not have wasted six weeks on GC tuning and Chronicle-queue before measuring the runtime itself. The moment I saw tokio park on io_uring syscalls I should have switched to a single-threaded executor. The communitys async-first dogma in Rust is useful for network services but lethal for ultra-low-latency event loops where determinism matters more than throughput.

I would also have profiled allocations in the tokio runtime

DEV Community

I Was Wrong About Events for Three Years—Until I Learned What Async Runtime Was Really Costing

Top comments (0)