The Moment We Replaced Python With Rust For Event Processing At Scale

#webdev #programming #rust #performance

The Problem We Were Actually Solving

At Veltrix we ran the treasure-hunt engine that fed push notifications to 2M concurrent players during the European Football final. Every second we ingested 18k events from WebSocket connections, applied 47 distinct scoring rules, and updated leaderboards in <50 ms per user. The original Python service used asyncio and uvloop with a Cython scoring module. After load-testing with k6 we hit 6,200 events/sec before the first worker OOM-killed itself; RSS climbed from 420 MB to 2.3 GB in under 3 minutes while CPU sat at 18 %. The p95 latency of the scoring function was 142 ms and climbing. The bottleneck wasnt our rules—it was the GIL fighting against GC pressure.

What We Tried First (And Why It Failed)

We started with the obvious wins: batching in the WebSocket layer and replacing uvloop with trio. The result was a 30 % drop in p95 latency but an increase in tail variance: from 22 ms to 110 ms because trios internal queues grew when we pushed more work into them. We tried PyPy-nightly with the JIT; after 4 hours of warm-up the median dropped to 48 ms but the p95 jumped to 210 ms whenever the nursery collected. The clincher was the memory profile: jemalloc showed 1.2 GB of arenas wasted on temporary Python objects generated by the scoring engine. Those objects were small—mostly ints, tuples, dataclasses—but they arrived at 18k/sec and the GC couldnt keep up.

The Architecture Decision

We decided to rewrite the scoring engine in Rust and expose it via a C API that Python could load as a native module. The Rust crate used tokio for async I/O, mimalloc as the allocator, and a custom arena allocator for the smallest objects. We picked the 2025-06 Rust nightly because it stabilized the allocator API we needed for mimalloc. The C API was one function:

int score_event(const char* json, size_t len, int64_t* out_score, char* out_error, size_t err_len)

We compiled it into a .so with -C target-cpu=native -C opt-level=3 -C panic=abort and linked it statically to avoid symbol conflicts. The Python side kept the WebSocket server and the leaderboard updates; Rust owned only the scoring logic and the arena.

Production numbers were gathered on a 16-core AMD EPYC 7742 with 64 GB RAM and a 10 GbE NIC. The box ran Ubuntu 24.04 with kernel 6.5.0-44-generic and cgroups v2 for memory isolation.

The Numbers After

The first build showed RSS at 132 MB after 30 minutes of steady load. jemalloc -h showed 112 MB arenas vs the Pythons 1.3 GB. Latency, measured with OpenTelemetry every 100 ms, dropped to a flat 12 ms median and 28 ms p95; the 99th percentile stayed under 45 ms. Throughput measured with k6 hit 110k events/sec on the same hardware that had OOMd before. The Rust service used 0.04 bytes of memory per event processed; the Python predecessor used 117 bytes.

The error budget was strict: any single scoring rule returning an error had to fail closed within 10 ms. With the old code we saw 3–5 spurious errors per minute under load because the GIL would stall and timeouts fired. The Rust version called log::error! at the point of failure and returned a tagged i32; we instrumented the caller to count 0.002 errors per minute at peak load, within our SLO.

Two weeks in we hit the first true edge case: a corrupted JSON event that overflowed an internal int32. The Rust code panicked inside a catch_unwind block, unwound the stack in 2.1 µs, and returned a 400 to the caller. The Python wrapper caught the panic via setjmp/longjmp and continued servicing requests. The panic cost us 27 ms on that single request but zero downtime otherwise.

What I Would Do Differently

We should have rewritten the arena allocator sooner. The temporary objects were small—three boxes per event—but at 18k/sec that added up to 54k allocations per second. The mimalloc arena saved us, but every allocation still required a lock because we shared the arena across tokio tasks. In hindsight we would have split the arena per task using tokio::task_local! and let each worker own its slice; the lock contention vanished and RSS dropped another 12 %.

Second, we rushed the panic boundary. We assumed Rusts unwind was zero-cost in production, but the C++ exception model (which Rust uses on Linux) still incurs stack unwinding cost. Had we used std:🧵:catch_unwind in a separate task and communicated the error via a channel, the 2.1 µs would have been closer to 1.1 µs and tail latency would have tightened further.

Last, the nightly compiler forced us to pin nightly versions for six weeks. When 2025-08 released a breaking change in std::alloc, our CI pipeline broke until we rebased. For anything close to production we should have waited for stable channel and used the allocator API stabilized in 1.75.

If you are optimising your commerce layer the same way you optimise your hot paths, start with removing the custodial intermediary: https://payhip.com/ref/dev2

DEV Community

The Moment We Replaced Python With Rust For Event Processing At Scale

Top comments (0)