Alex

Posted on Jul 1

From Python/Pandas to Rust/C++: taking our tick simulation from 140ms to microseconds per window

#rust #performance #showdev #python

TL;DR

We're a small ML lab building alpha models for a handful of partners. Our market simulation loop — the part that keeps you honest about look-ahead bias — was 900–1300 ms per window in Python/Pandas, which made every experiment a 6–20 hour run. We went pandas → numpy → hand-written Rust + C++ models and landed at 1–5 ms per window on a cheap cloud box (4–40 µs on a high-clock CPU). This is the honest engineering story, and an actual question at the end for anyone who does HFT/MM.

Not a pitch — I'll explain why at the bottom.

The problem: simulation, not latency

Our whole training stack is Python: feature engineering → targets → training → backtests → and the one that actually matters, simulation (strict, no look-ahead).

Simulation is brutal on compute. On 1m/5m bars over years of history, a single run on a normal workstation took 6–20 hours. For each window we compute several hundred features, then run inference. Data → features → inference of one window was 900–1300 ms.

We never cared about that latency for trading. We cared because every experiment took a day, and I had a backlog of hypotheses to test.

Step 1: pandas → numpy

Being Python people, the first move was obvious: rip pandas out of the hot path and go numpy. Real win — ~140 ms/window. We could finally evaluate models across more angles.

But rolling-window recomputation and allocation churn were still the ceiling, and 140 ms only let me run the basic experiments.

Step 2: accepting the language was the wall

My friend has written Rust for years and never shut up about it: "your Python is nonsense, rewrite it in Rust." We argued for years about whether Rust is always worth it.

This time I got it: no matter what CPU I throw at it, the GIL and Python's overhead cap me. There was no way up.

Step 3: Rust + C++

Not fast, not easy — we rewrote every feature in Rust, with O(1) incremental state per tick instead of recomputing rolling windows. That single change killed both the allocation churn and the latency variance. Then we converted the models to a C++ engine AOT-compiled for the target CPU, called over FFI.

Results, full cycle, one window:

Stage	Latency/window
Python / pandas	~140 ms
Cheap cloud box (vCPU)	1–5 ms
High-clock AMD test rig	4–40 µs

Simulations that took hours now take minutes. The memory-leak whack-a-mole is gone.

The part I didn't expect

The interesting outcome wasn't prod speed — it's the experiments this unlocked. We can now run real tick-level simulation (not a backtest) to test ideas we simply couldn't touch before, including some inspired by Michael Levin's work (bioelectric / collective-behavior stuff that turns out useful well beyond biology). In Python that was infeasible; in Rust it's basically bounded only by infrastructure.

Verify it yourself (no cherry-picked CSVs)

We stream raw live signals to a public board. Every signal is written to public S3 at generation time, immutable, with a microsecond timestamp — so you can confirm there's no look-ahead: signal_gen_time > bar_time, for every single one. The demo box also reports its real inference latency (you'll see ms, not µs — cheap silicon, honest number).

Where we're NOT flexing

We have real data-feed latency and zero colocation / kernel-bypass / exchange adjacency. This is fast compute, not a colocated HFT desk. Not pretending otherwise.

The honest question

If anyone here actually runs HFT / market-making in production: given fast compute but no colo (real feed latency), is any of this usable in prod? Our only idea so far is adverse-selection defense for market-making — skew/pull quotes ahead of a microstructure move. We might be completely wrong. I'd love a reality check from someone who's actually done it.

Why this isn't an ad

We don't sell to retail, and I doubt there are buyers for this among readers here. I'm writing it because this community appreciates a real Rust-rewrite story and will tear bad engineering apart — which is exactly what I want.

Rust is cool. That's the post.

Links

Live board (raw signals + real inference latency): https://livefinai.synlabs.pro/

DEV Community