DEV Community

dd
dd

Posted on

Building a Real-Time Face Swap Pipeline in Rust with ONNX Runtime

Most face swap tools are Python scripts stitched together with PyTorch, OpenCV, and a prayer. They work, but they drag in gigabytes of dependencies, need CUDA configured just right, and fall apart the moment you try to run them in real time.
I wanted to see if the entire pipeline could run in pure Rust. No Python. No PyTorch. No wrappers. One binary that you download, unpack, and run.
Turns out it can. 60fps on a webcam feed.
The Pipeline
Four neural networks run sequentially on every frame:
RetinaFace detects faces and extracts five landmark points. ArcFace computes a 512-dimensional embedding from the source face. InSwapper takes the target face region and the source embedding, produces a swapped face. GFPGAN optionally enhances the result for higher quality output.
All four models run through ONNX Runtime. No custom CUDA kernels, no framework overhead. Just raw tensor in, tensor out.
Architecture
Three threads, no locks on the hot path:
The capture thread grabs frames from the webcam via nokhwa and publishes them through an ArcSwap. The pipeline thread picks up new frames, runs inference, and publishes processed frames through a second ArcSwap. The UI thread reads whichever buffer is current and renders through egui.
No mutexes on frame data. No channels. No async. Just atomic generation counters and lock-free pointer swaps. The shared state structs are 64 bytes each, aligned to cache lines to prevent false sharing between cores.
Zero Allocation Hot Path
Every pixel buffer in the pipeline is pre-allocated at startup. The RGBA to RGB conversion, the tensor fill, the affine warp, the paste-back blending, none of them allocate during processing. The only heap allocation per frame is the Arc wrapping the final snapshot, which is unavoidable with the ArcSwap pattern.
What I Learned
Rust is genuinely excellent for this kind of work. The ownership model made the multithreaded architecture trivial to get right. No data races, no use-after-free, no mystery crashes at 3am. The borrow checker complained exactly once during development, and it was correct.
ONNX Runtime through the ort crate is production ready. Model loading, tensor creation, inference, all straightforward. The only rough edge is the session builder API requiring mutable references in surprising places.
egui is underrated for real-time applications. Immediate mode rendering with zero retained state makes it perfect for a live video feed. The texture upload path is clean and fast enough for 60fps without vsync tricks.
Try It
The release archive includes the binary and all models. Download, unpack, run. Nothing else needed.
GitHub: github.com/despite-death/face-swap
Feedback on the architecture and code is welcome. Especially interested in whether anyone has experience with ONNX Runtime multisession threading, running multiple models in parallel instead of sequentially could push this well past 60fps.

Top comments (2)

Collapse
 
motedb profile image
mote

Impressive work — running 4 neural nets at 60fps without Python is no small feat. The dependency-free binary approach is exactly what the edge AI world needs right now.

One thing I'm curious about: how do you handle state across frames? Real-time face swap often benefits from temporal smoothing (tracking face landmarks across multiple frames to reduce jitter), but that requires storing per-session state.

For embedded scenarios like robotics or drones, this kind of state management becomes even more critical — you need to persist model state, calibration data, and frame history without adding latency.

Were you using any form of stateful processing, or was each frame processed independently?

Collapse
 
motedb profile image
mote

The zero-allocation hot path section is the key insight here. Pre-allocating all pixel buffers at startup and using ArcSwap for lock-free pointer swaps instead of channels is exactly the right approach for real-time video processing.

One thing I've been exploring in a similar space — when you're doing inference on embedded/edge devices where memory is even more constrained, the state management gets trickier. The ArcSwap pattern works well on desktops, but on systems with 512MB RAM you often have to think about where the tensor buffers live between frames.

The 64-byte cache-line alignment trick is underappreciated. That single detail probably saved you 5-10% throughput from false sharing alone.

Curious — did you profile the GFPGAN enhancement step separately? That's typically where I'd expect the most variance across hardware. Are you batching those calls or processing frame-by-frame?