How I built a parallel video pipeline on RTX 5090s to kill cloud processing lag

#showdev #ai #python #mojo

Most AI video tools today are just wrappers around shared cloud GPU instances. When you upload a long video, your file sits in a queue behind hundreds of other jobs, which is why "AI clipping" often takes 40 minutes. The AI itself isn't slow, but the infrastructure is.

I decided to build Sintorio by moving away from rented cloud instances and running on a dedicated cluster of RTX 5090 GPUs that I own and operate. To hit the speeds I wanted, I had to optimize every layer of the stack.

For transcription, I used faster-whisper with a batched inference pipeline. The 25.7GB of VRAM on the 5090 allows for a much larger batch size than older cards, which sustains about 18x real-time throughput. I also moved face tracking from the CPU to the GPU using SCRFD on ONNX Runtime, which dropped frame processing time from 20ms to about 2ms.

The rendering itself happens in parallel using a producer-consumer model. Clips start rendering via hardware encoding the moment a viral segment is identified, so the system never sits idle waiting for the next step.

The end result is that a one-hour 4K video can be processed and branded in under two minutes. Since I run the hardware ourselves, it also allows for a zero data retention policy—videos are auto-deleted immediately after the session because I don't need to train models on user data.

I'm currently offering a lifetime deal to help fund the next rack of 5090s. I would love to hear from anyone else working on inference optimization for the Blackwell architecture or running their own GPU setups.

I’m currently offering a €79 Lifetime Deal to help fund the next rack of 5090s. No investors, just hardware and coffee.

I'd love to hear how others are optimizing inference on the 50-series cards!