DEV Community

Cover image for How I hit 375ms Voice-to-Voice latency by ditching OpenAI for Bare Metal NVIDIA Blackwells
Arden Talbot
Arden Talbot

Posted on

How I hit 375ms Voice-to-Voice latency by ditching OpenAI for Bare Metal NVIDIA Blackwells

The "Wrapper" Trap: I run an AI automation agency for healthcare clients. For the last year, I've been fighting a losing battle against latency.

Every time I built a Voice Agent using the standard stack (Twilio $\to$ Vapi/Retell $\to$ GPT-4o $\to$ ElevenLabs), I hit a hard wall.Latency: 800ms - 1200ms round trip.The "Feel": It felt like a walkie-talkie. Users would interrupt the bot, and the bot would keep talking for a second before realizing it.The Cost: $0.10/min + $1,000/mo for a HIPAA BAA.

I realized that network hops were the killer. Every time the audio left the server to go to OpenAI (Intelligence) or ElevenLabs (TTS), I was losing 200ms. So, I did the "irrational" thing: I bought my own hardware.The Bare Metal StackI moved everything to a dedicated NVIDIA Blackwell cluster.

The goal was simple: Zero Network Hops. The Audio, the Brain (LLM), and the Mouth (TTS) all live on the same GPU VRAM.Here is the architecture that got us to 375ms: Ingestion: Twilio Stream $\to$ Custom Rust WebSocket Server (Tokio-based).ASR: nemotron (running locally).The Brain: Nemotron-4 (4-bit quantized).Why Nemotron? It follows conversational instructions better than Llama-3 and fits nicely in memory.The Mouth: Kokoro-82M.Why Kokoro? It’s tiny (82M params) but sounds better than models 10x its size. Because it's so small, we can keep it "hot" in VRAM right next to the LLM.

The "Zero Retention" Architecture (HIPAA)
Since I own the metal, I could also solve the compliance issue at the kernel level.

Healthcare clients require HIPAA compliance, which usually means expensive encrypted storage and audit logs. I decided to go the other way: Don't store anything.

We configured the Linux kernel to run in a "volatile-only" mode: # Disable swap to prevent RAM from touching the disk
sudo swapoff -a
sudo sysctl vm.swappiness=0

Mount logs to RAM

mount -t tmpfs -o size=512m tmpfs /var/log/voquii

By processing the entire call in RAM and flushing it immediately after the WebSocket closes, we achieve "Zero Data Retention."

No recordings on disk.

No transcripts in DB.

Result: I can sign BAAs for free because my liability surface area is effectively zero.

The Results
Time-to-Speech: ~375ms (Consistently).

Cost: Flat rate (GPU cost), no per-minute tokens.

Experience: It feels like talking to a human on a slightly bad cell connection, rather than a robot.

Try the Demo
We just launched the beta on Product Hunt today to stress-test the cluster.

If you want to experience the latency (or roast my Rust implementation), check it out here:

πŸ‘‰ https://www.producthunt.com/products/voquii?utm_source=other&utm_medium=social

I’m hanging out in the comments all day answering questions about the bare-metal setup and how we handle the audio buffering!

Top comments (0)