I have spent the last month building real-time voice agents. I started with the standard stack: LiveKit and Gemini 2.5.
Even though the latency is impressively low, it still feels far from a natural conversation. Talking to these state-of-the-art models usually feels like playing a turn-based video game.
My turn: I speak.
System turn: It waits for silence. It thinks. It speaks.
This is "Half-Duplex" logic. It is like using a Walkie-Talkie. The system forces you to wait. But real conversation is "Full-Duplex". We interrupt each other. We laugh at the same time. We hum while listening.
For the last two days, I have been working with NVIDIA's PersonaPlex (based on Moshi/Mimi). It is completely different. It does not wait for you to stop talking.
The Code Proves It
I looked at the backend code to understand why it feels so different. The secret is in moshi/server.py.
In standard agents, you have a loop that waits for an "End of Turn" signal. In PersonaPlex, I found this in the ServerState initialization:
# moshi/moshi/server.py L115
self.mimi.streaming_forever(1)
self.other_mimi.streaming_forever(1)
self.lm_gen.streaming_forever(1)
It is literally "streaming forever." The model processes my voice and its own voice at the same time, 12 times every second. It predicts silence or speech constantly. It does not need "permission" to speak.
Realism is Overrated; Rhythm is Everything
Most AI voices feel like "fake meat"—they sound human but act robotic. PersonaPlex is different. It trades audio quality for speed.
To hit a 240ms reaction time, the audio runs at 24kHz (confirmed in loaders.py as SAMPLE_RATE = 24000). I run this command on my voice files to match the training environment:
ffmpeg -i input -ar 24000 -ac 1 -c:a pcm_s16le -af "lowpass=f=8000" output.wav
It is lo-fi, but the rhythm is perfect. The model relies on consistent "Chatterbox TTS" data and learns from "negative-duration silence" during training. This forces it to understand that conversation involves overlapping, not just waiting. It might sound synthetic, but it laughs and interrupts exactly like a human.
The Body & Brain Split
PersonaPlex separates "how it sounds" from "what it thinks."
-
The Body (Voice Prompt): A 15-second audio clip for acoustics (loaded via
lm_gen.load_voice_prompt). - The Brain (Text Prompt): Instructions for behavior.
The system pre-loads the voice to save time (reducing latency). But they must match. You cannot use a calm "Customer Service" voice with an "Angry Pirate" text prompt—the model will glitch because the acoustic skeleton fights the semantic brain.
Unlocking "Social Mode"
To stop it from acting like a boring assistant, use this specific trigger phrase found in the training data (and verified in the server code's system tagging):
"You enjoy having a good conversation."
Combine this with a high-energy voice sample, and it switches modes. It starts laughing, interrupting, and "vibing" instead of just solving tasks.
The Reality Check (Trade-offs)
While the roadmap shows tool-calling is coming next, there are still significant hurdles:
-
Context Limits: The model has a fixed context window (defined as
context: 3000frames inloaders.py). At 12.5Hz, this translates to roughly 240 seconds of memory. My tests show it often gets unstable around 160 seconds. - Stability: Overlapping speech feels natural until it gets buggy. Sometimes the model will just speak over you non-stop.
- Cost: "Infinite streaming" requires high-end NVIDIA GPUs (A100/H100).
- Complexity: Managing simultaneous audio/text streams is far more complex than standard WebSockets.
Despite these issues, PersonaPlex is the first model I have used that feels like a natural customer service agent rather than a text-to-speech bot.
Welcome to follow me on Substack as I will release more deep tests and analyses after spending some time with the model.
Top comments (0)