NVIDIA's PersonaPlex-7B and FlashLabs Chroma 1.0 dropped last week. Both open source. Both change how we build voice applications.
This is the practical implementation guide.
The Latency Problem
Traditional voice pipelines chain three models:
Audio → ASR → Text → LLM → Text → TTS → Audio
Each hop adds 500-1500ms. Total round trip: 2-5 seconds. Conversations feel robotic.
End-to-end models collapse this:
Audio → Unified Model → Audio
Latency drops to 300-500ms. Close enough to human turn-taking (200ms) to feel natural.
Project Setup
mkdir voice-app && cd voice-app
npm init -y
npm install express socket.io @xenova/transformers
npm install mediasoup bufferutil utf-8-validate
Basic server structure:
// server.js
import express from 'express'
import { createServer } from 'http'
import { Server } from 'socket.io'
const app = express()
const server = createServer(app)
const io = new Server(server, {
cors: { origin: '*' }
})
app.use(express.static('public'))
io.on('connection', (socket) => {
console.log('Client connected:', socket.id)
socket.on('audio-chunk', async (data) => {
// Process through voice model
// Emit response back
})
})
server.listen(3001)
Browser Audio Capture
The Web Audio API with AudioWorklet gives us low-latency capture:
// public/js/capture.js
class AudioCapture {
constructor(socket) {
this.socket = socket
this.context = null
this.processor = null
}
async start() {
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
sampleRate: 16000,
channelCount: 1,
echoCancellation: true,
noiseSuppression: true
}
})
this.context = new AudioContext({ sampleRate: 16000 })
const source = this.context.createMediaStreamSource(stream)
await this.context.audioWorklet.addModule('/js/processor.js')
this.processor = new AudioWorkletNode(this.context, 'audio-processor')
this.processor.port.onmessage = (e) => {
this.socket.emit('audio-chunk', e.data.audio)
}
source.connect(this.processor)
}
}
The AudioWorklet processor:
// public/js/processor.js
class AudioProcessor extends AudioWorkletProcessor {
constructor() {
super()
this.buffer = new Float32Array(4096)
this.index = 0
}
process(inputs) {
const input = inputs[0][0]
if (!input) return true
for (let i = 0; i < input.length; i++) {
this.buffer[this.index++] = input[i]
if (this.index >= 4096) {
this.port.postMessage({ audio: this.buffer.slice() })
this.index = 0
}
}
return true
}
}
registerProcessor('audio-processor', AudioProcessor)
4096 samples at 16kHz = 256ms chunks. Good balance between latency and overhead.
Voice Activity Detection
Simple energy-based VAD:
// services/vad.js
class VAD {
constructor(threshold = 0.01, silenceDelay = 500) {
this.threshold = threshold
this.silenceDelay = silenceDelay
this.speaking = false
this.silenceStart = null
}
process(audio) {
const energy = Math.sqrt(
audio.reduce((sum, s) => sum + s * s, 0) / audio.length
)
if (energy > this.threshold) {
this.speaking = true
this.silenceStart = null
} else if (this.speaking) {
if (!this.silenceStart) {
this.silenceStart = Date.now()
} else if (Date.now() - this.silenceStart > this.silenceDelay) {
this.speaking = false
return 'speech-end'
}
}
return this.speaking ? 'speaking' : 'silent'
}
}
export default VAD
For noisy environments, swap to Silero VAD (neural, more robust).
PersonaPlex Integration
PersonaPlex-7B supports full-duplex. It processes input while generating output.
// services/personaplex.js
import WebSocket from 'ws'
class PersonaPlexClient {
constructor(url) {
this.url = url
this.ws = null
this.handlers = new Map()
}
async connect() {
return new Promise((resolve, reject) => {
this.ws = new WebSocket(this.url)
this.ws.on('open', resolve)
this.ws.on('error', reject)
this.ws.on('message', (data) => {
const msg = JSON.parse(data)
this.emit(msg.type, msg.data)
})
})
}
send(audioData) {
this.ws.send(JSON.stringify({
type: 'audio',
data: Array.from(audioData)
}))
}
interrupt() {
this.ws.send(JSON.stringify({ type: 'interrupt' }))
}
on(event, fn) {
if (!this.handlers.has(event)) {
this.handlers.set(event, [])
}
this.handlers.get(event).push(fn)
}
emit(event, data) {
(this.handlers.get(event) || []).forEach(fn => fn(data))
}
}
export default PersonaPlexClient
The interrupt() method is key. Call it when VAD detects user speech during AI response.
Audio Playback Queue
Smooth playback requires buffering:
// public/js/playback.js
class AudioPlayback {
constructor() {
this.context = new AudioContext({ sampleRate: 24000 })
this.queue = []
this.playing = false
}
enqueue(audioData) {
this.queue.push(audioData)
if (!this.playing) this.playNext()
}
playNext() {
if (!this.queue.length) {
this.playing = false
return
}
this.playing = true
const data = this.queue.shift()
const buffer = this.context.createBuffer(1, data.length, 24000)
buffer.getChannelData(0).set(data)
const source = this.context.createBufferSource()
source.buffer = buffer
source.connect(this.context.destination)
source.onended = () => this.playNext()
source.start()
}
stop() {
this.queue = []
this.playing = false
}
}
export default AudioPlayback
Chroma Voice Cloning
Chroma 1.0 clones voices from 30 second samples:
// services/chroma.js
class ChromaClient {
constructor(url) {
this.url = url
this.embedding = null
}
async createEmbedding(samples) {
const res = await fetch(`${this.url}/embed`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
samples: samples.map(s => Array.from(s))
})
})
this.embedding = (await res.json()).embedding
return this.embedding
}
async process(audio) {
const res = await fetch(`${this.url}/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
audio: Array.from(audio),
voice: this.embedding
})
})
const data = await res.json()
return new Float32Array(data.audio)
}
}
export default ChromaClient
Putting It Together
Complete conversation manager:
// public/js/conversation.js
import { io } from 'socket.io-client'
import AudioCapture from './capture.js'
import AudioPlayback from './playback.js'
import VAD from './vad.js'
class Conversation {
constructor(url) {
this.socket = io(url)
this.capture = new AudioCapture(this.socket)
this.playback = new AudioPlayback()
this.vad = new VAD()
this.state = 'idle'
}
async start() {
await this.capture.start()
this.socket.on('audio-response', (data) => {
this.state = 'speaking'
this.playback.enqueue(new Float32Array(data))
})
this.socket.on('turn-complete', () => {
this.state = 'idle'
})
}
processChunk(audio) {
const vadResult = this.vad.process(audio)
if (vadResult === 'speaking' && this.state === 'speaking') {
// User interrupted
this.socket.emit('interrupt')
this.playback.stop()
}
if (vadResult !== 'silent') {
this.socket.emit('audio-chunk', Array.from(audio))
}
}
}
export default Conversation
Latency Optimization
Chunk size matters:
const SAMPLE_RATE = 16000
const CHUNK_MS = 20
const CHUNK_SIZE = Math.floor(SAMPLE_RATE * CHUNK_MS / 1000)
// 320 samples = 20ms chunks
Smaller chunks = lower latency, more overhead. 20-50ms is the sweet spot.
Keep connections warm:
// Ping every 30s to prevent timeout
setInterval(() => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({ type: 'ping' }))
}
}, 30000)
Stream responses immediately:
Do not wait for complete response. Start playback on first audio chunk.
Audio Quality Check
Detect problems before they affect the model:
function checkQuality(audio) {
const energy = Math.sqrt(
audio.reduce((sum, s) => sum + s * s, 0) / audio.length
)
const clipped = audio.filter(s => Math.abs(s) > 0.99).length
const clippingRatio = clipped / audio.length
if (energy < 0.001) return { ok: false, issue: 'too-quiet' }
if (energy > 0.8) return { ok: false, issue: 'too-loud' }
if (clippingRatio > 0.01) return { ok: false, issue: 'clipping' }
return { ok: true }
}
Deployment Notes
GPU requirements: Both models need GPU for real-time inference. A10G handles 10-20 concurrent sessions.
Regional latency: Deploy inference close to users. 100ms network latency kills the conversational feel.
Cost: GPU instances are expensive. Implement session timeouts and usage limits.
The code above is a starting point. Production apps need error handling, reconnection logic, and proper state management.
But the core architecture is here. Full-duplex voice AI, running on open source models, built with JavaScript.
Six months ago this required enterprise APIs and enterprise budgets.
Now it requires a weekend.
Top comments (0)