DEV Community

JSGuruJobs
JSGuruJobs

Posted on

Full-Duplex Voice AI in JavaScript: Building Real-Time Conversational Apps With Open Source Models

NVIDIA's PersonaPlex-7B and FlashLabs Chroma 1.0 dropped last week. Both open source. Both change how we build voice applications.

This is the practical implementation guide.

The Latency Problem

Traditional voice pipelines chain three models:

Audio → ASR → Text → LLM → Text → TTS → Audio
Enter fullscreen mode Exit fullscreen mode

Each hop adds 500-1500ms. Total round trip: 2-5 seconds. Conversations feel robotic.

End-to-end models collapse this:

Audio → Unified Model → Audio
Enter fullscreen mode Exit fullscreen mode

Latency drops to 300-500ms. Close enough to human turn-taking (200ms) to feel natural.

Project Setup

mkdir voice-app && cd voice-app
npm init -y
npm install express socket.io @xenova/transformers
npm install mediasoup bufferutil utf-8-validate
Enter fullscreen mode Exit fullscreen mode

Basic server structure:

// server.js
import express from 'express'
import { createServer } from 'http'
import { Server } from 'socket.io'

const app = express()
const server = createServer(app)
const io = new Server(server, {
  cors: { origin: '*' }
})

app.use(express.static('public'))

io.on('connection', (socket) => {
  console.log('Client connected:', socket.id)

  socket.on('audio-chunk', async (data) => {
    // Process through voice model
    // Emit response back
  })
})

server.listen(3001)
Enter fullscreen mode Exit fullscreen mode

Browser Audio Capture

The Web Audio API with AudioWorklet gives us low-latency capture:

// public/js/capture.js
class AudioCapture {
  constructor(socket) {
    this.socket = socket
    this.context = null
    this.processor = null
  }

  async start() {
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        sampleRate: 16000,
        channelCount: 1,
        echoCancellation: true,
        noiseSuppression: true
      }
    })

    this.context = new AudioContext({ sampleRate: 16000 })
    const source = this.context.createMediaStreamSource(stream)

    await this.context.audioWorklet.addModule('/js/processor.js')
    this.processor = new AudioWorkletNode(this.context, 'audio-processor')

    this.processor.port.onmessage = (e) => {
      this.socket.emit('audio-chunk', e.data.audio)
    }

    source.connect(this.processor)
  }
}
Enter fullscreen mode Exit fullscreen mode

The AudioWorklet processor:

// public/js/processor.js
class AudioProcessor extends AudioWorkletProcessor {
  constructor() {
    super()
    this.buffer = new Float32Array(4096)
    this.index = 0
  }

  process(inputs) {
    const input = inputs[0][0]
    if (!input) return true

    for (let i = 0; i < input.length; i++) {
      this.buffer[this.index++] = input[i]

      if (this.index >= 4096) {
        this.port.postMessage({ audio: this.buffer.slice() })
        this.index = 0
      }
    }
    return true
  }
}

registerProcessor('audio-processor', AudioProcessor)
Enter fullscreen mode Exit fullscreen mode

4096 samples at 16kHz = 256ms chunks. Good balance between latency and overhead.

Voice Activity Detection

Simple energy-based VAD:

// services/vad.js
class VAD {
  constructor(threshold = 0.01, silenceDelay = 500) {
    this.threshold = threshold
    this.silenceDelay = silenceDelay
    this.speaking = false
    this.silenceStart = null
  }

  process(audio) {
    const energy = Math.sqrt(
      audio.reduce((sum, s) => sum + s * s, 0) / audio.length
    )

    if (energy > this.threshold) {
      this.speaking = true
      this.silenceStart = null
    } else if (this.speaking) {
      if (!this.silenceStart) {
        this.silenceStart = Date.now()
      } else if (Date.now() - this.silenceStart > this.silenceDelay) {
        this.speaking = false
        return 'speech-end'
      }
    }

    return this.speaking ? 'speaking' : 'silent'
  }
}

export default VAD
Enter fullscreen mode Exit fullscreen mode

For noisy environments, swap to Silero VAD (neural, more robust).

PersonaPlex Integration

PersonaPlex-7B supports full-duplex. It processes input while generating output.

// services/personaplex.js
import WebSocket from 'ws'

class PersonaPlexClient {
  constructor(url) {
    this.url = url
    this.ws = null
    this.handlers = new Map()
  }

  async connect() {
    return new Promise((resolve, reject) => {
      this.ws = new WebSocket(this.url)
      this.ws.on('open', resolve)
      this.ws.on('error', reject)
      this.ws.on('message', (data) => {
        const msg = JSON.parse(data)
        this.emit(msg.type, msg.data)
      })
    })
  }

  send(audioData) {
    this.ws.send(JSON.stringify({
      type: 'audio',
      data: Array.from(audioData)
    }))
  }

  interrupt() {
    this.ws.send(JSON.stringify({ type: 'interrupt' }))
  }

  on(event, fn) {
    if (!this.handlers.has(event)) {
      this.handlers.set(event, [])
    }
    this.handlers.get(event).push(fn)
  }

  emit(event, data) {
    (this.handlers.get(event) || []).forEach(fn => fn(data))
  }
}

export default PersonaPlexClient
Enter fullscreen mode Exit fullscreen mode

The interrupt() method is key. Call it when VAD detects user speech during AI response.

Audio Playback Queue

Smooth playback requires buffering:

// public/js/playback.js
class AudioPlayback {
  constructor() {
    this.context = new AudioContext({ sampleRate: 24000 })
    this.queue = []
    this.playing = false
  }

  enqueue(audioData) {
    this.queue.push(audioData)
    if (!this.playing) this.playNext()
  }

  playNext() {
    if (!this.queue.length) {
      this.playing = false
      return
    }

    this.playing = true
    const data = this.queue.shift()
    const buffer = this.context.createBuffer(1, data.length, 24000)
    buffer.getChannelData(0).set(data)

    const source = this.context.createBufferSource()
    source.buffer = buffer
    source.connect(this.context.destination)
    source.onended = () => this.playNext()
    source.start()
  }

  stop() {
    this.queue = []
    this.playing = false
  }
}

export default AudioPlayback
Enter fullscreen mode Exit fullscreen mode

Chroma Voice Cloning

Chroma 1.0 clones voices from 30 second samples:

// services/chroma.js
class ChromaClient {
  constructor(url) {
    this.url = url
    this.embedding = null
  }

  async createEmbedding(samples) {
    const res = await fetch(`${this.url}/embed`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ 
        samples: samples.map(s => Array.from(s)) 
      })
    })
    this.embedding = (await res.json()).embedding
    return this.embedding
  }

  async process(audio) {
    const res = await fetch(`${this.url}/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        audio: Array.from(audio),
        voice: this.embedding
      })
    })

    const data = await res.json()
    return new Float32Array(data.audio)
  }
}

export default ChromaClient
Enter fullscreen mode Exit fullscreen mode

Putting It Together

Complete conversation manager:

// public/js/conversation.js
import { io } from 'socket.io-client'
import AudioCapture from './capture.js'
import AudioPlayback from './playback.js'
import VAD from './vad.js'

class Conversation {
  constructor(url) {
    this.socket = io(url)
    this.capture = new AudioCapture(this.socket)
    this.playback = new AudioPlayback()
    this.vad = new VAD()
    this.state = 'idle'
  }

  async start() {
    await this.capture.start()

    this.socket.on('audio-response', (data) => {
      this.state = 'speaking'
      this.playback.enqueue(new Float32Array(data))
    })

    this.socket.on('turn-complete', () => {
      this.state = 'idle'
    })
  }

  processChunk(audio) {
    const vadResult = this.vad.process(audio)

    if (vadResult === 'speaking' && this.state === 'speaking') {
      // User interrupted
      this.socket.emit('interrupt')
      this.playback.stop()
    }

    if (vadResult !== 'silent') {
      this.socket.emit('audio-chunk', Array.from(audio))
    }
  }
}

export default Conversation
Enter fullscreen mode Exit fullscreen mode

Latency Optimization

Chunk size matters:

const SAMPLE_RATE = 16000
const CHUNK_MS = 20
const CHUNK_SIZE = Math.floor(SAMPLE_RATE * CHUNK_MS / 1000)
// 320 samples = 20ms chunks
Enter fullscreen mode Exit fullscreen mode

Smaller chunks = lower latency, more overhead. 20-50ms is the sweet spot.

Keep connections warm:

// Ping every 30s to prevent timeout
setInterval(() => {
  if (ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({ type: 'ping' }))
  }
}, 30000)
Enter fullscreen mode Exit fullscreen mode

Stream responses immediately:

Do not wait for complete response. Start playback on first audio chunk.

Audio Quality Check

Detect problems before they affect the model:

function checkQuality(audio) {
  const energy = Math.sqrt(
    audio.reduce((sum, s) => sum + s * s, 0) / audio.length
  )

  const clipped = audio.filter(s => Math.abs(s) > 0.99).length
  const clippingRatio = clipped / audio.length

  if (energy < 0.001) return { ok: false, issue: 'too-quiet' }
  if (energy > 0.8) return { ok: false, issue: 'too-loud' }
  if (clippingRatio > 0.01) return { ok: false, issue: 'clipping' }

  return { ok: true }
}
Enter fullscreen mode Exit fullscreen mode

Deployment Notes

GPU requirements: Both models need GPU for real-time inference. A10G handles 10-20 concurrent sessions.

Regional latency: Deploy inference close to users. 100ms network latency kills the conversational feel.

Cost: GPU instances are expensive. Implement session timeouts and usage limits.


The code above is a starting point. Production apps need error handling, reconnection logic, and proper state management.

But the core architecture is here. Full-duplex voice AI, running on open source models, built with JavaScript.

Six months ago this required enterprise APIs and enterprise budgets.

Now it requires a weekend.

Top comments (0)