DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Node.js Voice Agent with AssemblyAI Universal-3 Pro Streaming

Node.js Voice Agent with AssemblyAI Universal-3 Pro Streaming

Build a real-time voice agent in Node.js using the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro) for speech-to-text — no Python required, no heavy framework dependencies.

Two modes in one repo:

  1. Terminal agent (src/agent.js) — mic input via mic, plays TTS audio in your terminal
  2. Browser server (src/server.js) — Node.js WebSocket server with a browser UI using getUserMedia

Why AssemblyAI Universal-3 Pro for Node.js?

Metric AssemblyAI Universal-3 Pro Deepgram Nova-3
P50 latency 307 ms 516 ms
Word Error Rate 8.14% 9.87%
Neural turn detection ❌ (VAD only)
Mid-session prompting
Real-time diarization
Anti-hallucination

Neural turn detection eliminates the need for a separate VAD library. The model uses both acoustic and linguistic signals to detect when a speaker has finished — not just when they've gone silent.

Quick Start

git clone https://github.com/kelseyefoster/voice-agent-nodejs-assemblyai
cd voice-agent-nodejs-assemblyai

npm install
cp .env.example .env
# Edit .env with your API keys
Enter fullscreen mode Exit fullscreen mode

Terminal Agent

npm start
# Speak into your mic — Ctrl+C to quit
Enter fullscreen mode Exit fullscreen mode

Browser Server

npm run server
# Open http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

AssemblyAI WebSocket URL

const AAI_WS_URL =
  `wss://streaming.assemblyai.com/v3/ws` +
  `?speech_model=u3-rt-pro` +
  `&encoding=pcm_s16le` +
  `&sample_rate=16000` +
  `&end_of_turn_confidence_threshold=0.4` +
  `&min_end_of_turn_silence_when_confident=300` +
  `&max_turn_silence=1500` +
  `&token=${ASSEMBLYAI_API_KEY}`;
Enter fullscreen mode Exit fullscreen mode

Message Handling

ws.on("message", async (data) => {
  const msg = JSON.parse(data.toString());

  if (msg.type === "Begin") {
    console.log(`Session: ${msg.id}`);
  }

  if (msg.type === "Turn" && !msg.end_of_turn) {
    process.stdout.write(`\r${msg.transcript}`);
  }

  if (msg.type === "Turn" && msg.end_of_turn) {
    const reply = await generateResponse(msg.transcript);
    await speak(reply);
  }
});
Enter fullscreen mode Exit fullscreen mode

Sending Audio

Browser (getUserMedia + ScriptProcessor)

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  const int16 = new Int16Array(float32.length);
  for (let i = 0; i < float32.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, Math.round(float32[i] * 32767)));
  }
  ws.send(int16.buffer);
};
Enter fullscreen mode Exit fullscreen mode

Terminal (mic package)

const micStream = micInstance.getAudioStream();
micStream.on("data", (chunk) => {
  aaiWs.send(chunk); // raw PCM s16le bytes
});
Enter fullscreen mode Exit fullscreen mode

Turn Detection Tuning

Parameter Default Lower Value Higher Value
end_of_turn_confidence_threshold 0.4 Faster response Fewer false triggers
min_end_of_turn_silence_when_confident 300ms Snappier More natural pauses
max_turn_silence 1500ms Faster cutoff More thinking time

Mid-Session Keyterm Prompting

Inject domain-specific vocabulary without restarting:

ws.send(JSON.stringify({
  type: "UpdateConfiguration",
  keyterms: ["AssemblyAI", "Universal-3", "your-product-name"],
}));
Enter fullscreen mode Exit fullscreen mode

Resources

Top comments (0)