Node.js Voice Agent with AssemblyAI Universal-3 Pro Streaming
Build a real-time voice agent in Node.js using the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro) for speech-to-text — no Python required, no heavy framework dependencies.
Two modes in one repo:
-
Terminal agent (
src/agent.js) — mic input viamic, plays TTS audio in your terminal -
Browser server (
src/server.js) — Node.js WebSocket server with a browser UI usinggetUserMedia
Why AssemblyAI Universal-3 Pro for Node.js?
| Metric | AssemblyAI Universal-3 Pro | Deepgram Nova-3 |
|---|---|---|
| P50 latency | 307 ms | 516 ms |
| Word Error Rate | 8.14% | 9.87% |
| Neural turn detection | ✅ | ❌ (VAD only) |
| Mid-session prompting | ✅ | ❌ |
| Real-time diarization | ✅ | ❌ |
| Anti-hallucination | ✅ | ❌ |
Neural turn detection eliminates the need for a separate VAD library. The model uses both acoustic and linguistic signals to detect when a speaker has finished — not just when they've gone silent.
Quick Start
git clone https://github.com/kelseyefoster/voice-agent-nodejs-assemblyai
cd voice-agent-nodejs-assemblyai
npm install
cp .env.example .env
# Edit .env with your API keys
Terminal Agent
npm start
# Speak into your mic — Ctrl+C to quit
Browser Server
npm run server
# Open http://localhost:3000
AssemblyAI WebSocket URL
const AAI_WS_URL =
`wss://streaming.assemblyai.com/v3/ws` +
`?speech_model=u3-rt-pro` +
`&encoding=pcm_s16le` +
`&sample_rate=16000` +
`&end_of_turn_confidence_threshold=0.4` +
`&min_end_of_turn_silence_when_confident=300` +
`&max_turn_silence=1500` +
`&token=${ASSEMBLYAI_API_KEY}`;
Message Handling
ws.on("message", async (data) => {
const msg = JSON.parse(data.toString());
if (msg.type === "Begin") {
console.log(`Session: ${msg.id}`);
}
if (msg.type === "Turn" && !msg.end_of_turn) {
process.stdout.write(`\r${msg.transcript}`);
}
if (msg.type === "Turn" && msg.end_of_turn) {
const reply = await generateResponse(msg.transcript);
await speak(reply);
}
});
Sending Audio
Browser (getUserMedia + ScriptProcessor)
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
const int16 = new Int16Array(float32.length);
for (let i = 0; i < float32.length; i++) {
int16[i] = Math.max(-32768, Math.min(32767, Math.round(float32[i] * 32767)));
}
ws.send(int16.buffer);
};
Terminal (mic package)
const micStream = micInstance.getAudioStream();
micStream.on("data", (chunk) => {
aaiWs.send(chunk); // raw PCM s16le bytes
});
Turn Detection Tuning
| Parameter | Default | Lower Value | Higher Value |
|---|---|---|---|
end_of_turn_confidence_threshold |
0.4 | Faster response | Fewer false triggers |
min_end_of_turn_silence_when_confident |
300ms | Snappier | More natural pauses |
max_turn_silence |
1500ms | Faster cutoff | More thinking time |
Mid-Session Keyterm Prompting
Inject domain-specific vocabulary without restarting:
ws.send(JSON.stringify({
type: "UpdateConfiguration",
keyterms: ["AssemblyAI", "Universal-3", "your-product-name"],
}));
Top comments (0)