Why Node.js + the Voice Agent API
Most JavaScript voice agent tutorials are stitched together but every layer adds latency, cost, and a place to fail.
The Voice Agent API replaces that pipeline with a single WebSocket:
| Multi-vendor JS pipeline | Voice Agent API in Node.js |
|---|---|
| npm packages for AI | 3+ (one per vendor) |
| API keys to manage | 3+ |
| Round trips per turn | 3 (mic→STT→LLM→TTS→speaker) |
| Turn detection | Wire up VAD or LLM endpointing |
| Barge-in | Implement yourself |
| Tool calling | Bridge LLM tool defs to your runtime |
The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send PCM16, get PCM16. That's the whole protocol surface.
Architecture
Microphone (mic + sox/arecord, 24 kHz PCM16)
│
│ base64-encoded chunks
│ { type: "input.audio", audio: "..." }
▼
┌─────────────────────────────────────────────────┐
│ wss://agents.assemblyai.com/v1/ws │
│ │
│ AssemblyAI Voice Agent API │
│ ├── Universal-3 Pro Streaming (speech → text) │
│ ├── LLM (text → reply) │
│ └── TTS (reply → audio) │
│ │
│ + neural turn detection │
│ + barge-in │
│ + tool calling │
└─────────────────────────────────────────────────┘
│
│ base64-encoded chunks
│ { type: "reply.audio", data: "..." }
▼
Speakers (speaker, 24 kHz PCM16)
Prerequisites
- Node.js 20+
- A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation, so mic-to-speaker feedback will trigger barge-in)
- An AssemblyAI API key — free tier available
The mic and speaker packages call into native audio backends:
- macOS: brew install sox
- Debian/Ubuntu: sudo apt-get install sox libsox-fmt-all libasound2-dev
- Windows: install SoX and the Visual Studio C++ build tools
Quick start
1. Clone and install
git clone https://github.com/kelsey-aai/voice-agent-nodejs
cd voice-agent-nodejs
npm install
2. Configure your API key
cp .env.example .env
# Edit .env — drop in your AssemblyAI API key
3. Run the agent
npm start
Plug in your headphones, wait for Connected (session ...), and start talking.
Connected (session sess_abc123).
Speak now. Press Ctrl+C to quit.
You: What can you do?
Agent: I can chat with you, answer questions, or help work through ideas — whatever you'd like.
That's the whole thing. Under 100 lines of Node.js.
How it works
1. Open the WebSocket and configure the session
import "dotenv/config";
import WebSocket from "ws";
const ws = new WebSocket("wss://agents.assemblyai.com/v1/ws", {
headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
});
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: "You are a friendly voice assistant. Keep replies short.",
greeting: "Hi there — what can I help you with?",
output: { voice: "ivy" },
},
}));
});
session.update is always your first message. It sets the agent's personality (system_prompt), what it says when the call connects (greeting), and the voice it speaks in (voice). All fields are optional, and you can re-send session.update at any time during the conversation to change them.
2. Stream microphone audio
import mic from "mic";
const micInstance = mic({
rate: "24000", channels: "1", encoding: "signed-integer",
bitwidth: "16", endian: "little",
});
const micStream = micInstance.getAudioStream();
micStream.on("data", (chunk) => {
if (sessionReady && ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({
type: "input.audio",
audio: chunk.toString("base64"),
}));
}
});
mic shells out to sox (or arecord on Linux) and emits raw PCM16 buffers as data events. We base64-encode each buffer and ship it as an input.audio event. The sessionReady gate keeps us from sending audio before the server has acknowledged the configuration with session.ready.
3. Play the agent's response
import Speaker from "speaker";
let speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === "reply.audio") {
speaker.write(Buffer.from(event.data, "base64"));
} else if (event.type === "reply.done" && event.status === "interrupted") {
speaker.end();
speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
}
});
reply.audio events stream as the agent generates each phrase, so playback starts within hundreds of milliseconds of the end of the user's turn. Each chunk is base64-decoded and written straight to the speaker stream, which feeds the OS audio buffer.
When the user interrupts the agent mid-reply (barge-in), the server emits reply.done with status: "interrupted". Node's speaker package doesn't expose a clean flush, so the simplest reliable pattern is to end the current speaker stream and create a fresh one for the next reply.
What you get for free
These are all handled by the API — you write zero code for them:
- Neural turn detection. The server combines acoustic and linguistic signals to decide when the user has finished speaking, so it knows the difference between a thinking pause and an actual end-of-turn.
- Barge-in. When the user speaks over the agent, the server stops generating, sends reply.done with status: "interrupted", and trims the agent transcript to what was actually spoken.
- Real-time partial transcripts. transcript.user.delta events stream as the user talks, so you can show what they're saying live.
- Final transcripts both ways. transcript.user and transcript.agent events arrive after each turn — perfect for logging, chat history, or moderation.
Customizing the agent
Pick a Different Voice
Eighteen English voices and 16 multilingual voices are available. Drop any voice ID into session.output.voice:
output: { voice: "james" } // conversational US male
output: { voice: "sophie" } // clear UK female
output: { voice: "diego" } // Latin American Spanish
output: { voice: "arjun" } // Hindi/Hinglish
See the Voices catalog for samples. Multilingual voices code-switch with English automatically.
Tune turn detection
Defaults work well for most apps. Override anything you want under session.input.turn_detection:
input: {
turn_detection: {
vad_threshold: 0.5, // 0.0–1.0; lower = more sensitive
min_silence: 600, // ms; min silence before confident end-of-turn
max_silence: 1500, // ms; max silence before forcing end-of-turn
interrupt_response: true, // false to disable barge-in
}
}
For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence.
Boost domain-specific terms
If your conversation involves rare words — product names, medical terms, customer names — add them to session.input.keyterms to bias speech recognition toward them:
input: { keyterms: ["Ozempic", "Salesforce", "AssemblyAI"] }
Troubleshooting
The agent keeps interrupting itself. Your microphone is picking up the agent's TTS output. Use headphones, or move to a browser-based client which gets free echo cancellation from getUserMedia.
speaker install fails on Linux. Install ALSA dev headers: sudo apt-get install libasound2-dev.
speaker install fails on macOS with Node 22+. Some node-gyp-based packages lag the latest Node major release. Use Node 20 LTS, or replace speaker with the bundled wav decoder + play-sound if you don't need streaming playback.
mic produces silence. Check that sox (macOS) or arecord (Linux) is on your PATH and that your terminal has microphone permission (macOS: System Settings → Privacy & Security → Microphone).
UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.
The full guide is in the Voice Agent API troubleshooting docs.
Frequently asked questions
What is AssemblyAI's Voice Agent API?
A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.
How do I build a voice agent in Node.js?
The simplest pattern is ws + mic + speaker: open a WebSocket to the Voice Agent API, send a session.update with your system prompt and voice, then pipe microphone audio in and speaker audio out. Under 100 lines. No LLM or TTS SDK needed.
What audio format does the Voice Agent API expect from a Node.js client?
By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Configure the mic package with rate: "24000", channels: "1", encoding: "signed-integer", bitwidth: "16". For telephony integrations you'd switch to audio/pcmu (G.711 μ-law, 8 kHz).
How do I authenticate to the Voice Agent API from Node.js?
Pass your AssemblyAI API key as a Bearer token in the Authorization header during the WebSocket upgrade. For browser apps where you can't expose your API key, mint a short-lived temporary token on your Node server and pass it as a ?token= query parameter from the browser instead.
Why is the agent interrupting itself in my Node.js terminal app?
The microphone is picking up the agent's TTS output through your speakers, the server interprets that as the user speaking, and barge-in fires. The fix is either headphones or running the client in a browser, where getUserMedia({ audio: { echoCancellation: true } }) gives you OS-level acoustic echo cancellation for free.
Can the Voice Agent API call functions from Node.js?
Yes. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Run your function in Node, then send a tool.result event when you receive the next reply.done.
How much does the AssemblyAI Voice Agent API cost?
AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.
Top comments (0)