DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Node.js voice agent with AssemblyAI's Voice Agent API

Why Node.js + the Voice Agent API

Most JavaScript voice agent tutorials are stitched together but every layer adds latency, cost, and a place to fail.

The Voice Agent API replaces that pipeline with a single WebSocket:

Multi-vendor JS pipeline Voice Agent API in Node.js
npm packages for AI 3+ (one per vendor)
API keys to manage 3+
Round trips per turn 3 (mic→STT→LLM→TTS→speaker)
Turn detection Wire up VAD or LLM endpointing
Barge-in Implement yourself
Tool calling Bridge LLM tool defs to your runtime

The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send PCM16, get PCM16. That's the whole protocol surface.

Architecture

 Microphone (mic + sox/arecord, 24 kHz PCM16)
    │
    │  base64-encoded chunks
    │  { type: "input.audio", audio: "..." }
    ▼
┌─────────────────────────────────────────────────┐
│  wss://agents.assemblyai.com/v1/ws              │
│                                                 │
│  AssemblyAI Voice Agent API                     │
│  ├── Universal-3 Pro Streaming  (speech → text) │
│  ├── LLM                        (text → reply)  │
│  └── TTS                        (reply → audio) │
│                                                 │
│  + neural turn detection                        │
│  + barge-in                                     │
│  + tool calling                                 │
└─────────────────────────────────────────────────┘
    │
    │  base64-encoded chunks
    │  { type: "reply.audio", data: "..." }
    ▼
Speakers (speaker, 24 kHz PCM16)
Enter fullscreen mode Exit fullscreen mode

Prerequisites

  • Node.js 20+
  • A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation, so mic-to-speaker feedback will trigger barge-in)
  • An AssemblyAI API key — free tier available

The mic and speaker packages call into native audio backends:

  • macOS: brew install sox
  • Debian/Ubuntu: sudo apt-get install sox libsox-fmt-all libasound2-dev
  • Windows: install SoX and the Visual Studio C++ build tools

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-nodejs
cd voice-agent-nodejs

npm install
Enter fullscreen mode Exit fullscreen mode

2. Configure your API key

 cp .env.example .env
# Edit .env — drop in your AssemblyAI API key
Enter fullscreen mode Exit fullscreen mode

3. Run the agent

 npm start
Enter fullscreen mode Exit fullscreen mode

Plug in your headphones, wait for Connected (session ...), and start talking.

Connected (session sess_abc123).
Speak now. Press Ctrl+C to quit.

You:   What can you do?
Agent: I can chat with you, answer questions, or help work through ideas — whatever you'd like.
Enter fullscreen mode Exit fullscreen mode

That's the whole thing. Under 100 lines of Node.js.

How it works

1. Open the WebSocket and configure the session

 import "dotenv/config";
import WebSocket from "ws";

const ws = new WebSocket("wss://agents.assemblyai.com/v1/ws", {
  headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
});

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      system_prompt: "You are a friendly voice assistant. Keep replies short.",
      greeting: "Hi there — what can I help you with?",
      output: { voice: "ivy" },
    },
  }));
});
Enter fullscreen mode Exit fullscreen mode

session.update is always your first message. It sets the agent's personality (system_prompt), what it says when the call connects (greeting), and the voice it speaks in (voice). All fields are optional, and you can re-send session.update at any time during the conversation to change them.

2. Stream microphone audio

 import mic from "mic";

const micInstance = mic({
  rate: "24000", channels: "1", encoding: "signed-integer",
  bitwidth: "16", endian: "little",
});
const micStream = micInstance.getAudioStream();

micStream.on("data", (chunk) => {
  if (sessionReady && ws.readyState === WebSocket.OPEN) {
    ws.send(JSON.stringify({
      type: "input.audio",
      audio: chunk.toString("base64"),
    }));
  }
});
Enter fullscreen mode Exit fullscreen mode

mic shells out to sox (or arecord on Linux) and emits raw PCM16 buffers as data events. We base64-encode each buffer and ship it as an input.audio event. The sessionReady gate keeps us from sending audio before the server has acknowledged the configuration with session.ready.

3. Play the agent's response

 import Speaker from "speaker";

let speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "reply.audio") {
    speaker.write(Buffer.from(event.data, "base64"));
  } else if (event.type === "reply.done" && event.status === "interrupted") {
    speaker.end();
    speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
  }
});
Enter fullscreen mode Exit fullscreen mode

reply.audio events stream as the agent generates each phrase, so playback starts within hundreds of milliseconds of the end of the user's turn. Each chunk is base64-decoded and written straight to the speaker stream, which feeds the OS audio buffer.

When the user interrupts the agent mid-reply (barge-in), the server emits reply.done with status: "interrupted". Node's speaker package doesn't expose a clean flush, so the simplest reliable pattern is to end the current speaker stream and create a fresh one for the next reply.

What you get for free

These are all handled by the API — you write zero code for them:

  • Neural turn detection. The server combines acoustic and linguistic signals to decide when the user has finished speaking, so it knows the difference between a thinking pause and an actual end-of-turn.
  • Barge-in. When the user speaks over the agent, the server stops generating, sends reply.done with status: "interrupted", and trims the agent transcript to what was actually spoken.
  • Real-time partial transcripts. transcript.user.delta events stream as the user talks, so you can show what they're saying live.
  • Final transcripts both ways. transcript.user and transcript.agent events arrive after each turn — perfect for logging, chat history, or moderation.

Customizing the agent

Pick a Different Voice

Eighteen English voices and 16 multilingual voices are available. Drop any voice ID into session.output.voice:

output: { voice: "james" }    // conversational US male
output: { voice: "sophie" }   // clear UK female
output: { voice: "diego" }    // Latin American Spanish
output: { voice: "arjun" }    // Hindi/Hinglish
Enter fullscreen mode Exit fullscreen mode

See the Voices catalog for samples. Multilingual voices code-switch with English automatically.

Tune turn detection

Defaults work well for most apps. Override anything you want under session.input.turn_detection:

input: {
  turn_detection: {
    vad_threshold: 0.5,        // 0.0–1.0; lower = more sensitive
    min_silence: 600,          // ms; min silence before confident end-of-turn
    max_silence: 1500,         // ms; max silence before forcing end-of-turn
    interrupt_response: true,  // false to disable barge-in
  }
}
Enter fullscreen mode Exit fullscreen mode

For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence.

Boost domain-specific terms

If your conversation involves rare words — product names, medical terms, customer names — add them to session.input.keyterms to bias speech recognition toward them:

input: { keyterms: ["Ozempic", "Salesforce", "AssemblyAI"] }
Enter fullscreen mode Exit fullscreen mode




Troubleshooting

The agent keeps interrupting itself. Your microphone is picking up the agent's TTS output. Use headphones, or move to a browser-based client which gets free echo cancellation from getUserMedia.

speaker install fails on Linux. Install ALSA dev headers: sudo apt-get install libasound2-dev.

speaker install fails on macOS with Node 22+. Some node-gyp-based packages lag the latest Node major release. Use Node 20 LTS, or replace speaker with the bundled wav decoder + play-sound if you don't need streaming playback.

mic produces silence. Check that sox (macOS) or arecord (Linux) is on your PATH and that your terminal has microphone permission (macOS: System Settings → Privacy & Security → Microphone).

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.

The full guide is in the Voice Agent API troubleshooting docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.

How do I build a voice agent in Node.js?

The simplest pattern is ws + mic + speaker: open a WebSocket to the Voice Agent API, send a session.update with your system prompt and voice, then pipe microphone audio in and speaker audio out. Under 100 lines. No LLM or TTS SDK needed.

What audio format does the Voice Agent API expect from a Node.js client?

By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Configure the mic package with rate: "24000", channels: "1", encoding: "signed-integer", bitwidth: "16". For telephony integrations you'd switch to audio/pcmu (G.711 μ-law, 8 kHz).

How do I authenticate to the Voice Agent API from Node.js?

Pass your AssemblyAI API key as a Bearer token in the Authorization header during the WebSocket upgrade. For browser apps where you can't expose your API key, mint a short-lived temporary token on your Node server and pass it as a ?token= query parameter from the browser instead.

Why is the agent interrupting itself in my Node.js terminal app?

The microphone is picking up the agent's TTS output through your speakers, the server interprets that as the user speaking, and barge-in fires. The fix is either headphones or running the client in a browser, where getUserMedia({ audio: { echoCancellation: true } }) gives you OS-level acoustic echo cancellation for free.

Can the Voice Agent API call functions from Node.js?

Yes. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Run your function in Node, then send a tool.result event when you receive the next reply.done.

How much does the AssemblyAI Voice Agent API cost?

AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.

Top comments (0)