This is the “real app” version of the 5-minute quickstart: a polished UI, AudioWorklet mic capture, temporary-token auth, and full barge-in handling. The AssemblyAI Voice Agent API does the speech recognition, the LLM, and the TTS server-side — you’re just shuttling audio bytes.
Why One WebSocket Beats a Multi-Service Pipeline
A traditional browser voice agent needs you to wire up streaming STT, an LLM, and a TTS provider, then orchestrate audio routing between them in the browser. Every hop adds latency, every provider needs a key, and every glue layer adds a failure mode.
| Multi-service browser pipeline | Voice Agent API |
|---|---|
| Services to wire up | STT + LLM + TTS (3+ vendors) |
| API keys to manage | 3+ |
| Round trips per turn | 3 (mic→STT→LLM→TTS→speaker) |
| Browser key exposure | Hard to avoid |
| Turn detection | Configure separately |
| Barge-in / interruption | Implement yourself |
| Tool calling | Wire LLM tools manually |
The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send 24 kHz PCM, get 24 kHz PCM back. That’s it.
Architecture
The system has two halves: a browser client and a lightweight Node server.
| Event | What we do with it |
|---|---|
session.ready |
Save the session_id, start sending audio |
transcript.user.delta |
Render a partial bubble (italic, low-opacity) as the user speaks |
transcript.user |
Promote the partial to a final user message |
reply.audio |
Decode base64 PCM, schedule playback (see below) |
transcript.agent |
Render the agent’s final reply (with interrupted if applicable) |
reply.done (with status: "interrupted") |
Flush queued audio — user barged in |
session.error |
Surface error code in the status indicator |
Data flow: The browser gets a token from the Node server, opens a WebSocket to the Voice Agent API with that token, streams mic PCM up, and receives reply PCM + transcript events back.
Prerequisites
- Node.js 18+ (uses native fetch and ES modules)
- A modern browser (Chrome 66+, Firefox 76+, Safari 14.1+ — anything with AudioWorklet)
- An AssemblyAI API key — free tier available
The browser needs a secure origin to access the mic. http://localhost counts as secure, so you can develop locally without TLS. If you deploy elsewhere, serve over HTTPS.
Quick Start
1. Clone and Install
git clone https://github.com/kelsey-aai/voice-assistant-app
cd voice-assistant-app
npm install
2. Configure Your API Key
cp .env.example .env
# Edit .env — drop in your AssemblyAI API key
3. Run the App
npm start
Open http://localhost:3000, pick a voice, hit Connect , grant mic permission, and start talking. You’ll see your speech transcribed live as a partial bubble, then committed to a final bubble, with the agent’s reply streaming back in audio and text.
How It Works
There are four moving parts: the token mint, the AudioWorklet that captures mic audio, the WebSocket loop that drives the conversation, and the playback scheduler that turns reply.audio events back into sound.
1. The Server Mints a Temporary Token
Your AssemblyAI API key never leaves the server. The browser asks /api/voice-token for a single-use token, valid for 5 minutes:
// server.js
app.get("/api/voice-token", async (_req, res) => {
const url = new URL("https://agents.assemblyai.com/v1/token");
url.searchParams.set("expires_in_seconds", "300");
const response = await fetch(url, {
headers: { Authorization: `Bearer ${API_KEY}` },
});
const { token } = await response.json();
res.json({ token });
});
Tokens are single-use — you fetch a fresh one for every connection. The browser then opens the WebSocket with the token as a query parameter:
ws = new WebSocket(`wss://agents.assemblyai.com/v1/ws?token=${token}`);
2. The AudioWorklet Captures Mic Audio
AudioWorklet runs your PCM conversion off the main thread, which keeps it glitch-free. The worklet receives Float32 samples, clips them to range, and posts them back as a transferable Int16Array buffer:
// public/pcm-processor.js
class PCMProcessor extends AudioWorkletProcessor {
process(inputs) {
const channel = inputs[0]?.[0];
if (!channel) return true;
const pcm = new Int16Array(channel.length);
for (let i = 0; i < channel.length; i++) {
const s = Math.max(-1, Math.min(1, channel[i]));
pcm[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}
this.port.postMessage(pcm.buffer, [pcm.buffer]);
return true;
}
}
The Voice Agent API expects 24 kHz PCM by default, so we force the entire AudioContext to 24 kHz on creation — no resampling needed:
audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE }); // 24000
We also enable browser-level acoustic echo cancellation when grabbing the mic, so the agent doesn’t interrupt itself by hearing its own TTS through the speakers:
navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
channelCount: 1,
},
});
3. The WebSocket Drives the Conversation
session.update is the first message — it configures system prompt, greeting, and voice. After that, you stream input.audio events whenever the worklet hands you a frame:
ws.onopen = () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: prompt.value,
greeting: "Hi there — what can I help you with?",
output: { voice: "ivy" },
},
}));
};
workletNode.port.onmessage = (e) => {
ws.send(JSON.stringify({
type: "input.audio",
audio: arrayBufferToBase64(e.data),
}));
};
The server replies with a stream of events. The ones we care about for UI:
| Parameter | Type | Description |
|---|---|---|
vad_threshold |
0.0–1.0 | Voice activity detection sensitivity. Higher = less sensitive. Raise for noisy environments. |
min_silence |
ms | Minimum silence duration before the end-of-turn check fires. |
max_silence |
ms | Hard cap on silence before forcing end-of-turn. Raise for deliberate speech (healthcare, eldercare). |
interrupt_response |
boolean | Set to false to disable barge-in entirely. |
4. Reply Playback Uses a Scheduling Cursor
reply.audio chunks arrive faster than they play — sometimes the whole reply is buffered before the first sample hits the speaker. Naively calling source.start(0) would overlap the chunks. Instead, we keep a playbackTime cursor and schedule each chunk back-to-back:
function playPCM(b64) {
const bytes = Uint8Array.from(atob(b64), (c) => c.charCodeAt(0));
const int16 = new Int16Array(bytes.buffer, bytes.byteOffset,
bytes.byteLength / 2);
const float = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++)
float[i] = int16[i] / 0x8000;
const buffer = audioCtx.createBuffer(1, float.length, SAMPLE_RATE);
buffer.getChannelData(0).set(float);
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
const now = audioCtx.currentTime;
if (playbackTime < now) playbackTime = now;
source.start(playbackTime);
playbackTime += buffer.duration;
scheduledSources.push(source);
}
5. Barge-In Stops Scheduled Audio Cold
When the user speaks over the agent, the server emits reply.done with status: "interrupted" and trims transcript.agent to what was actually spoken. The client’s job is to drop any audio that was scheduled but hasn’t played yet:
function flushPlayback() {
for (const src of scheduledSources) {
try { src.stop(); } catch (_) {}
}
scheduledSources = [];
playbackTime = audioCtx.currentTime;
}
That’s the entire interruption story: stop every scheduled AudioBufferSourceNode, reset the cursor to “now.”
Customization
Pick a Different Voice
The voice picker is wired to session.output.voice. The dropdown ships 16 popular options. Eighteen English voices and 16 multilingual voices are available in total. See the Voices catalog for samples of each. Multilingual voices code-switch with English automatically.
Change the Personality with the System Prompt
The textarea is bound to session.system_prompt. Tighten it for shorter replies, give it a persona, or scope it to a specific use case:
You are a customer-support agent for Acme Cloud Storage. Only answer
questions about Acme’s products, plans, and account billing. If the user
asks about anything else, politely redirect.
You can also re-send session.update mid-conversation to swap personas live. Note: greeting and output are immutable after the first apply — only system_prompt, tools, and input can change mid-session.
Tune Turn Detection
Add input.turn_detection to the session.update payload to control how patient the agent feels:
session: {
input: {
turn_detection: {
vad_threshold: 0.5, // 0.0–1.0; higher = less sensitive
min_silence: 600, // ms; min silence before end-of-turn
max_silence: 1500, // ms; hard cap before forcing end-of-turn
interrupt_response: true, // false to disable barge-in entirely
},
},
}
For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare) raise max_silence.
Troubleshooting
Mic blocked or no audio going up. Browsers only allow getUserMedia on secure origins. http://localhost counts; http://your-laptop.local doesn’t. If you’re testing across devices, terminate TLS in front of the Node server (Caddy, Cloudflare Tunnel, ngrok).
Agent keeps interrupting itself. Acoustic echo — the mic is picking up the speakers. Use headphones, or confirm echoCancellation: true is set on getUserMedia.
Audio sounds chipmunky or slowed-down. Sample-rate mismatch. The AudioContext must be created with { sampleRate: 24000 }. If you skip that, the browser creates the context at 44.1 or 48 kHz and the math falls apart.
UNAUTHORIZED close on connect. Token wasn’t included, expired, or was already used. Tokens are single-use — fetch a fresh one for every connection. Confirm ASSEMBLYAI_API_KEY is set on the server.
WebSocket closes with code 1006 and no error. Pre-handshake failure. In browsers, that’s usually a stale or invalid token. Re-fetch the token before reconnecting.
The full troubleshooting guide is in the Voice Agent API docs.
Frequently asked questions
What is AssemblyAI’s Voice Agent API?
AssemblyAI’s Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.
How do I use the Voice Agent API from the browser without exposing my API key?
Run a small backend that mints short-lived temporary tokens by calling GET https://agents.assemblyai.com/v1/token with your API key in an Authorization: Bearer header. The browser fetches a fresh token before each WebSocket connection and passes it as ?token= in the URL. Tokens are single-use and expire in 1–600 seconds.
What audio format does the browser need to send?
By default, the Voice Agent API expects audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. Create the AudioContext with { sampleRate: 24000 } so no resampling is needed, then use an AudioWorklet to convert Float32 mic samples to Int16Array and base64-encode the buffer.
How do I handle interruption (barge-in)?
When the user speaks while the agent is replying, the server emits reply.done with status: "interrupted". The browser must stop any scheduled AudioBufferSourceNodes and reset its playbackTime cursor to audioCtx.currentTime so the next reply starts cleanly.
Why does my voice agent keep interrupting itself?
Almost always acoustic echo: the mic is picking up the agent’s TTS output through the speakers. Pass echoCancellation: true to getUserMedia to enable the browser’s OS-level acoustic echo cancellation, and prefer headphones during development.
Can the Voice Agent API call tools or functions from the browser?
Yes — tool calling works the same way client-side. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Execute the tool in your client code, then send a tool.result event when you receive the next reply.done. See the tool calling guide for the full pattern.
How much does the Voice Agent API cost?
AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.
Top comments (0)