No separate STT, LLM, or TTS services to wire up. The AssemblyAI Voice Agent API handles the entire pipeline server-side: speech recognition, the language model that decides what to say, and the voice that speaks it back. Turn detection, barge-in, and tool calling are built in.
Why one WebSocket beats a multi-service pipeline
A traditional voice agent needs you to wire up at least three providers — a streaming STT, an LLM, and a TTS — and orchestrate the audio routing between them yourself. Every hop adds latency, every provider adds an API key, and every glue layer adds a place for the conversation to fall apart.
The Voice Agent API collapses all of that into a single connection:
| Multi-service pipeline | Voice Agent API |
|---|---|
| Services to wire up | STT + LLM + TTS (3+ vendors) |
| API keys to manage | 3+ |
| Round trips per turn | 3 (mic→STT→LLM→TTS→speaker) |
| Turn detection | Configure separately |
| Barge-in / interruption | Implement yourself |
| Tool calling | Wire LLM tools manually |
The endpoint is one URL: wss://agents.assemblyai.com/v1/ws. Send PCM16 audio, get PCM16 audio back. That’s it.
Architecture
The system is a single Python script that opens a WebSocket to the Voice Agent API:
| Parameter | Type | Description |
|---|---|---|
vad_threshold |
0.0–1.0 | Voice activity detection sensitivity. Lower = more sensitive to speech. Raise for noisy environments. |
min_silence |
ms | Minimum silence duration before a confident end-of-turn check fires. |
max_silence |
ms | Hard cap on silence before forcing end-of-turn. Raise for deliberate speech (healthcare, eldercare). |
interrupt_response |
boolean | Set to False to disable barge-in entirely. |
Prerequisites
- Python 3.10+
- A microphone — headphones strongly recommended (terminal apps don’t get OS-level echo cancellation, so the agent will interrupt itself if the mic picks up its own voice)
- An AssemblyAI API key — free tier available
On macOS, install PortAudio so sounddevice can access your mic:
brew install portaudio
Quick start
1. Clone and Install
git clone https://github.com/kelsey-aai/voice-agent-5min
cd voice-agent-5min
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
2. Configure Your API Key
cp .env.example .env
# Edit .env — drop in your AssemblyAI API key
3. Run the Agent
python agent.py
Plug in your headphones, wait for Connected (session ...), and start talking. You’ll see your transcript and the agent’s replies stream to the terminal in real time.
That’s the whole thing. Five minutes from clone to conversation.
How it works
The full file is under 100 lines. Three pieces do the actual work.
1. Connect and Configure
URL = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {API_KEY}"}
async with websockets.connect(URL, additional_headers=headers) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"system_prompt": "You are a friendly voice assistant.",
"greeting": "Hi there — what can I help you with?",
"output": {"voice": "ivy"},
},
}))
session.update is the first message you send. It sets the agent’s personality (system_prompt), what it says when the user picks up (greeting), and which voice it speaks in (voice: "ivy"). Every field is optional — you can update any of them mid-conversation by sending another session.update.
2. Stream Microphone Audio
def mic_callback(indata, *_):
if session_ready.is_set():
loop.call_soon_threadsafe(mic_queue.put_nowait, bytes(indata))
async def send_audio() -> None:
while True:
chunk = await mic_queue.get()
await ws.send(json.dumps({
"type": "input.audio",
"audio": base64.b64encode(chunk).decode(),
}))
sounddevice calls mic_callback on its own thread every 50 ms with a fresh PCM16 chunk. We hand it off to the asyncio event loop, base64-encode it, and ship it as an input.audio event. The session_ready gate keeps us from sending audio before the server says it’s ready.
3. Play the Agent’s Response
elif t == "reply.audio":
pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
speaker.write(pcm)
elif t == "reply.done" and event.get("status") == "interrupted":
speaker.abort()
speaker.start()
The server streams reply.audio chunks as the LLM generates the response — you don’t wait for the full reply to start playing. speaker.write() copies samples into the OS audio buffer; the hardware drains them at 24 kHz on its own clock.
When the user interrupts the agent mid-reply (barge-in), the server emits reply.done with status: "interrupted". We flush the speaker buffer with abort() then start() so the user doesn’t hear stale audio.
What you get for free
These are all handled by the API — you don’t write any code for them:
- Neural turn detection. The server decides when the user has finished speaking using both acoustic and linguistic signals, so it knows the difference between a thinking pause and an actual end-of-turn.
- Barge-in. When the user speaks over the agent, the server stops generating, sends reply.done with status: "interrupted", and trims the agent transcript to what was actually spoken.
- Real-time partial transcripts. transcript.user.delta events stream as the user talks, so you can show what they’re saying live.
- Final transcripts for both speakers. transcript.user and transcript.agent events arrive after each turn — perfect for logging or displaying chat history.
Tuning the agent
Pick a different voice
Eighteen English voices and 16 multilingual voices are available. Drop any voice ID into session.output.voice:
"output": {"voice": "james"} # conversational US male
"output": {"voice": "sophie"} # clear UK female
"output": {"voice": "diego"} # Latin American Spanish
"output": {"voice": "arjun"} # Hindi/Hinglish
See the Voices catalog for samples of each voice. Multilingual voices code-switch with English automatically.
Adjust turn detection
Default settings work well for most apps. Override anything you want under session.input.turn_detection:
"input": {
"turn_detection": {
"vad_threshold": 0.5, # 0.0–1.0; lower = more sensitive
"min_silence": 600, # ms; min silence before end-of-turn
"max_silence": 1500, # ms; max silence before forcing end-of-turn
"interrupt_response": True, # set False to disable barge-in
}
}
For noisy environments, raise vad_threshold. For deliberate speech (healthcare, eldercare), raise max_silence. Settings can be updated mid-session.
Boost domain-specific terms
If your conversation involves rare words — product names, medical terms, customer names — add them to session.input.keyterms to bias the speech recognition model toward them:
"input": {
"keyterms": ["Ozempic", "Salesforce", "AssemblyAI"]
}
Common issues
The agent keeps interrupting itself. Your microphone is picking up the agent’s TTS output through the speakers. Use headphones, or switch to a browser-based client which gets free echo cancellation from getUserMedia.
Audio sounds garbled or pitched up/down. The Voice Agent API uses 24 kHz by default. Make sure both sd.InputStream and sd.OutputStream are set to samplerate=24000.
UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and double-check .env.
The full troubleshooting guide is in the Voice Agent API docs.
Where to go next
Once you’ve got the basic agent talking, layer in capabilities:
- Add tools — give the agent the ability to look up information, hit APIs, or trigger workflows.
- Move to the browser — generate temporary tokens server-side so users can talk to the agent from a webpage with built-in echo cancellation.
Resume sessions — preserve conversation context across dropped connections.
Frequently asked questions
What is AssemblyAI’s Voice Agent API?
AssemblyAI’s Voice Agent API is a single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate STT, LLM, and TTS providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.
How is it different from streaming speech-to-text?
Streaming speech-to-text only gives you transcripts — you still have to send those transcripts to an LLM, take the LLM’s response, and send it through a TTS service before playing it back. The Voice Agent API does all of that for you: you stream microphone audio in and you get the agent’s spoken audio back.
How do I authenticate?
Pass your AssemblyAI API key as a Bearer token in the Authorization header during the WebSocket upgrade. For browser apps where you can’t expose your API key, generate a short-lived temporary token on your server and pass it as a ?token= query parameter instead. Each token is single-use.
What audio format does it expect?
By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. For telephony integrations (like Twilio), you can switch to audio/pcmu (G.711 μ-law, 8 kHz) or audio/pcma (G.711 A-law, 8 kHz) under session.input.format and session.output.format.
Can it call tools or functions?
Yes. Register tool definitions in session.tools on a session.update event. When the agent decides to call a tool, the server emits a tool.call event. Execute the tool in your client code, then send back a tool.result event when you receive the next reply.done. See the tool calling guide for the full pattern.
Why does my voice agent keep interrupting itself?
The most common cause is acoustic echo: the microphone picks up the agent’s TTS output through your speakers. The fix is either headphones or moving the client into a browser, where getUserMedia({ audio: { echoCancellation: true } }) gives you OS-level acoustic echo cancellation for free.
How much does it cost?
AssemblyAI offers a free tier so you can build and test without a credit card. For current pricing, see the AssemblyAI pricing page.
Top comments (0)