Mart Schweiger

Posted on May 7 • Originally published at assemblyai.com

Raw WebSocket Voice Agent with AssemblyAI's Voice Agent API

#voiceai #ai #websockets #tutorial

What "raw" means here

5-minute quickstart	Raw WebSocket (this tutorial)
Lines of Python	~80
Events handled	6
Partial transcripts (`transcript.user.delta`)	✖
Tool calling	✖
Session resume on reconnect	✖
Speech start/stop logging	✖
Error code handling	Minimal

If you want the fastest path to a working agent, start with the 5-minute quickstart. If you want to ship the Voice Agent API into a real product, build on this one — every edge case the protocol expresses is already in here.

Architecture

 Microphone (sounddevice, 24 kHz PCM16)
    │
    │  ┌──── client → server ────┐
    │  │  session.update         │  config (1st message)
    │  │  session.resume         │  reconnect within 30s
    │  │  input.audio            │  base64 PCM16 chunks
    │  │  tool.result            │  send on next reply.done
    │  └────────────────────────┘
    ▼
wss://agents.assemblyai.com/v1/ws
    ▲
    │  ┌──── server → client ────┐
    │  │  session.ready          │  save session_id
    │  │  session.updated        │
    │  │  input.speech.started   │
    │  │  input.speech.stopped   │
    │  │  transcript.user.delta  │  partial — live transcript
    │  │  transcript.user        │  final user transcript
    │  │  reply.started          │
    │  │  reply.audio            │  base64 PCM16 chunks
    │  │  transcript.agent       │  full agent transcript
    │  │  reply.done             │  status: "interrupted" on barge-in
    │  │  tool.call              │  arguments is a dict
    │  │  session.error          │  code + message
    │  └────────────────────────┘
Speakers (sounddevice, 24 kHz PCM16)

Prerequisites

Python 3.10+
A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation)
An AssemblyAI API key — free tier available

On macOS, install PortAudio for sounddevice:

brew install portaudio

Quick start

 git clone https://github.com/kelsey-aai/voice-agent-raw-websocket
cd voice-agent-raw-websocket

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your AssemblyAI API key

python agent.py

You'll see every event flow through the terminal as you talk:

[client→server] session.update (initial config)
[server→client] session.ready (sess_abc123)
Speak now. Press Ctrl+C to quit.

[server→client] input.speech.started
  …  what's the weather in tok
You:   What's the weather in Tokyo?
[server→client] reply.started (reply_xyz)
[server→client] tool.call get_weather({'location': 'Tokyo'}) id=call_abc
[server→client] reply.done
[client→server] tool.result id=call_abc
[server→client] reply.started (reply_xyz2)
Agent: It's currently 22 degrees and sunny in Tokyo.
[server→client] reply.done

Every event, explained

Client → Server

session.update — First Message, Also Re-sendable

 await ws.send(json.dumps({
    "type": "session.update",
    "session": {
        "system_prompt": "...",
        "greeting": "...",
        "output": {"voice": "ivy"},
        "tools": [...],
    },
}))

Always your first message. Configures the agent's personality, the spoken greeting, the voice, registered tools, turn detection sensitivity, and audio format. All fields are optional — re-send session.update mid-conversation to update any of them. greeting and the output block are immutable after the first apply.

input.audio — Stream Microphone Audio

 await ws.send(json.dumps({
    "type": "input.audio",
    "audio": base64.b64encode(pcm_bytes).decode(),
}))

Send only after session.ready. ~50 ms chunks at 24 kHz PCM16 mono. The server buffers across chunks, so chunk size isn't strict.

tool.result — Answer a tool.call

 await ws.send(json.dumps({
    "type": "tool.result",
    "call_id": "call_abc123",
    "result": json.dumps({"temp_c": 22}),
}))

The critical pattern: accumulate tool.result events and only send them when you receive reply.done for the turn that contained the tool.call. Sending early creates timing issues. If the user interrupts mid-turn (reply.done with status: "interrupted"), discard the accumulated results.

session.resume — Reconnect Within 30 Seconds

 await ws.send(json.dumps({
    "type": "session.resume",
    "session_id": "sess_abc123",
}))

If your WebSocket drops and you reconnect within 30 seconds with the previous session_id, the server preserves conversation context. Past 30 seconds you'll get session_not_found and need to start fresh.

Server → Client

Event	Carries	What to do
`session.ready`	`session_id`	Save the id, start sending `input.audio`
`session.updated`	—	A `session.update` was applied
`input.speech.started`	—	VAD detected speech onset
`input.speech.stopped`	—	VAD detected end of speech
`transcript.user.delta`	`text`	Live partial — overwrite a single UI line
`transcript.user`	`text`, `item_id`	Final user transcript for the turn
`reply.started`	`reply_id`	Agent began generating
`reply.audio`	`data`	Base64 PCM chunk — decode and play immediately
`transcript.agent`	`text`, `interrupted`	Full agent transcript (trimmed if interrupted)
`reply.done`	optional `status`	Reply complete; on `"interrupted"`, flush speaker and discard pending tool results
`tool.call`	`call_id`, `name`, `arguments`	Run the tool, accumulate the result, send on next `reply.done`
`session.error`	`code`, `message`, `timestamp`	See error table below

Error codes

Code	Recovery
`UNAUTHORIZED`	Bad/missing API key — fetch a fresh one
`FORBIDDEN`	Token valid but lacks permission
`session_not_found` / `session_forbidden` / `session_expired`	`session.resume` failed — start a fresh session
`agent_init_failed` / `agent_timeout`	Server-side worker issue — retry
`invalid_format`	Bad JSON or unknown `type` field
`invalid_audio`	base64 decode or PCM conversion failed
`invalid_value`	Bad voice or wrong field type in `session.update`
`immutable_field`	Tried to change `greeting` or `output` after first apply
`server_error`	At capacity — exponential backoff and retry

How the tool-calling pattern works

 pending_tools: list[dict] = []

if t == "tool.call":
    # 1. Accumulate — don't send a tool.result yet.
    if event["name"] == "get_weather":
        result = fake_get_weather(event["arguments"]["location"])
    pending_tools.append({"call_id": event["call_id"], "result": result})

elif t == "reply.done":
    if event.get("status") == "interrupted":
        # 2a. Barge-in: discard pending results.
        pending_tools.clear()
    else:
        # 2b. Reply finished cleanly: send all accumulated results.
        for tool in pending_tools:
            await ws.send(json.dumps({
                "type": "tool.result",
                "call_id": tool["call_id"],
                "result": json.dumps(tool["result"]),
            }))
        pending_tools.clear()

While the server waits for your tool.result, the agent speaks a transition phrase ("Let me check that for you") so the conversation doesn't go silent. You can steer that phrase by including instructions in the system prompt.

How session resume works

 session_id: str | None = None
while True:
    try:
        session_id = await run_session(session_id)
    except websockets.ConnectionClosed:
        # Reconnect within 30s — server preserves conversation.
        await asyncio.sleep(1)

run_session returns the session_id it received from the most recent session.ready. On the next iteration, that id is passed back as prev_session_id and the first message becomes session.resume instead of session.update. If the server replies session_not_found or session_expired, we clear the saved id and start a fresh session.

Tuning the agent

Voice, system prompt, turn detection, and key terms all configure under session.update. See Session configuration for every field. The same tuning options apply to every Voice Agent API tutorial — pick the one closest to your stack and the protocol underneath stays the same.

Common issues

The agent keeps interrupting itself. Mic feedback. Use headphones, or move to a browser-based client which gets free echo cancellation.

Audio sounds garbled or pitched up/down. The Voice Agent API uses 24 kHz by default. Make sure both sd.InputStream and sd.OutputStream are set to samplerate=24000.

session_not_found immediately after reconnect. The 30 s grace window expired or your network dropped for too long. Clear the saved session_id and start fresh.

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.

How many events does the Voice Agent API protocol have?

Four client-to-server events (session.update, input.audio, session.resume, tool.result) and twelve server-to-client events. All sixteen are documented in the events reference and handled explicitly in this tutorial's agent.py.

When should I send a tool.result to the Voice Agent API?

Always wait until you receive reply.done for the turn that contained the tool.call — never immediately. The recommended pattern is to accumulate tool results in a list as tool.call events arrive, then send them all in the reply.done handler. If reply.done arrives with status: "interrupted", discard the pending results.

How does session resume work?

Sessions are preserved for 30 seconds after disconnection. Reconnect within that window with {"type": "session.resume", "session_id": ""} as your first message. Past 30 seconds, you'll get session_not_found and need to send a fresh session.update.

Can I show live partial transcripts to the user?

Yes — that's what transcript.user.delta is for. The server emits a delta event every few hundred milliseconds while the user is speaking, then a final transcript.user event when the turn ends. Overwrite a single line on each delta and commit it when the final arrives.

What happens if the user interrupts the agent?

The server stops generating, emits reply.done with status: "interrupted", and emits transcript.agent with the text trimmed to what was actually spoken. Your client must flush its audio output buffer and discard any pending tool results.

DEV Community