DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Raw WebSocket Voice Agent with AssemblyAI's Voice Agent API

What "raw" means here

5-minute quickstart Raw WebSocket (this tutorial)
Lines of Python ~80
Events handled 6
Partial transcripts (transcript.user.delta)
Tool calling
Session resume on reconnect
Speech start/stop logging
Error code handling Minimal

If you want the fastest path to a working agent, start with the 5-minute quickstart. If you want to ship the Voice Agent API into a real product, build on this one — every edge case the protocol expresses is already in here.

Architecture

 Microphone (sounddevice, 24 kHz PCM16)
    │
    │  ┌──── client → server ────┐
    │  │  session.update         │  config (1st message)
    │  │  session.resume         │  reconnect within 30s
    │  │  input.audio            │  base64 PCM16 chunks
    │  │  tool.result            │  send on next reply.done
    │  └────────────────────────┘
    ▼
wss://agents.assemblyai.com/v1/ws
    ▲
    │  ┌──── server → client ────┐
    │  │  session.ready          │  save session_id
    │  │  session.updated        │
    │  │  input.speech.started   │
    │  │  input.speech.stopped   │
    │  │  transcript.user.delta  │  partial — live transcript
    │  │  transcript.user        │  final user transcript
    │  │  reply.started          │
    │  │  reply.audio            │  base64 PCM16 chunks
    │  │  transcript.agent       │  full agent transcript
    │  │  reply.done             │  status: "interrupted" on barge-in
    │  │  tool.call              │  arguments is a dict
    │  │  session.error          │  code + message
    │  └────────────────────────┘
Speakers (sounddevice, 24 kHz PCM16)
Enter fullscreen mode Exit fullscreen mode

Prerequisites

  • Python 3.10+
  • A microphone — headphones strongly recommended (terminal apps don't get OS-level echo cancellation)
  • An AssemblyAI API key — free tier available

On macOS, install PortAudio for sounddevice:

brew install portaudio
Enter fullscreen mode Exit fullscreen mode

Quick start

 git clone https://github.com/kelsey-aai/voice-agent-raw-websocket
cd voice-agent-raw-websocket

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env with your AssemblyAI API key

python agent.py
Enter fullscreen mode Exit fullscreen mode

You'll see every event flow through the terminal as you talk:

[client→server] session.update (initial config)
[server→client] session.ready (sess_abc123)
Speak now. Press Ctrl+C to quit.

[server→client] input.speech.started
  …  what's the weather in tok
You:   What's the weather in Tokyo?
[server→client] reply.started (reply_xyz)
[server→client] tool.call get_weather({'location': 'Tokyo'}) id=call_abc
[server→client] reply.done
[client→server] tool.result id=call_abc
[server→client] reply.started (reply_xyz2)
Agent: It's currently 22 degrees and sunny in Tokyo.
[server→client] reply.done
Enter fullscreen mode Exit fullscreen mode

Every event, explained

Client → Server

session.update — First Message, Also Re-sendable

 await ws.send(json.dumps({
    "type": "session.update",
    "session": {
        "system_prompt": "...",
        "greeting": "...",
        "output": {"voice": "ivy"},
        "tools": [...],
    },
}))
Enter fullscreen mode Exit fullscreen mode

Always your first message. Configures the agent's personality, the spoken greeting, the voice, registered tools, turn detection sensitivity, and audio format. All fields are optional — re-send session.update mid-conversation to update any of them. greeting and the output block are immutable after the first apply.

input.audio — Stream Microphone Audio

 await ws.send(json.dumps({
    "type": "input.audio",
    "audio": base64.b64encode(pcm_bytes).decode(),
}))
Enter fullscreen mode Exit fullscreen mode

Send only after session.ready. ~50 ms chunks at 24 kHz PCM16 mono. The server buffers across chunks, so chunk size isn't strict.

tool.result — Answer a tool.call

 await ws.send(json.dumps({
    "type": "tool.result",
    "call_id": "call_abc123",
    "result": json.dumps({"temp_c": 22}),
}))
Enter fullscreen mode Exit fullscreen mode

The critical pattern: accumulate tool.result events and only send them when you receive reply.done for the turn that contained the tool.call. Sending early creates timing issues. If the user interrupts mid-turn (reply.done with status: "interrupted"), discard the accumulated results.

session.resume — Reconnect Within 30 Seconds

 await ws.send(json.dumps({
    "type": "session.resume",
    "session_id": "sess_abc123",
}))
Enter fullscreen mode Exit fullscreen mode

If your WebSocket drops and you reconnect within 30 seconds with the previous session_id, the server preserves conversation context. Past 30 seconds you'll get session_not_found and need to start fresh.

Server → Client

Event Carries What to do
session.ready session_id Save the id, start sending input.audio
session.updated A session.update was applied
input.speech.started VAD detected speech onset
input.speech.stopped VAD detected end of speech
transcript.user.delta text Live partial — overwrite a single UI line
transcript.user text, item_id Final user transcript for the turn
reply.started reply_id Agent began generating
reply.audio data Base64 PCM chunk — decode and play immediately
transcript.agent text, interrupted Full agent transcript (trimmed if interrupted)
reply.done optional status Reply complete; on "interrupted", flush speaker and discard pending tool results
tool.call call_id, name, arguments Run the tool, accumulate the result, send on next reply.done
session.error code, message, timestamp See error table below

Error codes

Code Recovery
UNAUTHORIZED Bad/missing API key — fetch a fresh one
FORBIDDEN Token valid but lacks permission
session_not_found / session_forbidden / session_expired session.resume failed — start a fresh session
agent_init_failed / agent_timeout Server-side worker issue — retry
invalid_format Bad JSON or unknown type field
invalid_audio base64 decode or PCM conversion failed
invalid_value Bad voice or wrong field type in session.update
immutable_field Tried to change greeting or output after first apply
server_error At capacity — exponential backoff and retry

How the tool-calling pattern works

 pending_tools: list[dict] = []

if t == "tool.call":
    # 1. Accumulate — don't send a tool.result yet.
    if event["name"] == "get_weather":
        result = fake_get_weather(event["arguments"]["location"])
    pending_tools.append({"call_id": event["call_id"], "result": result})

elif t == "reply.done":
    if event.get("status") == "interrupted":
        # 2a. Barge-in: discard pending results.
        pending_tools.clear()
    else:
        # 2b. Reply finished cleanly: send all accumulated results.
        for tool in pending_tools:
            await ws.send(json.dumps({
                "type": "tool.result",
                "call_id": tool["call_id"],
                "result": json.dumps(tool["result"]),
            }))
        pending_tools.clear()
Enter fullscreen mode Exit fullscreen mode

While the server waits for your tool.result, the agent speaks a transition phrase ("Let me check that for you") so the conversation doesn't go silent. You can steer that phrase by including instructions in the system prompt.

How session resume works

 session_id: str | None = None
while True:
    try:
        session_id = await run_session(session_id)
    except websockets.ConnectionClosed:
        # Reconnect within 30s — server preserves conversation.
        await asyncio.sleep(1)
Enter fullscreen mode Exit fullscreen mode

run_session returns the session_id it received from the most recent session.ready. On the next iteration, that id is passed back as prev_session_id and the first message becomes session.resume instead of session.update. If the server replies session_not_found or session_expired, we clear the saved id and start a fresh session.

Tuning the agent

Voice, system prompt, turn detection, and key terms all configure under session.update. See Session configuration for every field. The same tuning options apply to every Voice Agent API tutorial — pick the one closest to your stack and the protocol underneath stays the same.

Common issues

The agent keeps interrupting itself. Mic feedback. Use headphones, or move to a browser-based client which gets free echo cancellation.

Audio sounds garbled or pitched up/down. The Voice Agent API uses 24 kHz by default. Make sure both sd.InputStream and sd.OutputStream are set to samplerate=24000.

session_not_found immediately after reconnect. The 30 s grace window expired or your network dropped for too long. Clear the saved session_id and start fresh.

UNAUTHORIZED close on connect. Your API key is missing, expired, or wrong. Grab a fresh one from the AssemblyAI dashboard and re-check .env.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI's Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech-to-text, LLM reasoning, and text-to-speech — so you can build a conversational voice agent without wiring up separate providers. It includes neural turn detection, barge-in, tool calling, and 30+ voices out of the box.

How many events does the Voice Agent API protocol have?

Four client-to-server events (session.update, input.audio, session.resume, tool.result) and twelve server-to-client events. All sixteen are documented in the events reference and handled explicitly in this tutorial's agent.py.

When should I send a tool.result to the Voice Agent API?

Always wait until you receive reply.done for the turn that contained the tool.call — never immediately. The recommended pattern is to accumulate tool results in a list as tool.call events arrive, then send them all in the reply.done handler. If reply.done arrives with status: "interrupted", discard the pending results.

How does session resume work?

Sessions are preserved for 30 seconds after disconnection. Reconnect within that window with {"type": "session.resume", "session_id": ""} as your first message. Past 30 seconds, you'll get session_not_found and need to send a fresh session.update.

Can I show live partial transcripts to the user?

Yes — that's what transcript.user.delta is for. The server emits a delta event every few hundred milliseconds while the user is speaking, then a final transcript.user event when the turn ends. Overwrite a single line on each delta and commit it when the final arrives.

What happens if the user interrupts the agent?

The server stops generating, emits reply.done with status: "interrupted", and emits transcript.agent with the text trimmed to what was actually spoken. Your client must flush its audio output buffer and discard any pending tool results.

Top comments (0)