DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Build an Agora voice agent with AssemblyAI's Voice Agent API

Why combine Agora and the Voice Agent API

Agora gives you battle-tested WebRTC: low-latency audio routing across 200+ countries, automatic codec negotiation, jitter buffers, NAT traversal, and SDKs for every client platform. The Voice Agent API is the AI brain in one connection.

DIY pipeline behind Agora Voice Agent API behind Agora
Vendors to integrate STT + LLM + TTS (3+)
API keys to rotate 3+
Round trips per turn 3
Turn detection Plug in a separate VAD
Barge-in Implement yourself
Tool calling Wire LLM tools yourself
Voices Pick a TTS vendor

Architecture

The system has three layers:

Parameter Type Description
vad_threshold 0.0–1.0 Voice activity detection sensitivity. Raise for noisy call environments.
min_silence ms Minimum silence before a confident end-of-turn check fires.
max_silence ms Hard cap on silence before forcing end-of-turn. Raise for deliberate speakers.
interrupt_response boolean Set to False to disable barge-in entirely.

The bot resamples between Agora's 16 kHz and the Voice Agent API's 24 kHz using SciPy's polyphase filter. Both sides use PCM16 mono.

Prerequisites

  • Python 3.10+
  • An Agora project with an App ID (and App Certificate if enabled)
  • An AssemblyAI API key — free tier available
  • Linux or macOS (the Agora native server SDK does not officially ship Windows wheels; use WSL2 or a Linux container on Windows)

Quick start

1. Clone and install

 git clone https://github.com/kelsey-aai/voice-agent-agora
cd voice-agent-agora

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

2. Configure credentials

 cp .env.example .env
Enter fullscreen mode Exit fullscreen mode

Edit .env:

ASSEMBLYAI_API_KEY=your_assemblyai_key
AGORA_APP_ID=your_agora_app_id
AGORA_APP_CERTIFICATE=your_agora_app_certificate
AGORA_CHANNEL=voice-agent-demo
AGORA_BOT_UID=9999
Enter fullscreen mode Exit fullscreen mode

If your Agora project has App Certificate disabled, leave AGORA_APP_CERTIFICATE blank.

3. Run the bot

 python bot.py --channel voice-agent-demo
Enter fullscreen mode Exit fullscreen mode

4. Connect a client

Open Agora's Web demo, enter your App ID, the same channel name, a different UID, and click Join. Speak — the bot transcribes you live, the LLM replies, and the synthesized voice plays back through your browser.

How it works

The bridge is two cooperating asyncio tasks — one pulling caller audio out of Agora and pushing it to AssemblyAI, the other pulling reply audio out of AssemblyAI and pushing it back into Agora.

1. Connect to the Voice Agent API

 URL = "wss://agents.assemblyai.com/v1/ws"
headers = {"Authorization": f"Bearer {API_KEY}"}

async with websockets.connect(URL, additional_headers=headers) as ws:
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": "You are a friendly voice assistant.",
            "greeting": "Hi — I just joined the call.",
            "input": {"format": {"encoding": "audio/pcm"}},
            "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}},
        },
    }))

\\
Enter fullscreen mode Exit fullscreen mode

session.update is the first message and configures personality, greeting, and voice. The default audio format is audio/pcm — 24 kHz, 16-bit signed LE, mono.

2. Pull caller audio out of Agora

The bot registers an IAudioFrameObserver whose on_playback_audio_frame_before_mixing hook fires every 10 ms with one participant's audio frame. We resample 16 kHz → 24 kHz with SciPy's polyphase filter:

def on_playback_audio_frame_before_mixing(self, channel_id, uid, frame):
    pcm16 = bytes(frame.buffer)              # 16 kHz PCM16
    pcm24 = resample_pcm16(pcm16, 16_000, 24_000)
    loop.call_soon_threadsafe(agent.inbound_audio.put_nowait, pcm24)
    return 0
Enter fullscreen mode Exit fullscreen mode

call_soon_threadsafe is required because Agora's observer runs on a native C++ thread, not the asyncio loop.

3. Stream audio to AssemblyAI

 chunk = await mic_queue.get()
await ws.send(json.dumps({
    "type": "input.audio",
    "audio": base64.b64encode(chunk).decode(),
}))
Enter fullscreen mode Exit fullscreen mode

4. Publish the reply back into Agora

When reply.audio events arrive, we decode the base64 PCM, resample 24 kHz → 16 kHz, and hand it to AudioPcmDataSender:

elif t == "reply.audio":
    pcm = base64.b64decode(event["data"])
    await self.outbound_audio.put(pcm)

pcm16 = resample_pcm16(pcm24, 24_000, 16_000)
self.pcm_sender.send_audio_pcm_data(
    pcm16, 0, len(pcm16)//2, 2, 1, 16_000,
)
Enter fullscreen mode Exit fullscreen mode

We pace the pushes to wall-clock time so a long reply doesn't blast into Agora's buffer in one go — that keeps barge-in responsive.

5. Handle barge-in

 elif t == "reply.done" and event.get("status") == "interrupted":
    while not outbound_audio.empty():
        outbound_audio.get_nowait()
Enter fullscreen mode Exit fullscreen mode

The Voice Agent API also trims the transcript.agent event to what the bot actually got out before it was cut off — useful for accurate logging.

Tuning

Pick a different voice

 "output": {"voice": "james"}    # conversational US male
"output": {"voice": "sophie"}   # clear UK female
"output": {"voice": "diego"}    # Latin American Spanish
"output": {"voice": "arjun"}    # Hindi/Hinglish
Enter fullscreen mode Exit fullscreen mode

Browse the full Voices catalog. Multilingual voices code-switch with English automatically.

Adjust turn detection

 "input": {
"turn_detection": {
"vad_threshold": 0.5,
"min_silence": 600,
"max_silence": 1500,
"interrupt_response": True,
}
}
Enter fullscreen mode Exit fullscreen mode




Boost domain-specific words


 "input": {"keyterms": ["AssemblyAI", "Agora", "Universal-3"]}
Enter fullscreen mode Exit fullscreen mode




Add tools

Register functions on session.tools to let the agent look up data, hit APIs, or trigger workflows. Full pattern in the tool calling docs.

Troubleshooting

agora-python-server-sdk install fails on macOS. The package ships pre-built C++ wheels for Linux and macOS. If pip falls back to source build, install Xcode command-line tools (xcode-select --install) or run the bot in a Linux container.

Bot joins but stays silent. Check that your client connected with the same AGORA_CHANNEL name and a different UID than AGORA_BOT_UID. Agora rejects duplicate UIDs.

UNAUTHORIZED close from AssemblyAI. API key missing, expired, or wrong. Pull a fresh one from the AssemblyAI dashboard.

Audio sounds chipmunky or sluggish. Sample-rate mismatch. Confirm set_playback_audio_frame_before_mixing_parameters(channels=1, sample_rate_hz=16000) and that resampling is on between Agora's 16 kHz and the API's 24 kHz.

Bot interrupts itself. Acoustic loop somewhere — usually one client has speakers + mic open without echo cancellation. Browser clients should request getUserMedia({ audio: { echoCancellation: true } }).

Token errors from Agora. If your project has App Certificate enabled, AGORA_APP_CERTIFICATE must be set and the bot UID + channel name must match what you signed.

Full troubleshooting guide: Voice Agent API docs.

Known limitations

  • agora-python-server-sdk is a beta wrapper around Agora's native C++ SDK. Class layouts have moved between minor versions. We pin 2.2.4 and document the exact API surface the bot uses.
  • Agora's recommended path for new voice-agent projects is the Conversational AI Engine — a hosted REST service. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API.

No Windows wheels. Run inside WSL2 or a Linux Docker container.

Frequently asked questions

What is the AssemblyAI Voice Agent API?

A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech recognition on Universal-3 Pro Streaming, LLM reasoning, and TTS with 30+ voices. It includes neural turn detection, barge-in, and tool calling.

How do I connect the Voice Agent API to Agora?

Run a server-side bot with agora-python-server-sdk. The bot joins the Agora channel, registers an IAudioFrameObserver to capture caller audio (16 kHz PCM), resamples to 24 kHz, and forwards each chunk to the Voice Agent API. Reply audio comes back, gets resampled to 16 kHz, and is published via AudioPcmDataSender.

Can I use Agora's Conversational AI Engine instead?

Yes — it supports AssemblyAI as the STT provider, but uses Agora's LLM and TTS layers. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API.

What audio format does it use with Agora?

The Voice Agent API defaults to audio/pcm at 24 kHz. Agora delivers 16 kHz PCM, so the bot resamples 16 kHz ↔ 24 kHz on each side using SciPy's polyphase filter.

How does barge-in work?

The Voice Agent API emits reply.done with status: "interrupted". The bridge flushes its outbound audio queue so the bot stops talking immediately.

Do I need an Agora App Certificate?

Only if your Agora project has it enabled. If so, set AGORA_APP_CERTIFICATE in .env. If disabled, leave it blank.

How much does it cost?

AssemblyAI offers a free tier. For current pricing, see the AssemblyAI pricing page.

Top comments (0)