Mart Schweiger

Posted on May 7 • Originally published at assemblyai.com

Build a voice agent with LiveKit and AssemblyAI’s Voice Agent API

#voiceai #ai #webrtc #tutorial

Why combine LiveKit and the Voice Agent API

WebRTC and AI are different problems with different best-in-class solutions:

LiveKit is the easiest way to ship production-grade real-time audio. SDKs for Web, iOS, Android, React Native, Flutter, and Unity. Built-in recording, simulcast, adaptive bitrate, and end-to-end encryption. A managed cloud and a self-hostable open-source server.
AssemblyAI’s Voice Agent API is the easiest way to ship a voice agent. One WebSocket gives you Universal-3 Pro Streaming for speech-to-text, an LLM, a TTS engine with 30+ voices, plus neural turn detection, barge-in, and tool calling — all server-side.

Use them together and you get multi-user voice rooms with a real AI agent inside, without writing a STT/LLM/TTS orchestration layer or building your own WebRTC stack.

How this differs from the LiveKit Agents framework

LiveKit Agents framework	This tutorial (Voice Agent API + LiveKit transport)
Where the AI lives	You configure STT, LLM, and TTS plugins separately
Services to wire up	3+ (one per plugin)
API keys to manage	3+
Turn detection	Plugin-dependent; configure VAD + endpointing
Barge-in	Framework handles it across plugins
Tool calling	LLM-plugin-specific
What LiveKit does	Transport + agent runtime

Architecture

The system has four layers:

Parameter	Type	Description
`vad_threshold`	0.0–1.0	Voice activity detection sensitivity. Higher = ignore more background noise.
`min_silence`	ms	Minimum silence before a confident end-of-turn. Drop to 300 for fast-paced conversation.
`max_silence`	ms	Hard ceiling on silence before forcing end-of-turn. Raise to 2500 for deliberate speech (eldercare, healthcare).
`interrupt_response`	boolean	Set to `False` to disable barge-in entirely.

Audio flows at 24 kHz mono PCM16 between the worker and the Voice Agent API. LiveKit’s native FFI resampler handles the conversion between WebRTC’s internal 48 kHz and the 24 kHz the API expects.

Prerequisites

Python 3.10+
An AssemblyAI API key — free tier available, no credit card
A LiveKit server — either a free LiveKit Cloud project or a self-hosted livekit-server
A LiveKit client to talk to the agent — the fastest path is the hosted LiveKit Agents Playground

You don’t need a microphone or speakers on the worker machine — the worker is a server-side participant. All audio I/O happens in the browser/mobile client.

Quick start

1. Clone and Install

 git clone https://github.com/kelsey-aai/voice-agent-livekit
cd voice-agent-livekit

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure Environment

 cp .env.example .env

Fill in .env:

ASSEMBLYAI_API_KEY=           # https://www.assemblyai.com/dashboard/signup
LIVEKIT_URL=wss://<project>.livekit.cloud
LIVEKIT_API_KEY=              # LiveKit Cloud → Settings → Keys
LIVEKIT_API_SECRET=
ROOM_NAME=voice-agent-demo

For self-hosted LiveKit, run livekit-server --dev and use LIVEKIT_URL=ws://localhost:7880.

3. Run the Worker

 python worker.py

4. Connect a Client

The fastest way is the LiveKit Agents Playground:

Open the playground.
Paste your LIVEKIT_URL and a token. Generate a token from the LiveKit Cloud dashboard, set the room to voice-agent-demo and the identity to anything other than voice-agent.
Click Connect , allow microphone access, and start talking.

How it works

The worker is one file (worker.py) and roughly 250 lines. Six steps do the actual work.

1. Mint a LiveKit Token and Join the Room

 from livekit import api, rtc

token = (
    api.AccessToken(LIVEKIT_API_KEY, LIVEKIT_API_SECRET)
    .with_identity("voice-agent")
    .with_grants(api.VideoGrants(
        room_join=True, room=ROOM_NAME,
        can_publish=True, can_subscribe=True,
    ))
    .to_jwt()
)

room = rtc.Room()
await room.connect(LIVEKIT_URL, token)

AccessToken builds a signed JWT with the grants the worker needs: subscribe to incoming audio, publish a reply track. room.connect() opens the WebRTC signaling and media path.

2. Publish a Local Audio Track for the Agent’s Voice

 audio_source = rtc.AudioSource(sample_rate=24_000, num_channels=1)
local_track = rtc.LocalAudioTrack.create_audio_track("agent-voice", audio_source)

await room.local_participant.publish_track(
    local_track,
    rtc.TrackPublishOptions(source=rtc.TrackSource.SOURCE_MICROPHONE),
)

AudioSource is LiveKit’s pump for sending audio into a room. We configure it at 24 kHz mono — the Voice Agent API’s default format — so reply audio goes straight in without resampling.

3. Subscribe to the User’s Audio Track

 @room.on("track_subscribed")
def on_track_subscribed(track, publication, participant):
    if track.kind == rtc.TrackKind.KIND_AUDIO:
        asyncio.create_task(bridge_to_voice_agent(track))

LiveKit emits track_subscribed when a remote participant publishes a track and it gets routed to us. We only care about audio.

4. Forward Microphone Audio to the Voice Agent API

 stream = rtc.AudioStream.from_track(
    track=mic_track,
    sample_rate=24_000,    # ask LiveKit to resample to 24 kHz
    num_channels=1,
)

async for event in stream:
    pcm16_bytes = bytes(event.frame.data)
    await ws.send(json.dumps({
        "type": "input.audio",
        "audio": base64.b64encode(pcm16_bytes).decode("ascii"),
    }))

AudioStream does the resampling. WebRTC carries audio at 48 kHz internally, but we ask for 24 kHz mono and the LiveKit FFI resampler handles the conversion. Each AudioFrame exposes data as a memoryview of int16 samples — base64-encode and ship as input.audio.

5. Play the Agent’s Reply Back into the Room

 elif t == "reply.audio":
    pcm = base64.b64decode(event["data"])
    samples = len(pcm) // 2  # 2 bytes per int16, mono
    frame = rtc.AudioFrame(
        data=pcm,
        sample_rate=24_000,
        num_channels=1,
        samples_per_channel=samples,
    )
    await audio_source.capture_frame(frame)

The agent streams reply.audio events as soon as the LLM begins generating. Each chunk is wrapped in an AudioFrame and pushed into the AudioSource, which queues it up to 1 second deep and drains at 24 kHz on its own clock.

6. Handle Barge-In

 elif t == "input.speech.started":
    # User started talking; stop playback.
    audio_source.clear_queue()

elif t == "reply.done":
    if event.get("status") == "interrupted":
        audio_source.clear_queue()

AudioSource.clear_queue() immediately discards every queued frame so the user doesn’t hear stale agent audio after they’ve spoken over it.

Tuning the agent

Pick a Voice

 "output": {"voice": "james"}     # conversational US male
"output": {"voice": "sophie"}    # clear UK female
"output": {"voice": "diego"}     # Latin American Spanish
"output": {"voice": "arjun"}     # Hindi/Hinglish

See the Voices catalog for samples. Multilingual voices code-switch automatically.

Adjust the System Prompt and Greeting

 "session": {
    "system_prompt": (
        "You are a customer support agent for Acme. Speak in 1–2 short "
        "sentences. Confirm the user's question before answering."
    ),
    "greeting": "Hi, this is Acme support — what's going on?",
}

You can re-send session.update mid-conversation to swap the prompt or voice. greeting is locked once spoken, but system_prompt and voice are not.

Tune Turn Detection

 "input": {
    "turn_detection": {
        "vad_threshold": 0.5,        # 0.0–1.0; higher = ignore more noise
        "min_silence": 600,          # ms before confident end-of-turn
        "max_silence": 1500,         # ms hard ceiling
        "interrupt_response": True,  # set False to disable barge-in
    }
}

For deliberate speech (eldercare, healthcare), raise max_silence to 2500. For fast-paced conversation, drop min_silence to 300.

Boost domain-specific terms

If your conversation includes product names, medical terms, or rare proper nouns, add them to session.input.keyterms:

"input": { "keyterms": ["Universal-3 Pro Streaming", "AssemblyAI", "LiveKit"] }

Multiple participants in one room

This worker bridges one remote audio track to the Voice Agent API. Two ways to scale:

One agent per room. Spin up a separate worker process per room. Best for 1-on-1 use cases like phone-style support agents.
Mix participants before sending. If you want a meeting-style multi-talker agent, mix all remote audio with rtc.AudioMixer and send the mix to one Voice Agent API session.

Troubleshooting

The worker connects but the client never hears the agent. Make sure your client subscribed to the agent’s track. Confirm can_subscribe=True on the client’s token.

UNAUTHORIZED close on the AssemblyAI WebSocket. Your ASSEMBLYAI_API_KEY is missing, expired, or pasted with whitespace. Grab a fresh key from the AssemblyAI dashboard.

LiveKit ConnectError: invalid token. The JWT signature didn’t validate against the LIVEKIT_API_SECRET. Check that the URL, key, and secret all come from the same LiveKit project.

Audio is choppy or robotic. Almost always the audio buffer running dry. Run the worker close to your network egress. Inside AudioSource(... queue_size_ms=1000) you have one second of headroom; raise it to 2000 if you see transient stalls.

Audio sounds pitched up or down. Sample-rate mismatch. Both AudioSource and AudioStream.from_track must be configured at sample_rate=24_000, num_channels=1.

Agent keeps interrupting itself. Browser clients with getUserMedia({ audio: { echoCancellation: true } }) handle this automatically. On custom mobile clients, make sure AEC is enabled on the capture side.

The full troubleshooting guide is in the Voice Agent API docs.

Frequently asked questions

What is AssemblyAI’s Voice Agent API?

A single WebSocket endpoint that handles the full voice agent pipeline server-side: speech-to-text via Universal-3 Pro Streaming, an LLM, and a TTS engine with 30+ voices. It includes neural turn detection, barge-in, and tool calling out of the box.

Why use LiveKit with the Voice Agent API instead of going direct?

LiveKit handles real-time audio transport (WebRTC, mobile and browser SDKs, recording, scaling, and global edge distribution). The Voice Agent API handles the AI. Combining them gives you multi-user voice rooms, mobile clients, and recording without building a WebRTC stack.

Is this the LiveKit Agents framework?

No. The LiveKit Agents framework expects separate STT, LLM, and TTS plugins. This tutorial uses livekit-rtc directly to join a room as a server-side participant, then forwards audio to the Voice Agent API, which replaces all three.

What audio format does the Voice Agent API expect?

By default, audio/pcm — 16-bit signed little-endian PCM at 24,000 Hz, mono, base64-encoded. This worker configures both LiveKit AudioStream and AudioSource at 24 kHz mono so no manual resampling is needed.

Can the Voice Agent API call tools from inside a LiveKit room?

Yes. Register tool definitions in session.tools on session.update. When the agent decides to invoke one, the server emits a tool.call event. Run the tool in your worker, then send back a tool.result after receiving reply.done.

How do I scale to many concurrent rooms?

Run one worker per room. LiveKit Cloud’s agent dispatch can spin up a worker per active room, and each worker holds one Voice Agent API WebSocket. Both scale horizontally.

How much does it cost?

AssemblyAI offers a free tier. For current pricing, see the AssemblyAI pricing page and the LiveKit Cloud pricing page.

DEV Community