Mart Schweiger

Posted on May 19 • Originally published at assemblyai.com

How to build a voice agent with Twilio and AssemblyAI

#voiceai #ai #telephony #tutorial

Building a voice agent on Twilio with AssemblyAI takes one WebSocket server that bridges Twilio Voice Media Streams into Universal-3 Pro Streaming, your LLM of choice, and a text-to-speech model — all under an 800ms turn budget. This tutorial walks through every piece: the TwiML to open the audio stream, the FastAPI WebSocket bridge that handles 8kHz mulaw audio in both directions, the LLM loop with tool calling, and the deployment considerations that decide whether your agent feels human or obviously robotic on a real phone call.

By the end of this guide, you'll have a working inbound phone-based voice agent that answers a Twilio number, transcribes the caller in real time, calls tools (order lookup, callback scheduling, human transfer), and speaks back — all with code you can fork and ship today. The full repository is at the end of this post.

Why Twilio + AssemblyAI works for phone-based voice agents

Twilio is the most common telephony layer for voice agents because it handles the PSTN connection, gives you a phone number in minutes, and exposes the call audio as a Media Stream you can bridge into your own backend over a WebSocket. The audio comes in at 8kHz mulaw — the standard telephony format, not the 16kHz PCM most audio tools assume.

AssemblyAI's Universal-3 Pro Streaming model is built specifically for this. It accepts pcm_mulaw at sample_rate=8000 natively, so you don't pay the round-trip latency cost of resampling phone audio into 16kHz PCM and back. Combined with 307ms P50 latency, immutable transcripts, and 21% fewer alphanumeric errors than the previous generation of streaming speech-to-text models, it's the speech-to-text layer that decides whether your agent captures a confirmation code on the first try or makes the caller repeat it.

The architecture is straightforward:

  Caller's phone
       │
   Twilio Voice (PSTN)
       │  TwiML → open WebSocket
       ▼
  Your FastAPI server (this tutorial)
   ┌────┴────┐
   ▼         ▲
 AssemblyAI    ElevenLabs TTS
 Universal-3   (ulaw_8000 output)
 Pro Streaming
   │             ▲
   │ transcript  │ audio
   ▼             │
   GPT-4o + tool calling
     │
     └─► action + spoken reply

Audio flows in two directions continuously. Twilio sends inbound audio (caller → your server → AssemblyAI). Your server generates an LLM response, runs it through ElevenLabs, and streams the synthesized audio back to Twilio as mulaw frames. All of it stays inside one WebSocket per call.

Before you start

You'll need:

An AssemblyAI account with API key access to Universal-3 Pro Streaming
A Twilio account with a Voice-enabled phone number
An OpenAI API key (or another LLM provider)
An ElevenLabs API key (or another streaming TTS provider with mulaw output)
Python 3.11+
ngrok for exposing your local server to Twilio during development

Install the dependencies:

pip install fastapi uvicorn websockets python-dotenv openai elevenlabs twilio

Step 1: Configure the Twilio TwiML for an inbound call

When someone calls your Twilio number, Twilio fetches a TwiML document from your server and uses it to decide what to do with the call. To stream the call audio to your WebSocket, you return TwiML with a block:

# server.py
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request):
    host = request.url.hostname
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://{host}/media-stream" />
  </Connect>
</Response>"""
    return Response(content=twiml, media_type="application/xml")

In the Twilio console, set the phone number's voice webhook to POST https://your-host/twilio/voice. When a call comes in, Twilio will hit this endpoint, parse the TwiML, and open a WebSocket to /media-stream that carries the call audio.

Step 2: Bridge Twilio Media Streams to Universal-3 Pro Streaming

This is the core of the agent. The WebSocket handler receives Twilio's audio frames, forwards them to AssemblyAI, listens for transcripts, and routes them into the LLM loop.

# server.py (continued)
import asyncio
import base64
import json
import os
import websockets
from fastapi import WebSocket

ASSEMBLY_WS = "wss://streaming.assemblyai.com/v3/ws"

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket):
    await twilio_ws.accept()
    stream_sid = None

    # Open AssemblyAI streaming session — note: pcm_mulaw, 8kHz
aai_url = (
    f"{ASSEMBLY_WS}"
    f"?speech_model=u3-rt-pro"
    f"&encoding=pcm_mulaw"
    f"&sample_rate=8000"
)
aai_ws = await websockets.connect(
    aai_url,
    extra_headers={"Authorization": os.environ["ASSEMBLYAI_API_KEY"]},
)

    async def pump_twilio_to_aai():
        nonlocal stream_sid
        async for raw in twilio_ws.iter_text():
            event = json.loads(raw)
            if event["event"] == "start":
                stream_sid = event["start"]["streamSid"]
            elif event["event"] == "media":
                audio_b64 = event["media"]["payload"]
                # Twilio sends base64-encoded mulaw. AssemblyAI accepts raw bytes.
                await aai_ws.send(base64.b64decode(audio_b64))
            elif event["event"] == "stop":
                await aai_ws.close()
                return

    async def pump_aai_to_llm():
        async for message in aai_ws:
            data = json.loads(message)
            if data.get("type") == "Turn" and data.get("end_of_turn"):
                transcript = data.get("transcript", "").strip()
                if transcript:
                    await handle_user_turn(transcript, twilio_ws, stream_sid)

    await asyncio.gather(pump_twilio_to_aai(), pump_aai_to_llm())

The critical settings:

speech_model=u3-rt-pro selects Universal-3 Pro Streaming
encoding=pcm_mulaw and sample_rate=8000 tell AssemblyAI to expect raw mulaw without resampling
format_turns=true gives you properly cased and punctuated transcripts ready for the LLM

When end_of_turn is true, the caller has finished speaking and you have a complete utterance to send to the LLM.

Step 3: Run the LLM loop with tool calling

handle_user_turn is where the conversation logic lives. It takes the transcript, sends it to the LLM with the available tools, and either calls a tool or responds with text that becomes the agent's spoken reply.

# server.py (continued)
from openai import AsyncOpenAI

openai = AsyncOpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the status of a customer order by order ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string", "description": "e.g. AB3792"}
                },
                "required": ["order_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "Transfer the caller to a human agent.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string"}
                },
                "required": ["reason"],
            },
        },
    },
]

conversation = [
    {
        "role": "system",
        "content": (
            "You are a friendly phone-based voice agent for a shoe retailer. "
            "Keep replies short — one or two sentences. "
            "Use get_order_status to look up orders. "
            "Use transfer_to_human if the caller asks for a person or is upset."
        ),
    }
]

async def handle_user_turn(transcript, twilio_ws, stream_sid):
    conversation.append({"role": "user", "content": transcript})
    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=conversation,
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message

if msg.tool_calls:
    conversation.append(msg.model_dump())
    for call in msg.tool_calls:
        result = await dispatch_tool(call.function.name, call.function.arguments)
        conversation.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result,
        })
    followup = await openai.chat.completions.create(
        model="gpt-4o", messages=conversation
    )
    reply = followup.choices[0].message.content
    else:
        reply = msg.content

    conversation.append({"role": "assistant", "content": reply})
    await speak(reply, twilio_ws, stream_sid)

The tool dispatcher is where your business logic lives. For a real deployment, replace the stubs with calls to your CRM, order management system, or scheduling backend.

Step 4: Stream the TTS audio back to Twilio as mulaw

Twilio expects audio frames as base64-encoded mulaw at 8kHz. ElevenLabs supports a ulaw_8000 output format that produces exactly this — which means no resampling, no conversion, just stream the bytes back.

# server.py (continued)
from elevenlabs.client import AsyncElevenLabs

eleven = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

async def speak(text, twilio_ws, stream_sid):
    audio_stream = eleven.text_to_speech.stream(
        voice_id=os.environ.get("ELEVENLABS_VOICE_ID", "EXAVITQu4vr4xnSDxMaL"),
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="ulaw_8000",
    )
    async for chunk in audio_stream:
        payload = base64.b64encode(chunk).decode()
        await twilio_ws.send_text(json.dumps({
            "event": "media",
            "streamSid": stream_sid,
            "media": {"payload": payload},
        }))

Each chunk gets streamed to Twilio as a media event. Twilio plays the audio to the caller as it arrives, which means the caller hears the first word of the agent's reply while the rest is still being synthesized.

Step 5: Run it and connect Twilio

Start your server and expose it through ngrok:

uvicorn server:app --port 8000

ngrok http 8000

Copy the https://*.ngrok-free.dev URL ngrok prints. In the Twilio console:

Buy or pick a Voice-enabled phone number
Open the number's configuration
Under "A call comes in," set the webhook to https://your-ngrok-url/twilio/voice with method POST
Save

Call the number from your phone. You should hear the agent pick up and respond in natural conversation.

Latency budget: where your milliseconds go

A natural-feeling phone agent answers in under 800ms from when the caller stops speaking to when the caller hears the first audio of the reply. Here's where that budget gets spent on a Twilio + AssemblyAI stack:

Stage	Typical latency
AssemblyAI end-of-turn finalization	~150–250ms
LLM first-token generation (GPT-4o)	~200–400ms
TTS first-byte (ElevenLabs streaming)	~200–400ms
Twilio round-trip	~50–100ms
Total perceived latency	~600–1100ms

Three things blow the budget the moment you stop being careful:

Resampling audio. Anything that converts 8kHz mulaw to 16kHz PCM (and back) costs 50–150ms each way. AssemblyAI's Universal-3 Pro Streaming and ElevenLabs's ulaw_8000 output both keep audio in mulaw end-to-end.
Non-streaming LLMs. Waiting for the full response before TTS starts is a guaranteed dead zone. Stream tokens from the LLM and chunk them to TTS sentence-by-sentence.
Cold-start tools. A tool call that hits a slow database eats your entire turn. Cache hot data and aggressively timeout slow lookups.

What about the AssemblyAI Voice Agent API?

If your voice agent doesn't need Twilio specifically — for example a browser-based assistant, a mobile app, or an embedded device — the Voice Agent API wraps STT, LLM, TTS, turn detection, and tool calling behind a single WebSocket at a flat $4.50/hour (announcement). You skip the three-provider plumbing entirely.

For Twilio-bridged phone calls today, the chained architecture in this tutorial is still the most flexible path — it lets you pick exactly the LLM, TTS voice, and tool definitions you want. The Voice Agent API is the right choice for everything that isn't a PSTN inbound call, and Twilio integration through the Voice Agent API is on the roadmap.

The complete repository

Fork the runnable repo at github.com/kelsey-aai/twilio-voice-agent-assemblyai. It includes the FastAPI server, tool dispatcher, sample tools (get_order_status, transfer_to_human), a .env.example, and ngrok setup instructions. Total length: ~250 lines of Python.

Frequently asked questions

How do I build a voice agent with Twilio and AssemblyAI?

To build a voice agent with Twilio and AssemblyAI, point your Twilio phone number at a TwiML endpoint that opens a to your server's WebSocket. In the WebSocket handler, forward Twilio's 8kHz mulaw audio frames to AssemblyAI's Universal-3 Pro Streaming API using encoding=pcm_mulaw and sample_rate=8000. When AssemblyAI returns a finalized turn, pass the transcript to an LLM (GPT-4o, Claude) with your tool definitions — see our function calling tutorial for a deeper walkthrough — then stream the LLM's reply through a TTS model that supports ulaw_8000 output (like ElevenLabs) back to Twilio as base64-encoded media events.

Why use AssemblyAI for a Twilio voice agent?

AssemblyAI's Universal-3 Pro Streaming model is built for the audio Twilio actually sends — 8kHz mulaw — without requiring resampling, which costs latency. For an overview of the broader category, see AI voice agents in 2026. It delivers 307ms P50 latency, immutable transcripts your downstream LLM can trust, and 21% fewer alphanumeric errors than the previous generation, which matters when the agent is capturing confirmation codes, phone numbers, or email addresses over a phone line.

Does the Voice Agent API work with Twilio?

The AssemblyAI Voice Agent API is the simplest path for voice agents that don't need Twilio specifically — a single WebSocket replaces STT, LLM, and TTS at $4.50/hour. Native Twilio integration through the Voice Agent API is on the roadmap. Today, the chained architecture in this tutorial (Universal-3 Pro Streaming + your LLM + your TTS, bridged through a Twilio Media Streams WebSocket) is the standard path for Twilio-based phone agents.

What latency should I expect from a Twilio voice agent?

A well-tuned Twilio voice agent built on AssemblyAI Universal-3 Pro Streaming, GPT-4o, and ElevenLabs typically hits 600–1100ms from caller-stops-talking to caller-hears-reply. The biggest latency killers are resampling audio (use native mulaw end-to-end), non-streaming LLM responses (stream tokens), and slow tool calls (cache and timeout aggressively).

How much does it cost to run a phone-based voice agent?

The cost breaks down across four components: Twilio voice (per-minute, varies by country), AssemblyAI Universal-3 Pro Streaming ($0.15/hour of session time), the LLM (varies by provider — typically a few cents per minute of conversation for GPT-4o), and TTS (per-character or per-minute). End-to-end you're looking at a few cents per minute at scale, with the exact number driven by which LLM and TTS you choose.

Can a Twilio voice agent handle multiple simultaneous calls?

Yes. AssemblyAI's Universal-3 Pro Streaming supports unlimited concurrent streams at a flat $0.15/hour with no separate negotiation required. Twilio handles concurrency per-account based on your plan. The constraint at scale is usually your own server's WebSocket concurrency limits — FastAPI with uvicorn workers handles hundreds of concurrent calls comfortably on modest hardware.

DEV Community