When you build an AI agent that handles voice calls, you quickly hit a wall: how do you get the spoken words into your AI's context in real time?
The naive path is painful. You stand up a WebSocket server, ingest raw RTP packets, decode the audio codec, buffer frames, feed them into a speech-to-text engine, manage partial vs. final transcripts, and somehow do all of this while also running your actual AI logic. Oh, and latency matters — callers don't wait 3 seconds between sentences.
This post shows how to skip all that plumbing and get real-time transcription piped directly into your AI agent using VoIPBin.
The Problem With DIY Transcription
Let's be concrete. Here's what a "simple" DIY voice pipeline looks like:
Caller → SIP → RTP stream → your server
↓
decode opus/PCMU
↓
buffer + VAD detection
↓
STT API (Google/AWS/Deepgram)
↓
handle partial transcripts
↓
your AI logic
↓
TTS → re-encode audio → RTP back
Each step is a failure point. Each step adds latency. And none of it is your actual product.
What VoIPBin Does Instead
VoIPBin handles the entire media pipeline. Your AI agent only ever sees text in, text out.
The architecture flips to:
Caller → SIP → VoIPBin
↓
STT (real-time)
↓
webhook → your AI agent (text)
↓
AI response (text)
↓
TTS + RTP back to caller
Your server doesn't touch audio at all. It receives a transcript, thinks, replies with text.
Hands-On: Getting Transcripts Into Your Agent
Step 1: Sign Up and Get Your Token
curl -s -X POST https://api.voipbin.net/v1.0/auth/signup \
-H "Content-Type: application/json" \
-d '{"username": "yourname", "password": "yourpass"}'
You get back an accesskey.token. No email confirmation, no waiting.
Step 2: Create a Flow That Transcribes and Calls Your Webhook
VoIPBin Flows define what happens during a call. Here's a minimal flow that:
- Answers the call
- Listens and transcribes
- Sends the transcript to your agent via webhook
- Speaks the agent's reply back
import httpx
TOKEN = "your-access-token"
flow = {
"name": "AI Transcription Agent",
"steps": [
{
"type": "answer"
},
{
"type": "talk",
"text": "Hello! How can I help you today?"
},
{
"type": "listen",
"timeout": 5000,
"webhook": {
"url": "https://your-server.com/transcript",
"method": "POST"
}
}
]
}
res = httpx.post(
"https://api.voipbin.net/v1.0/flows",
headers={"Authorization": f"Bearer {TOKEN}"},
json=flow
)
print(res.json())
When a caller speaks, VoIPBin transcribes it and POSTs the result to your webhook URL. You handle the text, return a response, and VoIPBin speaks it.
Step 3: Your AI Agent Receives the Transcript
Here's a minimal FastAPI handler that receives transcripts and responds with an AI reply:
from fastapi import FastAPI, Request
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI()
@app.post("/transcript")
async def handle_transcript(req: Request):
body = await req.json()
# VoIPBin sends the recognized text here
user_text = body.get("transcript", "")
call_id = body.get("call_id", "")
print(f"[{call_id}] User said: {user_text}")
if not user_text.strip():
return {"action": "listen"} # keep listening if empty
# Run your AI
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful phone assistant. Keep responses brief."},
{"role": "user", "content": user_text}
]
)
reply = response.choices[0].message.content
# Return the reply — VoIPBin will speak it via TTS
return {
"action": "talk",
"text": reply,
"then": {"action": "listen"} # loop: listen again after speaking
}
That's the entire voice AI agent. No audio code. No codec handling. No streaming buffers.
Step 4: Attach a Phone Number (or Use SIP Without One)
If you want callers to reach you via a real phone number:
# Purchase a DID number
res = httpx.post(
"https://api.voipbin.net/v1.0/numbers/purchase",
headers={"Authorization": f"Bearer {TOKEN}"},
json={"country": "US", "flow_id": "<your-flow-id>"}
)
number = res.json()["number"]
print(f"Your AI agent is reachable at: {number}")
Or skip the number entirely with a Direct Hash SIP URI — useful for testing or internal tools:
sip:<your-hash>@sip.voipbin.net
No number needed. Dial it from any SIP client.
What the Transcript Payload Looks Like
Here's a sample payload your webhook receives:
{
"call_id": "a1b2c3d4-...",
"transcript": "I need to cancel my subscription",
"confidence": 0.97,
"language": "en-US",
"is_final": true,
"duration_ms": 1840
}
is_final: true means VoIPBin has determined the speaker finished. You don't have to implement voice activity detection (VAD) yourself — it's already done.
Building a Conversation Loop
For a multi-turn conversation, keep a session store keyed by call_id:
from collections import defaultdict
conversation_history = defaultdict(list)
@app.post("/transcript")
async def handle_transcript(req: Request):
body = await req.json()
user_text = body.get("transcript", "")
call_id = body.get("call_id", "")
if not user_text.strip():
return {"action": "listen"}
# Append to conversation history
history = conversation_history[call_id]
history.append({"role": "user", "content": user_text})
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful support agent."},
*history
]
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return {
"action": "talk",
"text": reply,
"then": {"action": "listen"}
}
@app.post("/call-ended")
async def cleanup(req: Request):
body = await req.json()
call_id = body.get("call_id", "")
# Log the full conversation, then clean up
print(f"Call {call_id} ended. Turns: {len(conversation_history.get(call_id, []))}")
conversation_history.pop(call_id, None)
return {"ok": True}
Now your agent maintains context across the entire call without any extra infrastructure.
Using the MCP Server (Claude Desktop / Cursor)
If you want to wire this up from Claude Desktop or Cursor without writing any boilerplate:
uvx voipbin-mcp
This launches the VoIPBin MCP server. Configure it in your Claude Desktop config.json and you can tell Claude to "make a call", "check transcripts", or "set up a flow" directly from the chat.
What You Get (and What You Don't Have to Build)
| What VoIPBin handles | What you own |
|---|---|
| RTP / audio streaming | Business logic |
| Speech-to-text | AI model choice |
| Voice activity detection | Conversation state |
| Text-to-speech | Webhook endpoint |
| SIP signaling | Deployment |
| Multi-language STT | Nothing else |
Your codebase stays small. Your team doesn't need telephony expertise.
Try It
- Website: voipbin.net
-
MCP Server:
uvx voipbin-mcp -
Go SDK:
go get github.com/voipbin/voipbin-go -
Signup:
POST https://api.voipbin.net/v1.0/auth/signup— token returned immediately
If you're building any kind of voice AI — customer support bots, appointment scheduling, survey calls — the transcription pipeline is the part that will eat your sprint. Offload it.
Questions? Drop them in the comments.
Top comments (0)