Speech-to-text sounds simple until you actually build it.
You need to handle RTP packet assembly, choose the right audio codec (G.711? G.722? Opus?), manage jitter buffers, stream audio chunks to a transcription API with low enough latency that the conversation doesn't feel broken, and then pipe that text into your AI agent — all in real time, while keeping the call alive.
Most developers who try this spend weeks on audio infrastructure before writing a single line of AI logic.
There's a better path.
The Real Problem: Audio Is Hostile Territory for Most Developers
Voice calls operate at the network layer — RTP streams, SIP signaling, DTMF tones. These are protocols that telecom engineers have spent decades specializing in. Most AI developers have never touched them.
So when you want an AI agent that can listen to a caller and respond intelligently, you're suddenly responsible for:
- Capturing audio in real-time from a phone network
- Transcribing it with low latency (>500ms feels broken)
- Feeding transcription chunks into your LLM
- Generating a response and synthesizing speech
- Playing audio back without interrupting the conversation flow
Each of these steps is solvable. But doing all of them together, reliably, at scale, is a significant engineering investment.
What Media Offloading Actually Means
The key insight behind VoIPBin's architecture is this: your AI agent should never touch audio.
Instead of your agent receiving raw RTP streams, VoIPBin sits in the middle and handles:
- STT (Speech-to-Text): Audio from the caller is transcribed in real-time
- Text delivery: Your agent receives clean text via webhook
- TTS (Text-to-Speech): Your agent replies with text, VoIPBin speaks it
- RTP lifecycle: Stream setup, teardown, codec negotiation — all handled
Your AI agent becomes a pure text processor. It reads, thinks, and writes. No audio code required.
Caller → [RTP Audio] → VoIPBin → [STT] → [Webhook: text] → Your AI Agent
Caller ← [RTP Audio] ← VoIPBin ← [TTS] ← [Response: text] ← Your AI Agent
Building a Real-Time Transcription Handler
Let's build a simple AI agent that receives transcriptions and responds. We'll use Python with FastAPI.
Step 1: Sign Up and Get Your Access Token
curl -X POST https://api.voipbin.net/v1.0/auth/signup \
-H "Content-Type: application/json" \
-d '{
"username": "your-username",
"password": "your-password",
"email": "you@example.com"
}'
You'll get an accesskey.token back immediately — no OTP, no waiting.
Step 2: Configure Your Webhook Handler
VoIPBin sends call events to your server. The key event type for transcription is transcript.
from fastapi import FastAPI, Request
from openai import OpenAI
import httpx
import os
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
VOIPBIN_TOKEN = os.environ["VOIPBIN_TOKEN"]
VOIPBIN_BASE = "https://api.voipbin.net/v1.0"
# In-memory conversation context (use Redis in production)
conversation_history = {}
@app.post("/webhook")
async def handle_event(request: Request):
event = await request.json()
event_type = event.get("type")
call_id = event.get("call_id")
if event_type == "call.initiated":
# Initialize conversation context for this call
conversation_history[call_id] = [
{
"role": "system",
"content": "You are a helpful voice assistant. Keep responses concise — under 30 words. The caller will hear your response spoken aloud."
}
]
# Greet the caller
await speak(call_id, "Hello! How can I help you today?")
elif event_type == "transcript":
# This is where the magic happens
transcript_text = event.get("transcript", "")
if not transcript_text.strip():
return {"status": "ok"}
print(f"[{call_id}] Caller said: {transcript_text}")
# Add caller's words to conversation history
history = conversation_history.get(call_id, [])
history.append({"role": "user", "content": transcript_text})
# Get AI response
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=history,
max_tokens=100
)
ai_reply = response.choices[0].message.content
# Add AI reply to history
history.append({"role": "assistant", "content": ai_reply})
conversation_history[call_id] = history
print(f"[{call_id}] AI replies: {ai_reply}")
# Speak the response back to the caller
await speak(call_id, ai_reply)
elif event_type == "call.ended":
# Clean up
conversation_history.pop(call_id, None)
print(f"[{call_id}] Call ended")
return {"status": "ok"}
async def speak(call_id: str, text: str):
"""Send TTS response back through the call."""
async with httpx.AsyncClient() as http:
await http.post(
f"{VOIPBIN_BASE}/calls/{call_id}/actions",
headers={
"Authorization": f"Bearer {VOIPBIN_TOKEN}",
"Content-Type": "application/json"
},
json={
"action": "talk",
"text": text,
"voice": "en-US-Neural2-F"
}
)
Step 3: Register Your Webhook with VoIPBin
curl -X POST https://api.voipbin.net/v1.0/webhooks \
-H "Authorization: Bearer $VOIPBIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://your-server.example.com/webhook",
"events": ["call.initiated", "transcript", "call.ended"]
}'
Step 4: Create an Inbound Flow
curl -X POST https://api.voipbin.net/v1.0/flows \
-H "Authorization: Bearer $VOIPBIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "AI Transcription Agent",
"webhook_url": "https://your-server.example.com/webhook",
"transcription": {
"enabled": true,
"language": "en-US",
"interim_results": false
}
}'
Assign a number to this flow and you're live. Every call becomes a real-time transcription → AI → voice loop.
What You Can Build With Real-Time Transcription
Once your agent is receiving clean text from every call, the use cases multiply fast:
Live sentiment monitoring
Detect frustration or urgency in caller language and escalate to a human agent automatically.
def check_urgency(text: str) -> bool:
urgent_keywords = ["cancel", "lawsuit", "refund", "urgent", "immediately", "supervisor"]
return any(kw in text.lower() for kw in urgent_keywords)
# In your transcript handler:
if check_urgency(transcript_text):
await transfer_to_human(call_id)
return
Keyword-triggered actions
Play a specific audio clip, send an SMS, or trigger a webhook when the caller mentions specific topics.
Post-call summaries
Every transcript event is a conversation turn. After the call ends, you have the full dialogue. Run it through GPT-4 for a structured summary, action items, or CRM update.
elif event_type == "call.ended":
history = conversation_history.get(call_id, [])
# Generate call summary
summary_response = client.chat.completions.create(
model="gpt-4o",
messages=history + [{
"role": "user",
"content": "Summarize this call in 3 bullet points. What did the caller want? Was it resolved?"
}]
)
summary = summary_response.choices[0].message.content
print(f"Call {call_id} summary:\n{summary}")
# Save to your CRM or database
await save_call_record(call_id, history, summary)
Compliance recording
Every word is captured as structured text. Search it, audit it, analyze it — no manual review of audio files.
The Latency Question
Real-time voice AI lives and dies by latency. If the gap between the caller finishing a sentence and hearing a response exceeds ~1.5 seconds, the conversation feels wrong.
The main latency components:
| Component | Typical Range |
|---|---|
| STT (end-of-speech detection) | 200–400ms |
| LLM inference (short responses) | 300–800ms |
| TTS synthesis | 100–300ms |
| Network/RTP delivery | 50–150ms |
Total: 650ms–1.65 seconds for a complete turn.
VoIPBin handles the STT and TTS ends of this pipeline. Your job is to keep the LLM call fast — use smaller models for simple responses, stream output where possible, and cache common replies.
Running It Locally
pip install fastapi uvicorn openai httpx
# Expose locally with ngrok for testing
ngrok http 8000
# Set your env vars
export VOIPBIN_TOKEN="your-token-here"
export OPENAI_API_KEY="sk-..."
# Start the server
uvicorn main:app --reload
Update your VoIPBin webhook URL to point at your ngrok URL and you can test with a real phone call immediately.
What You're Not Building
This is worth pausing on. By using VoIPBin's media offloading for transcription, you've skipped:
- RTP server implementation
- Codec handling (G.711 μ-law, a-law, G.722, Opus)
- Jitter buffer management
- Audio chunking and streaming to STT APIs
- VAD (Voice Activity Detection) for end-of-speech detection
- TTS audio streaming back into RTP
That's typically 4–8 weeks of specialized work for an experienced VoIP engineer. You built the same capability in an afternoon.
Try It
- Sign up: voipbin.net — free trial, no credit card
-
API docs:
https://api.voipbin.net/v1.0 -
MCP Server:
uvx voipbin-mcp— use VoIPBin directly from Claude Code or Cursor -
Go SDK:
go get github.com/voipbin/voipbin-go
If you're building AI agents that need to handle real phone conversations, the audio layer shouldn't be the hard part. Let the infrastructure handle the RTP. You handle the intelligence.
Have questions about integrating real-time transcription into your AI agent? Drop them in the comments.
Top comments (0)