voipbin

Posted on May 4

Real-Time Voice Transcription for Your AI Agent — Without Writing a Single Line of Audio Code

#voip #ai #python #tutorial

Speech-to-text sounds simple until you actually build it.

You need to handle RTP packet assembly, choose the right audio codec (G.711? G.722? Opus?), manage jitter buffers, stream audio chunks to a transcription API with low enough latency that the conversation doesn't feel broken, and then pipe that text into your AI agent — all in real time, while keeping the call alive.

Most developers who try this spend weeks on audio infrastructure before writing a single line of AI logic.

There's a better path.

The Real Problem: Audio Is Hostile Territory for Most Developers

Voice calls operate at the network layer — RTP streams, SIP signaling, DTMF tones. These are protocols that telecom engineers have spent decades specializing in. Most AI developers have never touched them.

So when you want an AI agent that can listen to a caller and respond intelligently, you're suddenly responsible for:

Capturing audio in real-time from a phone network
Transcribing it with low latency (>500ms feels broken)
Feeding transcription chunks into your LLM
Generating a response and synthesizing speech
Playing audio back without interrupting the conversation flow

Each of these steps is solvable. But doing all of them together, reliably, at scale, is a significant engineering investment.

What Media Offloading Actually Means

The key insight behind VoIPBin's architecture is this: your AI agent should never touch audio.

Instead of your agent receiving raw RTP streams, VoIPBin sits in the middle and handles:

STT (Speech-to-Text): Audio from the caller is transcribed in real-time
Text delivery: Your agent receives clean text via webhook
TTS (Text-to-Speech): Your agent replies with text, VoIPBin speaks it
RTP lifecycle: Stream setup, teardown, codec negotiation — all handled

Your AI agent becomes a pure text processor. It reads, thinks, and writes. No audio code required.

Caller → [RTP Audio] → VoIPBin → [STT] → [Webhook: text] → Your AI Agent
Caller ← [RTP Audio] ← VoIPBin ← [TTS] ← [Response: text] ← Your AI Agent

Building a Real-Time Transcription Handler

Let's build a simple AI agent that receives transcriptions and responds. We'll use Python with FastAPI.

Step 1: Sign Up and Get Your Access Token

curl -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{
    "username": "your-username",
    "password": "your-password",
    "email": "you@example.com"
  }'

You'll get an accesskey.token back immediately — no OTP, no waiting.

Step 2: Configure Your Webhook Handler

VoIPBin sends call events to your server. The key event type for transcription is transcript.

from fastapi import FastAPI, Request
from openai import OpenAI
import httpx
import os

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

VOIPBIN_TOKEN = os.environ["VOIPBIN_TOKEN"]
VOIPBIN_BASE = "https://api.voipbin.net/v1.0"

# In-memory conversation context (use Redis in production)
conversation_history = {}

@app.post("/webhook")
async def handle_event(request: Request):
    event = await request.json()
    event_type = event.get("type")
    call_id = event.get("call_id")

    if event_type == "call.initiated":
        # Initialize conversation context for this call
        conversation_history[call_id] = [
            {
                "role": "system",
                "content": "You are a helpful voice assistant. Keep responses concise — under 30 words. The caller will hear your response spoken aloud."
            }
        ]
        # Greet the caller
        await speak(call_id, "Hello! How can I help you today?")

    elif event_type == "transcript":
        # This is where the magic happens
        transcript_text = event.get("transcript", "")
        if not transcript_text.strip():
            return {"status": "ok"}

        print(f"[{call_id}] Caller said: {transcript_text}")

        # Add caller's words to conversation history
        history = conversation_history.get(call_id, [])
        history.append({"role": "user", "content": transcript_text})

        # Get AI response
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=history,
            max_tokens=100
        )
        ai_reply = response.choices[0].message.content

        # Add AI reply to history
        history.append({"role": "assistant", "content": ai_reply})
        conversation_history[call_id] = history

        print(f"[{call_id}] AI replies: {ai_reply}")

        # Speak the response back to the caller
        await speak(call_id, ai_reply)

    elif event_type == "call.ended":
        # Clean up
        conversation_history.pop(call_id, None)
        print(f"[{call_id}] Call ended")

    return {"status": "ok"}


async def speak(call_id: str, text: str):
    """Send TTS response back through the call."""
    async with httpx.AsyncClient() as http:
        await http.post(
            f"{VOIPBIN_BASE}/calls/{call_id}/actions",
            headers={
                "Authorization": f"Bearer {VOIPBIN_TOKEN}",
                "Content-Type": "application/json"
            },
            json={
                "action": "talk",
                "text": text,
                "voice": "en-US-Neural2-F"
            }
        )

Step 3: Register Your Webhook with VoIPBin

curl -X POST https://api.voipbin.net/v1.0/webhooks \
  -H "Authorization: Bearer $VOIPBIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-server.example.com/webhook",
    "events": ["call.initiated", "transcript", "call.ended"]
  }'

Step 4: Create an Inbound Flow

curl -X POST https://api.voipbin.net/v1.0/flows \
  -H "Authorization: Bearer $VOIPBIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "AI Transcription Agent",
    "webhook_url": "https://your-server.example.com/webhook",
    "transcription": {
      "enabled": true,
      "language": "en-US",
      "interim_results": false
    }
  }'

Assign a number to this flow and you're live. Every call becomes a real-time transcription → AI → voice loop.

What You Can Build With Real-Time Transcription

Once your agent is receiving clean text from every call, the use cases multiply fast:

Live sentiment monitoring
Detect frustration or urgency in caller language and escalate to a human agent automatically.

def check_urgency(text: str) -> bool:
    urgent_keywords = ["cancel", "lawsuit", "refund", "urgent", "immediately", "supervisor"]
    return any(kw in text.lower() for kw in urgent_keywords)

# In your transcript handler:
if check_urgency(transcript_text):
    await transfer_to_human(call_id)
    return

Keyword-triggered actions
Play a specific audio clip, send an SMS, or trigger a webhook when the caller mentions specific topics.

Post-call summaries
Every transcript event is a conversation turn. After the call ends, you have the full dialogue. Run it through GPT-4 for a structured summary, action items, or CRM update.

elif event_type == "call.ended":
    history = conversation_history.get(call_id, [])

    # Generate call summary
    summary_response = client.chat.completions.create(
        model="gpt-4o",
        messages=history + [{
            "role": "user",
            "content": "Summarize this call in 3 bullet points. What did the caller want? Was it resolved?"
        }]
    )
    summary = summary_response.choices[0].message.content
    print(f"Call {call_id} summary:\n{summary}")

    # Save to your CRM or database
    await save_call_record(call_id, history, summary)

Compliance recording
Every word is captured as structured text. Search it, audit it, analyze it — no manual review of audio files.

The Latency Question

Real-time voice AI lives and dies by latency. If the gap between the caller finishing a sentence and hearing a response exceeds ~1.5 seconds, the conversation feels wrong.

The main latency components:

Component	Typical Range
STT (end-of-speech detection)	200–400ms
LLM inference (short responses)	300–800ms
TTS synthesis	100–300ms
Network/RTP delivery	50–150ms

Total: 650ms–1.65 seconds for a complete turn.

VoIPBin handles the STT and TTS ends of this pipeline. Your job is to keep the LLM call fast — use smaller models for simple responses, stream output where possible, and cache common replies.

Running It Locally

pip install fastapi uvicorn openai httpx

# Expose locally with ngrok for testing
ngrok http 8000

# Set your env vars
export VOIPBIN_TOKEN="your-token-here"
export OPENAI_API_KEY="sk-..."

# Start the server
uvicorn main:app --reload

Update your VoIPBin webhook URL to point at your ngrok URL and you can test with a real phone call immediately.

What You're Not Building

This is worth pausing on. By using VoIPBin's media offloading for transcription, you've skipped:

RTP server implementation
Codec handling (G.711 μ-law, a-law, G.722, Opus)
Jitter buffer management
Audio chunking and streaming to STT APIs
VAD (Voice Activity Detection) for end-of-speech detection
TTS audio streaming back into RTP

That's typically 4–8 weeks of specialized work for an experienced VoIP engineer. You built the same capability in an afternoon.

Try It

Sign up: voipbin.net — free trial, no credit card
API docs: https://api.voipbin.net/v1.0
MCP Server: uvx voipbin-mcp — use VoIPBin directly from Claude Code or Cursor
Go SDK: go get github.com/voipbin/voipbin-go

If you're building AI agents that need to handle real phone conversations, the audio layer shouldn't be the hard part. Let the infrastructure handle the RTP. You handle the intelligence.

Have questions about integrating real-time transcription into your AI agent? Drop them in the comments.

DEV Community