voipbin

Posted on Apr 10

Real-Time Voice Transcription for Your AI Agent — Without the Plumbing

#ai #voip #tutorial #python

When you build an AI agent that handles voice calls, you quickly hit a wall: how do you get the spoken words into your AI's context in real time?

The naive path is painful. You stand up a WebSocket server, ingest raw RTP packets, decode the audio codec, buffer frames, feed them into a speech-to-text engine, manage partial vs. final transcripts, and somehow do all of this while also running your actual AI logic. Oh, and latency matters — callers don't wait 3 seconds between sentences.

This post shows how to skip all that plumbing and get real-time transcription piped directly into your AI agent using VoIPBin.

The Problem With DIY Transcription

Let's be concrete. Here's what a "simple" DIY voice pipeline looks like:

Caller → SIP → RTP stream → your server
                               ↓
                         decode opus/PCMU
                               ↓
                     buffer + VAD detection
                               ↓
                       STT API (Google/AWS/Deepgram)
                               ↓
                    handle partial transcripts
                               ↓
                         your AI logic
                               ↓
                    TTS → re-encode audio → RTP back

Each step is a failure point. Each step adds latency. And none of it is your actual product.

What VoIPBin Does Instead

VoIPBin handles the entire media pipeline. Your AI agent only ever sees text in, text out.

The architecture flips to:

Caller → SIP → VoIPBin
                  ↓
           STT (real-time)
                  ↓
        webhook → your AI agent (text)
                  ↓
        AI response (text)
                  ↓
           TTS + RTP back to caller

Your server doesn't touch audio at all. It receives a transcript, thinks, replies with text.

Hands-On: Getting Transcripts Into Your Agent

Step 1: Sign Up and Get Your Token

curl -s -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{"username": "yourname", "password": "yourpass"}'

You get back an accesskey.token. No email confirmation, no waiting.

Step 2: Create a Flow That Transcribes and Calls Your Webhook

VoIPBin Flows define what happens during a call. Here's a minimal flow that:

Answers the call
Listens and transcribes
Sends the transcript to your agent via webhook
Speaks the agent's reply back

import httpx

TOKEN = "your-access-token"

flow = {
    "name": "AI Transcription Agent",
    "steps": [
        {
            "type": "answer"
        },
        {
            "type": "talk",
            "text": "Hello! How can I help you today?"
        },
        {
            "type": "listen",
            "timeout": 5000,
            "webhook": {
                "url": "https://your-server.com/transcript",
                "method": "POST"
            }
        }
    ]
}

res = httpx.post(
    "https://api.voipbin.net/v1.0/flows",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json=flow
)
print(res.json())

When a caller speaks, VoIPBin transcribes it and POSTs the result to your webhook URL. You handle the text, return a response, and VoIPBin speaks it.

Step 3: Your AI Agent Receives the Transcript

Here's a minimal FastAPI handler that receives transcripts and responds with an AI reply:

from fastapi import FastAPI, Request
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/transcript")
async def handle_transcript(req: Request):
    body = await req.json()

    # VoIPBin sends the recognized text here
    user_text = body.get("transcript", "")
    call_id = body.get("call_id", "")

    print(f"[{call_id}] User said: {user_text}")

    if not user_text.strip():
        return {"action": "listen"}  # keep listening if empty

    # Run your AI
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful phone assistant. Keep responses brief."},
            {"role": "user", "content": user_text}
        ]
    )

    reply = response.choices[0].message.content

    # Return the reply — VoIPBin will speak it via TTS
    return {
        "action": "talk",
        "text": reply,
        "then": {"action": "listen"}  # loop: listen again after speaking
    }

That's the entire voice AI agent. No audio code. No codec handling. No streaming buffers.

Step 4: Attach a Phone Number (or Use SIP Without One)

If you want callers to reach you via a real phone number:

# Purchase a DID number
res = httpx.post(
    "https://api.voipbin.net/v1.0/numbers/purchase",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"country": "US", "flow_id": "<your-flow-id>"}
)
number = res.json()["number"]
print(f"Your AI agent is reachable at: {number}")

Or skip the number entirely with a Direct Hash SIP URI — useful for testing or internal tools:

sip:<your-hash>@sip.voipbin.net

No number needed. Dial it from any SIP client.

What the Transcript Payload Looks Like

Here's a sample payload your webhook receives:

{
  "call_id": "a1b2c3d4-...",
  "transcript": "I need to cancel my subscription",
  "confidence": 0.97,
  "language": "en-US",
  "is_final": true,
  "duration_ms": 1840
}

is_final: true means VoIPBin has determined the speaker finished. You don't have to implement voice activity detection (VAD) yourself — it's already done.

Building a Conversation Loop

For a multi-turn conversation, keep a session store keyed by call_id:

from collections import defaultdict

conversation_history = defaultdict(list)

@app.post("/transcript")
async def handle_transcript(req: Request):
    body = await req.json()
    user_text = body.get("transcript", "")
    call_id = body.get("call_id", "")

    if not user_text.strip():
        return {"action": "listen"}

    # Append to conversation history
    history = conversation_history[call_id]
    history.append({"role": "user", "content": user_text})

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful support agent."},
            *history
        ]
    )

    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})

    return {
        "action": "talk",
        "text": reply,
        "then": {"action": "listen"}
    }

@app.post("/call-ended")
async def cleanup(req: Request):
    body = await req.json()
    call_id = body.get("call_id", "")
    # Log the full conversation, then clean up
    print(f"Call {call_id} ended. Turns: {len(conversation_history.get(call_id, []))}")
    conversation_history.pop(call_id, None)
    return {"ok": True}

Now your agent maintains context across the entire call without any extra infrastructure.

Using the MCP Server (Claude Desktop / Cursor)

If you want to wire this up from Claude Desktop or Cursor without writing any boilerplate:

uvx voipbin-mcp

This launches the VoIPBin MCP server. Configure it in your Claude Desktop config.json and you can tell Claude to "make a call", "check transcripts", or "set up a flow" directly from the chat.

What You Get (and What You Don't Have to Build)

What VoIPBin handles	What you own
RTP / audio streaming	Business logic
Speech-to-text	AI model choice
Voice activity detection	Conversation state
Text-to-speech	Webhook endpoint
SIP signaling	Deployment
Multi-language STT	Nothing else

Your codebase stays small. Your team doesn't need telephony expertise.

Try It

Website: voipbin.net
MCP Server: uvx voipbin-mcp
Go SDK: go get github.com/voipbin/voipbin-go
Signup: POST https://api.voipbin.net/v1.0/auth/signup — token returned immediately

If you're building any kind of voice AI — customer support bots, appointment scheduling, survey calls — the transcription pipeline is the part that will eat your sprint. Offload it.

Questions? Drop them in the comments.

DEV Community