DEV Community

voipbin
voipbin

Posted on

Real-Time Voice Transcription for Your AI Agent — Without the Plumbing

When you build an AI agent that handles voice calls, you quickly hit a wall: how do you get the spoken words into your AI's context in real time?

The naive path is painful. You stand up a WebSocket server, ingest raw RTP packets, decode the audio codec, buffer frames, feed them into a speech-to-text engine, manage partial vs. final transcripts, and somehow do all of this while also running your actual AI logic. Oh, and latency matters — callers don't wait 3 seconds between sentences.

This post shows how to skip all that plumbing and get real-time transcription piped directly into your AI agent using VoIPBin.


The Problem With DIY Transcription

Let's be concrete. Here's what a "simple" DIY voice pipeline looks like:

Caller → SIP → RTP stream → your server
                               ↓
                         decode opus/PCMU
                               ↓
                     buffer + VAD detection
                               ↓
                       STT API (Google/AWS/Deepgram)
                               ↓
                    handle partial transcripts
                               ↓
                         your AI logic
                               ↓
                    TTS → re-encode audio → RTP back
Enter fullscreen mode Exit fullscreen mode

Each step is a failure point. Each step adds latency. And none of it is your actual product.


What VoIPBin Does Instead

VoIPBin handles the entire media pipeline. Your AI agent only ever sees text in, text out.

The architecture flips to:

Caller → SIP → VoIPBin
                  ↓
           STT (real-time)
                  ↓
        webhook → your AI agent (text)
                  ↓
        AI response (text)
                  ↓
           TTS + RTP back to caller
Enter fullscreen mode Exit fullscreen mode

Your server doesn't touch audio at all. It receives a transcript, thinks, replies with text.


Hands-On: Getting Transcripts Into Your Agent

Step 1: Sign Up and Get Your Token

curl -s -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{"username": "yourname", "password": "yourpass"}'
Enter fullscreen mode Exit fullscreen mode

You get back an accesskey.token. No email confirmation, no waiting.


Step 2: Create a Flow That Transcribes and Calls Your Webhook

VoIPBin Flows define what happens during a call. Here's a minimal flow that:

  1. Answers the call
  2. Listens and transcribes
  3. Sends the transcript to your agent via webhook
  4. Speaks the agent's reply back
import httpx

TOKEN = "your-access-token"

flow = {
    "name": "AI Transcription Agent",
    "steps": [
        {
            "type": "answer"
        },
        {
            "type": "talk",
            "text": "Hello! How can I help you today?"
        },
        {
            "type": "listen",
            "timeout": 5000,
            "webhook": {
                "url": "https://your-server.com/transcript",
                "method": "POST"
            }
        }
    ]
}

res = httpx.post(
    "https://api.voipbin.net/v1.0/flows",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json=flow
)
print(res.json())
Enter fullscreen mode Exit fullscreen mode

When a caller speaks, VoIPBin transcribes it and POSTs the result to your webhook URL. You handle the text, return a response, and VoIPBin speaks it.


Step 3: Your AI Agent Receives the Transcript

Here's a minimal FastAPI handler that receives transcripts and responds with an AI reply:

from fastapi import FastAPI, Request
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/transcript")
async def handle_transcript(req: Request):
    body = await req.json()

    # VoIPBin sends the recognized text here
    user_text = body.get("transcript", "")
    call_id = body.get("call_id", "")

    print(f"[{call_id}] User said: {user_text}")

    if not user_text.strip():
        return {"action": "listen"}  # keep listening if empty

    # Run your AI
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful phone assistant. Keep responses brief."},
            {"role": "user", "content": user_text}
        ]
    )

    reply = response.choices[0].message.content

    # Return the reply — VoIPBin will speak it via TTS
    return {
        "action": "talk",
        "text": reply,
        "then": {"action": "listen"}  # loop: listen again after speaking
    }
Enter fullscreen mode Exit fullscreen mode

That's the entire voice AI agent. No audio code. No codec handling. No streaming buffers.


Step 4: Attach a Phone Number (or Use SIP Without One)

If you want callers to reach you via a real phone number:

# Purchase a DID number
res = httpx.post(
    "https://api.voipbin.net/v1.0/numbers/purchase",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={"country": "US", "flow_id": "<your-flow-id>"}
)
number = res.json()["number"]
print(f"Your AI agent is reachable at: {number}")
Enter fullscreen mode Exit fullscreen mode

Or skip the number entirely with a Direct Hash SIP URI — useful for testing or internal tools:

sip:<your-hash>@sip.voipbin.net
Enter fullscreen mode Exit fullscreen mode

No number needed. Dial it from any SIP client.


What the Transcript Payload Looks Like

Here's a sample payload your webhook receives:

{
  "call_id": "a1b2c3d4-...",
  "transcript": "I need to cancel my subscription",
  "confidence": 0.97,
  "language": "en-US",
  "is_final": true,
  "duration_ms": 1840
}
Enter fullscreen mode Exit fullscreen mode

is_final: true means VoIPBin has determined the speaker finished. You don't have to implement voice activity detection (VAD) yourself — it's already done.


Building a Conversation Loop

For a multi-turn conversation, keep a session store keyed by call_id:

from collections import defaultdict

conversation_history = defaultdict(list)

@app.post("/transcript")
async def handle_transcript(req: Request):
    body = await req.json()
    user_text = body.get("transcript", "")
    call_id = body.get("call_id", "")

    if not user_text.strip():
        return {"action": "listen"}

    # Append to conversation history
    history = conversation_history[call_id]
    history.append({"role": "user", "content": user_text})

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful support agent."},
            *history
        ]
    )

    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})

    return {
        "action": "talk",
        "text": reply,
        "then": {"action": "listen"}
    }

@app.post("/call-ended")
async def cleanup(req: Request):
    body = await req.json()
    call_id = body.get("call_id", "")
    # Log the full conversation, then clean up
    print(f"Call {call_id} ended. Turns: {len(conversation_history.get(call_id, []))}")
    conversation_history.pop(call_id, None)
    return {"ok": True}
Enter fullscreen mode Exit fullscreen mode

Now your agent maintains context across the entire call without any extra infrastructure.


Using the MCP Server (Claude Desktop / Cursor)

If you want to wire this up from Claude Desktop or Cursor without writing any boilerplate:

uvx voipbin-mcp
Enter fullscreen mode Exit fullscreen mode

This launches the VoIPBin MCP server. Configure it in your Claude Desktop config.json and you can tell Claude to "make a call", "check transcripts", or "set up a flow" directly from the chat.


What You Get (and What You Don't Have to Build)

What VoIPBin handles What you own
RTP / audio streaming Business logic
Speech-to-text AI model choice
Voice activity detection Conversation state
Text-to-speech Webhook endpoint
SIP signaling Deployment
Multi-language STT Nothing else

Your codebase stays small. Your team doesn't need telephony expertise.


Try It

  • Website: voipbin.net
  • MCP Server: uvx voipbin-mcp
  • Go SDK: go get github.com/voipbin/voipbin-go
  • Signup: POST https://api.voipbin.net/v1.0/auth/signup — token returned immediately

If you're building any kind of voice AI — customer support bots, appointment scheduling, survey calls — the transcription pipeline is the part that will eat your sprint. Offload it.

Questions? Drop them in the comments.

Top comments (0)