voipbin

Posted on Apr 28

Stop Handling RTP in Your AI Agent — Let the Infrastructure Do It

#voip #ai #webdev #tutorial

If you have ever tried to build a real AI voice agent — not a demo, but something that actually picks up phone calls — you know the moment everything stops being fun.

It is not the AI logic. That part is usually clean. It is everything underneath: RTP streams, jitter buffers, DTMF detection, SRTP negotiation, codec transcoding, VAD (voice activity detection), SIP signaling. Suddenly your elegant AI agent is buried under thousands of lines of audio plumbing that have nothing to do with the problem you set out to solve.

This post is about how to escape that trap.

The Real Problem: Audio Infrastructure Is a Full-Time Job

Let me show you what typical DIY voice-AI stacks look like under the hood:

Phone Call
  → SIP INVITE (handle SIP stack)
  → RTP stream (decode G.711 / G.729 / Opus)
  → Jitter buffer management
  → VAD (detect speech vs silence)
  → Chunked audio → STT API
  → AI inference (finally, the fun part)
  → TTS API → audio bytes
  → Encode audio back to RTP
  → Send RTP packets with correct timestamps
  → Handle DTMF events in parallel
  → Deal with packet loss, reordering, clock drift

Every one of those steps is an opportunity to introduce bugs, latency spikes, or dropped audio. And none of it is your core product.

Worse, scaling it is a nightmare. RTP is stateful and UDP-based. You cannot just throw it behind a load balancer and call it a day.

The Alternative: Media Offloading

The idea is simple: let AI handle text, let infrastructure handle audio.

Your AI agent should only ever see:

Transcribed text ("I want to check my account balance")
Structured events (call started, DTMF pressed, call ended)

And it should only ever produce:

Text responses that get synthesized into speech
High-level commands ("transfer to billing", "play hold music", "hang up")

This is what VoIPBin calls Media Offloading. VoIPBin sits between the phone network and your AI, handling 100% of the audio layer. Your code speaks clean HTTP and JSON.

What This Looks Like in Practice

Here is a complete AI voice agent in Python. Notice what is not in this code:

from flask import Flask, request, jsonify
import openai

app = Flask(__name__)
client = openai.OpenAI()

conversation_history = {}

@app.route("/voice-agent", methods=["POST"])
def voice_agent():
    event = request.json
    call_id = event.get("call_id")
    event_type = event.get("type")

    # Call started — greet the caller
    if event_type == "call.started":
        conversation_history[call_id] = [
            {"role": "system", "content": "You are a helpful support agent. Be concise."}
        ]
        return jsonify({
            "actions": [
                {"type": "talk", "text": "Hello! How can I help you today?"}
            ]
        })

    # Caller spoke — run AI inference on the transcript
    if event_type == "speech.recognized":
        transcript = event.get("transcript", "")
        history = conversation_history.get(call_id, [])
        history.append({"role": "user", "content": transcript})

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=history
        )
        reply = response.choices[0].message.content
        history.append({"role": "assistant", "content": reply})
        conversation_history[call_id] = history

        return jsonify({
            "actions": [
                {"type": "talk", "text": reply}
            ]
        })

    # Call ended — clean up
    if event_type == "call.ended":
        conversation_history.pop(call_id, None)
        return jsonify({"actions": []})

    return jsonify({"actions": []})

if __name__ == "__main__":
    app.run(port=5000)

That is the entire voice agent. No RTP. No SIP. No audio codecs. No jitter buffer. Just HTTP in, HTTP out.

VoIPBin takes the raw phone call, transcribes speech in real time, sends your webhook the transcript, receives your text response, and synthesizes it into speech — all before the caller notices any delay.

Getting Started

First, sign up and get your API token:

curl -s -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{
    "username": "your-username",
    "password": "your-password",
    "email": "you@example.com"
  }'
# Returns: { "token": "..." }  ← your access token, no OTP needed

Then create a flow that points to your webhook:

curl -s -X POST https://api.voipbin.net/v1.0/flows \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "AI Voice Agent",
    "actions": [
      {
        "type": "webhook",
        "url": "https://your-server.com/voice-agent"
      }
    ]
  }'

Assign a phone number to the flow, and every incoming call now drives your webhook. That is it. Your Flask app above is a fully functional phone AI agent.

The Golang SDK Version

If you prefer Go, the SDK makes it equally clean:

package main

import (
    "encoding/json"
    "net/http"
    "log"
)

type Event struct {
    CallID     string `json:"call_id"`
    Type       string `json:"type"`
    Transcript string `json:"transcript"`
}

type Action struct {
    Type string `json:"type"`
    Text string `json:"text,omitempty"`
}

type Response struct {
    Actions []Action `json:"actions"`
}

func voiceHandler(w http.ResponseWriter, r *http.Request) {
    var event Event
    json.NewDecoder(r.Body).Decode(&event)

    var actions []Action

    switch event.Type {
    case "call.started":
        actions = []Action{{Type: "talk", Text: "Hello! How can I help you?"}}
    case "speech.recognized":
        // Call your AI model here
        reply := callAI(event.Transcript)
        actions = []Action{{Type: "talk", Text: reply}}
    }

    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(Response{Actions: actions})
}

func main() {
    http.HandleFunc("/voice-agent", voiceHandler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Install the SDK:

go get github.com/voipbin/voipbin-go

What You Get

When you offload audio to infrastructure instead of handling it yourself:

Simpler codebase. Your AI logic stays focused on AI logic. No audio thread pools, no RTP socket management, no codec libraries to vendor.

Better latency. VoIPBin runs purpose-built media servers geographically close to carriers. Your webhook only adds the inference time — not the full audio round-trip overhead of a DIY stack.

Free horizontal scaling. Your webhook is stateless HTTP. Scale it like any API. VoIPBin handles the stateful media sessions.

Automatic codec handling. Callers dial in from SIP phones, mobile networks, WebRTC clients — each with different codecs. VoIPBin normalizes everything before your code ever sees it.

STT/TTS without wiring. Built-in speech recognition and synthesis. Swap languages or voices through config, not code.

The Mental Model Shift

The key insight is this: audio infrastructure is not your product, it is a dependency.

You would not write your own TCP stack to build an API. You would not implement your own TLS to serve HTTPS. The same logic applies to RTP and SIP.

Let the infrastructure layer handle packets, timing, and codec negotiation. Keep your application layer focused on the thing that actually creates value — the AI behavior.

Your AI voice agent should look like a webhook handler. If it looks like a media server, something has gone wrong.

Try It

Website: voipbin.net
MCP Server: uvx voipbin-mcp (works with Claude Code, Cursor)
Golang SDK: go get github.com/voipbin/voipbin-go
API Base: https://api.voipbin.net/v1.0

If you are building AI voice agents and want to talk through the architecture, drop a comment below. Happy to go deeper on any part of this.

DEV Community