voipbin

Posted on Apr 26

GPT-4o Voice on a Real Phone Line: Connecting OpenAI Realtime API to Actual Calls

#voip #ai #openai #python

You have probably seen OpenAI Realtime API demos — ultra-low-latency, natural voice conversations with GPT-4o. Impressive in the browser. But what about a real phone call?

Your users are not always at a computer. They call. Turning that browser demo into an actual phone number is where most developers get stuck.

This post walks through the architecture and code to put GPT-4o voice on a real phone line — without writing a single line of RTP or SIP code.

The Core Problem

OpenAI Realtime API speaks WebSocket. Phone networks speak RTP (Real-time Transport Protocol) — a completely different audio streaming format that requires:

SIP signaling to handle call setup and teardown
RTP stream processing for audio delivery
Audio codec transcoding (G.711 ↔ PCM16)
Network jitter buffering and packet loss handling

Most developers do not want to build this. They should not have to.

VoIPBin acts as the translation layer. When a call comes in, VoIPBin converts phone audio into a WebSocket stream your backend can consume — the same format OpenAI expects.

Phone Call
    │
    ▼
VoIPBin  ← handles SIP + RTP entirely
    │  WebSocket (PCM16 audio frames)
    ▼
Your Server (lightweight bridge)
    │  OpenAI Realtime API WebSocket
    ▼
GPT-4o  ← processes audio, generates voice response
    │
    ▼
VoIPBin  ← converts PCM16 back to RTP
    │
    ▼
Caller hears GPT-4o

Your server does one job: relay audio between two WebSocket connections.

Step 1 — Sign Up for VoIPBin

VoIPBin signup is a single API call. No email verification, no waiting.

curl -X POST https://api.voipbin.net/v1.0/auth/signup \
  -H "Content-Type: application/json" \
  -d '{"username": "yourname", "password": "yourpassword"}'

Save the accesskey.token from the response — that is your API key. Then provision a phone number from the VoIPBin dashboard and point its inbound webhook at your server.

Step 2 — Build the Bridge Server

Install dependencies:

pip install fastapi uvicorn websockets openai python-dotenv

Create bridge.py:

import asyncio
import base64
import json
import os
import websockets
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import JSONResponse
from dotenv import load_dotenv

load_dotenv()
app = FastAPI()

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_REALTIME_URL = (
    "wss://api.openai.com/v1/realtime"
    "?model=gpt-4o-realtime-preview-2024-12-17"
)

SYSTEM_PROMPT = (
    "You are a helpful phone assistant. "
    "Keep responses concise — callers are on a phone, not reading a document. "
    "One or two sentences per turn is ideal."
)


@app.post("/webhook/inbound")
async def handle_inbound_call(request: Request):
    data = await request.json()
    call_id = data.get("call_id", "unknown")
    print(f"Incoming call: {call_id}")

    # Tell VoIPBin to stream audio to our WebSocket bridge
    return JSONResponse({
        "actions": [
            {
                "type": "talk",
                "text": "Connecting you to the AI assistant."
            },
            {
                "type": "stream",
                "url": "wss://your-server.com/call",
                "audio_format": "pcm16",
                "sample_rate": 16000
            }
        ]
    })


@app.websocket("/call")
async def call_bridge(voipbin_ws: WebSocket):
    await voipbin_ws.accept()
    print("Call connected — opening OpenAI Realtime session")

    async with websockets.connect(
        OPENAI_REALTIME_URL,
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1"
        }
    ) as openai_ws:

        # Configure the session
        await openai_ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": SYSTEM_PROMPT,
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 700
                }
            }
        }))

        async def phone_to_ai():
            # Forward caller audio to OpenAI
            try:
                async for audio_bytes in voipbin_ws.iter_bytes():
                    await openai_ws.send(json.dumps({
                        "type": "input_audio_buffer.append",
                        "audio": base64.b64encode(audio_bytes).decode()
                    }))
            except Exception as exc:
                print(f"phone_to_ai closed: {exc}")

        async def ai_to_phone():
            # Forward GPT-4o audio back to caller
            try:
                async for raw in openai_ws:
                    event = json.loads(raw)

                    if event["type"] == "response.audio.delta":
                        audio = base64.b64decode(event["delta"])
                        await voipbin_ws.send_bytes(audio)

                    elif event["type"] == (
                        "conversation.item.input_audio_transcription.completed"
                    ):
                        print(f"Caller: {event.get('transcript', '')}")

            except Exception as exc:
                print(f"ai_to_phone closed: {exc}")

        await asyncio.gather(phone_to_ai(), ai_to_phone())

    print("Call ended")

Step 3 — Run and Test

uvicorn bridge:app --host 0.0.0.0 --port 8000

For local development, expose it with ngrok:

ngrok http 8000

Update your VoIPBin number's webhook to https://your-ngrok-url/webhook/inbound. Call the number. You are talking to GPT-4o.

What VoIPBin Is Handling (That You Are Not)

Here is everything happening behind the scenes that your code does not touch:

Concern	Handled by
SIP INVITE / 200 OK / BYE	VoIPBin
RTP packet assembly	VoIPBin
G.711 ↔ PCM16 transcoding	VoIPBin
DTMF detection	VoIPBin
Network jitter compensation	VoIPBin
Call recording (optional)	VoIPBin
Your AI logic	You

Your server handles two async loops and some JSON. That is the entire diff.

Scaling This

Because VoIPBin handles all media processing, your bridge server is CPU-light. Each call is two WebSocket connections and async I/O. A single server can comfortably handle dozens of concurrent calls. When you outgrow one, add more — each is stateless.

VoIPBin scales its media infrastructure independently. You scale your AI logic independently. Clean separation of concerns.

Extending the Bot

With this foundation, common additions are straightforward:

Call recording: VoIPBin can record audio server-side — just add a flag in your webhook response.

Post-call transcripts: OpenAI Realtime API streams transcripts via the conversation.item.input_audio_transcription.completed event. Collect them and send a summary after hangup.

Human escalation: If GPT-4o cannot help, add a transfer action to your VoIPBin webhook response to forward the call to a human agent.

Outbound calls: Flip the model — use VoIPBin to dial out, then connect the answered call to the same bridge. Same architecture, opposite direction.

Getting Started

VoIPBin API docs and dashboard: voipbin.net
Golang SDK: go get github.com/voipbin/voipbin-go
MCP server (use from Claude Code or Cursor): uvx voipbin-mcp

The signup endpoint returns your token immediately — no waiting, no onboarding call required.

The gap between "cool browser demo" and "real phone call" turns out to be about 60 lines of Python and one API account.

DEV Community