You have probably seen OpenAI Realtime API demos — ultra-low-latency, natural voice conversations with GPT-4o. Impressive in the browser. But what about a real phone call?
Your users are not always at a computer. They call. Turning that browser demo into an actual phone number is where most developers get stuck.
This post walks through the architecture and code to put GPT-4o voice on a real phone line — without writing a single line of RTP or SIP code.
The Core Problem
OpenAI Realtime API speaks WebSocket. Phone networks speak RTP (Real-time Transport Protocol) — a completely different audio streaming format that requires:
- SIP signaling to handle call setup and teardown
- RTP stream processing for audio delivery
- Audio codec transcoding (G.711 ↔ PCM16)
- Network jitter buffering and packet loss handling
Most developers do not want to build this. They should not have to.
VoIPBin acts as the translation layer. When a call comes in, VoIPBin converts phone audio into a WebSocket stream your backend can consume — the same format OpenAI expects.
Phone Call
│
▼
VoIPBin ← handles SIP + RTP entirely
│ WebSocket (PCM16 audio frames)
▼
Your Server (lightweight bridge)
│ OpenAI Realtime API WebSocket
▼
GPT-4o ← processes audio, generates voice response
│
▼
VoIPBin ← converts PCM16 back to RTP
│
▼
Caller hears GPT-4o
Your server does one job: relay audio between two WebSocket connections.
Step 1 — Sign Up for VoIPBin
VoIPBin signup is a single API call. No email verification, no waiting.
curl -X POST https://api.voipbin.net/v1.0/auth/signup \
-H "Content-Type: application/json" \
-d '{"username": "yourname", "password": "yourpassword"}'
Save the accesskey.token from the response — that is your API key. Then provision a phone number from the VoIPBin dashboard and point its inbound webhook at your server.
Step 2 — Build the Bridge Server
Install dependencies:
pip install fastapi uvicorn websockets openai python-dotenv
Create bridge.py:
import asyncio
import base64
import json
import os
import websockets
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import JSONResponse
from dotenv import load_dotenv
load_dotenv()
app = FastAPI()
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
OPENAI_REALTIME_URL = (
"wss://api.openai.com/v1/realtime"
"?model=gpt-4o-realtime-preview-2024-12-17"
)
SYSTEM_PROMPT = (
"You are a helpful phone assistant. "
"Keep responses concise — callers are on a phone, not reading a document. "
"One or two sentences per turn is ideal."
)
@app.post("/webhook/inbound")
async def handle_inbound_call(request: Request):
data = await request.json()
call_id = data.get("call_id", "unknown")
print(f"Incoming call: {call_id}")
# Tell VoIPBin to stream audio to our WebSocket bridge
return JSONResponse({
"actions": [
{
"type": "talk",
"text": "Connecting you to the AI assistant."
},
{
"type": "stream",
"url": "wss://your-server.com/call",
"audio_format": "pcm16",
"sample_rate": 16000
}
]
})
@app.websocket("/call")
async def call_bridge(voipbin_ws: WebSocket):
await voipbin_ws.accept()
print("Call connected — opening OpenAI Realtime session")
async with websockets.connect(
OPENAI_REALTIME_URL,
extra_headers={
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1"
}
) as openai_ws:
# Configure the session
await openai_ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": SYSTEM_PROMPT,
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {"model": "whisper-1"},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 700
}
}
}))
async def phone_to_ai():
# Forward caller audio to OpenAI
try:
async for audio_bytes in voipbin_ws.iter_bytes():
await openai_ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
except Exception as exc:
print(f"phone_to_ai closed: {exc}")
async def ai_to_phone():
# Forward GPT-4o audio back to caller
try:
async for raw in openai_ws:
event = json.loads(raw)
if event["type"] == "response.audio.delta":
audio = base64.b64decode(event["delta"])
await voipbin_ws.send_bytes(audio)
elif event["type"] == (
"conversation.item.input_audio_transcription.completed"
):
print(f"Caller: {event.get('transcript', '')}")
except Exception as exc:
print(f"ai_to_phone closed: {exc}")
await asyncio.gather(phone_to_ai(), ai_to_phone())
print("Call ended")
Step 3 — Run and Test
uvicorn bridge:app --host 0.0.0.0 --port 8000
For local development, expose it with ngrok:
ngrok http 8000
Update your VoIPBin number's webhook to https://your-ngrok-url/webhook/inbound. Call the number. You are talking to GPT-4o.
What VoIPBin Is Handling (That You Are Not)
Here is everything happening behind the scenes that your code does not touch:
| Concern | Handled by |
|---|---|
| SIP INVITE / 200 OK / BYE | VoIPBin |
| RTP packet assembly | VoIPBin |
| G.711 ↔ PCM16 transcoding | VoIPBin |
| DTMF detection | VoIPBin |
| Network jitter compensation | VoIPBin |
| Call recording (optional) | VoIPBin |
| Your AI logic | You |
Your server handles two async loops and some JSON. That is the entire diff.
Scaling This
Because VoIPBin handles all media processing, your bridge server is CPU-light. Each call is two WebSocket connections and async I/O. A single server can comfortably handle dozens of concurrent calls. When you outgrow one, add more — each is stateless.
VoIPBin scales its media infrastructure independently. You scale your AI logic independently. Clean separation of concerns.
Extending the Bot
With this foundation, common additions are straightforward:
Call recording: VoIPBin can record audio server-side — just add a flag in your webhook response.
Post-call transcripts: OpenAI Realtime API streams transcripts via the conversation.item.input_audio_transcription.completed event. Collect them and send a summary after hangup.
Human escalation: If GPT-4o cannot help, add a transfer action to your VoIPBin webhook response to forward the call to a human agent.
Outbound calls: Flip the model — use VoIPBin to dial out, then connect the answered call to the same bridge. Same architecture, opposite direction.
Getting Started
- VoIPBin API docs and dashboard: voipbin.net
- Golang SDK:
go get github.com/voipbin/voipbin-go - MCP server (use from Claude Code or Cursor):
uvx voipbin-mcp
The signup endpoint returns your token immediately — no waiting, no onboarding call required.
The gap between "cool browser demo" and "real phone call" turns out to be about 60 lines of Python and one API account.
Top comments (0)