Most voice AI feels like talking to a call center robot. You say something, wait two seconds, get a canned response. The latency kills the illusion.
I wanted something better — a voice agent that could hold a real conversation with near-human response times. Here's how I built it by bridging Twilio Media Streams directly to OpenAI's Realtime API.
The Architecture
The traditional approach to voice AI is a three-step pipeline: Speech-to-Text → LLM → Text-to-Speech. Each step adds latency. STT takes 500ms-1s. The LLM call takes another 500ms-2s. TTS adds another 500ms. You're looking at 1.5-3.5 seconds of dead air before your agent says anything. Humans notice pauses over 300ms.
OpenAI's Realtime API changes the game. Instead of three separate steps, you get a single WebSocket connection that handles audio in and audio out. The model "hears" raw audio and "speaks" back directly. No transcription round-trip.
The trick is getting Twilio's phone audio into that WebSocket.
How It Works
When someone calls our Twilio number, Twilio opens a Media Stream — a WebSocket that sends us raw audio packets (mulaw, 8kHz). Our Node.js server sits in the middle:
Phone Call → Twilio → Media Stream WebSocket → Our Server → OpenAI Realtime WebSocket
The server's job is simple: receive audio chunks from Twilio, forward them to OpenAI, and pipe OpenAI's audio responses back to Twilio. It's a bridge, not a processor.
Here's the core of it:
// Twilio sends audio
twilioWs.on("message", (data) => {
const msg = JSON.parse(data);
if (msg.event === "media") {
openaiWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: msg.media.payload // Already base64 mulaw
}));
}
});
// OpenAI sends audio back
openaiWs.on("message", (data) => {
const event = JSON.parse(data);
if (event.type === "response.audio.delta") {
twilioWs.send(JSON.stringify({
event: "media",
streamSid: streamSid,
media: { payload: event.delta }
}));
}
});
That's the essential loop. Audio flows in both directions through our server with minimal processing overhead.
The Details That Matter
Transcription: I added input_audio_transcription: { model: "whisper-1" } to capture what both sides say. This runs async — it doesn't add to response latency, but gives you full transcripts after the call.
Voice selection: OpenAI offers several voices. I went with ash — it's deeper and more natural for a male-presenting agent. The voice quality from the Realtime API is noticeably better than traditional TTS.
Interruption handling: The Realtime API handles barge-in natively. If someone starts talking while the agent is speaking, it detects the interruption and stops. No custom VAD needed.
DNS and SSL: Twilio needs a public WebSocket endpoint. I pointed realtime.byldr.co at my VPS (not through Cloudflare proxy — WebSockets don't play nice with proxied connections) and used Let's Encrypt for SSL.
Real-World Results
End-to-end latency: roughly 200ms from when someone finishes speaking to when the agent starts responding. That's fast enough to feel conversational. People don't notice the gap.
I tested it by having the agent cold-call me with a comically aggressive sales pitch. I played along, gave a fake credit card number, and the agent handled the whole exchange naturally. Full transcript captured on both sides.
What I'd Do Differently
The Realtime API is still relatively new, and the pricing is steep — audio tokens cost significantly more than text. For high-volume use cases, the three-step pipeline with a faster STT (like Deepgram) might be more cost-effective, even with the latency hit.
Also, error handling on WebSocket reconnection needs more thought. Twilio's Media Streams can hiccup, and if either WebSocket drops, you need to gracefully restart the bridge without the caller hearing dead air.
The Stack
- Runtime: Node.js on a $10/month VPS
- Phone: Twilio (inbound + outbound)
- AI: OpenAI Realtime API (gpt-4o-realtime-preview)
- Process manager: PM2
- Total infrastructure cost: ~$15/month plus per-minute API costs
You don't need a fancy MLOps platform to build voice AI that feels real. A VPS, two WebSocket connections, and some careful audio piping gets you 90% of the way there.
Top comments (0)