Building Sub-200ms Voice AI with OpenAI Realtime API

#ai #voice #openai #tutorial

Voice AI has always had a latency problem. Traditional pipelines—speech-to-text, LLM processing, text-to-speech—stack delays that make conversations feel robotic. Users wait 2-3 seconds for responses. It kills the magic.

OpenAI's Realtime API changes everything. We're talking sub-200ms response times. Bidirectional audio streaming. Real conversations with AI.

Here's how I built it.

The Architecture

Twilio (Phone) <-> WebSocket Server <-> OpenAI Realtime API
     ↓                    ↓                    ↓
  PSTN Audio      Media Stream Bridge      GPT-4o Realtime
   (μ-law)           (Base64)              (PCM 24kHz)

The key insight: no transcription step. Audio goes directly to the model, and audio comes directly back. The model "hears" and "speaks" natively.

Twilio Media Streams

When someone calls your Twilio number, you respond with TwiML that opens a WebSocket:

<Response>
  <Connect>
    <Stream url="wss://your-server.com/media-stream">
      <Parameter name="callerNumber" value="{From}"/>
    </Stream>
  </Connect>
</Response>

Twilio sends audio chunks as base64-encoded μ-law (8kHz). You'll need to transcode to PCM 24kHz for OpenAI.

The WebSocket Bridge

Your server maintains two WebSocket connections:

Twilio → Your Server: Receives caller audio
Your Server → OpenAI: Sends/receives audio from the model

// Simplified flow
twilioWs.on('message', (data) => {
  const { event, media } = JSON.parse(data);
  if (event === 'media') {
    const pcmAudio = transcode(media.payload);
    openaiWs.send(JSON.stringify({
      type: 'input_audio_buffer.append',
      audio: pcmAudio
    }));
  }
});

Server VAD: The Secret Sauce

turn_detection: { type: 'server_vad' } enables Voice Activity Detection on OpenAI's side. The model automatically detects when the user stops speaking and begins responding.

This is crucial for natural conversation flow.

Latency Breakdown

Component	Time
Twilio → Server	~50ms
Server → OpenAI	~30ms
Model Processing	~80ms
OpenAI → Server	~30ms
Server → Twilio	~50ms
Total	~200ms

Compare this to traditional pipelines (2-3 seconds) and it's night and day.

Production Considerations

Audio Format Hell: Twilio uses μ-law 8kHz. OpenAI wants PCM 24kHz.
WebSocket Lifecycle: Handle disconnections gracefully.
Costs: Realtime API pricing is per-minute of audio.
Interruptions: Handle response.cancelled events.

The voice assistant feels genuinely conversational. This is the future of voice interfaces.

Originally published at ryancwynar.com

DEV Community