OpenAI's New Realtime Voice Models Can Think, Translate, and Transcribe — Here's What Developers Need to Know

#ai #api #image #tutorial

OpenAI just shipped three realtime voice models through their API. One reasons at GPT-5 level during live calls. One translates 70+ languages in real time. One does streaming transcription. All available today.

Let me break down what matters for developers.

The Three Models

GPT-Realtime-2 handles voice conversations with GPT-5-level reasoning. The key difference from previous voice models: it can call tools mid-conversation without going silent. It narrates what it's doing while executing — OpenAI calls this "preamble."

GPT-Realtime-Translate does real-time voice translation. 70+ input languages, 13 output languages. End-to-end audio processing (no intermediate text step), which preserves tone and emotion.

GPT-Realtime-Whisper is streaming speech-to-text. Words appear as the speaker talks. Built for live captions and meeting transcription.

Integration Options

All three use the Realtime API with three connection methods:

WebRTC — browser-based, lowest latency
WebSocket — server-side, more control
SIP — telephony integration

GPT-Realtime-2: Voice Agents That Actually Work

If you've built voice agents before, you know the pain: tool calls create dead air. The user asks something that requires a database lookup, and the agent goes silent for 2-3 seconds. Feels broken.

GPT-Realtime-2 solves this with preamble — it talks through its actions while executing them. "Let me check your calendar... I see you have a meeting with Alex Kim in 12 minutes." The tool call happens in parallel with the speech.

Other developer-relevant specs:

128K context window (up from 32K)
Handles interruptions without losing context
Better instruction following for system prompts
Text tokens: $4/$16 per million (input/output)
Audio tokens: $32/$64 per million

GPT-Realtime-Translate: The $0.034/min Disruption

The translation model is priced at $0.034 per minute. For context, a human simultaneous interpreter costs $25-44 per minute.

Technical details:

Processes raw audio end-to-end (not cascaded speech-to-text-to-speech)
Preserves speaker emotion and tone
Works best with brief pauses between thoughts (labeled "turn-based" in docs)
Occasional hallucinations still occur
Supports language switching mid-stream

The end-to-end approach is what makes the quality difference. Traditional pipelines lose vocal characteristics at every stage. This model skips text entirely.

GPT-Realtime-Whisper: Streaming Transcription

If you need real-time captions or meeting transcription, this is the model. Low-latency streaming output as the speaker talks.

What You Can Build

The three models together cover the full voice infrastructure stack:

Customer support agents that can reason, look up accounts, and process requests — all by voice
Real-time translation layers for international meetings at 1/1000th the cost of human interpreters
Live captioning systems for streaming, conferences, or accessibility
Multilingual voice assistants that handle code-switching naturally
Telephony bots via SIP integration that feel like talking to a person