OpenAI just shipped three realtime voice models through their API. One reasons at GPT-5 level during live calls. One translates 70+ languages in real time. One does streaming transcription. All available today.
Let me break down what matters for developers.
The Three Models
GPT-Realtime-2 handles voice conversations with GPT-5-level reasoning. The key difference from previous voice models: it can call tools mid-conversation without going silent. It narrates what it's doing while executing — OpenAI calls this "preamble."
GPT-Realtime-Translate does real-time voice translation. 70+ input languages, 13 output languages. End-to-end audio processing (no intermediate text step), which preserves tone and emotion.
GPT-Realtime-Whisper is streaming speech-to-text. Words appear as the speaker talks. Built for live captions and meeting transcription.
Integration Options
All three use the Realtime API with three connection methods:
- WebRTC — browser-based, lowest latency
- WebSocket — server-side, more control
- SIP — telephony integration
GPT-Realtime-2: Voice Agents That Actually Work
If you've built voice agents before, you know the pain: tool calls create dead air. The user asks something that requires a database lookup, and the agent goes silent for 2-3 seconds. Feels broken.
GPT-Realtime-2 solves this with preamble — it talks through its actions while executing them. "Let me check your calendar... I see you have a meeting with Alex Kim in 12 minutes." The tool call happens in parallel with the speech.
Other developer-relevant specs:
- 128K context window (up from 32K)
- Handles interruptions without losing context
- Better instruction following for system prompts
- Text tokens: $4/$16 per million (input/output)
- Audio tokens: $32/$64 per million
GPT-Realtime-Translate: The $0.034/min Disruption
The translation model is priced at $0.034 per minute. For context, a human simultaneous interpreter costs $25-44 per minute.
Technical details:
- Processes raw audio end-to-end (not cascaded speech-to-text-to-speech)
- Preserves speaker emotion and tone
- Works best with brief pauses between thoughts (labeled "turn-based" in docs)
- Occasional hallucinations still occur
- Supports language switching mid-stream
The end-to-end approach is what makes the quality difference. Traditional pipelines lose vocal characteristics at every stage. This model skips text entirely.
GPT-Realtime-Whisper: Streaming Transcription
If you need real-time captions or meeting transcription, this is the model. Low-latency streaming output as the speaker talks.
What You Can Build
The three models together cover the full voice infrastructure stack:
- Customer support agents that can reason, look up accounts, and process requests — all by voice
- Real-time translation layers for international meetings at 1/1000th the cost of human interpreters
- Live captioning systems for streaming, conferences, or accessibility
- Multilingual voice assistants that handle code-switching naturally
- Telephony bots via SIP integration that feel like talking to a person
Top comments (0)