DEV Community

Evan-dong
Evan-dong

Posted on

OpenAI's New Realtime Voice Models Can Think, Translate, and Transcribe — Here's What Developers Need to Know

OpenAI just shipped three realtime voice models through their API. One reasons at GPT-5 level during live calls. One translates 70+ languages in real time. One does streaming transcription. All available today.

Let me break down what matters for developers.

The Three Models

GPT-Realtime-2 handles voice conversations with GPT-5-level reasoning. The key difference from previous voice models: it can call tools mid-conversation without going silent. It narrates what it's doing while executing — OpenAI calls this "preamble."

GPT-Realtime-Translate does real-time voice translation. 70+ input languages, 13 output languages. End-to-end audio processing (no intermediate text step), which preserves tone and emotion.

GPT-Realtime-Whisper is streaming speech-to-text. Words appear as the speaker talks. Built for live captions and meeting transcription.

Integration Options

All three use the Realtime API with three connection methods:

  • WebRTC — browser-based, lowest latency
  • WebSocket — server-side, more control
  • SIP — telephony integration

GPT-Realtime-2: Voice Agents That Actually Work

If you've built voice agents before, you know the pain: tool calls create dead air. The user asks something that requires a database lookup, and the agent goes silent for 2-3 seconds. Feels broken.

GPT-Realtime-2 solves this with preamble — it talks through its actions while executing them. "Let me check your calendar... I see you have a meeting with Alex Kim in 12 minutes." The tool call happens in parallel with the speech.

Other developer-relevant specs:

  • 128K context window (up from 32K)
  • Handles interruptions without losing context
  • Better instruction following for system prompts
  • Text tokens: $4/$16 per million (input/output)
  • Audio tokens: $32/$64 per million

GPT-Realtime-Translate: The $0.034/min Disruption

The translation model is priced at $0.034 per minute. For context, a human simultaneous interpreter costs $25-44 per minute.

Technical details:

  • Processes raw audio end-to-end (not cascaded speech-to-text-to-speech)
  • Preserves speaker emotion and tone
  • Works best with brief pauses between thoughts (labeled "turn-based" in docs)
  • Occasional hallucinations still occur
  • Supports language switching mid-stream

The end-to-end approach is what makes the quality difference. Traditional pipelines lose vocal characteristics at every stage. This model skips text entirely.

GPT-Realtime-Whisper: Streaming Transcription

If you need real-time captions or meeting transcription, this is the model. Low-latency streaming output as the speaker talks.

What You Can Build

The three models together cover the full voice infrastructure stack:

  • Customer support agents that can reason, look up accounts, and process requests — all by voice
  • Real-time translation layers for international meetings at 1/1000th the cost of human interpreters
  • Live captioning systems for streaming, conferences, or accessibility
  • Multilingual voice assistants that handle code-switching naturally
  • Telephony bots via SIP integration that feel like talking to a person

Links

Top comments (0)