OpenAI shipped GPT-Realtime-Translate on May 8. It's their first model purpose-built for live speech translation, and it supports 70+ input languages.
I've been building a live translation pipeline at VoiceFrom, so I ran it through the same eval harness I use on our own system and three other competitors: Google Meet, LiveVoice, and Palabra. Same source audio, same scoring, eight language pairs.
How I scored it:
- Accuracy: GEMBA-MQM v2, an LLM judge that annotates specific translation errors (type + severity) rather than giving a single score. 10 scoring passes per segment, outlier removal, rank-reciprocal weighted aggregation. Ranked #1 on WMT24.
- Latency: Automated Ear-Voice Span, the time between when a source phrase is spoken and when the translation starts playing.
What I found:
- VoiceFrom Pro was more accurate than OpenAI in 6 out of 8 language pairs
- OpenAI had the fastest median latency (5.4s vs 7.3s for VoiceFrom)
- Google Meet was fastest overall but had by far the worst accuracy
- The accuracy gaps were much bigger than the latency gaps
The interesting tradeoff: OpenAI is fast but makes more errors. Google is fastest but the translations are often wrong. The platforms that take a bit longer tend to get the meaning right.
Full benchmark with charts and audio samples: Five platforms, one harness: a head-to-head live translation benchmark
The eval harness is open source if you want to run it on your own system: VoiceFrom/live-s2st-eval
Top comments (0)