I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems

#ai #opensource #openai

OpenAI shipped GPT-Realtime-Translate on May 8. It's their first model purpose-built for live speech translation, and it supports 70+ input languages.

I've been building a live translation pipeline at VoiceFrom, so I ran it through the same eval harness I use on our own system and three other competitors: Google Meet, LiveVoice, and Palabra. Same source audio, same scoring, eight language pairs.

How I scored it:

Accuracy: GEMBA-MQM v2, an LLM judge that annotates specific translation errors (type + severity) rather than giving a single score. 10 scoring passes per segment, outlier removal, rank-reciprocal weighted aggregation. Ranked #1 on WMT24.
Latency: Automated Ear-Voice Span, the time between when a source phrase is spoken and when the translation starts playing.

What I found:

VoiceFrom Pro was more accurate than OpenAI in 6 out of 8 language pairs
OpenAI had the fastest median latency (5.4s vs 7.3s for VoiceFrom)
Google Meet was fastest overall but had by far the worst accuracy
The accuracy gaps were much bigger than the latency gaps

The interesting tradeoff: OpenAI is fast but makes more errors. Google is fastest but the translations are often wrong. The platforms that take a bit longer tend to get the meaning right.

Full benchmark with charts and audio samples: Five platforms, one harness: a head-to-head live translation benchmark

The eval harness is open source if you want to run it on your own system: VoiceFrom/live-s2st-eval