DEV Community

Cover image for I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems
Yahya Saleh
Yahya Saleh

Posted on • Originally published at voicefrom.ai

I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems

OpenAI shipped GPT-Realtime-Translate on May 8. It's their first model purpose-built for live speech translation, and it supports 70+ input languages.

I've been building a live translation pipeline at VoiceFrom, so I ran it through the same eval harness I use on our own system and three other competitors: Google Meet, LiveVoice, and Palabra. Same source audio, same scoring, eight language pairs.

How I scored it:

  • Accuracy: GEMBA-MQM v2, an LLM judge that annotates specific translation errors (type + severity) rather than giving a single score. 10 scoring passes per segment, outlier removal, rank-reciprocal weighted aggregation. Ranked #1 on WMT24.
  • Latency: Automated Ear-Voice Span, the time between when a source phrase is spoken and when the translation starts playing.

What I found:

  • VoiceFrom Pro was more accurate than OpenAI in 6 out of 8 language pairs
  • OpenAI had the fastest median latency (5.4s vs 7.3s for VoiceFrom)
  • Google Meet was fastest overall but had by far the worst accuracy
  • The accuracy gaps were much bigger than the latency gaps

The interesting tradeoff: OpenAI is fast but makes more errors. Google is fastest but the translations are often wrong. The platforms that take a bit longer tend to get the meaning right.

Full benchmark with charts and audio samples: Five platforms, one harness: a head-to-head live translation benchmark

The eval harness is open source if you want to run it on your own system: VoiceFrom/live-s2st-eval

Top comments (0)