I use GEMBA-MQM v2 to evaluate translation quality in my live speech-to-speech translation pipeline. MQM (Multidimensional Quality Metrics) is an open industry standard for grading translations. Instead of a single score, it classifies every error by type (mistranslation, omission, hallucination, grammar, etc.) and severity (critical, major, minor). It's what professional linguists use when they review translations manually.
GEMBA makes an LLM do this same annotation process. It prompts the model to read the source and translation side by side, find the errors, and tag each one with an MQM type and severity. So you get the same structured error breakdown you'd get from a human reviewer, but automated. It ranked #1 on WMT24 by correlation with human MQM annotations.
The catch: LLM judges are noisy. On one English-to-German clip, 10 passes gave me scores from -29 to -109. Same translation, same model.
The fix is straightforward. Run 10 passes per segment, drop outliers beyond 2 standard deviations, aggregate with rank-reciprocal weighted averaging so the harshest outlier doesn't dominate. That same clip settles at -41.9 across 10 passes.
If you're using LLM-as-judge for anything, try running multiple passes. The variance will surprise you.
Full methodology: LLMs as translation judges: Inside GEMBA-MQM v2
Code: VoiceFrom/live-s2st-eval
Top comments (0)