I benchmarked local voice-cloning models across English, German, Modern Standard Arabic, Spanish, and Mandarin Chinese.
Models:
- OmniVoice int8
- Chatterbox Multilingual fp16
- VoxCPM2 bf16
- Fish Audio S2 Pro fp16
The benchmark uses Google FLEURS references. Each row includes reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.
Main result in this run: OmniVoice was the strongest all-around row set. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.
This is not a human MOS study. It is an engineering benchmark for comparing model behavior inside one local speech stack.
Full post with the table and audio samples:
https://www.soniqo.audio/blog/voice-cloning-benchmarks
Speech Studio, the desktop app using the same stack:
Top comments (0)