Voice cloning models, measured across five languages

#swift #audio #ai #tts

I benchmarked local voice-cloning models across English, German, Modern Standard Arabic, Spanish, and Mandarin Chinese.

Models:

OmniVoice int8
Chatterbox Multilingual fp16
VoxCPM2 bf16
Fish Audio S2 Pro fp16

The benchmark uses Google FLEURS references. Each row includes reference audio, generated audio, speaker similarity, WER/CER, generated audio length, and RTF.

Main result in this run: OmniVoice was the strongest all-around row set. VoxCPM2 bf16 was especially strong on Arabic speaker match. Fish Audio S2 Pro showed strong German/Arabic similarity but slower RTF. Chatterbox Multilingual was competitive on Arabic and Spanish.

This is not a human MOS study. It is an engineering benchmark for comparing model behavior inside one local speech stack.

Full post with the table and audio samples:

https://www.soniqo.audio/blog/voice-cloning-benchmarks

Speech Studio, the desktop app using the same stack:

https://www.soniqo.audio/speech-studio