While OpenAI has been slow to release its GPT-4o voice assistant, other audio generation models have been launched one after another, and importantly, they are open-source.
Recently, Alibaba's Tongyi Lab made a significant move by launching FunAudioLLM, an open-source speech model project featuring two models: SenseVoice and CosyVoice.
https://fun-audio-llm.github.io/
SenseVoice focuses on high-precision multilingual speech recognition, emotion recognition, and audio event detection, supporting over 50 languages. It outperforms Whisper in accuracy, especially for Chinese and Cantonese, with improvements of over 50%.
It excels in emotion recognition and can detect common human-machine interaction events such as music, applause, laughter, crying, coughing, and sneezing, achieving state-of-the-art (SOTA) results in various tests.
CosyVoice specializes in natural speech generation, supporting multiple languages, tones, and emotional control. It can generate voices in Chinese, English, Japanese, Cantonese, and Korean, significantly outperforming traditional speech generation models.
With just 3-10 seconds of original audio, CosyVoice can simulate the voice, including rhythm and emotion details, even for cross-language speech generation.
CosyVoice also supports fine-grained control over the generated speech's emotions and rhythm using rich text or natural language, significantly enhancing the emotional expressiveness of the audio.
Let's dive into the uses and demonstrations of FunAudioLLM.
Multilingual Translation that Replicates Tone and Emotion
By combining SenseVoice, LLM, and CosyVoice, we can seamlessly perform speech-to-speech translation (S2ST). This integrated approach not only improves translation efficiency and fluency but also captures the emotions and tones in the original speech, reproducing these emotional nuances in the translated speech, making the conversation more authentic and engaging.
Whether it's multilingual conference interpreting, cross-cultural communication, or providing instant voice translation services for non-native speakers, this technology significantly reduces language barriers and communication losses.
Emotionally-Rich Voice Interaction
By integrating SenseVoice, large language models (LLMs), and CosyVoice, it is possible to develop an emotional voice chat application. When SenseVoice detects emotions, sentiments, or other paralinguistic information such as coughing, the LLM generates corresponding emotional responses. CosyVoice then synthesizes the appropriate emotional tone, creating a comfortable and natural conversational interaction.
Interactive Podcast
By integrating SenseVoice, an LLM-based multi-agent system with real-time world knowledge, and CosyVoice, we can create an interactive podcast station. SenseVoice captures real-time conversations between AI podcasters and users, even recognizing ambient sounds and emotions. Users can interrupt the AI podcasters at any time to steer the discussion in different directions. CosyVoice generates the AI podcasters' voices, with control over multiple languages, tones, and emotions, offering listeners a rich and diverse auditory experience.
Audiobooks
Leveraging LLM's analytical capabilities to structure book content and identify emotions, combined with CosyVoice's speech synthesis, we can create highly expressive audiobooks. LLMs deeply understand the text, capturing every emotional nuance and story arc, while CosyVoice transforms these emotions into speech with specific emotional tones and emphasis. This provides listeners with a rich and emotionally engaging experience, turning audiobooks into an emotional and vivid auditory feast.
How FunAudioLLM Works
CosyVoice
CosyVoice is a large speech generation model based on speech quantization encoding. It discretizes speech and leverages large model technology to achieve natural and fluent speech generation. Compared to traditional speech generation technologies, CosyVoice excels in natural prosody and realistic voice timbre. It supports up to five languages and allows fine-grained control of the generated speech's emotions and other dimensions using natural language or rich text.
The research team offers the base model CosyVoice-300M, a fine-tuned version CosyVoice-300M-SFT, and a model supporting fine-grained control CosyVoice-300M-Instruct, catering to different use cases.
Objective Metrics of Generated Speech
The research team evaluated the content consistency of synthesized audio using the open-source Chinese dataset Aishell3 and the English dataset LibriTTS through speech recognition tests.
Comparisons with the original audio and the recently popular ChatTTS showed that CosyVoice's synthesized audio achieved higher content consistency and rarely exhibited hallucination or additional word phenomena.
CosyVoice effectively modeled the semantic information in the synthesized text, reaching a level comparable to human speakers. Additionally, re-scoring the synthesized audio further reduced recognition error rates, even surpassing human performance in content consistency and speaker similarity.
Emotional Control Capability
The research team also assessed CosyVoice's emotional control capability using a pre-trained emotion classification model. This assessment focused on five high-expressiveness voice emotions: happiness, sadness, anger, fear, and disgust.
The test results indicated that CosyVoice-300M inherently possesses the ability to infer emotions from text content. The model trained with fine-grained control, CosyVoice-300M-Instruct, achieved higher scores in emotion classification, demonstrating stronger emotional control capability.
SenseVoice
SenseVoice is a foundational speech understanding model with capabilities covering automatic speech recognition (ASR), language identification (LID), sentiment recognition (SER), and audio event detection (AED). It aims to provide comprehensive speech processing functions to support the construction of complex speech interaction systems.
SenseVoice-Small is a lightweight foundational speech model designed for fast speech understanding, suitable for latency-sensitive applications such as real-time speech interaction systems.
SenseVoice-Large is a comprehensive model focusing on more precise speech understanding, supporting more languages, and suitable for scenarios requiring high recognition accuracy.
Multilingual Speech Recognition Performance
The research team compared SenseVoice and Whisper's multilingual recognition performance and inference efficiency on open-source datasets, including AISHELL-1, AISHELL-2, Wenetspeech, Librispeech, and Common Voice.
Emotional Speech Recognition Performance
SenseVoice can also be used for discrete emotion recognition, currently supporting emotions like happiness, sadness, anger, and neutrality. Evaluated on seven popular emotion recognition datasets, SenseVoice-Large achieved or surpassed SOTA results even without fine-tuning on target corpora.
Audio Event Detection Performance
Both SenseVoice-Small and SenseVoice-Large can detect audio events in speech, including music, applause, and laughter. SenseVoice-Large can also precisely identify the start and end times of these events.
Currently, models related to SenseVoice and CosyVoice have been open-sourced on ModelScope and Huggingface, with corresponding training, inference, and fine-tuning codes released on GitHub.
For more details, check out the linksο½
β FunAudioLLM: https://github.com/FunAudioLLM
β CosyVoice Repository:https://github.com/FunAudioLLM/CosyVoice
β SenseVoice Repository:https://github.com/FunAudioLLM/SenseVoice
Top comments (0)