DEV Community

cz
cz

Posted on

CosyVoice 2025 Complete Guide: The Ultimate Multi-lingual Text-to-Speech Solution

🎯 Core Highlights (TL;DR)

  • State-of-the-art Performance: Fun-CosyVoice 3.0 achieves industry-leading content consistency (0.81% CER) and speaker similarity (77.4%) with only 0.5B parameters
  • Extensive Language Support: Covers 9 major languages and 18+ Chinese dialects with zero-shot voice cloning capability
  • Production-Ready Features: Bi-streaming support with ultra-low latency (150ms), pronunciation inpainting, and instruction-based control
  • Open-Source & Scalable: Fully open-source with complete training/inference/deployment pipeline and multiple runtime options (vLLM, TensorRT-LLM)

Table of Contents

  1. What is CosyVoice?
  2. Key Features and Capabilities
  3. Model Versions Comparison
  4. Performance Benchmarks
  5. Installation and Setup
  6. Usage Guide
  7. Deployment Options
  8. Best Practices
  9. FAQ

What is CosyVoice?

CosyVoice is an advanced Large Language Model (LLM)-based Text-to-Speech (TTS) system developed by FunAudioLLM. It represents a significant leap in zero-shot multilingual speech synthesis technology, enabling natural voice generation across multiple languages without requiring extensive training data for each speaker.

Evolution Timeline

The CosyVoice family has evolved through three major versions:

  • CosyVoice 1.0 (July 2024): Initial release with 300M parameters, establishing the foundation for scalable multilingual TTS
  • CosyVoice 2.0 (December 2024): Introduced streaming capabilities with 0.5B parameters and enhanced LLM architecture
  • Fun-CosyVoice 3.0 (December 2025): Current state-of-the-art with reinforcement learning optimization and in-the-wild speech generation

πŸ’‘ Expert Insight
CosyVoice 3.0's use of supervised semantic tokens and flow matching training enables it to achieve human-like speech quality while maintaining computational efficiencyβ€”a critical balance for production deployments.

Key Features and Capabilities

🌍 Language Coverage

Supported Languages:

  • 9 Major Languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
  • 18+ Chinese Dialects: Guangdong (Cantonese), Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, and more

Cross-lingual Capabilities:

  • Zero-shot voice cloning across different languages
  • Multi-lingual speech synthesis from single prompt
  • Accent-preserving voice conversion

🎯 Advanced Technical Features

Feature Description Use Case
Pronunciation Inpainting Support for Chinese Pinyin and English CMU phonemes Precise control over pronunciation for brand names, technical terms
Bi-Streaming Text-in and audio-out streaming Real-time applications with 150ms latency
Instruct Support Control language, dialect, emotion, speed, volume Dynamic voice customization
Text Normalization Automatic handling of numbers, symbols, formats No frontend module required
RAS Inference Repetition Aware Sampling for LLM stability Prevents audio artifacts and repetitions

πŸš€ Performance Characteristics

Latency: As low as 150ms (streaming mode)
Model Size: 0.5B parameters (Fun-CosyVoice3)
Audio Quality: 25Hz sampling rate
Streaming: KV cache + SDPA optimization
Acceleration: 4x speedup with TensorRT-LLM
Enter fullscreen mode Exit fullscreen mode

⚠️ Important Note
While CosyVoice 3.0 offers impressive capabilities, optimal performance requires GPU acceleration. CPU-only inference may result in significantly slower generation times.

Model Versions Comparison

Available Models

Model Parameters Best For Key Advantage
Fun-CosyVoice3-0.5B-2512 0.5B Production use, best overall quality SOTA performance with RL optimization
Fun-CosyVoice3-0.5B-2512_RL 0.5B Maximum accuracy Lowest CER (0.81%) and WER (1.68%)
CosyVoice2-0.5B 0.5B Streaming applications Optimized for real-time synthesis
CosyVoice-300M 300M Resource-constrained environments Smaller footprint, good quality
CosyVoice-300M-SFT 300M Supervised fine-tuning tasks Pre-trained for specific voice styles
CosyVoice-300M-Instruct 300M Instruction-based synthesis Enhanced control capabilities

Version Selection Guide

graph TD
    A[Choose CosyVoice Model] --> B{Primary Use Case?}
    B -->|Best Quality| C[Fun-CosyVoice3-0.5B-2512_RL]
    B -->|Production Balance| D[Fun-CosyVoice3-0.5B-2512]
    B -->|Real-time Streaming| E[CosyVoice2-0.5B]
    B -->|Limited Resources| F[CosyVoice-300M]
    B -->|Custom Instructions| G[CosyVoice-300M-Instruct]
Enter fullscreen mode Exit fullscreen mode

Performance Benchmarks

Comprehensive Evaluation Results

The following table compares Fun-CosyVoice 3.0 against leading open-source and closed-source TTS systems:

Model Open-Source Size test-zh CER (%) ↓ test-zh Speaker Sim (%) ↑ test-en WER (%) ↓ test-en Speaker Sim (%) ↑ test-hard CER (%) ↓ test-hard Speaker Sim (%) ↑
Human - - 1.26 75.5 2.14 73.4 - -
Seed-TTS ❌ - 1.12 79.6 2.25 76.2 7.59 77.6
MiniMax-Speech ❌ - 0.83 78.3 1.65 69.2 - -
F5-TTS βœ… 0.3B 1.52 74.1 2.00 64.7 8.67 71.3
CosyVoice2 βœ… 0.5B 1.45 75.7 2.57 65.9 6.83 72.4
VoxCPM βœ… 0.5B 0.93 77.2 1.85 72.9 8.87 73.0
GLM-TTS RL βœ… 1.5B 0.89 76.4 - - - -
Fun-CosyVoice3-0.5B-2512 βœ… 0.5B 1.21 78.0 2.24 71.8 6.71 75.8
Fun-CosyVoice3-0.5B-2512_RL βœ… 0.5B 0.81 77.4 1.68 69.5 5.44 75.0

Key Performance Insights

βœ… Best Practices

  • Content Accuracy: Fun-CosyVoice3 RL achieves 0.81% CER on Chinese test set, outperforming models 3x larger
  • Speaker Similarity: 78.0% similarity score approaches human-level performance (75.5%)
  • Challenging Scenarios: 5.44% CER on hard test set demonstrates robust handling of complex speech patterns
  • Efficiency: Achieves SOTA results with only 0.5B parameters vs. 1.5B+ competitors

Installation and Setup

Prerequisites

  • Operating System: Linux (Ubuntu/CentOS recommended)
  • Python Version: 3.10
  • GPU: NVIDIA GPU with CUDA support (recommended for optimal performance)
  • Conda: Miniconda or Anaconda

Step-by-Step Installation

1. Clone the Repository

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

# If submodule cloning fails due to network issues
git submodule update --init --recursive
Enter fullscreen mode Exit fullscreen mode

2. Create Conda Environment

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
Enter fullscreen mode Exit fullscreen mode

3. Install System Dependencies

# Ubuntu
sudo apt-get install sox libsox-dev

# CentOS
sudo yum install sox sox-devel
Enter fullscreen mode Exit fullscreen mode

4. Download Pre-trained Models

For Hugging Face Users (Recommended for International Users):

from huggingface_hub import snapshot_download

# Download Fun-CosyVoice 3.0 (Recommended)
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', 
                   local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Download CosyVoice 2.0
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', 
                   local_dir='pretrained_models/CosyVoice2-0.5B')

# Download text normalization resources
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', 
                   local_dir='pretrained_models/CosyVoice-ttsfrd')
Enter fullscreen mode Exit fullscreen mode

For ModelScope Users (China Region):

from modelscope import snapshot_download

snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', 
                   local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', 
                   local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-ttsfrd', 
                   local_dir='pretrained_models/CosyVoice-ttsfrd')
Enter fullscreen mode Exit fullscreen mode

5. Optional: Install Enhanced Text Normalization

For improved text normalization (especially for Chinese):

cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Pro Tip
If you skip the ttsfrd installation, CosyVoice will automatically fall back to WeTextProcessing. While functional, ttsfrd provides better accuracy for number and symbol normalization.

Usage Guide

Quick Start with Web Demo

The fastest way to experience CosyVoice:

# Launch web interface
python3 webui.py --port 50000 --model_dir pretrained_models/Fun-CosyVoice3-0.5B

# For instruct mode
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
Enter fullscreen mode Exit fullscreen mode

Access the interface at http://localhost:50000

Python API Usage

Basic Inference Example

from cosyvoice.cli.cosyvoice import CosyVoice

# Initialize model
cosyvoice = CosyVoice('pretrained_models/Fun-CosyVoice3-0.5B')

# Zero-shot voice cloning
prompt_speech = 'path/to/reference_audio.wav'
text = "Hello, this is a test of CosyVoice zero-shot synthesis."

for i, audio_chunk in enumerate(cosyvoice.inference_zero_shot(
    text, 
    prompt_text="Reference text spoken in the audio",
    prompt_speech=prompt_speech
)):
    # Save or process audio_chunk
    pass
Enter fullscreen mode Exit fullscreen mode

Advanced Usage: Instruction-Based Synthesis

# Control emotion, speed, and other parameters
for audio in cosyvoice.inference_instruct(
    text="Your text here",
    speaker="default",
    instruct_text="Speak with excitement at a moderate pace"
):
    # Process audio
    pass
Enter fullscreen mode Exit fullscreen mode

vLLM Acceleration (CosyVoice 2.0)

For maximum inference speed with CosyVoice 2.0:

Setup vLLM Environment

# Create separate environment for vLLM
conda create -n cosyvoice_vllm --clone cosyvoice
conda activate cosyvoice_vllm
pip install vllm==v0.9.0 transformers==4.51.3
Enter fullscreen mode Exit fullscreen mode

Run vLLM Inference

python vllm_example.py
Enter fullscreen mode Exit fullscreen mode

⚠️ Compatibility Note
vLLM v0.9.0 requires specific versions of PyTorch (2.7.0) and Transformers (4.51.3). Ensure your hardware supports these requirements before installation.

Deployment Options

Docker Deployment (Recommended for Production)

gRPC Server Deployment

cd runtime/python
docker build -t cosyvoice:v1.0 .

# Launch gRPC server
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \
  python3 server.py --port 50000 --max_conc 4 \
  --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity"

# Test with client
cd grpc
python3 client.py --port 50000 --mode zero_shot
Enter fullscreen mode Exit fullscreen mode

FastAPI Server Deployment

# Launch FastAPI server
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \
  python3 server.py --port 50000 \
  --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity"

# Test with client
cd fastapi
python3 client.py --port 50000 --mode sft
Enter fullscreen mode Exit fullscreen mode

TensorRT-LLM Deployment (4x Acceleration)

For maximum performance with CosyVoice 2.0:

cd runtime/triton_trtllm
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Performance Comparison:

Runtime Relative Speed Use Case
HuggingFace Transformers 1x (baseline) Development, testing
vLLM 2-3x Production with moderate load
TensorRT-LLM 4x High-throughput production

βœ… Deployment Best Practice

  • Development: Use web demo or Python API
  • Small-scale Production: FastAPI with Docker
  • Large-scale Production: TensorRT-LLM with load balancing
  • Real-time Applications: vLLM or TensorRT-LLM with streaming

Best Practices

Model Selection Strategy

πŸ“Š Decision Framework:

1. Quality Priority β†’ Fun-CosyVoice3-0.5B-2512_RL
2. Balanced Performance β†’ Fun-CosyVoice3-0.5B-2512
3. Real-time Streaming β†’ CosyVoice2-0.5B + vLLM
4. Resource Constraints β†’ CosyVoice-300M
5. Custom Control β†’ CosyVoice-300M-Instruct
Enter fullscreen mode Exit fullscreen mode

Optimization Tips

For Low Latency

  1. Enable Streaming Mode: Use bi-streaming for text-in and audio-out
  2. KV Cache: Ensure KV cache is enabled in inference config
  3. SDPA Optimization: Utilize Scaled Dot-Product Attention
  4. Batch Processing: Group similar-length inputs

For High Quality

  1. Use RL Model: Fun-CosyVoice3-0.5B-2512_RL for maximum accuracy
  2. Provide Clear Prompts: High-quality reference audio (3-10 seconds)
  3. Text Normalization: Install ttsfrd for better preprocessing
  4. Pronunciation Control: Use pinyin/phoneme inpainting for critical terms

For Multilingual Applications

  1. Language-Specific Prompts: Provide reference audio in target language
  2. Cross-lingual Cloning: Use instruct mode to specify target language
  3. Dialect Support: Leverage 18+ Chinese dialect capabilities
  4. Mixed Language: Segment text by language for optimal results

Common Pitfalls to Avoid

⚠️ Warning: Common Issues

  1. Insufficient GPU Memory: 0.5B models require ~8GB VRAM minimum
  2. Poor Reference Audio: Background noise or multiple speakers degrade cloning
  3. Text Format Issues: Ensure proper encoding (UTF-8) for non-English text
  4. Version Mismatch: vLLM compatibility requires specific package versions
  5. Network Timeouts: Use ModelScope mirrors in China region

πŸ€” Frequently Asked Questions

Q: What's the difference between CosyVoice 2.0 and 3.0?

A: Fun-CosyVoice 3.0 introduces several key improvements:

  • Reinforcement Learning Optimization: RL-trained model achieves 0.81% CER vs. 1.45% in v2.0
  • Enhanced Naturalness: Improved prosody and speaker similarity through post-training
  • In-the-wild Performance: Better handling of challenging real-world scenarios (5.44% vs. 6.83% CER on hard test set)
  • Pronunciation Control: Advanced pinyin/phoneme inpainting capabilities

Q: Can I use CosyVoice for commercial applications?

A: Yes, CosyVoice is open-source and available for commercial use. However:

  • Review the license terms in the GitHub repository
  • Ensure compliance with voice cloning regulations in your jurisdiction
  • The disclaimer states content is for academic purposes; verify production use rights
  • Consider ethical implications of voice cloning technology

Q: How much GPU memory do I need?

A: Memory requirements vary by model:

  • CosyVoice-300M: ~4-6GB VRAM
  • CosyVoice2-0.5B: ~6-8GB VRAM
  • Fun-CosyVoice3-0.5B: ~8-10GB VRAM
  • Batch Inference: Add 2-4GB per additional concurrent request

For CPU-only inference, expect 16GB+ RAM and significantly slower speeds (10-50x slower).

Q: Which languages are best supported?

A: Based on evaluation data:

  • Excellent: Chinese (Mandarin), English
  • Very Good: Japanese, Korean
  • Good: German, Spanish, French, Italian, Russian
  • Dialects: 18+ Chinese dialects with varying quality

English and Chinese have the most extensive training data and achieve the best results.

Q: How do I improve voice cloning quality?

A: Follow these guidelines:

  1. Reference Audio Quality:

    • Duration: 3-10 seconds optimal
    • Single speaker only
    • Clear speech, minimal background noise
    • Natural speaking pace
  2. Prompt Text Accuracy:

    • Provide exact transcription of reference audio
    • Match language and dialect
  3. Model Selection:

    • Use Fun-CosyVoice3-0.5B-2512_RL for best quality
    • Consider fine-tuning for specific voices
  4. Post-processing:

    • Apply noise reduction if needed
    • Normalize audio levels

Q: Can I fine-tune CosyVoice on my own data?

A: Yes, the repository includes training scripts in examples/libritts/cosyvoice/run.sh. Requirements:

  • High-quality paired audio-text data
  • GPU cluster (multi-GPU recommended)
  • Familiarity with flow matching training
  • See the paper for detailed training methodology

Q: What's the best deployment option for my use case?

A: Choose based on your requirements:

Scenario Recommended Setup Rationale
Research/Testing Web demo or Python API Easy setup, full features
Small API (<100 req/day) FastAPI + Docker Simple deployment, good performance
Medium API (100-10K req/day) vLLM + Load Balancer 2-3x speedup, scalable
High-throughput (>10K req/day) TensorRT-LLM + Kubernetes 4x speedup, enterprise-grade
Real-time Streaming CosyVoice2 + vLLM Low latency, streaming support

Q: How does CosyVoice compare to commercial TTS services?

A: Advantages over commercial services:

  • βœ… Full control and customization
  • βœ… No API costs or rate limits
  • βœ… Data privacy (on-premise deployment)
  • βœ… Access to model weights for research

Commercial services may offer:

  • ⚑ Simpler integration
  • πŸ”§ Managed infrastructure
  • πŸ“ž Enterprise support

For most technical teams, CosyVoice's performance and flexibility outweigh the setup complexity.

Additional Resources

Official Links

Community and Support

  • GitHub Issues: Report bugs and request features
  • DingTalk Group: Join the official Chinese community (QR code in repository)
  • Research Papers: Read the academic papers for deep technical understanding

Related Projects

CosyVoice builds upon:

Conclusion and Next Steps

Fun-CosyVoice 3.0 represents a significant advancement in open-source text-to-speech technology, combining state-of-the-art performance with practical deployment capabilities. Its combination of high accuracy (0.81% CER), extensive language support (9 languages + 18 dialects), and production-ready features (streaming, low latency) makes it an excellent choice for both research and commercial applications.

Recommended Action Plan

  1. Get Started (Week 1):

    • Install CosyVoice following the setup guide
    • Test with web demo to understand capabilities
    • Experiment with different models and modes
  2. Evaluate (Week 2-3):

    • Test with your specific use cases
    • Benchmark performance on your hardware
    • Compare quality against your requirements
  3. Deploy (Week 4+):

    • Choose appropriate deployment method
    • Implement monitoring and logging
    • Optimize for your production workload
  4. Optimize (Ongoing):

    • Fine-tune on domain-specific data if needed
    • Implement caching strategies
    • Scale infrastructure based on usage

Stay Updated

The CosyVoice project is actively maintained with regular updates. Check the roadmap in the GitHub repository for upcoming features and improvements.


Disclaimer: This guide is based on information available as of December 2025. Always refer to the official documentation for the most current information and best practices.

CosyVoice Complete Guide

Top comments (0)