On March 9, 2026, Fish Audio open-sourced S2-Pro, a TTS model that outperforms closed-source systems across multiple benchmarks. Model weights, training code, and inference engine are all open source.
Natural Language Control
S2-Pro supports free-form inline control. You can describe the desired effect directly in natural language within the text:
-
[whisper in small voice]- Soft whisper -
[professional broadcast tone]- Professional broadcast tone -
[pitch up]- Raise pitch -
[laughing]- Laughter
The system supports 15,000+ tags covering emotion, tone, volume, and rhythm. No need to learn a fixed tag set—just write what you think.
Training Data
10 million hours of audio across 80+ languages, including Japanese, English, Chinese, Korean, Spanish, Portuguese, Arabic, Russian, French, German, Swedish, Italian, Turkish, and 60+ other languages.
No phoneme annotation or language-specific preprocessing required.
Benchmarks
| Test | S2-Pro | Comparison |
|---|---|---|
| Seed-TTS Eval Chinese WER | 0.54% | Lowest |
| Seed-TTS Eval English WER | 0.99% | Lowest |
| Audio Turing Test | 0.515 | vs Seed-TTS 0.417 |
| EmergentTTS-Eval | 81.88% | Highest |
In Seed-TTS evaluation, S2-Pro's word error rate is lower than Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), and Seed-TTS (1.12/2.25).
Dual-AR Architecture
The model generates audio in two layers:
Slow AR (4B parameters): Predicts the primary semantic codebook along the time axis
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step
This design enables fast inference while maintaining audio quality.
Reinforcement Learning Alignment
S2-Pro uses GRPO for post-training. Key point: the models used to filter training data directly serve as reward models during reinforcement learning. This eliminates distribution mismatch between pre-training and post-training.
Reward signals include:
- Semantic accuracy
- Instruction following
- Acoustic preference
- Timbre similarity
Production Inference
Dual-AR architecture is structurally identical to standard LLMs, allowing direct use of SGLang optimizations:
- Continuous batching
- Paged KV cache
- CUDA graph replay
- RadixAttention prefix caching
Single H200 GPU Performance:
- RTF: 0.195
- Time-to-first-audio: ~100ms
- Throughput: 3,000+ tokens/s
For voice cloning scenarios, SGLang automatically caches reference audio KV states. When the same voice is reused, prefix cache hit rate averages 86.4% (peak >90%).
Practical Features
Voice Cloning: Clone voices using short reference samples (typically 10-30 seconds). Captures timbre, speaking style, and emotional tendencies.
Multi-Speaker: Upload reference audio containing multiple speakers, and the model processes each speaker's features via <|speaker:i|> tokens. Single generation can include multiple speakers.
Multi-Turn Dialogue: The model uses previous context to improve expressiveness in subsequent generations.
Open Source Content
- Model weights: HuggingFace
- Training and fine-tuning code
- SGLang inference engine
- GitHub: fish-speech
- Technical report PDF
License: Fish Audio Research License
- Free for research and non-commercial use
- Commercial use requires separate license (business@fish.audio)
Quick Start
Installation
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
pip install uv
uv sync
Command Line
python -m fish_speech.text_to_speech \
--text "Hello, I am Fish Audio S2-Pro" \
--reference_audio reference.wav \
--output output.wav
WebUI
python -m fish_speech.webui
Docker
docker pull fishaudio/fish-speech:latest
docker run -it --gpus all fishaudio/fish-speech:latest
SGLang Server
For production environments, use SGLang:
https://github.com/sgl-project/sglang-omni
Links:
- Website: https://fish.audio/
- GitHub: https://github.com/fishaudio/fish-speech
- HuggingFace: https://huggingface.co/fishaudio/s2-pro
- Blog: https://fish.audio/blog/fish-audio-open-sources-s2/
- Technical Report: PDF
Top comments (0)