Garyvov

Posted on Mar 11

Fish Audio S2-Pro: A TTS Model with Emotion in Speech Controlled with Natural Language

#fishaudio #tts #ai

On March 9, 2026, Fish Audio open-sourced S2-Pro, a TTS model that outperforms closed-source systems across multiple benchmarks. Model weights, training code, and inference engine are all open source.

Natural Language Control

S2-Pro supports free-form inline control. You can describe the desired effect directly in natural language within the text:

[whisper in small voice] - Soft whisper
[professional broadcast tone] - Professional broadcast tone
[pitch up] - Raise pitch
[laughing] - Laughter

The system supports 15,000+ tags covering emotion, tone, volume, and rhythm. No need to learn a fixed tag set—just write what you think.

Training Data

10 million hours of audio across 80+ languages, including Japanese, English, Chinese, Korean, Spanish, Portuguese, Arabic, Russian, French, German, Swedish, Italian, Turkish, and 60+ other languages.

No phoneme annotation or language-specific preprocessing required.

Benchmarks

Test	S2-Pro	Comparison
Seed-TTS Eval Chinese WER	0.54%	Lowest
Seed-TTS Eval English WER	0.99%	Lowest
Audio Turing Test	0.515	vs Seed-TTS 0.417
EmergentTTS-Eval	81.88%	Highest

In Seed-TTS evaluation, S2-Pro's word error rate is lower than Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), and Seed-TTS (1.12/2.25).

Dual-AR Architecture

The model generates audio in two layers:

Slow AR (4B parameters): Predicts the primary semantic codebook along the time axis
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step

This design enables fast inference while maintaining audio quality.

Reinforcement Learning Alignment

S2-Pro uses GRPO for post-training. Key point: the models used to filter training data directly serve as reward models during reinforcement learning. This eliminates distribution mismatch between pre-training and post-training.

Reward signals include:

Semantic accuracy
Instruction following
Acoustic preference
Timbre similarity

Production Inference

Dual-AR architecture is structurally identical to standard LLMs, allowing direct use of SGLang optimizations:

Continuous batching
Paged KV cache
CUDA graph replay
RadixAttention prefix caching

Single H200 GPU Performance:

RTF: 0.195
Time-to-first-audio: ~100ms
Throughput: 3,000+ tokens/s

For voice cloning scenarios, SGLang automatically caches reference audio KV states. When the same voice is reused, prefix cache hit rate averages 86.4% (peak >90%).

Practical Features

Voice Cloning: Clone voices using short reference samples (typically 10-30 seconds). Captures timbre, speaking style, and emotional tendencies.

Multi-Speaker: Upload reference audio containing multiple speakers, and the model processes each speaker's features via <|speaker:i|> tokens. Single generation can include multiple speakers.

Multi-Turn Dialogue: The model uses previous context to improve expressiveness in subsequent generations.

Open Source Content

Model weights: HuggingFace
Training and fine-tuning code
SGLang inference engine
GitHub: fish-speech
Technical report PDF

License: Fish Audio Research License

Free for research and non-commercial use
Commercial use requires separate license (business@fish.audio)

Quick Start

Installation

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
pip install uv
uv sync

Command Line

python -m fish_speech.text_to_speech \
  --text "Hello, I am Fish Audio S2-Pro" \
  --reference_audio reference.wav \
  --output output.wav

WebUI

python -m fish_speech.webui

Docker

docker pull fishaudio/fish-speech:latest
docker run -it --gpus all fishaudio/fish-speech:latest

SGLang Server

For production environments, use SGLang:
https://github.com/sgl-project/sglang-omni

Links:

Website: https://fish.audio/
GitHub: https://github.com/fishaudio/fish-speech
HuggingFace: https://huggingface.co/fishaudio/s2-pro
Blog: https://fish.audio/blog/fish-audio-open-sources-s2/
Technical Report: PDF

DEV Community