DEV Community

Cover image for Fish Audio S2-Pro: A TTS Model with Emotion in Speech Controlled with Natural Language
Garyvov
Garyvov

Posted on

Fish Audio S2-Pro: A TTS Model with Emotion in Speech Controlled with Natural Language

On March 9, 2026, Fish Audio open-sourced S2-Pro, a TTS model that outperforms closed-source systems across multiple benchmarks. Model weights, training code, and inference engine are all open source.

Natural Language Control

S2-Pro supports free-form inline control. You can describe the desired effect directly in natural language within the text:

  • [whisper in small voice] - Soft whisper
  • [professional broadcast tone] - Professional broadcast tone
  • [pitch up] - Raise pitch
  • [laughing] - Laughter

The system supports 15,000+ tags covering emotion, tone, volume, and rhythm. No need to learn a fixed tag set—just write what you think.

Training Data

10 million hours of audio across 80+ languages, including Japanese, English, Chinese, Korean, Spanish, Portuguese, Arabic, Russian, French, German, Swedish, Italian, Turkish, and 60+ other languages.

No phoneme annotation or language-specific preprocessing required.

Benchmarks

Test S2-Pro Comparison
Seed-TTS Eval Chinese WER 0.54% Lowest
Seed-TTS Eval English WER 0.99% Lowest
Audio Turing Test 0.515 vs Seed-TTS 0.417
EmergentTTS-Eval 81.88% Highest

In Seed-TTS evaluation, S2-Pro's word error rate is lower than Qwen3-TTS (0.77/1.24), MiniMax Speech-02 (0.99/1.90), and Seed-TTS (1.12/2.25).

Dual-AR Architecture

The model generates audio in two layers:

Slow AR (4B parameters): Predicts the primary semantic codebook along the time axis
Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step

This design enables fast inference while maintaining audio quality.

Reinforcement Learning Alignment

S2-Pro uses GRPO for post-training. Key point: the models used to filter training data directly serve as reward models during reinforcement learning. This eliminates distribution mismatch between pre-training and post-training.

Reward signals include:

  • Semantic accuracy
  • Instruction following
  • Acoustic preference
  • Timbre similarity

Production Inference

Dual-AR architecture is structurally identical to standard LLMs, allowing direct use of SGLang optimizations:

  • Continuous batching
  • Paged KV cache
  • CUDA graph replay
  • RadixAttention prefix caching

Single H200 GPU Performance:

  • RTF: 0.195
  • Time-to-first-audio: ~100ms
  • Throughput: 3,000+ tokens/s

For voice cloning scenarios, SGLang automatically caches reference audio KV states. When the same voice is reused, prefix cache hit rate averages 86.4% (peak >90%).

Practical Features

Voice Cloning: Clone voices using short reference samples (typically 10-30 seconds). Captures timbre, speaking style, and emotional tendencies.

Multi-Speaker: Upload reference audio containing multiple speakers, and the model processes each speaker's features via <|speaker:i|> tokens. Single generation can include multiple speakers.

Multi-Turn Dialogue: The model uses previous context to improve expressiveness in subsequent generations.

Open Source Content

  • Model weights: HuggingFace
  • Training and fine-tuning code
  • SGLang inference engine
  • GitHub: fish-speech
  • Technical report PDF

License: Fish Audio Research License

  • Free for research and non-commercial use
  • Commercial use requires separate license (business@fish.audio)

Quick Start

Installation

git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
pip install uv
uv sync
Enter fullscreen mode Exit fullscreen mode

Command Line

python -m fish_speech.text_to_speech \
  --text "Hello, I am Fish Audio S2-Pro" \
  --reference_audio reference.wav \
  --output output.wav
Enter fullscreen mode Exit fullscreen mode

WebUI

python -m fish_speech.webui
Enter fullscreen mode Exit fullscreen mode

Docker

docker pull fishaudio/fish-speech:latest
docker run -it --gpus all fishaudio/fish-speech:latest
Enter fullscreen mode Exit fullscreen mode

SGLang Server

For production environments, use SGLang:
https://github.com/sgl-project/sglang-omni


Links:

Top comments (0)