Introduction
"What if AI could generate conversations as natural as a real podcast — supporting multi-turn interactions, different dialects, and even real emotions like laughter and sighs?"
This is Part 12 of the "Open Source Project of the Day" series. Today we explore SoulX-Podcast (GitHub).
Traditional TTS systems are primarily designed for single-speaker, single-turn dialogue, and the generated speech lacks the natural feel of real conversations. SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation. It not only supports cross-dialect zero-shot voice cloning but also controls paralinguistic events (such as laughter and sighs) to make AI-generated speech more natural and authentic — truly achieving podcast-level quality.
What You'll Learn
- SoulX-Podcast's core architecture and technical characteristics
- How multi-turn, multi-speaker conversational speech generation works
- The technical breakthrough of cross-dialect zero-shot voice cloning
- How paralinguistic control (laughter, sighs, etc.) is implemented
- How to use SoulX-Podcast to generate high-quality podcasts
- Comparative analysis with other TTS systems
- How to use the WebUI and API
Prerequisites
- Basic understanding of TTS (Text-to-Speech)
- Basic understanding of speech synthesis concepts
- Familiarity with Python programming (optional)
- Basic knowledge of zero-shot learning (optional)
Project Background
Project Introduction
SoulX-Podcast is a TTS system specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation, developed by the Soul AI team. It not only performs excellently on traditional single-speaker TTS tasks, but more importantly achieves high-quality multi-turn conversational speech generation, making AI-generated speech closer to the natural feel of real podcasts.
Core problems the project solves:
- Traditional TTS systems only support single-speaker, single-turn dialogue
- Lack of contextual understanding and coherence for multi-turn dialogue
- Cannot generate natural conversations between multiple speakers
- Lack of dialect support, unable to achieve cross-dialect voice cloning
- Generated speech lacks authentic emotional expression (such as laughter, sighs, and other paralinguistic elements)
Target user groups:
- Content creators who need to generate podcast content
- Application developers needing multi-speaker conversational speech
- Developers needing dialect speech synthesis
- AI application developers needing high-quality, natural speech synthesis
- Developers with high requirements for speech synthesis quality
Author/Team Introduction
Team: Soul AI Lab
- Background: Research team focused on voice technology and AI
- Contributors: 4 core contributors
- Philosophy: Build a high-quality, natural, authentic podcast-style speech generation system
- Related work: Published related technical papers, provides models and demos on Hugging Face
Project creation date: October 2025 (based on GitHub activity, an actively maintained project)
Project Stats
- ⭐ GitHub Stars: 3.1k+ (rapidly and continuously growing)
- 🍴 Forks: 403+
- 📦 Version: Latest version (continuously updated)
- 📄 License: Apache-2.0
Project development history:
- October 2025: Project created, initial version released
- October 28, 2025: Paper published
- October 29, 2025: Model released on Hugging Face
- October 30, 2025: Added WebUI and single-speaker TTS examples
- October 31, 2025: Deployed Hugging Face online demo
- November 3, 2025: Added vLLM acceleration and Docker deployment support
Main Features
Core Purpose
SoulX-Podcast's core purpose is to generate high-quality, natural, authentic podcast-style multi-turn conversational speech, with main features including:
- Multi-turn, multi-speaker conversational speech generation: Supports natural dialogue between multiple speakers while maintaining contextual coherence
- Cross-dialect zero-shot voice cloning: Supports dialects like Sichuan, Henan, and Cantonese — just provide a reference audio to clone
- Paralinguistic control: Supports paralinguistic events like laughter, sighs, breathing, coughing, and throat clearing to enhance realism
- High-quality single-speaker TTS: Also performs excellently on traditional single-speaker TTS tasks
- Multilingual support: Supports Chinese (Mandarin and various dialects) and English
Use Cases
-
Podcast content generation
- Automatically generate podcast dialogue content
- Natural conversation between multiple speakers
- Add authentic emotional expressions (laughter, sighs, etc.)
-
Audiobook production
- Audiobooks with multi-character dialogue
- Voice generation for characters with different dialects
- Natural emotional expression
-
Educational content production
- Multi-speaker teaching dialogues
- Educational content in different dialects
- Engaging conversational teaching
-
Games and entertainment applications
- Voice generation for game characters
- Character voices in different dialects
- Rich paralinguistic expression
-
Assistive technology applications
- Generating natural conversations for visually impaired users
- Personalized voice assistants
- Multilingual, multi-dialect voice services
Quick Start
Installation steps:
# 1. Clone repository
git clone https://github.com/Soul-AILab/SoulX-Podcast.git
cd SoulX-Podcast
# 2. Create Conda environment
conda create -n soulxpodcast -y python=3.11
conda activate soulxpodcast
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download model
pip install -U huggingface_hub
huggingface-cli download --resume-download Soul-AILab/SoulX-Podcast-1.7B \
--local-dir pretrained_models/SoulX-Podcast-1.7B
Simplest usage example:
# Use WebUI (simplest approach)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B
# Or use the dialect model
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect
# Use command-line example script
bash example/infer_dialogue.sh
Python code example:
from soulxpodcast import SoulXPodcast
# Initialize model
model = SoulXPodcast(model_path="pretrained_models/SoulX-Podcast-1.7B")
# Generate multi-speaker dialogue
dialogue = [
{"speaker": "Host", "text": "Welcome to today's podcast episode!"},
{"speaker": "Guest", "text": "Thanks for having me! <|laughter|> So happy to be here."},
{"speaker": "Host", "text": "Let's dive into today's discussion."}
]
# Generate speech
audio = model.generate_dialogue(dialogue, reference_audios={
"Host": "path/to/host_audio.wav",
"Guest": "path/to/guest_audio.wav"
})
# Save audio
model.save_audio(audio, "output_podcast.wav")
Core Features
-
Multi-turn, multi-speaker conversational speech generation
- Supports natural dialogue between multiple speakers
- Maintains contextual coherence and dialogue flow
- Each speaker can have different voice characteristics
-
Cross-dialect zero-shot voice cloning
- Supports dialects like Sichuan, Henan, and Cantonese
- Only requires providing a reference audio to clone
- No need to train separate models for each dialect
-
Paralinguistic control
- Supports multiple paralinguistic tags:
<|laughter|>,<|sigh|>,<|breathing|>,<|coughing|>,<|throat_clearing|> - Enhances the realism and naturalness of speech
- Makes AI-generated speech closer to human expression
- Supports multiple paralinguistic tags:
-
High-quality single-speaker TTS
- Performs excellently on traditional single-speaker TTS tasks
- Supports long-text speech synthesis
- Generates natural, clear speech
-
Multilingual support
- Supports Chinese (Mandarin and various dialects)
- Supports English
- Can mix multiple languages
-
WebUI interface
- Friendly graphical interface
- Simple and easy-to-use workflow
- Real-time preview and adjustment
-
API support
- Provides RESTful API interface
- Easy to integrate into other applications
- Supports batch processing
-
vLLM acceleration
- Supports vLLM inference acceleration
- Docker deployment support
- Improves generation speed
Project Advantages
| Comparison | SoulX-Podcast | Traditional TTS | Other Conversational TTS |
|---|---|---|---|
| Multi-speaker dialogue | ✅ Native support | ❌ Not supported | ⚠️ Limited support |
| Multi-turn dialogue | ✅ Contextually coherent | ❌ Single-turn | ⚠️ Limited support |
| Dialect support | ✅ Zero-shot cross-dialect | ❌ Not supported | ❌ Not supported |
| Paralinguistic control | ✅ Multiple paralinguistics | ❌ Not supported | ❌ Not supported |
| Speech quality | ✅ Podcast-level | ⚠️ Average | ⚠️ Average |
| Naturalness | ✅ High naturalness | ⚠️ Moderate | ⚠️ Moderate |
| Zero-shot cloning | ✅ Supported | ⚠️ Limited | ⚠️ Limited |
Why choose SoulX-Podcast?
Compared to traditional TTS and other conversational TTS systems, SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker dialogue, supports cross-dialect zero-shot voice cloning and paralinguistic control, and generates more natural and authentic speech — making it the ideal choice for podcast content generation and high-quality conversational speech synthesis.
Detailed Project Analysis
Architecture Design
SoulX-Podcast uses a Transformer-based generative architecture specifically optimized for multi-turn, multi-speaker conversational speech generation.
Core Architecture
SoulX-Podcast System
├── Text Processing
│ ├── Multi-speaker dialogue parsing
│ ├── Paralinguistic tag recognition
│ ├── Context understanding
│ └── Multilingual processing
├── Voice Cloning
│ ├── Reference audio encoding
│ ├── Speaker feature extraction
│ ├── Cross-dialect feature transfer
│ └── Zero-shot cloning
├── Speech Generation
│ ├── Multi-turn dialogue generation
│ ├── Paralinguistic event generation
│ ├── Contextual coherence maintenance
│ └── High-quality audio synthesis
└── Model Architecture
├── Transformer Encoder
├── Multi-Speaker Attention
├── Dialect-Aware Module
└── Paralinguistic Control Module
Multi-Turn Dialogue Generation
SoulX-Podcast's core innovation lies in contextual understanding of multi-turn dialogue:
Workflow:
- Parse multi-speaker dialogue text
- Extract reference audio features for each speaker
- Understand dialogue context and coherence
- Generate speech that maintains contextual coherence
- Handle natural transitions between speakers
Technical characteristics:
- Uses Transformer architecture for processing long sequences
- Multi-speaker attention mechanism
- Context window management
- Dialogue coherence modeling
Cross-Dialect Zero-Shot Voice Cloning
SoulX-Podcast achieves cross-dialect zero-shot voice cloning, which is an important technical breakthrough:
Working principle:
- Extract speaker features from reference audio (dialect-independent)
- Recognize linguistic features of the target dialect
- Transfer speaker features to the target dialect
- Generate speech in the target dialect
Supported dialects:
- Sichuan dialect
- Henan dialect
- Cantonese
- Other Chinese dialects (via model extension)
Advantages:
- No need to train separate models for each dialect
- Only requires providing reference audio to clone
- Preserves the speaker's voice characteristics
- Accurately reproduces dialect features
Paralinguistic Control
SoulX-Podcast supports multiple paralinguistic events to enhance the realism of speech:
Supported paralinguistic tags:
-
<|laughter|>: Laughter -
<|sigh|>: Sigh -
<|breathing|>: Breathing sound -
<|coughing|>: Coughing sound -
<|throat_clearing|>: Throat clearing
Implementation:
- Insert paralinguistic tags in text
- Model recognizes tags and generates corresponding audio events
- Naturally integrates into the speech stream
- Maintains speech coherence
Usage example:
# Use paralinguistic tags in text
text = "The weather is so nice today! <|laughter|> Let's go for a walk. <|sigh|> But be careful about sun protection."
# The model will automatically recognize and generate corresponding paralinguistic events
audio = model.generate(text, reference_audio="speaker.wav")
Model Architecture
SoulX-Podcast is based on a 1.7B parameter Transformer model:
Model characteristics:
- Parameter count: 1.7B (base model and dialect model)
- Architecture: Transformer-based generative model
- Training data: Large-scale multi-speaker dialogue data
- Optimization: Optimized for multi-turn dialogue and dialect support
Two versions:
- SoulX-Podcast-1.7B: Base model, supports multi-turn dialogue and paralinguistic control
- SoulX-Podcast-1.7B-dialect: Dialect model, additionally supports cross-dialect zero-shot cloning
Key Technical Implementation
Multi-Speaker Dialogue Processing
SoulX-Podcast processes multi-speaker dialogue through:
- Speaker identification: Assign unique identifiers to each speaker
- Reference audio management: Provide reference audio for each speaker
- Context management: Maintain context information for multi-turn dialogue
- Feature extraction: Extract speaker features from reference audio
- Dialogue generation: Generate dialogue speech that maintains speaker characteristics
Zero-Shot Voice Cloning
Implementation of zero-shot voice cloning:
- Feature disentanglement: Decouple speaker features from language features
- Feature extraction: Extract speaker features from reference audio
- Feature transfer: Transfer speaker features to the target language/dialect
- Speech generation: Generate speech based on transferred features
Paralinguistic Event Generation
How paralinguistic events are generated:
- Tag recognition: Recognize paralinguistic tags in text
- Event modeling: Build models for each type of paralinguistic event
- Natural integration: Naturally integrate paralinguistic events into the speech stream
- Temporal alignment: Ensure paralinguistic events appear at the correct time points
Usage
WebUI Usage
SoulX-Podcast provides a friendly WebUI interface:
# Start WebUI (base model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B
# Start WebUI (dialect model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect
WebUI features:
- Text input and editing
- Reference audio upload
- Paralinguistic tag insertion
- Real-time preview and adjustment
- Audio export
API Usage
SoulX-Podcast provides an API interface:
# Start API service
python3 run_api.py --model_path pretrained_models/SoulX-Podcast-1.7B
API endpoints:
-
/generate: Generate single-speaker speech -
/generate_dialogue: Generate multi-speaker dialogue -
/clone_voice: Zero-shot voice cloning
vLLM Acceleration
SoulX-Podcast supports vLLM acceleration:
# Build Docker image
cd runtime/vllm
docker build -t soulxpodcast:v1.0 .
# Run container
docker run -it --runtime=nvidia --name soulxpodcast \
-v /mnt/data:/mnt/data -p 7860:7860 soulxpodcast:v1.0
Advantages:
- Faster inference speed
- Better GPU utilization
- Supports batch processing
- Easy to deploy and scale
Comparison with Other Projects
Comparison with Supertonic
| Feature | SoulX-Podcast | Supertonic |
|---|---|---|
| Primary use | Podcast-style multi-turn dialogue | On-device single-speaker TTS |
| Multi-speaker | ✅ Native support | ❌ Not supported |
| Multi-turn dialogue | ✅ Contextually coherent | ❌ Single-turn |
| Dialect support | ✅ Zero-shot cross-dialect | ⚠️ Limited |
| Paralinguistic control | ✅ Multiple paralinguistics | ❌ Not supported |
| Deployment | Cloud/local | On-device |
| Performance | High quality | Blazing fast |
Recommendation:
- Need podcast-style multi-turn dialogue → SoulX-Podcast
- Need on-device ultra-fast TTS → Supertonic
Comparison with Other Conversational TTS
SoulX-Podcast's advantages over other conversational TTS systems:
- Designed specifically for podcasts: Specifically optimized for podcast-style multi-turn dialogue
- Cross-dialect support: Unique cross-dialect zero-shot cloning capability
- Paralinguistic control: Rich paralinguistic event support
- High-quality generation: Podcast-level speech quality
- Easy to use: Friendly WebUI and API interfaces
Project Resources
Official Resources
- 🌟 GitHub: https://github.com/Soul-AILab/SoulX-Podcast
- 🌐 Demo: Hugging Face Spaces
- 📦 Models: SoulX-Podcast-1.7B | SoulX-Podcast-1.7B-dialect
- 📄 Paper: arXiv:2510.23541
Who Should Use This
SoulX-Podcast is especially suitable for: Content creators who need to generate podcast content, application developers needing multi-speaker conversational speech, developers needing dialect speech synthesis, AI application developers needing high-quality natural speech synthesis, developers with high requirements for speech synthesis quality, and developers needing paralinguistic control.
Not suitable for: Users who only need simple single-speaker TTS, on-device applications with strict model size constraints, scenarios that don't require multi-turn dialogue.
Welcome to visit my personal homepage for more useful knowledge and interesting products
Top comments (0)