WonderLab

Posted on Mar 10

Open Source Project of the Day (Part 12): SoulX-Podcast - Multi-Turn Conversational Podcast Generation

#opensource #tts #podcast #ai

Introduction

"What if AI could generate conversations as natural as a real podcast — supporting multi-turn interactions, different dialects, and even real emotions like laughter and sighs?"

This is Part 12 of the "Open Source Project of the Day" series. Today we explore SoulX-Podcast (GitHub).

Traditional TTS systems are primarily designed for single-speaker, single-turn dialogue, and the generated speech lacks the natural feel of real conversations. SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation. It not only supports cross-dialect zero-shot voice cloning but also controls paralinguistic events (such as laughter and sighs) to make AI-generated speech more natural and authentic — truly achieving podcast-level quality.

What You'll Learn

SoulX-Podcast's core architecture and technical characteristics
How multi-turn, multi-speaker conversational speech generation works
The technical breakthrough of cross-dialect zero-shot voice cloning
How paralinguistic control (laughter, sighs, etc.) is implemented
How to use SoulX-Podcast to generate high-quality podcasts
Comparative analysis with other TTS systems
How to use the WebUI and API

Prerequisites

Basic understanding of TTS (Text-to-Speech)
Basic understanding of speech synthesis concepts
Familiarity with Python programming (optional)
Basic knowledge of zero-shot learning (optional)

Project Background

Project Introduction

SoulX-Podcast is a TTS system specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation, developed by the Soul AI team. It not only performs excellently on traditional single-speaker TTS tasks, but more importantly achieves high-quality multi-turn conversational speech generation, making AI-generated speech closer to the natural feel of real podcasts.

Core problems the project solves:

Traditional TTS systems only support single-speaker, single-turn dialogue
Lack of contextual understanding and coherence for multi-turn dialogue
Cannot generate natural conversations between multiple speakers
Lack of dialect support, unable to achieve cross-dialect voice cloning
Generated speech lacks authentic emotional expression (such as laughter, sighs, and other paralinguistic elements)

Target user groups:

Content creators who need to generate podcast content
Application developers needing multi-speaker conversational speech
Developers needing dialect speech synthesis
AI application developers needing high-quality, natural speech synthesis
Developers with high requirements for speech synthesis quality

Author/Team Introduction

Team: Soul AI Lab

Background: Research team focused on voice technology and AI
Contributors: 4 core contributors
Philosophy: Build a high-quality, natural, authentic podcast-style speech generation system
Related work: Published related technical papers, provides models and demos on Hugging Face

Project creation date: October 2025 (based on GitHub activity, an actively maintained project)

Project Stats

⭐ GitHub Stars: 3.1k+ (rapidly and continuously growing)
🍴 Forks: 403+
📦 Version: Latest version (continuously updated)
📄 License: Apache-2.0

Project development history:

October 2025: Project created, initial version released
October 28, 2025: Paper published
October 29, 2025: Model released on Hugging Face
October 30, 2025: Added WebUI and single-speaker TTS examples
October 31, 2025: Deployed Hugging Face online demo
November 3, 2025: Added vLLM acceleration and Docker deployment support

Main Features

Core Purpose

SoulX-Podcast's core purpose is to generate high-quality, natural, authentic podcast-style multi-turn conversational speech, with main features including:

Multi-turn, multi-speaker conversational speech generation: Supports natural dialogue between multiple speakers while maintaining contextual coherence
Cross-dialect zero-shot voice cloning: Supports dialects like Sichuan, Henan, and Cantonese — just provide a reference audio to clone
Paralinguistic control: Supports paralinguistic events like laughter, sighs, breathing, coughing, and throat clearing to enhance realism
High-quality single-speaker TTS: Also performs excellently on traditional single-speaker TTS tasks
Multilingual support: Supports Chinese (Mandarin and various dialects) and English

Use Cases

Podcast content generation
- Automatically generate podcast dialogue content
- Natural conversation between multiple speakers
- Add authentic emotional expressions (laughter, sighs, etc.)
Audiobook production
- Audiobooks with multi-character dialogue
- Voice generation for characters with different dialects
- Natural emotional expression
Educational content production
- Multi-speaker teaching dialogues
- Educational content in different dialects
- Engaging conversational teaching
Games and entertainment applications
- Voice generation for game characters
- Character voices in different dialects
- Rich paralinguistic expression
Assistive technology applications
- Generating natural conversations for visually impaired users
- Personalized voice assistants
- Multilingual, multi-dialect voice services

Quick Start

Installation steps:

# 1. Clone repository
git clone https://github.com/Soul-AILab/SoulX-Podcast.git
cd SoulX-Podcast

# 2. Create Conda environment
conda create -n soulxpodcast -y python=3.11
conda activate soulxpodcast

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download model
pip install -U huggingface_hub
huggingface-cli download --resume-download Soul-AILab/SoulX-Podcast-1.7B \
  --local-dir pretrained_models/SoulX-Podcast-1.7B

Simplest usage example:

# Use WebUI (simplest approach)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B

# Or use the dialect model
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect

# Use command-line example script
bash example/infer_dialogue.sh

Python code example:

from soulxpodcast import SoulXPodcast

# Initialize model
model = SoulXPodcast(model_path="pretrained_models/SoulX-Podcast-1.7B")

# Generate multi-speaker dialogue
dialogue = [
    {"speaker": "Host", "text": "Welcome to today's podcast episode!"},
    {"speaker": "Guest", "text": "Thanks for having me! <|laughter|> So happy to be here."},
    {"speaker": "Host", "text": "Let's dive into today's discussion."}
]

# Generate speech
audio = model.generate_dialogue(dialogue, reference_audios={
    "Host": "path/to/host_audio.wav",
    "Guest": "path/to/guest_audio.wav"
})

# Save audio
model.save_audio(audio, "output_podcast.wav")

Core Features

Multi-turn, multi-speaker conversational speech generation
- Supports natural dialogue between multiple speakers
- Maintains contextual coherence and dialogue flow
- Each speaker can have different voice characteristics
Cross-dialect zero-shot voice cloning
- Supports dialects like Sichuan, Henan, and Cantonese
- Only requires providing a reference audio to clone
- No need to train separate models for each dialect
Paralinguistic control
- Supports multiple paralinguistic tags: <|laughter|>, <|sigh|>, <|breathing|>, <|coughing|>, <|throat_clearing|>
- Enhances the realism and naturalness of speech
- Makes AI-generated speech closer to human expression
High-quality single-speaker TTS
- Performs excellently on traditional single-speaker TTS tasks
- Supports long-text speech synthesis
- Generates natural, clear speech
Multilingual support
- Supports Chinese (Mandarin and various dialects)
- Supports English
- Can mix multiple languages
WebUI interface
- Friendly graphical interface
- Simple and easy-to-use workflow
- Real-time preview and adjustment
API support
- Provides RESTful API interface
- Easy to integrate into other applications
- Supports batch processing
vLLM acceleration
- Supports vLLM inference acceleration
- Docker deployment support
- Improves generation speed

Project Advantages

Comparison	SoulX-Podcast	Traditional TTS	Other Conversational TTS
Multi-speaker dialogue	✅ Native support	❌ Not supported	⚠️ Limited support
Multi-turn dialogue	✅ Contextually coherent	❌ Single-turn	⚠️ Limited support
Dialect support	✅ Zero-shot cross-dialect	❌ Not supported	❌ Not supported
Paralinguistic control	✅ Multiple paralinguistics	❌ Not supported	❌ Not supported
Speech quality	✅ Podcast-level	⚠️ Average	⚠️ Average
Naturalness	✅ High naturalness	⚠️ Moderate	⚠️ Moderate
Zero-shot cloning	✅ Supported	⚠️ Limited	⚠️ Limited

Why choose SoulX-Podcast?

Compared to traditional TTS and other conversational TTS systems, SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker dialogue, supports cross-dialect zero-shot voice cloning and paralinguistic control, and generates more natural and authentic speech — making it the ideal choice for podcast content generation and high-quality conversational speech synthesis.

Detailed Project Analysis

Architecture Design

SoulX-Podcast uses a Transformer-based generative architecture specifically optimized for multi-turn, multi-speaker conversational speech generation.

Core Architecture

SoulX-Podcast System
├── Text Processing
│   ├── Multi-speaker dialogue parsing
│   ├── Paralinguistic tag recognition
│   ├── Context understanding
│   └── Multilingual processing
├── Voice Cloning
│   ├── Reference audio encoding
│   ├── Speaker feature extraction
│   ├── Cross-dialect feature transfer
│   └── Zero-shot cloning
├── Speech Generation
│   ├── Multi-turn dialogue generation
│   ├── Paralinguistic event generation
│   ├── Contextual coherence maintenance
│   └── High-quality audio synthesis
└── Model Architecture
    ├── Transformer Encoder
    ├── Multi-Speaker Attention
    ├── Dialect-Aware Module
    └── Paralinguistic Control Module

Multi-Turn Dialogue Generation

SoulX-Podcast's core innovation lies in contextual understanding of multi-turn dialogue:

Workflow:

Parse multi-speaker dialogue text
Extract reference audio features for each speaker
Understand dialogue context and coherence
Generate speech that maintains contextual coherence
Handle natural transitions between speakers

Technical characteristics:

Uses Transformer architecture for processing long sequences
Multi-speaker attention mechanism
Context window management
Dialogue coherence modeling

Cross-Dialect Zero-Shot Voice Cloning

SoulX-Podcast achieves cross-dialect zero-shot voice cloning, which is an important technical breakthrough:

Working principle:

Extract speaker features from reference audio (dialect-independent)
Recognize linguistic features of the target dialect
Transfer speaker features to the target dialect
Generate speech in the target dialect

Supported dialects:

Sichuan dialect
Henan dialect
Cantonese
Other Chinese dialects (via model extension)

Advantages:

No need to train separate models for each dialect
Only requires providing reference audio to clone
Preserves the speaker's voice characteristics
Accurately reproduces dialect features

Paralinguistic Control

SoulX-Podcast supports multiple paralinguistic events to enhance the realism of speech:

Supported paralinguistic tags:

<|laughter|>: Laughter
<|sigh|>: Sigh
<|breathing|>: Breathing sound
<|coughing|>: Coughing sound
<|throat_clearing|>: Throat clearing

Implementation:

Insert paralinguistic tags in text
Model recognizes tags and generates corresponding audio events
Naturally integrates into the speech stream
Maintains speech coherence

Usage example:

# Use paralinguistic tags in text
text = "The weather is so nice today! <|laughter|> Let's go for a walk. <|sigh|> But be careful about sun protection."

# The model will automatically recognize and generate corresponding paralinguistic events
audio = model.generate(text, reference_audio="speaker.wav")

Model Architecture

SoulX-Podcast is based on a 1.7B parameter Transformer model:

Model characteristics:

Parameter count: 1.7B (base model and dialect model)
Architecture: Transformer-based generative model
Training data: Large-scale multi-speaker dialogue data
Optimization: Optimized for multi-turn dialogue and dialect support

Two versions:

SoulX-Podcast-1.7B: Base model, supports multi-turn dialogue and paralinguistic control
SoulX-Podcast-1.7B-dialect: Dialect model, additionally supports cross-dialect zero-shot cloning

Key Technical Implementation

Multi-Speaker Dialogue Processing

SoulX-Podcast processes multi-speaker dialogue through:

Speaker identification: Assign unique identifiers to each speaker
Reference audio management: Provide reference audio for each speaker
Context management: Maintain context information for multi-turn dialogue
Feature extraction: Extract speaker features from reference audio
Dialogue generation: Generate dialogue speech that maintains speaker characteristics

Zero-Shot Voice Cloning

Implementation of zero-shot voice cloning:

Feature disentanglement: Decouple speaker features from language features
Feature extraction: Extract speaker features from reference audio
Feature transfer: Transfer speaker features to the target language/dialect
Speech generation: Generate speech based on transferred features

Paralinguistic Event Generation

How paralinguistic events are generated:

Tag recognition: Recognize paralinguistic tags in text
Event modeling: Build models for each type of paralinguistic event
Natural integration: Naturally integrate paralinguistic events into the speech stream
Temporal alignment: Ensure paralinguistic events appear at the correct time points

Usage

WebUI Usage

SoulX-Podcast provides a friendly WebUI interface:

# Start WebUI (base model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B

# Start WebUI (dialect model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect

WebUI features:

Text input and editing
Reference audio upload
Paralinguistic tag insertion
Real-time preview and adjustment
Audio export

API Usage

SoulX-Podcast provides an API interface:

# Start API service
python3 run_api.py --model_path pretrained_models/SoulX-Podcast-1.7B

API endpoints:

/generate: Generate single-speaker speech
/generate_dialogue: Generate multi-speaker dialogue
/clone_voice: Zero-shot voice cloning

vLLM Acceleration

SoulX-Podcast supports vLLM acceleration:

# Build Docker image
cd runtime/vllm
docker build -t soulxpodcast:v1.0 .

# Run container
docker run -it --runtime=nvidia --name soulxpodcast \
  -v /mnt/data:/mnt/data -p 7860:7860 soulxpodcast:v1.0

Advantages:

Faster inference speed
Better GPU utilization
Supports batch processing
Easy to deploy and scale

Comparison with Other Projects

Comparison with Supertonic

Feature	SoulX-Podcast	Supertonic
Primary use	Podcast-style multi-turn dialogue	On-device single-speaker TTS
Multi-speaker	✅ Native support	❌ Not supported
Multi-turn dialogue	✅ Contextually coherent	❌ Single-turn
Dialect support	✅ Zero-shot cross-dialect	⚠️ Limited
Paralinguistic control	✅ Multiple paralinguistics	❌ Not supported
Deployment	Cloud/local	On-device
Performance	High quality	Blazing fast

Recommendation:

Need podcast-style multi-turn dialogue → SoulX-Podcast
Need on-device ultra-fast TTS → Supertonic

Comparison with Other Conversational TTS

SoulX-Podcast's advantages over other conversational TTS systems:

Designed specifically for podcasts: Specifically optimized for podcast-style multi-turn dialogue
Cross-dialect support: Unique cross-dialect zero-shot cloning capability
Paralinguistic control: Rich paralinguistic event support
High-quality generation: Podcast-level speech quality
Easy to use: Friendly WebUI and API interfaces

Project Resources

Official Resources

🌟 GitHub: https://github.com/Soul-AILab/SoulX-Podcast
🌐 Demo: Hugging Face Spaces
📦 Models: SoulX-Podcast-1.7B | SoulX-Podcast-1.7B-dialect
📄 Paper: arXiv:2510.23541

Who Should Use This

SoulX-Podcast is especially suitable for: Content creators who need to generate podcast content, application developers needing multi-speaker conversational speech, developers needing dialect speech synthesis, AI application developers needing high-quality natural speech synthesis, developers with high requirements for speech synthesis quality, and developers needing paralinguistic control.

Not suitable for: Users who only need simple single-speaker TTS, on-device applications with strict model size constraints, scenarios that don't require multi-turn dialogue.

Welcome to visit my personal homepage for more useful knowledge and interesting products

DEV Community

Open Source Project of the Day (Part 12): SoulX-Podcast - Multi-Turn Conversational Podcast Generation

Introduction

What You'll Learn

Prerequisites

Project Background

Project Introduction

Author/Team Introduction

Project Stats

Main Features

Core Purpose

Use Cases

Quick Start

Core Features

Project Advantages

Detailed Project Analysis

Architecture Design

Core Architecture

Multi-Turn Dialogue Generation

Cross-Dialect Zero-Shot Voice Cloning

Paralinguistic Control

Model Architecture

Key Technical Implementation

Multi-Speaker Dialogue Processing

Zero-Shot Voice Cloning

Paralinguistic Event Generation

Usage

WebUI Usage

API Usage

vLLM Acceleration

Comparison with Other Projects

Comparison with Supertonic

Comparison with Other Conversational TTS

Project Resources

Official Resources

Who Should Use This

Top comments (0)