DEV Community

WonderLab
WonderLab

Posted on

Open Source Project of the Day (Part 12): SoulX-Podcast - Multi-Turn Conversational Podcast Generation

Introduction

"What if AI could generate conversations as natural as a real podcast — supporting multi-turn interactions, different dialects, and even real emotions like laughter and sighs?"

This is Part 12 of the "Open Source Project of the Day" series. Today we explore SoulX-Podcast (GitHub).

Traditional TTS systems are primarily designed for single-speaker, single-turn dialogue, and the generated speech lacks the natural feel of real conversations. SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation. It not only supports cross-dialect zero-shot voice cloning but also controls paralinguistic events (such as laughter and sighs) to make AI-generated speech more natural and authentic — truly achieving podcast-level quality.

What You'll Learn

  • SoulX-Podcast's core architecture and technical characteristics
  • How multi-turn, multi-speaker conversational speech generation works
  • The technical breakthrough of cross-dialect zero-shot voice cloning
  • How paralinguistic control (laughter, sighs, etc.) is implemented
  • How to use SoulX-Podcast to generate high-quality podcasts
  • Comparative analysis with other TTS systems
  • How to use the WebUI and API

Prerequisites

  • Basic understanding of TTS (Text-to-Speech)
  • Basic understanding of speech synthesis concepts
  • Familiarity with Python programming (optional)
  • Basic knowledge of zero-shot learning (optional)

Project Background

Project Introduction

SoulX-Podcast is a TTS system specifically designed for podcast-style multi-turn, multi-speaker conversational speech generation, developed by the Soul AI team. It not only performs excellently on traditional single-speaker TTS tasks, but more importantly achieves high-quality multi-turn conversational speech generation, making AI-generated speech closer to the natural feel of real podcasts.

Core problems the project solves:

  • Traditional TTS systems only support single-speaker, single-turn dialogue
  • Lack of contextual understanding and coherence for multi-turn dialogue
  • Cannot generate natural conversations between multiple speakers
  • Lack of dialect support, unable to achieve cross-dialect voice cloning
  • Generated speech lacks authentic emotional expression (such as laughter, sighs, and other paralinguistic elements)

Target user groups:

  • Content creators who need to generate podcast content
  • Application developers needing multi-speaker conversational speech
  • Developers needing dialect speech synthesis
  • AI application developers needing high-quality, natural speech synthesis
  • Developers with high requirements for speech synthesis quality

Author/Team Introduction

Team: Soul AI Lab

  • Background: Research team focused on voice technology and AI
  • Contributors: 4 core contributors
  • Philosophy: Build a high-quality, natural, authentic podcast-style speech generation system
  • Related work: Published related technical papers, provides models and demos on Hugging Face

Project creation date: October 2025 (based on GitHub activity, an actively maintained project)

Project Stats

  • GitHub Stars: 3.1k+ (rapidly and continuously growing)
  • 🍴 Forks: 403+
  • 📦 Version: Latest version (continuously updated)
  • 📄 License: Apache-2.0

Project development history:

  • October 2025: Project created, initial version released
  • October 28, 2025: Paper published
  • October 29, 2025: Model released on Hugging Face
  • October 30, 2025: Added WebUI and single-speaker TTS examples
  • October 31, 2025: Deployed Hugging Face online demo
  • November 3, 2025: Added vLLM acceleration and Docker deployment support

Main Features

Core Purpose

SoulX-Podcast's core purpose is to generate high-quality, natural, authentic podcast-style multi-turn conversational speech, with main features including:

  1. Multi-turn, multi-speaker conversational speech generation: Supports natural dialogue between multiple speakers while maintaining contextual coherence
  2. Cross-dialect zero-shot voice cloning: Supports dialects like Sichuan, Henan, and Cantonese — just provide a reference audio to clone
  3. Paralinguistic control: Supports paralinguistic events like laughter, sighs, breathing, coughing, and throat clearing to enhance realism
  4. High-quality single-speaker TTS: Also performs excellently on traditional single-speaker TTS tasks
  5. Multilingual support: Supports Chinese (Mandarin and various dialects) and English

Use Cases

  1. Podcast content generation

    • Automatically generate podcast dialogue content
    • Natural conversation between multiple speakers
    • Add authentic emotional expressions (laughter, sighs, etc.)
  2. Audiobook production

    • Audiobooks with multi-character dialogue
    • Voice generation for characters with different dialects
    • Natural emotional expression
  3. Educational content production

    • Multi-speaker teaching dialogues
    • Educational content in different dialects
    • Engaging conversational teaching
  4. Games and entertainment applications

    • Voice generation for game characters
    • Character voices in different dialects
    • Rich paralinguistic expression
  5. Assistive technology applications

    • Generating natural conversations for visually impaired users
    • Personalized voice assistants
    • Multilingual, multi-dialect voice services

Quick Start

Installation steps:

# 1. Clone repository
git clone https://github.com/Soul-AILab/SoulX-Podcast.git
cd SoulX-Podcast

# 2. Create Conda environment
conda create -n soulxpodcast -y python=3.11
conda activate soulxpodcast

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download model
pip install -U huggingface_hub
huggingface-cli download --resume-download Soul-AILab/SoulX-Podcast-1.7B \
  --local-dir pretrained_models/SoulX-Podcast-1.7B
Enter fullscreen mode Exit fullscreen mode

Simplest usage example:

# Use WebUI (simplest approach)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B

# Or use the dialect model
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect

# Use command-line example script
bash example/infer_dialogue.sh
Enter fullscreen mode Exit fullscreen mode

Python code example:

from soulxpodcast import SoulXPodcast

# Initialize model
model = SoulXPodcast(model_path="pretrained_models/SoulX-Podcast-1.7B")

# Generate multi-speaker dialogue
dialogue = [
    {"speaker": "Host", "text": "Welcome to today's podcast episode!"},
    {"speaker": "Guest", "text": "Thanks for having me! <|laughter|> So happy to be here."},
    {"speaker": "Host", "text": "Let's dive into today's discussion."}
]

# Generate speech
audio = model.generate_dialogue(dialogue, reference_audios={
    "Host": "path/to/host_audio.wav",
    "Guest": "path/to/guest_audio.wav"
})

# Save audio
model.save_audio(audio, "output_podcast.wav")
Enter fullscreen mode Exit fullscreen mode

Core Features

  1. Multi-turn, multi-speaker conversational speech generation

    • Supports natural dialogue between multiple speakers
    • Maintains contextual coherence and dialogue flow
    • Each speaker can have different voice characteristics
  2. Cross-dialect zero-shot voice cloning

    • Supports dialects like Sichuan, Henan, and Cantonese
    • Only requires providing a reference audio to clone
    • No need to train separate models for each dialect
  3. Paralinguistic control

    • Supports multiple paralinguistic tags: <|laughter|>, <|sigh|>, <|breathing|>, <|coughing|>, <|throat_clearing|>
    • Enhances the realism and naturalness of speech
    • Makes AI-generated speech closer to human expression
  4. High-quality single-speaker TTS

    • Performs excellently on traditional single-speaker TTS tasks
    • Supports long-text speech synthesis
    • Generates natural, clear speech
  5. Multilingual support

    • Supports Chinese (Mandarin and various dialects)
    • Supports English
    • Can mix multiple languages
  6. WebUI interface

    • Friendly graphical interface
    • Simple and easy-to-use workflow
    • Real-time preview and adjustment
  7. API support

    • Provides RESTful API interface
    • Easy to integrate into other applications
    • Supports batch processing
  8. vLLM acceleration

    • Supports vLLM inference acceleration
    • Docker deployment support
    • Improves generation speed

Project Advantages

Comparison SoulX-Podcast Traditional TTS Other Conversational TTS
Multi-speaker dialogue ✅ Native support ❌ Not supported ⚠️ Limited support
Multi-turn dialogue ✅ Contextually coherent ❌ Single-turn ⚠️ Limited support
Dialect support ✅ Zero-shot cross-dialect ❌ Not supported ❌ Not supported
Paralinguistic control ✅ Multiple paralinguistics ❌ Not supported ❌ Not supported
Speech quality ✅ Podcast-level ⚠️ Average ⚠️ Average
Naturalness ✅ High naturalness ⚠️ Moderate ⚠️ Moderate
Zero-shot cloning ✅ Supported ⚠️ Limited ⚠️ Limited

Why choose SoulX-Podcast?

Compared to traditional TTS and other conversational TTS systems, SoulX-Podcast is specifically designed for podcast-style multi-turn, multi-speaker dialogue, supports cross-dialect zero-shot voice cloning and paralinguistic control, and generates more natural and authentic speech — making it the ideal choice for podcast content generation and high-quality conversational speech synthesis.


Detailed Project Analysis

Architecture Design

SoulX-Podcast uses a Transformer-based generative architecture specifically optimized for multi-turn, multi-speaker conversational speech generation.

Core Architecture

SoulX-Podcast System
├── Text Processing
│   ├── Multi-speaker dialogue parsing
│   ├── Paralinguistic tag recognition
│   ├── Context understanding
│   └── Multilingual processing
├── Voice Cloning
│   ├── Reference audio encoding
│   ├── Speaker feature extraction
│   ├── Cross-dialect feature transfer
│   └── Zero-shot cloning
├── Speech Generation
│   ├── Multi-turn dialogue generation
│   ├── Paralinguistic event generation
│   ├── Contextual coherence maintenance
│   └── High-quality audio synthesis
└── Model Architecture
    ├── Transformer Encoder
    ├── Multi-Speaker Attention
    ├── Dialect-Aware Module
    └── Paralinguistic Control Module
Enter fullscreen mode Exit fullscreen mode

Multi-Turn Dialogue Generation

SoulX-Podcast's core innovation lies in contextual understanding of multi-turn dialogue:

Workflow:

  1. Parse multi-speaker dialogue text
  2. Extract reference audio features for each speaker
  3. Understand dialogue context and coherence
  4. Generate speech that maintains contextual coherence
  5. Handle natural transitions between speakers

Technical characteristics:

  • Uses Transformer architecture for processing long sequences
  • Multi-speaker attention mechanism
  • Context window management
  • Dialogue coherence modeling

Cross-Dialect Zero-Shot Voice Cloning

SoulX-Podcast achieves cross-dialect zero-shot voice cloning, which is an important technical breakthrough:

Working principle:

  1. Extract speaker features from reference audio (dialect-independent)
  2. Recognize linguistic features of the target dialect
  3. Transfer speaker features to the target dialect
  4. Generate speech in the target dialect

Supported dialects:

  • Sichuan dialect
  • Henan dialect
  • Cantonese
  • Other Chinese dialects (via model extension)

Advantages:

  • No need to train separate models for each dialect
  • Only requires providing reference audio to clone
  • Preserves the speaker's voice characteristics
  • Accurately reproduces dialect features

Paralinguistic Control

SoulX-Podcast supports multiple paralinguistic events to enhance the realism of speech:

Supported paralinguistic tags:

  • <|laughter|>: Laughter
  • <|sigh|>: Sigh
  • <|breathing|>: Breathing sound
  • <|coughing|>: Coughing sound
  • <|throat_clearing|>: Throat clearing

Implementation:

  1. Insert paralinguistic tags in text
  2. Model recognizes tags and generates corresponding audio events
  3. Naturally integrates into the speech stream
  4. Maintains speech coherence

Usage example:

# Use paralinguistic tags in text
text = "The weather is so nice today! <|laughter|> Let's go for a walk. <|sigh|> But be careful about sun protection."

# The model will automatically recognize and generate corresponding paralinguistic events
audio = model.generate(text, reference_audio="speaker.wav")
Enter fullscreen mode Exit fullscreen mode

Model Architecture

SoulX-Podcast is based on a 1.7B parameter Transformer model:

Model characteristics:

  • Parameter count: 1.7B (base model and dialect model)
  • Architecture: Transformer-based generative model
  • Training data: Large-scale multi-speaker dialogue data
  • Optimization: Optimized for multi-turn dialogue and dialect support

Two versions:

  1. SoulX-Podcast-1.7B: Base model, supports multi-turn dialogue and paralinguistic control
  2. SoulX-Podcast-1.7B-dialect: Dialect model, additionally supports cross-dialect zero-shot cloning

Key Technical Implementation

Multi-Speaker Dialogue Processing

SoulX-Podcast processes multi-speaker dialogue through:

  1. Speaker identification: Assign unique identifiers to each speaker
  2. Reference audio management: Provide reference audio for each speaker
  3. Context management: Maintain context information for multi-turn dialogue
  4. Feature extraction: Extract speaker features from reference audio
  5. Dialogue generation: Generate dialogue speech that maintains speaker characteristics

Zero-Shot Voice Cloning

Implementation of zero-shot voice cloning:

  1. Feature disentanglement: Decouple speaker features from language features
  2. Feature extraction: Extract speaker features from reference audio
  3. Feature transfer: Transfer speaker features to the target language/dialect
  4. Speech generation: Generate speech based on transferred features

Paralinguistic Event Generation

How paralinguistic events are generated:

  1. Tag recognition: Recognize paralinguistic tags in text
  2. Event modeling: Build models for each type of paralinguistic event
  3. Natural integration: Naturally integrate paralinguistic events into the speech stream
  4. Temporal alignment: Ensure paralinguistic events appear at the correct time points

Usage

WebUI Usage

SoulX-Podcast provides a friendly WebUI interface:

# Start WebUI (base model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B

# Start WebUI (dialect model)
python3 webui.py --model_path pretrained_models/SoulX-Podcast-1.7B-dialect
Enter fullscreen mode Exit fullscreen mode

WebUI features:

  • Text input and editing
  • Reference audio upload
  • Paralinguistic tag insertion
  • Real-time preview and adjustment
  • Audio export

API Usage

SoulX-Podcast provides an API interface:

# Start API service
python3 run_api.py --model_path pretrained_models/SoulX-Podcast-1.7B
Enter fullscreen mode Exit fullscreen mode

API endpoints:

  • /generate: Generate single-speaker speech
  • /generate_dialogue: Generate multi-speaker dialogue
  • /clone_voice: Zero-shot voice cloning

vLLM Acceleration

SoulX-Podcast supports vLLM acceleration:

# Build Docker image
cd runtime/vllm
docker build -t soulxpodcast:v1.0 .

# Run container
docker run -it --runtime=nvidia --name soulxpodcast \
  -v /mnt/data:/mnt/data -p 7860:7860 soulxpodcast:v1.0
Enter fullscreen mode Exit fullscreen mode

Advantages:

  • Faster inference speed
  • Better GPU utilization
  • Supports batch processing
  • Easy to deploy and scale

Comparison with Other Projects

Comparison with Supertonic

Feature SoulX-Podcast Supertonic
Primary use Podcast-style multi-turn dialogue On-device single-speaker TTS
Multi-speaker ✅ Native support ❌ Not supported
Multi-turn dialogue ✅ Contextually coherent ❌ Single-turn
Dialect support ✅ Zero-shot cross-dialect ⚠️ Limited
Paralinguistic control ✅ Multiple paralinguistics ❌ Not supported
Deployment Cloud/local On-device
Performance High quality Blazing fast

Recommendation:

  • Need podcast-style multi-turn dialogue → SoulX-Podcast
  • Need on-device ultra-fast TTS → Supertonic

Comparison with Other Conversational TTS

SoulX-Podcast's advantages over other conversational TTS systems:

  1. Designed specifically for podcasts: Specifically optimized for podcast-style multi-turn dialogue
  2. Cross-dialect support: Unique cross-dialect zero-shot cloning capability
  3. Paralinguistic control: Rich paralinguistic event support
  4. High-quality generation: Podcast-level speech quality
  5. Easy to use: Friendly WebUI and API interfaces

Project Resources

Official Resources

Who Should Use This

SoulX-Podcast is especially suitable for: Content creators who need to generate podcast content, application developers needing multi-speaker conversational speech, developers needing dialect speech synthesis, AI application developers needing high-quality natural speech synthesis, developers with high requirements for speech synthesis quality, and developers needing paralinguistic control.

Not suitable for: Users who only need simple single-speaker TTS, on-device applications with strict model size constraints, scenarios that don't require multi-turn dialogue.


Welcome to visit my personal homepage for more useful knowledge and interesting products

Top comments (0)