VibeVoice: Generate 90-Minute Multi-Speaker Audio Locally

VibeVoice is a groundbreaking framework for generating expressive, long-form, multi-speaker conversational audio like podcasts and dialogues directly from text. It tackles major challenges in traditional Text-to-Speech (TTS) systems, especially in scalability and producing natural turn-taking, creating incredibly realistic audio that can extend up to 90 minutes with up to four distinct speakers.

Core Features of VibeVoice

VibeVoice introduces several innovations that set it apart from other TTS models, delivering exceptional quality and efficiency for long-form audio generation.

1. Extended Multi-Speaker Generation
The model can synthesize up to 90 minutes of continuous audio with up to 4 distinct speakers. This surpasses the typical 1-2 speaker limits of many prior models, making it perfect for creating rich, dynamic content like podcasts, audiobooks, or dramatic readings.

2. Highly Efficient Speech Tokenizers
A key innovation is its use of continuous speech tokenizers (Acoustic and Semantic) that operate at an ultra-low frame rate of 7.5 Hz. This method efficiently preserves audio fidelity while significantly boosting computational efficiency, making it possible to process very long audio sequences without a heavy performance hit.

3. Advanced LLM and Diffusion Framework
VibeVoice leverages a Large Language Model (LLM) to understand the textual context and dialogue flow, combined with a diffusion head to generate high-fidelity acoustic details. This allows the model to create natural-sounding conversations that capture the subtle rhythms and pacing of human speech. It currently supports both English and Chinese.

All the powerful AI features described above have been integrated into a one-click local installation package. This allows you to run the tool directly on your personal computer, ensuring data privacy and eliminating complex setup headaches.

Quick Setup and Usage Guide

Getting started with the VibeVoice local package is straightforward.

Step 1: Download and Extract
First, download the compressed package. Once downloaded, extract its contents and double-click the startup command to launch the application.

Step 2: Input Text and Select Speakers
Provide the text you want to convert to audio. You can assign different parts of the text to different speakers to create a conversation.

Step 3: Configure and Generate
Adjust the parameters as needed, then click "Run" to start the generation process. The final audio output will be saved and ready for use.

System Requirements

To run the local package smoothly, your system should meet the following requirements:

OS: Windows 10/11 (64-bit)
GPU: NVIDIA 30, 40, or 50 series card with at least 8GB of VRAM
CUDA: Version 12.4 or higher

This local package provides a secure and user-friendly way to harness the power of VibeVoice without worrying about privacy or complicated environmental setups.