TL;DR
VibeVoice is Microsoft’s open-source voice AI suite, featuring three models: VibeVoice-1.5B (text-to-speech, up to 90 mins, 4 speakers), VibeVoice-Realtime-0.5B (streaming TTS), and VibeVoice-ASR (speech recognition, 60 mins, 50+ languages, 7.77% WER). All are MIT-licensed and run locally. This guide provides actionable steps for installation, usage, and API integration.
Introduction
Microsoft released VibeVoice as an open-source, local voice AI framework in early 2026, supporting both text-to-speech (TTS) and automatic speech recognition (ASR). All models run on your own hardware, with no cloud requirement.
Three model types:
- VibeVoice-1.5B: Generates expressive TTS from text, supporting up to 90 minutes and 4 different speakers per pass.
- VibeVoice-Realtime-0.5B: Lightweight streaming TTS, ~300ms latency for first audio chunk.
- VibeVoice-ASR: Transcribes up to 60 minutes of audio, supports 50+ languages, outputs structured transcripts with speaker IDs and timestamps.
Microsoft temporarily disabled the main repo after voice cloning misuse. They re-enabled it with these safeguards:
- Audible AI disclaimer in all generated audio
- Imperceptible watermarking for provenance
VibeVoice-ASR is available on Azure AI Foundry (cloud). TTS models remain MIT-licensed for local use.
This guide covers installation, TTS generation, speech recognition, API integration, and voice AI endpoint testing using Apidog.
How VibeVoice Works: Architecture Overview
The Tokenizer Breakthrough
VibeVoice uses continuous speech tokenizers at 7.5 Hz—far lower than typical 50-100 Hz rates. This allows handling of very long sequences (up to 90 minutes) without context loss.
There are two tokenizers:
- Acoustic Tokenizer: Sigma-VAE, ~340M parameters, downsampling audio 3,200x from 24kHz input.
- Semantic Tokenizer: Same architecture, trained for linguistic meaning using an ASR proxy.
Next-Token Diffusion
VibeVoice combines a Qwen2.5-1.5B LLM backbone with a ~123M parameter diffusion head. The LLM manages context/dialogue; the diffusion head generates acoustic detail using DDPM with Classifier-Free Guidance.
- Total parameter count: ~3B
Training Approach
Curriculum learning: Training starts with short sequences (4K tokens), then increases to 16K, 32K, and 64K. Tokenizers are frozen; only the LLM and diffusion head update.
VibeVoice Model Specifications
| Model | Parameters | Purpose | Max length | Languages | License |
|---|---|---|---|---|---|
| VibeVoice-1.5B | 3B | Text-to-speech | 90 minutes | English, Chinese | MIT |
| VibeVoice-Realtime-0.5B | ~0.5B | Streaming TTS | Long-form | English, Chinese | MIT |
| VibeVoice-ASR | ~9B | Speech recognition | 60 minutes | 50+ languages | MIT |
VibeVoice-1.5B (TTS)
| Specification | Value |
|---|---|
| LLM base | Qwen2.5-1.5B |
| Context length | 64K tokens |
| Max speakers | 4 simultaneous |
| Audio output | 24kHz WAV mono |
| Tensor type | BF16 |
| Format | Safetensors |
| HuggingFace downloads | 62,630/month |
| Community forks | 12 fine-tuned variants |
VibeVoice-ASR
| Specification | Value |
|---|---|
| Architecture base | Qwen2.5 |
| Parameters | ~9B |
| Audio processing | Up to 60 mins, single pass |
| Frame rate | 7.5 Hz |
| Average WER | 7.77% (8 English datasets) |
| LibriSpeech Clean WER | 2.20% |
| TED-LIUM WER | 2.57% |
| Languages | 50+ |
| Output | Structured (Who/When/What) |
| Supported audio | WAV, FLAC, MP3 (16kHz+) |
Installation and Setup
Prerequisites
- Python 3.8+
- NVIDIA GPU with CUDA
- TTS: 7–8 GB VRAM minimum
- ASR: 24 GB VRAM minimum (A100/H100 recommended)
- 32 GB RAM minimum (64 GB for ASR)
- CUDA 11.8+ (12.0+ recommended)
Install VibeVoice TTS
# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
# Install dependencies
pip install -r requirements.txt
Models download automatically on first run, or pre-download:
from huggingface_hub import snapshot_download
# Download the 1.5B TTS model
snapshot_download(
"microsoft/VibeVoice-1.5B",
local_dir="./models/VibeVoice-1.5B",
local_dir_use_symlinks=False
)
Install via pip (Community Package)
pip install vibevoice
Install for ASR
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt
Or use Azure AI Foundry for managed cloud inference.
Generating Speech with VibeVoice-1.5B
Single-Speaker Generation
Create your script:
Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.
Run inference:
python VibeVoice \
--model_path microsoft/VibeVoice-1.5B \
--txt_path script.txt \
--speaker_names Alice \
--cfg_scale 1.5
Output is saved as .wav in outputs/.
Multi-Speaker Podcast Generation
Supports up to 4 speakers with consistent identities:
Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.
python VibeVoice \
--model_path microsoft/VibeVoice-1.5B \
--txt_path podcast_script.txt \
--speaker_names Alice Bob Carol \
--cfg_scale 1.5
Voice Cloning (Zero-Shot)
Requirements
- WAV mono, 24,000 Hz, 30–60 seconds of clear speech
Convert audio:
ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav
Use Gradio demo:
python demo/gradio_demo.py
Open http://127.0.0.1:7860, upload reference audio, and generate speech.
Streaming with VibeVoice-Realtime-0.5B
For low-latency (~300ms) streaming output:
python demo/streaming_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path script.txt \
--speaker_name Alice
Use Realtime for interactive apps; 1.5B for highest quality.
Using VibeVoice with Python
Pipeline API Example
from transformers import pipeline
from huggingface_hub import snapshot_download
# Download model
model_path = snapshot_download("microsoft/VibeVoice-1.5B")
# Load pipeline
pipe = pipeline(
"text-to-speech",
model=model_path,
no_processor=False
)
# Multi-speaker script
script = [
{"role": "Alice", "content": "How do you handle API versioning?"},
{"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]
# Apply chat template
input_data = pipe.processor.apply_chat_template(script)
# Generate audio
generate_kwargs = {
"cfg_scale": 1.5,
"n_diffusion_steps": 50,
}
output = pipe(input_data, generate_kwargs=generate_kwargs)
FastAPI Wrapper for Production
Run an OpenAI-compatible REST API with Docker:
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up
Test with curl:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "vibevoice-1.5b",
"input": "Your API documentation should be a conversation, not a monologue.",
"voice": "alice"
}' \
--output speech.wav
You can test this endpoint with Apidog using the same format as OpenAI’s TTS API—import the endpoint, configure the request, and validate voice generation.
Using VibeVoice-ASR for Speech Recognition
Basic Transcription
python asr_inference.py \
--model_path microsoft/VibeVoice-ASR \
--audio_path meeting_recording.wav
Structured Output Format
Each segment includes:
- Who: Speaker ID
- When: Start/end timestamps
- What: Transcribed text
Example:
{
"segments": [
{
"speaker": "Speaker 1",
"start": 0.0,
"end": 4.2,
"text": "Let's review the API endpoints for the new release."
},
{
"speaker": "Speaker 2",
"start": 4.5,
"end": 8.1,
"text": "I've added three new endpoints for the billing module."
}
]
}
ASR as an MCP Server
Integrate ASR with coding tools using MCP:
pip install vibevoice-mcp-server
vibevoice-mcp serve
This enables transcription as part of coding workflows (e.g., for agent-based tools).
When to Use VibeVoice-ASR vs Whisper
| Use case | Best choice | Why |
|---|---|---|
| Long meetings (30–60 min) | VibeVoice-ASR | Single-pass, speaker ID |
| Multi-speaker interviews | VibeVoice-ASR | Built-in diarization |
| Podcasts with timestamps | VibeVoice-ASR | Structured output |
| Multilingual (50+ languages) | VibeVoice-ASR | Broader language support |
| Short, noisy clips | Whisper | Better noise robustness |
| Edge/mobile deployment | Whisper | Smaller model, more device support |
| Specialized non-English | Whisper | More mature fine-tuning |
Testing Voice AI APIs with Apidog
Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API, Apidog streamlines integration testing.
Test the TTS Endpoint
- Create a new POST request in Apidog targeting your FastAPI server.
- Use OpenAI-compatible request body:
{
"model": "vibevoice-1.5b",
"input": "Test speech synthesis with proper intonation and pacing.",
"voice": "alice",
"response_format": "wav"
}
- Send the request—verify
audio/wavin response headers. - Save and audit the output WAV file.
Test the ASR Endpoint
For speech-to-text:
- POST request with
multipart/form-data. - Attach your audio file as a form field.
- Confirm the JSON response includes speaker IDs, timestamps, and text.
Validate Audio API Contracts
Apidog supports:
- Binary uploads (ASR endpoints)
- JSON bodies (TTS endpoints)
- Response schema validation for structured output
- Environment variables for switching endpoints
Safety and Responsible Use
Microsoft implemented safeguards:
- Audible AI disclaimer: “This segment was generated by AI” in all audio
- Imperceptible watermarking: Content verification
- Inference logging: Abuse detection (hashed, aggregated logs)
- MIT License: Commercial use allowed, but production deployment requires further testing
Allowed Use
- Research, academic, internal prototyping
- Podcast generation with AI disclosure
- Accessibility (TTS for visually impaired)
Not Allowed
- Voice impersonation without explicit consent
- Deepfakes or unlabelled AI audio
- Real-time voice conversion for deepfakes
- Generating music or non-speech audio
Limitations to Know About
- TTS language support is limited: Only English and Chinese. Other languages are not intelligible. ASR supports 50+ languages.
- High hardware requirements for ASR: Needs 24GB+ VRAM (A100/H100). TTS models run on 7–8GB consumer GPUs.
- No overlapping speech: TTS is strictly turn-based dialogue.
- Potential biases: Models inherit biases from Qwen2.5.
- Research-grade: Not production-ready; expect rough edges and limited error handling.
Deploying VibeVoice-ASR on Azure AI Foundry
For managed, production-grade ASR, use Azure AI Foundry:
- No GPU management needed
- Automatic scaling and updates
- HTTPS endpoint: upload audio, receive structured (Who/When/What) transcription
Check Azure AI Foundry’s model catalog for current options and pricing.
To test Azure-hosted VibeVoice endpoints, set the URL and authentication headers in Apidog and run test requests with sample audio.
Community and Ecosystem
VibeVoice has an active developer community:
- 62,630+ monthly downloads (HuggingFace, 1.5B model)
- 2,280+ likes, 79+ Spaces, 12 fine-tuned variants
- 4 quantized versions for lower-VRAM use
-
Community fork:
vibevoice-community/VibeVoice(active maintenance)
Notable projects:
- VibeVoice-FastAPI: Production REST API + Docker
- VibeVoice MCP Server: Model Context Protocol integration
- Apple Silicon support: Community scripts for M-series Macs
- Quantized models: GGUF and alternate formats
FAQ
Is VibeVoice free to use?
Yes. All models (TTS 1.5B, Realtime 0.5B, ASR) are MIT-licensed. Azure AI Foundry offers managed hosting with separate pricing.
Can VibeVoice run on Apple Silicon Macs?
Yes, via community scripts. Check HuggingFace discussions for updates. Slower than on CUDA GPUs.
How does VibeVoice compare to ElevenLabs?
- VibeVoice: Free, local, privacy-friendly, no API costs.
- ElevenLabs: Higher quality, more voices, easier setup, paid cloud service.
Why was the GitHub repository temporarily disabled?
Microsoft paused the repo to address voice cloning misuse. They added safety features (disclaimer, watermarking) before reopening. Development continued in community forks.
Can I fine-tune VibeVoice on custom voices?
Yes. Requires 30–60 seconds of clear WAV audio (24kHz mono) and GPU resources.
What audio formats does VibeVoice output?
WAV, 24kHz mono. Convert to MP3/OGG/FLAC with ffmpeg as needed.
Can I use VibeVoice-ASR as a Whisper replacement?
For long-form, multi-speaker audio: yes. VibeVoice-ASR processes 60-minute files with diarization. Whisper is better for noisy, short clips or edge/mobile deployment.
Does VibeVoice support real-time voice chat?
VibeVoice-Realtime-0.5B supports streaming input with ~300ms latency, suitable for near-real-time but not full-duplex chat. For true real-time, consider Azure OpenAI’s GPT-Realtime.
For hands-on API testing, debugging, and validation, use Apidog to streamline your voice AI development workflow.






Top comments (0)