TL;DR
VibeVoice is Microsoft’s open-source voice AI family with three models: VibeVoice-1.5B for text-to-speech (up to 90 minutes, 4 speakers), VibeVoice-Realtime-0.5B for streaming TTS, and VibeVoice-ASR for speech recognition (60-minute audio, 50+ languages, 7.77% WER). All models are MIT-licensed and run locally. This guide covers installation, usage, and API integration.
Introduction
Microsoft released VibeVoice as an open-source voice AI framework in early 2026. It includes models for speech synthesis and speech recognition that can run locally without a cloud dependency.
The framework includes three models:
- VibeVoice-1.5B: generates expressive, multi-speaker conversational audio from text scripts. It can synthesize up to 90 minutes of speech with 4 distinct speakers in a single pass.
- VibeVoice-Realtime-0.5B: a lightweight streaming TTS model with ~300ms first-chunk latency.
- VibeVoice-ASR: transcribes up to 60 minutes of continuous audio with speaker identification, timestamps, and structured output across 50+ languages.
The TTS models caused controversy after release. Microsoft temporarily disabled the main GitHub repository after discovering voice cloning misuse. The community forked the code, and Microsoft later re-enabled the repo with added safeguards: an audible AI disclaimer embedded in generated audio and imperceptible watermarking for provenance verification.
VibeVoice-ASR is available on Azure AI Foundry for cloud deployment. The TTS models remain research-focused with an MIT license.
This guide shows how to install VibeVoice, generate speech, run ASR, expose a local API, and test voice AI endpoints with Apidog.
How VibeVoice works: architecture overview
Tokenizers
VibeVoice’s key architectural feature is its continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. Most speech models process audio at 50-100 Hz. This 7-13x reduction helps the model handle long sequences, including up to 90 minutes of generated audio.
VibeVoice uses two tokenizers:
- Acoustic Tokenizer: a sigma-VAE variant with ~340M parameters in a mirror-symmetric encoder-decoder. It downsamples 3,200x from 24kHz input audio.
- Semantic Tokenizer: mirrors the acoustic tokenizer architecture but is trained with an ASR proxy task to capture linguistic meaning.
Next-token diffusion
The model combines an LLM backbone, Qwen2.5-1.5B, with a lightweight diffusion head of ~123M parameters.
- The LLM handles text context and dialogue flow.
- The diffusion head generates high-fidelity acoustic details using DDPM, or Denoising Diffusion Probabilistic Models, with Classifier-Free Guidance.
Total parameter count is about 3B, including tokenizers and diffusion head.
Training approach
VibeVoice uses curriculum learning. It progressively trains on longer sequences:
- 4K tokens
- 16K tokens
- 32K tokens
- 64K tokens
The pre-trained tokenizers stay frozen during this stage. Only the LLM and diffusion head parameters update. This helps the model learn long-form audio generation without losing short-form capabilities.
VibeVoice model specifications
| Model | Parameters | Purpose | Max length | Languages | License |
|---|---|---|---|---|---|
| VibeVoice-1.5B | 3B total | Text-to-speech | 90 minutes | English, Chinese | MIT |
| VibeVoice-Realtime-0.5B | ~0.5B | Streaming TTS | Long-form | English, Chinese | MIT |
| VibeVoice-ASR | ~9B | Speech recognition | 60 minutes | 50+ languages | MIT |
VibeVoice-1.5B TTS
| Specification | Value |
|---|---|
| LLM base | Qwen2.5-1.5B |
| Context length | 64K tokens |
| Max speakers | 4 simultaneous |
| Audio output | 24kHz WAV mono |
| Tensor type | BF16 |
| Format | Safetensors |
| HuggingFace downloads | 62,630/month |
| Community forks | 12 fine-tuned variants |
VibeVoice-ASR
| Specification | Value |
|---|---|
| Architecture base | Qwen2.5 |
| Parameters | ~9B |
| Audio processing | Up to 60 minutes single pass |
| Frame rate | 7.5 Hz |
| Average WER | 7.77% across 8 English datasets |
| LibriSpeech Clean WER | 2.20% |
| TED-LIUM WER | 2.57% |
| Languages | 50+ |
| Output | Structured Who + When + What |
| Supported audio | WAV, FLAC, MP3 at 16kHz+ |
Installation and setup
Prerequisites
Before installing, make sure you have:
- Python 3.8+
- NVIDIA GPU with CUDA support
- Minimum 7-8 GB VRAM for TTS models
- Minimum 24 GB VRAM for ASR model, with A100/H100 recommended
- 32 GB RAM minimum, 64 GB recommended for ASR
- CUDA 11.8+, with CUDA 12.0+ recommended
Install VibeVoice TTS
Clone the repository and install dependencies:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements.txt
Models download automatically from HuggingFace on first run.
You can also pre-download the model:
from huggingface_hub import snapshot_download
snapshot_download(
"microsoft/VibeVoice-1.5B",
local_dir="./models/VibeVoice-1.5B",
local_dir_use_symlinks=False
)
Install via pip community package
pip install vibevoice
Install for ASR
VibeVoice-ASR uses a separate setup:
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt
You can also deploy VibeVoice-ASR through Azure AI Foundry for managed cloud inference.
Generate speech with VibeVoice-1.5B
Single-speaker generation
Create a text file named script.txt:
Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.
Run inference:
python VibeVoice \
--model_path microsoft/VibeVoice-1.5B \
--txt_path script.txt \
--speaker_names Alice \
--cfg_scale 1.5
The output is saved as a .wav file in the outputs/ directory.
Multi-speaker podcast generation
VibeVoice supports up to 4 speakers with consistent voice identities across the recording.
Create podcast_script.txt:
Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.
Run:
python VibeVoice \
--model_path microsoft/VibeVoice-1.5B \
--txt_path podcast_script.txt \
--speaker_names Alice Bob Carol \
--cfg_scale 1.5
Use this mode for long-form generated content such as podcasts, tutorials, and scripted interviews.
Voice cloning zero-shot
You can clone a voice from a reference audio sample.
Reference audio requirements:
- Format: WAV mono
- Sample rate: 24,000 Hz
- Duration: 30-60 seconds
- Content: clear speech with minimal background noise
Convert existing audio with ffmpeg:
ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav
Launch the Gradio demo:
python demo/gradio_demo.py
Open the local UI:
http://127.0.0.1:7860
Upload your reference audio, select the cloned voice, and generate speech.
Stream audio with VibeVoice-Realtime-0.5B
For low-latency output, use the realtime model:
python demo/streaming_inference_from_file.py \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--txt_path script.txt \
--speaker_name Alice
Use the realtime model for interactive applications. Use VibeVoice-1.5B when output quality matters more than latency.
Use VibeVoice from Python
Pipeline API
Use the HuggingFace pipeline for programmatic generation:
from transformers import pipeline
from huggingface_hub import snapshot_download
model_path = snapshot_download("microsoft/VibeVoice-1.5B")
pipe = pipeline(
"text-to-speech",
model=model_path,
no_processor=False
)
script = [
{"role": "Alice", "content": "How do you handle API versioning?"},
{"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]
input_data = pipe.processor.apply_chat_template(script)
generate_kwargs = {
"cfg_scale": 1.5,
"n_diffusion_steps": 50,
}
output = pipe(input_data, generate_kwargs=generate_kwargs)
FastAPI wrapper for an OpenAI-compatible endpoint
The community built a FastAPI wrapper that exposes VibeVoice as an OpenAI-compatible TTS API:
git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up
Send a TTS request:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "vibevoice-1.5b",
"input": "Your API documentation should be a conversation, not a monologue.",
"voice": "alice"
}' \
--output speech.wav
Because the endpoint follows the OpenAI TTS request format, you can test it with the same JSON body you would use for OpenAI-compatible speech APIs.
Use VibeVoice-ASR for speech recognition
Basic transcription
Run ASR against an audio file:
python asr_inference.py \
--model_path microsoft/VibeVoice-ASR \
--audio_path meeting_recording.wav
Structured output format
VibeVoice-ASR returns structured transcription segments with:
-
Who: speaker identity, such as
Speaker 1 - When: start and end timestamps
- What: transcribed text
Example output:
{
"segments": [
{
"speaker": "Speaker 1",
"start": 0.0,
"end": 4.2,
"text": "Let's review the API endpoints for the new release."
},
{
"speaker": "Speaker 2",
"start": 4.5,
"end": 8.1,
"text": "I've added three new endpoints for the billing module."
}
]
}
Run ASR as an MCP server
VibeVoice-ASR can run as an MCP, or Model Context Protocol, server. This lets tools such as Claude Code, Cursor, and other AI coding tools use transcription as part of a workflow.
Install and run:
pip install vibevoice-mcp-server
vibevoice-mcp serve
Example workflow:
- Record a meeting, feature discussion, or voice note.
- Send the audio to the MCP server.
- Let your coding agent consume the transcript.
- Convert requirements or notes into implementation tasks.
VibeVoice-ASR vs Whisper
| Use case | Best choice | Why |
|---|---|---|
| Long meetings, 30-60 min | VibeVoice-ASR | Single-pass 60-minute processing and speaker ID |
| Interviews with multiple speakers | VibeVoice-ASR | Built-in diarization |
| Podcasts needing timestamps | VibeVoice-ASR | Structured Who/When/What output |
| Multilingual content, 50+ languages | VibeVoice-ASR | Broader language support |
| Short clips in noisy environments | Whisper | Better noise robustness |
| Edge/mobile deployment | Whisper | Smaller model size and wider device support |
| Specialized non-English languages | Whisper | More mature multilingual fine-tuning |
Test voice AI APIs with Apidog
Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API wrapper, Apidog helps you test the full request/response flow for voice AI endpoints.
Test a TTS endpoint
Create a new POST request in Apidog.
For a local FastAPI wrapper, use:
POST http://localhost:8000/v1/audio/speech
Set the request header:
Content-Type: application/json
Set the JSON body:
{
"model": "vibevoice-1.5b",
"input": "Test speech synthesis with proper intonation and pacing.",
"voice": "alice",
"response_format": "wav"
}
Then validate:
- The response status is successful.
- The response
Content-Typeis audio-compatible, such asaudio/wav. - The response body can be saved as a WAV file.
- The generated audio matches the input text and selected voice.
Test an ASR endpoint
For speech-to-text APIs, use multipart/form-data.
Typical request fields:
file: meeting_recording.wav
model: vibevoice-asr
Validate the JSON response includes:
- Speaker IDs
- Start and end timestamps
- Transcribed text
- Segment ordering
Example checks:
{
"segments": [
{
"speaker": "Speaker 1",
"start": 0.0,
"end": 4.2,
"text": "Let's review the API endpoints for the new release."
}
]
}
Validate audio API contracts
Voice AI APIs often mix binary files and JSON metadata. Use Apidog to verify:
- Binary file uploads for ASR endpoints
- JSON body formatting for TTS endpoints
- Response validation for structured transcripts
- Environment variables for switching between local and cloud endpoints
- Auth headers for Azure-hosted or custom deployments
This is useful before integrating the endpoint into an application, CI job, or backend service.
Safety and responsible use
Microsoft added several safeguards after the initial misuse incidents:
- Audible AI disclaimer: generated audio includes an automatic “This segment was generated by AI” message.
- Imperceptible watermarking: hidden markers enable third-party verification of VibeVoice-generated content.
- Inference logging: hashed logs detect abuse patterns with quarterly aggregated statistics.
- MIT license: permits commercial use, but Microsoft recommends against production deployment without further testing.
Allowed use cases
- Research and academic use
- Internal prototyping and testing
- Podcast generation with proper AI disclosure
- Accessibility applications, such as text-to-speech for visually impaired users
Avoid these use cases
- Voice impersonation without explicit recorded consent
- Deepfakes or presenting AI audio as genuine human recordings
- Real-time voice conversion for live deepfake applications
- Generating non-speech audio, such as music or sound effects
Limitations
TTS language support is narrow. VibeVoice-1.5B supports English and Chinese. Other languages produce unintelligible output. VibeVoice-ASR has broader coverage at 50+ languages.
ASR hardware requirements are high. The ASR model needs 24 GB+ VRAM, with A100/H100-class GPUs recommended. The TTS models can run on consumer GPUs with 7-8 GB VRAM.
The TTS model does not handle overlapping speech. Dialogue is turn-based and does not model speakers talking over each other.
Both models inherit biases from their Qwen2.5 base. Outputs can contain unexpected, biased, or inaccurate content.
VibeVoice is research-grade software. Expect rough edges in edge cases, error handling, and non-English output.
Deploy VibeVoice-ASR on Azure AI Foundry
If you do not want to manage GPU infrastructure, deploy VibeVoice-ASR through Azure AI Foundry.
The managed endpoint handles:
- Scaling
- Model updates
- Infrastructure maintenance
- HTTPS API access
The API accepts audio files and returns structured transcriptions in the same Who/When/What format as the local model.
This is useful for workloads where you need uptime and operational consistency that self-hosted GPU inference may not provide. Check Azure AI Foundry’s model catalog for current pricing and deployment options.
To test an Azure-hosted VibeVoice endpoint before app integration:
- Create an Apidog environment for Azure.
- Add the endpoint URL as an environment variable.
- Configure authentication headers.
- Upload sample audio files.
- Validate transcript structure and response latency.
Community and ecosystem
VibeVoice has an active community:
- 62,630+ monthly HuggingFace downloads for the 1.5B model
- 2,280+ likes on HuggingFace
- 79+ HuggingFace Spaces running the model
- 12 fine-tuned variants from the community
- 4 quantized versions for lower-VRAM deployment
- Community fork at
vibevoice-community/VibeVoicewith active maintenance
Notable community projects:
- VibeVoice-FastAPI: production REST API wrapper with Docker support
- VibeVoice MCP Server: integration with AI coding tools through Model Context Protocol
- Apple Silicon support: community scripts for M-series Mac inference
- Quantized models: GGUF and other formats for reduced VRAM usage
FAQ
Is VibeVoice free to use?
Yes. All three models, VibeVoice-1.5B, VibeVoice-Realtime-0.5B, and VibeVoice-ASR, are MIT-licensed. You can use them for commercial and non-commercial purposes. Azure AI Foundry hosting has separate pricing for managed cloud inference.
Can VibeVoice run on Apple Silicon Macs?
The community has contributed scripts for M-series Mac inference. Check the HuggingFace discussions for the VibeVoice-1.5B model. Performance is slower than CUDA GPUs but functional.
How does VibeVoice compare to ElevenLabs?
VibeVoice runs locally with no API costs and no data leaving your machine. ElevenLabs offers higher quality, more voices, and easier setup, but requires a paid subscription and cloud processing.
For privacy-sensitive applications or offline use, VibeVoice is useful. For production quality and ease of use, ElevenLabs is ahead.
Why was the GitHub repository temporarily disabled?
Microsoft discovered people using voice cloning for impersonation and deepfakes. They disabled the repo, added safety features such as audible disclaimers and watermarking, and later re-enabled it. The community fork continued development during the downtime.
Can I fine-tune VibeVoice on custom voices?
Yes. The community has produced 12 fine-tuned variants on HuggingFace. You need voice samples, typically 30-60 seconds of clear WAV audio at 24kHz mono, plus GPU resources for training.
What audio formats does VibeVoice output?
VibeVoice outputs WAV at 24,000 Hz mono. You can convert the generated file to MP3, OGG, FLAC, or other formats with ffmpeg.
Example:
ffmpeg -i speech.wav speech.mp3
Can I use VibeVoice-ASR as a Whisper replacement?
For long-form audio with speaker identification, yes. VibeVoice-ASR handles 60-minute recordings in a single pass with built-in diarization.
Whisper still fits better for short noisy clips or edge deployment. It also has a more mature ecosystem for many speech-to-text workflows.
Does VibeVoice support real-time voice chat?
VibeVoice-Realtime-0.5B supports streaming text input with ~300ms first-chunk latency. It is usable for near-real-time applications but is not designed for full-duplex voice conversation.






Top comments (0)