Preecha

Posted on Jun 15

What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI family with three models: VibeVoice-1.5B for text-to-speech (up to 90 minutes, 4 speakers), VibeVoice-Realtime-0.5B for streaming TTS, and VibeVoice-ASR for speech recognition (60-minute audio, 50+ languages, 7.77% WER). All models are MIT-licensed and run locally. This guide covers installation, usage, and API integration.

Try Apidog today

Introduction

Microsoft released VibeVoice as an open-source voice AI framework in early 2026. It includes models for speech synthesis and speech recognition that can run locally without a cloud dependency.

The framework includes three models:

VibeVoice-1.5B: generates expressive, multi-speaker conversational audio from text scripts. It can synthesize up to 90 minutes of speech with 4 distinct speakers in a single pass.
VibeVoice-Realtime-0.5B: a lightweight streaming TTS model with ~300ms first-chunk latency.
VibeVoice-ASR: transcribes up to 60 minutes of continuous audio with speaker identification, timestamps, and structured output across 50+ languages.

The TTS models caused controversy after release. Microsoft temporarily disabled the main GitHub repository after discovering voice cloning misuse. The community forked the code, and Microsoft later re-enabled the repo with added safeguards: an audible AI disclaimer embedded in generated audio and imperceptible watermarking for provenance verification.

VibeVoice-ASR is available on Azure AI Foundry for cloud deployment. The TTS models remain research-focused with an MIT license.

This guide shows how to install VibeVoice, generate speech, run ASR, expose a local API, and test voice AI endpoints with Apidog.

How VibeVoice works: architecture overview

Tokenizers

VibeVoice’s key architectural feature is its continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. Most speech models process audio at 50-100 Hz. This 7-13x reduction helps the model handle long sequences, including up to 90 minutes of generated audio.

VibeVoice uses two tokenizers:

Acoustic Tokenizer: a sigma-VAE variant with ~340M parameters in a mirror-symmetric encoder-decoder. It downsamples 3,200x from 24kHz input audio.
Semantic Tokenizer: mirrors the acoustic tokenizer architecture but is trained with an ASR proxy task to capture linguistic meaning.

Next-token diffusion

The model combines an LLM backbone, Qwen2.5-1.5B, with a lightweight diffusion head of ~123M parameters.

The LLM handles text context and dialogue flow.
The diffusion head generates high-fidelity acoustic details using DDPM, or Denoising Diffusion Probabilistic Models, with Classifier-Free Guidance.

Total parameter count is about 3B, including tokenizers and diffusion head.

Training approach

VibeVoice uses curriculum learning. It progressively trains on longer sequences:

4K tokens
16K tokens
32K tokens
64K tokens

The pre-trained tokenizers stay frozen during this stage. Only the LLM and diffusion head parameters update. This helps the model learn long-form audio generation without losing short-form capabilities.

VibeVoice model specifications

Model	Parameters	Purpose	Max length	Languages	License
VibeVoice-1.5B	3B total	Text-to-speech	90 minutes	English, Chinese	MIT
VibeVoice-Realtime-0.5B	~0.5B	Streaming TTS	Long-form	English, Chinese	MIT
VibeVoice-ASR	~9B	Speech recognition	60 minutes	50+ languages	MIT

VibeVoice-1.5B TTS

Specification	Value
LLM base	Qwen2.5-1.5B
Context length	64K tokens
Max speakers	4 simultaneous
Audio output	24kHz WAV mono
Tensor type	BF16
Format	Safetensors
HuggingFace downloads	62,630/month
Community forks	12 fine-tuned variants

VibeVoice-ASR

Specification	Value
Architecture base	Qwen2.5
Parameters	~9B
Audio processing	Up to 60 minutes single pass
Frame rate	7.5 Hz
Average WER	7.77% across 8 English datasets
LibriSpeech Clean WER	2.20%
TED-LIUM WER	2.57%
Languages	50+
Output	Structured Who + When + What
Supported audio	WAV, FLAC, MP3 at 16kHz+

Installation and setup

Prerequisites

Before installing, make sure you have:

Python 3.8+
NVIDIA GPU with CUDA support
Minimum 7-8 GB VRAM for TTS models
Minimum 24 GB VRAM for ASR model, with A100/H100 recommended
32 GB RAM minimum, 64 GB recommended for ASR
CUDA 11.8+, with CUDA 12.0+ recommended

Install VibeVoice TTS

Clone the repository and install dependencies:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

pip install -r requirements.txt

Models download automatically from HuggingFace on first run.

You can also pre-download the model:

from huggingface_hub import snapshot_download

snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

Install via pip community package

pip install vibevoice

Install for ASR

VibeVoice-ASR uses a separate setup:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

pip install -r requirements-asr.txt

You can also deploy VibeVoice-ASR through Azure AI Foundry for managed cloud inference.

Generate speech with VibeVoice-1.5B

Single-speaker generation

Create a text file named script.txt:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5

The output is saved as a .wav file in the outputs/ directory.

Multi-speaker podcast generation

VibeVoice supports up to 4 speakers with consistent voice identities across the recording.

Create podcast_script.txt:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.

Run:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5

Use this mode for long-form generated content such as podcasts, tutorials, and scripted interviews.

Voice cloning zero-shot

You can clone a voice from a reference audio sample.

Reference audio requirements:

Format: WAV mono
Sample rate: 24,000 Hz
Duration: 30-60 seconds
Content: clear speech with minimal background noise

Convert existing audio with ffmpeg:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav

Launch the Gradio demo:

python demo/gradio_demo.py

Open the local UI:

http://127.0.0.1:7860

Upload your reference audio, select the cloned voice, and generate speech.

Stream audio with VibeVoice-Realtime-0.5B

For low-latency output, use the realtime model:

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice

Use the realtime model for interactive applications. Use VibeVoice-1.5B when output quality matters more than latency.

Use VibeVoice from Python

Pipeline API

Use the HuggingFace pipeline for programmatic generation:

from transformers import pipeline
from huggingface_hub import snapshot_download

model_path = snapshot_download("microsoft/VibeVoice-1.5B")

pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

input_data = pipe.processor.apply_chat_template(script)

generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)

FastAPI wrapper for an OpenAI-compatible endpoint

The community built a FastAPI wrapper that exposes VibeVoice as an OpenAI-compatible TTS API:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI

docker compose up

Send a TTS request:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav

Because the endpoint follows the OpenAI TTS request format, you can test it with the same JSON body you would use for OpenAI-compatible speech APIs.

Use VibeVoice-ASR for speech recognition

Basic transcription

Run ASR against an audio file:

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav

Structured output format

VibeVoice-ASR returns structured transcription segments with:

Who: speaker identity, such as Speaker 1
When: start and end timestamps
What: transcribed text

Example output:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}

Run ASR as an MCP server

VibeVoice-ASR can run as an MCP, or Model Context Protocol, server. This lets tools such as Claude Code, Cursor, and other AI coding tools use transcription as part of a workflow.

Install and run:

pip install vibevoice-mcp-server

vibevoice-mcp serve

Example workflow:

Record a meeting, feature discussion, or voice note.
Send the audio to the MCP server.
Let your coding agent consume the transcript.
Convert requirements or notes into implementation tasks.

VibeVoice-ASR vs Whisper

Use case	Best choice	Why
Long meetings, 30-60 min	VibeVoice-ASR	Single-pass 60-minute processing and speaker ID
Interviews with multiple speakers	VibeVoice-ASR	Built-in diarization
Podcasts needing timestamps	VibeVoice-ASR	Structured Who/When/What output
Multilingual content, 50+ languages	VibeVoice-ASR	Broader language support
Short clips in noisy environments	Whisper	Better noise robustness
Edge/mobile deployment	Whisper	Smaller model size and wider device support
Specialized non-English languages	Whisper	More mature multilingual fine-tuning

Test voice AI APIs with Apidog

Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API wrapper, Apidog helps you test the full request/response flow for voice AI endpoints.

Test a TTS endpoint

Create a new POST request in Apidog.

For a local FastAPI wrapper, use:

POST http://localhost:8000/v1/audio/speech

Set the request header:

Content-Type: application/json

Set the JSON body:

{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}

Then validate:

The response status is successful.
The response Content-Type is audio-compatible, such as audio/wav.
The response body can be saved as a WAV file.
The generated audio matches the input text and selected voice.

Test an ASR endpoint

For speech-to-text APIs, use multipart/form-data.

Typical request fields:

file: meeting_recording.wav
model: vibevoice-asr

Validate the JSON response includes:

Speaker IDs
Start and end timestamps
Transcribed text
Segment ordering

Example checks:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    }
  ]
}

Validate audio API contracts

Voice AI APIs often mix binary files and JSON metadata. Use Apidog to verify:

Binary file uploads for ASR endpoints
JSON body formatting for TTS endpoints
Response validation for structured transcripts
Environment variables for switching between local and cloud endpoints
Auth headers for Azure-hosted or custom deployments

This is useful before integrating the endpoint into an application, CI job, or backend service.

Safety and responsible use

Microsoft added several safeguards after the initial misuse incidents:

Audible AI disclaimer: generated audio includes an automatic “This segment was generated by AI” message.
Imperceptible watermarking: hidden markers enable third-party verification of VibeVoice-generated content.
Inference logging: hashed logs detect abuse patterns with quarterly aggregated statistics.
MIT license: permits commercial use, but Microsoft recommends against production deployment without further testing.

Allowed use cases

Research and academic use
Internal prototyping and testing
Podcast generation with proper AI disclosure
Accessibility applications, such as text-to-speech for visually impaired users

Avoid these use cases

Voice impersonation without explicit recorded consent
Deepfakes or presenting AI audio as genuine human recordings
Real-time voice conversion for live deepfake applications
Generating non-speech audio, such as music or sound effects

Limitations

TTS language support is narrow. VibeVoice-1.5B supports English and Chinese. Other languages produce unintelligible output. VibeVoice-ASR has broader coverage at 50+ languages.

ASR hardware requirements are high. The ASR model needs 24 GB+ VRAM, with A100/H100-class GPUs recommended. The TTS models can run on consumer GPUs with 7-8 GB VRAM.

The TTS model does not handle overlapping speech. Dialogue is turn-based and does not model speakers talking over each other.

Both models inherit biases from their Qwen2.5 base. Outputs can contain unexpected, biased, or inaccurate content.

VibeVoice is research-grade software. Expect rough edges in edge cases, error handling, and non-English output.

Deploy VibeVoice-ASR on Azure AI Foundry

If you do not want to manage GPU infrastructure, deploy VibeVoice-ASR through Azure AI Foundry.

The managed endpoint handles:

Scaling
Model updates
Infrastructure maintenance
HTTPS API access

The API accepts audio files and returns structured transcriptions in the same Who/When/What format as the local model.

This is useful for workloads where you need uptime and operational consistency that self-hosted GPU inference may not provide. Check Azure AI Foundry’s model catalog for current pricing and deployment options.

To test an Azure-hosted VibeVoice endpoint before app integration:

Create an Apidog environment for Azure.
Add the endpoint URL as an environment variable.
Configure authentication headers.
Upload sample audio files.
Validate transcript structure and response latency.

Community and ecosystem

VibeVoice has an active community:

62,630+ monthly HuggingFace downloads for the 1.5B model
2,280+ likes on HuggingFace
79+ HuggingFace Spaces running the model
12 fine-tuned variants from the community
4 quantized versions for lower-VRAM deployment
Community fork at vibevoice-community/VibeVoice with active maintenance

Notable community projects:

VibeVoice-FastAPI: production REST API wrapper with Docker support
VibeVoice MCP Server: integration with AI coding tools through Model Context Protocol
Apple Silicon support: community scripts for M-series Mac inference
Quantized models: GGUF and other formats for reduced VRAM usage

FAQ

Is VibeVoice free to use?

Yes. All three models, VibeVoice-1.5B, VibeVoice-Realtime-0.5B, and VibeVoice-ASR, are MIT-licensed. You can use them for commercial and non-commercial purposes. Azure AI Foundry hosting has separate pricing for managed cloud inference.

Can VibeVoice run on Apple Silicon Macs?

The community has contributed scripts for M-series Mac inference. Check the HuggingFace discussions for the VibeVoice-1.5B model. Performance is slower than CUDA GPUs but functional.

How does VibeVoice compare to ElevenLabs?

VibeVoice runs locally with no API costs and no data leaving your machine. ElevenLabs offers higher quality, more voices, and easier setup, but requires a paid subscription and cloud processing.

For privacy-sensitive applications or offline use, VibeVoice is useful. For production quality and ease of use, ElevenLabs is ahead.

Why was the GitHub repository temporarily disabled?

Microsoft discovered people using voice cloning for impersonation and deepfakes. They disabled the repo, added safety features such as audible disclaimers and watermarking, and later re-enabled it. The community fork continued development during the downtime.

Can I fine-tune VibeVoice on custom voices?

Yes. The community has produced 12 fine-tuned variants on HuggingFace. You need voice samples, typically 30-60 seconds of clear WAV audio at 24kHz mono, plus GPU resources for training.

What audio formats does VibeVoice output?

VibeVoice outputs WAV at 24,000 Hz mono. You can convert the generated file to MP3, OGG, FLAC, or other formats with ffmpeg.

Example:

ffmpeg -i speech.wav speech.mp3

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form audio with speaker identification, yes. VibeVoice-ASR handles 60-minute recordings in a single pass with built-in diarization.

Whisper still fits better for short noisy clips or edge deployment. It also has a more mature ecosystem for many speech-to-text workflows.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming text input with ~300ms first-chunk latency. It is usable for near-real-time applications but is not designed for full-duplex voice conversation.