DEV Community

Cover image for What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models
Preecha
Preecha

Posted on

What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI family with three models: VibeVoice-1.5B for text-to-speech (up to 90 minutes, 4 speakers), VibeVoice-Realtime-0.5B for streaming TTS, and VibeVoice-ASR for speech recognition (60-minute audio, 50+ languages, 7.77% WER). All models are MIT-licensed and run locally. This guide covers installation, usage, and API integration.

Try Apidog today

Introduction

Microsoft released VibeVoice as an open-source voice AI framework in early 2026. It includes models for speech synthesis and speech recognition that can run locally without a cloud dependency.

Image

The framework includes three models:

  • VibeVoice-1.5B: generates expressive, multi-speaker conversational audio from text scripts. It can synthesize up to 90 minutes of speech with 4 distinct speakers in a single pass.
  • VibeVoice-Realtime-0.5B: a lightweight streaming TTS model with ~300ms first-chunk latency.
  • VibeVoice-ASR: transcribes up to 60 minutes of continuous audio with speaker identification, timestamps, and structured output across 50+ languages.

Image

The TTS models caused controversy after release. Microsoft temporarily disabled the main GitHub repository after discovering voice cloning misuse. The community forked the code, and Microsoft later re-enabled the repo with added safeguards: an audible AI disclaimer embedded in generated audio and imperceptible watermarking for provenance verification.

VibeVoice-ASR is available on Azure AI Foundry for cloud deployment. The TTS models remain research-focused with an MIT license.

This guide shows how to install VibeVoice, generate speech, run ASR, expose a local API, and test voice AI endpoints with Apidog.

How VibeVoice works: architecture overview

Tokenizers

VibeVoice’s key architectural feature is its continuous speech tokenizers operating at an ultra-low frame rate of 7.5 Hz. Most speech models process audio at 50-100 Hz. This 7-13x reduction helps the model handle long sequences, including up to 90 minutes of generated audio.

Image

Image

VibeVoice uses two tokenizers:

  • Acoustic Tokenizer: a sigma-VAE variant with ~340M parameters in a mirror-symmetric encoder-decoder. It downsamples 3,200x from 24kHz input audio.
  • Semantic Tokenizer: mirrors the acoustic tokenizer architecture but is trained with an ASR proxy task to capture linguistic meaning.

Next-token diffusion

The model combines an LLM backbone, Qwen2.5-1.5B, with a lightweight diffusion head of ~123M parameters.

  • The LLM handles text context and dialogue flow.
  • The diffusion head generates high-fidelity acoustic details using DDPM, or Denoising Diffusion Probabilistic Models, with Classifier-Free Guidance.

Total parameter count is about 3B, including tokenizers and diffusion head.

Training approach

VibeVoice uses curriculum learning. It progressively trains on longer sequences:

  1. 4K tokens
  2. 16K tokens
  3. 32K tokens
  4. 64K tokens

The pre-trained tokenizers stay frozen during this stage. Only the LLM and diffusion head parameters update. This helps the model learn long-form audio generation without losing short-form capabilities.

VibeVoice model specifications

Model Parameters Purpose Max length Languages License
VibeVoice-1.5B 3B total Text-to-speech 90 minutes English, Chinese MIT
VibeVoice-Realtime-0.5B ~0.5B Streaming TTS Long-form English, Chinese MIT
VibeVoice-ASR ~9B Speech recognition 60 minutes 50+ languages MIT

VibeVoice-1.5B TTS

Specification Value
LLM base Qwen2.5-1.5B
Context length 64K tokens
Max speakers 4 simultaneous
Audio output 24kHz WAV mono
Tensor type BF16
Format Safetensors
HuggingFace downloads 62,630/month
Community forks 12 fine-tuned variants

VibeVoice-ASR

Specification Value
Architecture base Qwen2.5
Parameters ~9B
Audio processing Up to 60 minutes single pass
Frame rate 7.5 Hz
Average WER 7.77% across 8 English datasets
LibriSpeech Clean WER 2.20%
TED-LIUM WER 2.57%
Languages 50+
Output Structured Who + When + What
Supported audio WAV, FLAC, MP3 at 16kHz+

Installation and setup

Prerequisites

Before installing, make sure you have:

  • Python 3.8+
  • NVIDIA GPU with CUDA support
  • Minimum 7-8 GB VRAM for TTS models
  • Minimum 24 GB VRAM for ASR model, with A100/H100 recommended
  • 32 GB RAM minimum, 64 GB recommended for ASR
  • CUDA 11.8+, with CUDA 12.0+ recommended

Install VibeVoice TTS

Clone the repository and install dependencies:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Models download automatically from HuggingFace on first run.

You can also pre-download the model:

from huggingface_hub import snapshot_download

snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)
Enter fullscreen mode Exit fullscreen mode

Install via pip community package

pip install vibevoice
Enter fullscreen mode Exit fullscreen mode

Install for ASR

VibeVoice-ASR uses a separate setup:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

pip install -r requirements-asr.txt
Enter fullscreen mode Exit fullscreen mode

You can also deploy VibeVoice-ASR through Azure AI Foundry for managed cloud inference.

Generate speech with VibeVoice-1.5B

Single-speaker generation

Create a text file named script.txt:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.
Enter fullscreen mode Exit fullscreen mode

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5
Enter fullscreen mode Exit fullscreen mode

The output is saved as a .wav file in the outputs/ directory.

Multi-speaker podcast generation

VibeVoice supports up to 4 speakers with consistent voice identities across the recording.

Create podcast_script.txt:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.
Enter fullscreen mode Exit fullscreen mode

Run:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5
Enter fullscreen mode Exit fullscreen mode

Use this mode for long-form generated content such as podcasts, tutorials, and scripted interviews.

Voice cloning zero-shot

You can clone a voice from a reference audio sample.

Reference audio requirements:

  • Format: WAV mono
  • Sample rate: 24,000 Hz
  • Duration: 30-60 seconds
  • Content: clear speech with minimal background noise

Convert existing audio with ffmpeg:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav
Enter fullscreen mode Exit fullscreen mode

Launch the Gradio demo:

python demo/gradio_demo.py
Enter fullscreen mode Exit fullscreen mode

Open the local UI:

http://127.0.0.1:7860
Enter fullscreen mode Exit fullscreen mode

Upload your reference audio, select the cloned voice, and generate speech.

Stream audio with VibeVoice-Realtime-0.5B

For low-latency output, use the realtime model:

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice
Enter fullscreen mode Exit fullscreen mode

Use the realtime model for interactive applications. Use VibeVoice-1.5B when output quality matters more than latency.

Use VibeVoice from Python

Pipeline API

Use the HuggingFace pipeline for programmatic generation:

from transformers import pipeline
from huggingface_hub import snapshot_download

model_path = snapshot_download("microsoft/VibeVoice-1.5B")

pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

input_data = pipe.processor.apply_chat_template(script)

generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)
Enter fullscreen mode Exit fullscreen mode

FastAPI wrapper for an OpenAI-compatible endpoint

The community built a FastAPI wrapper that exposes VibeVoice as an OpenAI-compatible TTS API:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI

docker compose up
Enter fullscreen mode Exit fullscreen mode

Send a TTS request:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav
Enter fullscreen mode Exit fullscreen mode

Because the endpoint follows the OpenAI TTS request format, you can test it with the same JSON body you would use for OpenAI-compatible speech APIs.

Use VibeVoice-ASR for speech recognition

Basic transcription

Run ASR against an audio file:

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav
Enter fullscreen mode Exit fullscreen mode

Structured output format

VibeVoice-ASR returns structured transcription segments with:

  • Who: speaker identity, such as Speaker 1
  • When: start and end timestamps
  • What: transcribed text

Example output:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Run ASR as an MCP server

VibeVoice-ASR can run as an MCP, or Model Context Protocol, server. This lets tools such as Claude Code, Cursor, and other AI coding tools use transcription as part of a workflow.

Install and run:

pip install vibevoice-mcp-server

vibevoice-mcp serve
Enter fullscreen mode Exit fullscreen mode

Example workflow:

  1. Record a meeting, feature discussion, or voice note.
  2. Send the audio to the MCP server.
  3. Let your coding agent consume the transcript.
  4. Convert requirements or notes into implementation tasks.

VibeVoice-ASR vs Whisper

Use case Best choice Why
Long meetings, 30-60 min VibeVoice-ASR Single-pass 60-minute processing and speaker ID
Interviews with multiple speakers VibeVoice-ASR Built-in diarization
Podcasts needing timestamps VibeVoice-ASR Structured Who/When/What output
Multilingual content, 50+ languages VibeVoice-ASR Broader language support
Short clips in noisy environments Whisper Better noise robustness
Edge/mobile deployment Whisper Smaller model size and wider device support
Specialized non-English languages Whisper More mature multilingual fine-tuning

Test voice AI APIs with Apidog

Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API wrapper, Apidog helps you test the full request/response flow for voice AI endpoints.

Image

Test a TTS endpoint

Create a new POST request in Apidog.

For a local FastAPI wrapper, use:

POST http://localhost:8000/v1/audio/speech
Enter fullscreen mode Exit fullscreen mode

Set the request header:

Content-Type: application/json
Enter fullscreen mode Exit fullscreen mode

Set the JSON body:

{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}
Enter fullscreen mode Exit fullscreen mode

Then validate:

  1. The response status is successful.
  2. The response Content-Type is audio-compatible, such as audio/wav.
  3. The response body can be saved as a WAV file.
  4. The generated audio matches the input text and selected voice.

Test an ASR endpoint

For speech-to-text APIs, use multipart/form-data.

Typical request fields:

file: meeting_recording.wav
model: vibevoice-asr
Enter fullscreen mode Exit fullscreen mode

Validate the JSON response includes:

  • Speaker IDs
  • Start and end timestamps
  • Transcribed text
  • Segment ordering

Example checks:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Validate audio API contracts

Voice AI APIs often mix binary files and JSON metadata. Use Apidog to verify:

  • Binary file uploads for ASR endpoints
  • JSON body formatting for TTS endpoints
  • Response validation for structured transcripts
  • Environment variables for switching between local and cloud endpoints
  • Auth headers for Azure-hosted or custom deployments

This is useful before integrating the endpoint into an application, CI job, or backend service.

Safety and responsible use

Microsoft added several safeguards after the initial misuse incidents:

  • Audible AI disclaimer: generated audio includes an automatic “This segment was generated by AI” message.
  • Imperceptible watermarking: hidden markers enable third-party verification of VibeVoice-generated content.
  • Inference logging: hashed logs detect abuse patterns with quarterly aggregated statistics.
  • MIT license: permits commercial use, but Microsoft recommends against production deployment without further testing.

Allowed use cases

  • Research and academic use
  • Internal prototyping and testing
  • Podcast generation with proper AI disclosure
  • Accessibility applications, such as text-to-speech for visually impaired users

Avoid these use cases

  • Voice impersonation without explicit recorded consent
  • Deepfakes or presenting AI audio as genuine human recordings
  • Real-time voice conversion for live deepfake applications
  • Generating non-speech audio, such as music or sound effects

Limitations

TTS language support is narrow. VibeVoice-1.5B supports English and Chinese. Other languages produce unintelligible output. VibeVoice-ASR has broader coverage at 50+ languages.

Image

ASR hardware requirements are high. The ASR model needs 24 GB+ VRAM, with A100/H100-class GPUs recommended. The TTS models can run on consumer GPUs with 7-8 GB VRAM.

The TTS model does not handle overlapping speech. Dialogue is turn-based and does not model speakers talking over each other.

Both models inherit biases from their Qwen2.5 base. Outputs can contain unexpected, biased, or inaccurate content.

VibeVoice is research-grade software. Expect rough edges in edge cases, error handling, and non-English output.

Deploy VibeVoice-ASR on Azure AI Foundry

If you do not want to manage GPU infrastructure, deploy VibeVoice-ASR through Azure AI Foundry.

The managed endpoint handles:

  • Scaling
  • Model updates
  • Infrastructure maintenance
  • HTTPS API access

The API accepts audio files and returns structured transcriptions in the same Who/When/What format as the local model.

This is useful for workloads where you need uptime and operational consistency that self-hosted GPU inference may not provide. Check Azure AI Foundry’s model catalog for current pricing and deployment options.

To test an Azure-hosted VibeVoice endpoint before app integration:

  1. Create an Apidog environment for Azure.
  2. Add the endpoint URL as an environment variable.
  3. Configure authentication headers.
  4. Upload sample audio files.
  5. Validate transcript structure and response latency.

Community and ecosystem

VibeVoice has an active community:

  • 62,630+ monthly HuggingFace downloads for the 1.5B model
  • 2,280+ likes on HuggingFace
  • 79+ HuggingFace Spaces running the model
  • 12 fine-tuned variants from the community
  • 4 quantized versions for lower-VRAM deployment
  • Community fork at vibevoice-community/VibeVoice with active maintenance

Notable community projects:

  • VibeVoice-FastAPI: production REST API wrapper with Docker support
  • VibeVoice MCP Server: integration with AI coding tools through Model Context Protocol
  • Apple Silicon support: community scripts for M-series Mac inference
  • Quantized models: GGUF and other formats for reduced VRAM usage

FAQ

Is VibeVoice free to use?

Yes. All three models, VibeVoice-1.5B, VibeVoice-Realtime-0.5B, and VibeVoice-ASR, are MIT-licensed. You can use them for commercial and non-commercial purposes. Azure AI Foundry hosting has separate pricing for managed cloud inference.

Can VibeVoice run on Apple Silicon Macs?

The community has contributed scripts for M-series Mac inference. Check the HuggingFace discussions for the VibeVoice-1.5B model. Performance is slower than CUDA GPUs but functional.

How does VibeVoice compare to ElevenLabs?

VibeVoice runs locally with no API costs and no data leaving your machine. ElevenLabs offers higher quality, more voices, and easier setup, but requires a paid subscription and cloud processing.

For privacy-sensitive applications or offline use, VibeVoice is useful. For production quality and ease of use, ElevenLabs is ahead.

Why was the GitHub repository temporarily disabled?

Microsoft discovered people using voice cloning for impersonation and deepfakes. They disabled the repo, added safety features such as audible disclaimers and watermarking, and later re-enabled it. The community fork continued development during the downtime.

Can I fine-tune VibeVoice on custom voices?

Yes. The community has produced 12 fine-tuned variants on HuggingFace. You need voice samples, typically 30-60 seconds of clear WAV audio at 24kHz mono, plus GPU resources for training.

What audio formats does VibeVoice output?

VibeVoice outputs WAV at 24,000 Hz mono. You can convert the generated file to MP3, OGG, FLAC, or other formats with ffmpeg.

Example:

ffmpeg -i speech.wav speech.mp3
Enter fullscreen mode Exit fullscreen mode

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form audio with speaker identification, yes. VibeVoice-ASR handles 60-minute recordings in a single pass with built-in diarization.

Whisper still fits better for short noisy clips or edge deployment. It also has a more mature ecosystem for many speech-to-text workflows.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming text input with ~300ms first-chunk latency. It is usable for near-real-time applications but is not designed for full-duplex voice conversation.

Top comments (0)