Wanda

Posted on Apr 2 • Originally published at apidog.com

What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI suite, featuring three models: VibeVoice-1.5B (text-to-speech, up to 90 mins, 4 speakers), VibeVoice-Realtime-0.5B (streaming TTS), and VibeVoice-ASR (speech recognition, 60 mins, 50+ languages, 7.77% WER). All are MIT-licensed and run locally. This guide provides actionable steps for installation, usage, and API integration.

Try Apidog today

Introduction

Microsoft released VibeVoice as an open-source, local voice AI framework in early 2026, supporting both text-to-speech (TTS) and automatic speech recognition (ASR). All models run on your own hardware, with no cloud requirement.

Three model types:

VibeVoice-1.5B: Generates expressive TTS from text, supporting up to 90 minutes and 4 different speakers per pass.
VibeVoice-Realtime-0.5B: Lightweight streaming TTS, ~300ms latency for first audio chunk.
VibeVoice-ASR: Transcribes up to 60 minutes of audio, supports 50+ languages, outputs structured transcripts with speaker IDs and timestamps.

Microsoft temporarily disabled the main repo after voice cloning misuse. They re-enabled it with these safeguards:

Audible AI disclaimer in all generated audio
Imperceptible watermarking for provenance

VibeVoice-ASR is available on Azure AI Foundry (cloud). TTS models remain MIT-licensed for local use.

This guide covers installation, TTS generation, speech recognition, API integration, and voice AI endpoint testing using Apidog.

How VibeVoice Works: Architecture Overview

The Tokenizer Breakthrough

VibeVoice uses continuous speech tokenizers at 7.5 Hz—far lower than typical 50-100 Hz rates. This allows handling of very long sequences (up to 90 minutes) without context loss.

There are two tokenizers:

Acoustic Tokenizer: Sigma-VAE, ~340M parameters, downsampling audio 3,200x from 24kHz input.
Semantic Tokenizer: Same architecture, trained for linguistic meaning using an ASR proxy.

Next-Token Diffusion

VibeVoice combines a Qwen2.5-1.5B LLM backbone with a ~123M parameter diffusion head. The LLM manages context/dialogue; the diffusion head generates acoustic detail using DDPM with Classifier-Free Guidance.

Total parameter count: ~3B

Training Approach

Curriculum learning: Training starts with short sequences (4K tokens), then increases to 16K, 32K, and 64K. Tokenizers are frozen; only the LLM and diffusion head update.

VibeVoice Model Specifications

Model	Parameters	Purpose	Max length	Languages	License
VibeVoice-1.5B	3B	Text-to-speech	90 minutes	English, Chinese	MIT
VibeVoice-Realtime-0.5B	~0.5B	Streaming TTS	Long-form	English, Chinese	MIT
VibeVoice-ASR	~9B	Speech recognition	60 minutes	50+ languages	MIT

VibeVoice-1.5B (TTS)

Specification	Value
LLM base	Qwen2.5-1.5B
Context length	64K tokens
Max speakers	4 simultaneous
Audio output	24kHz WAV mono
Tensor type	BF16
Format	Safetensors
HuggingFace downloads	62,630/month
Community forks	12 fine-tuned variants

VibeVoice-ASR

Specification	Value
Architecture base	Qwen2.5
Parameters	~9B
Audio processing	Up to 60 mins, single pass
Frame rate	7.5 Hz
Average WER	7.77% (8 English datasets)
LibriSpeech Clean WER	2.20%
TED-LIUM WER	2.57%
Languages	50+
Output	Structured (Who/When/What)
Supported audio	WAV, FLAC, MP3 (16kHz+)

Installation and Setup

Prerequisites

Python 3.8+
NVIDIA GPU with CUDA
TTS: 7–8 GB VRAM minimum
ASR: 24 GB VRAM minimum (A100/H100 recommended)
32 GB RAM minimum (64 GB for ASR)
CUDA 11.8+ (12.0+ recommended)

Install VibeVoice TTS

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

# Install dependencies
pip install -r requirements.txt

Models download automatically on first run, or pre-download:

from huggingface_hub import snapshot_download

# Download the 1.5B TTS model
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

Install via pip (Community Package)

pip install vibevoice

Install for ASR

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt

Or use Azure AI Foundry for managed cloud inference.

Generating Speech with VibeVoice-1.5B

Single-Speaker Generation

Create your script:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5

Output is saved as .wav in outputs/.

Multi-Speaker Podcast Generation

Supports up to 4 speakers with consistent identities:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5

Voice Cloning (Zero-Shot)

Requirements

WAV mono, 24,000 Hz, 30–60 seconds of clear speech

Convert audio:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav

Use Gradio demo:

python demo/gradio_demo.py

Open http://127.0.0.1:7860, upload reference audio, and generate speech.

Streaming with VibeVoice-Realtime-0.5B

For low-latency (~300ms) streaming output:

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice

Use Realtime for interactive apps; 1.5B for highest quality.

Using VibeVoice with Python

Pipeline API Example

from transformers import pipeline
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download("microsoft/VibeVoice-1.5B")

# Load pipeline
pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

# Multi-speaker script
script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

# Apply chat template
input_data = pipe.processor.apply_chat_template(script)

# Generate audio
generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)

FastAPI Wrapper for Production

Run an OpenAI-compatible REST API with Docker:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up

Test with curl:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav

You can test this endpoint with Apidog using the same format as OpenAI’s TTS API—import the endpoint, configure the request, and validate voice generation.

Using VibeVoice-ASR for Speech Recognition

Basic Transcription

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav

Structured Output Format

Each segment includes:

Who: Speaker ID
When: Start/end timestamps
What: Transcribed text

Example:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}

ASR as an MCP Server

Integrate ASR with coding tools using MCP:

pip install vibevoice-mcp-server
vibevoice-mcp serve

This enables transcription as part of coding workflows (e.g., for agent-based tools).

When to Use VibeVoice-ASR vs Whisper

Use case	Best choice	Why
Long meetings (30–60 min)	VibeVoice-ASR	Single-pass, speaker ID
Multi-speaker interviews	VibeVoice-ASR	Built-in diarization
Podcasts with timestamps	VibeVoice-ASR	Structured output
Multilingual (50+ languages)	VibeVoice-ASR	Broader language support
Short, noisy clips	Whisper	Better noise robustness
Edge/mobile deployment	Whisper	Smaller model, more device support
Specialized non-English	Whisper	More mature fine-tuning

Testing Voice AI APIs with Apidog

Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API, Apidog streamlines integration testing.

Test the TTS Endpoint

Create a new POST request in Apidog targeting your FastAPI server.
Use OpenAI-compatible request body:

{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}

Send the request—verify audio/wav in response headers.
Save and audit the output WAV file.

Test the ASR Endpoint

For speech-to-text:

POST request with multipart/form-data.
Attach your audio file as a form field.
Confirm the JSON response includes speaker IDs, timestamps, and text.

Validate Audio API Contracts

Apidog supports:

Binary uploads (ASR endpoints)
JSON bodies (TTS endpoints)
Response schema validation for structured output
Environment variables for switching endpoints

Safety and Responsible Use

Microsoft implemented safeguards:

Audible AI disclaimer: “This segment was generated by AI” in all audio
Imperceptible watermarking: Content verification
Inference logging: Abuse detection (hashed, aggregated logs)
MIT License: Commercial use allowed, but production deployment requires further testing

Allowed Use

Research, academic, internal prototyping
Podcast generation with AI disclosure
Accessibility (TTS for visually impaired)

Not Allowed

Voice impersonation without explicit consent
Deepfakes or unlabelled AI audio
Real-time voice conversion for deepfakes
Generating music or non-speech audio

Limitations to Know About

TTS language support is limited: Only English and Chinese. Other languages are not intelligible. ASR supports 50+ languages.
High hardware requirements for ASR: Needs 24GB+ VRAM (A100/H100). TTS models run on 7–8GB consumer GPUs.
No overlapping speech: TTS is strictly turn-based dialogue.
Potential biases: Models inherit biases from Qwen2.5.
Research-grade: Not production-ready; expect rough edges and limited error handling.

Deploying VibeVoice-ASR on Azure AI Foundry

For managed, production-grade ASR, use Azure AI Foundry:

No GPU management needed
Automatic scaling and updates
HTTPS endpoint: upload audio, receive structured (Who/When/What) transcription

Check Azure AI Foundry’s model catalog for current options and pricing.

To test Azure-hosted VibeVoice endpoints, set the URL and authentication headers in Apidog and run test requests with sample audio.

Community and Ecosystem

VibeVoice has an active developer community:

62,630+ monthly downloads (HuggingFace, 1.5B model)
2,280+ likes, 79+ Spaces, 12 fine-tuned variants
4 quantized versions for lower-VRAM use
Community fork: vibevoice-community/VibeVoice (active maintenance)

Notable projects:

VibeVoice-FastAPI: Production REST API + Docker
VibeVoice MCP Server: Model Context Protocol integration
Apple Silicon support: Community scripts for M-series Macs
Quantized models: GGUF and alternate formats

FAQ

Is VibeVoice free to use?

Yes. All models (TTS 1.5B, Realtime 0.5B, ASR) are MIT-licensed. Azure AI Foundry offers managed hosting with separate pricing.

Can VibeVoice run on Apple Silicon Macs?

Yes, via community scripts. Check HuggingFace discussions for updates. Slower than on CUDA GPUs.

How does VibeVoice compare to ElevenLabs?

VibeVoice: Free, local, privacy-friendly, no API costs.
ElevenLabs: Higher quality, more voices, easier setup, paid cloud service.

Why was the GitHub repository temporarily disabled?

Microsoft paused the repo to address voice cloning misuse. They added safety features (disclaimer, watermarking) before reopening. Development continued in community forks.

Can I fine-tune VibeVoice on custom voices?

Yes. Requires 30–60 seconds of clear WAV audio (24kHz mono) and GPU resources.

What audio formats does VibeVoice output?

WAV, 24kHz mono. Convert to MP3/OGG/FLAC with ffmpeg as needed.

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form, multi-speaker audio: yes. VibeVoice-ASR processes 60-minute files with diarization. Whisper is better for noisy, short clips or edge/mobile deployment.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming input with ~300ms latency, suitable for near-real-time but not full-duplex chat. For true real-time, consider Azure OpenAI’s GPT-Realtime.

For hands-on API testing, debugging, and validation, use Apidog to streamline your voice AI development workflow.