DEV Community

Cover image for What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models
Wanda
Wanda

Posted on • Originally published at apidog.com

What Is Microsoft VibeVoice? How to Use the Open-Source Voice AI Models

TL;DR

VibeVoice is Microsoft’s open-source voice AI suite, featuring three models: VibeVoice-1.5B (text-to-speech, up to 90 mins, 4 speakers), VibeVoice-Realtime-0.5B (streaming TTS), and VibeVoice-ASR (speech recognition, 60 mins, 50+ languages, 7.77% WER). All are MIT-licensed and run locally. This guide provides actionable steps for installation, usage, and API integration.

Try Apidog today

Introduction

Microsoft released VibeVoice as an open-source, local voice AI framework in early 2026, supporting both text-to-speech (TTS) and automatic speech recognition (ASR). All models run on your own hardware, with no cloud requirement.

VibeVoice architecture

Three model types:

  • VibeVoice-1.5B: Generates expressive TTS from text, supporting up to 90 minutes and 4 different speakers per pass.
  • VibeVoice-Realtime-0.5B: Lightweight streaming TTS, ~300ms latency for first audio chunk.
  • VibeVoice-ASR: Transcribes up to 60 minutes of audio, supports 50+ languages, outputs structured transcripts with speaker IDs and timestamps.

VibeVoice models

Microsoft temporarily disabled the main repo after voice cloning misuse. They re-enabled it with these safeguards:

  • Audible AI disclaimer in all generated audio
  • Imperceptible watermarking for provenance

VibeVoice-ASR is available on Azure AI Foundry (cloud). TTS models remain MIT-licensed for local use.

This guide covers installation, TTS generation, speech recognition, API integration, and voice AI endpoint testing using Apidog.

How VibeVoice Works: Architecture Overview

The Tokenizer Breakthrough

VibeVoice uses continuous speech tokenizers at 7.5 Hz—far lower than typical 50-100 Hz rates. This allows handling of very long sequences (up to 90 minutes) without context loss.

Tokenizer
Architecture

There are two tokenizers:

  • Acoustic Tokenizer: Sigma-VAE, ~340M parameters, downsampling audio 3,200x from 24kHz input.
  • Semantic Tokenizer: Same architecture, trained for linguistic meaning using an ASR proxy.

Next-Token Diffusion

VibeVoice combines a Qwen2.5-1.5B LLM backbone with a ~123M parameter diffusion head. The LLM manages context/dialogue; the diffusion head generates acoustic detail using DDPM with Classifier-Free Guidance.

  • Total parameter count: ~3B

Training Approach

Curriculum learning: Training starts with short sequences (4K tokens), then increases to 16K, 32K, and 64K. Tokenizers are frozen; only the LLM and diffusion head update.

VibeVoice Model Specifications

Model Parameters Purpose Max length Languages License
VibeVoice-1.5B 3B Text-to-speech 90 minutes English, Chinese MIT
VibeVoice-Realtime-0.5B ~0.5B Streaming TTS Long-form English, Chinese MIT
VibeVoice-ASR ~9B Speech recognition 60 minutes 50+ languages MIT

VibeVoice-1.5B (TTS)

Specification Value
LLM base Qwen2.5-1.5B
Context length 64K tokens
Max speakers 4 simultaneous
Audio output 24kHz WAV mono
Tensor type BF16
Format Safetensors
HuggingFace downloads 62,630/month
Community forks 12 fine-tuned variants

VibeVoice-ASR

Specification Value
Architecture base Qwen2.5
Parameters ~9B
Audio processing Up to 60 mins, single pass
Frame rate 7.5 Hz
Average WER 7.77% (8 English datasets)
LibriSpeech Clean WER 2.20%
TED-LIUM WER 2.57%
Languages 50+
Output Structured (Who/When/What)
Supported audio WAV, FLAC, MP3 (16kHz+)

Installation and Setup

Prerequisites

  • Python 3.8+
  • NVIDIA GPU with CUDA
  • TTS: 7–8 GB VRAM minimum
  • ASR: 24 GB VRAM minimum (A100/H100 recommended)
  • 32 GB RAM minimum (64 GB for ASR)
  • CUDA 11.8+ (12.0+ recommended)

Install VibeVoice TTS

# Clone the repository
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice

# Install dependencies
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Models download automatically on first run, or pre-download:

from huggingface_hub import snapshot_download

# Download the 1.5B TTS model
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="./models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)
Enter fullscreen mode Exit fullscreen mode

Install via pip (Community Package)

pip install vibevoice
Enter fullscreen mode Exit fullscreen mode

Install for ASR

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements-asr.txt
Enter fullscreen mode Exit fullscreen mode

Or use Azure AI Foundry for managed cloud inference.

Generating Speech with VibeVoice-1.5B

Single-Speaker Generation

Create your script:

Alice: Welcome to the Apidog developer podcast. Today we're covering API testing strategies for 2026.
Enter fullscreen mode Exit fullscreen mode

Run inference:

python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path script.txt \
  --speaker_names Alice \
  --cfg_scale 1.5
Enter fullscreen mode Exit fullscreen mode

Output is saved as .wav in outputs/.

Multi-Speaker Podcast Generation

Supports up to 4 speakers with consistent identities:

Alice: Welcome back to the show. Today we have two API experts joining us.
Bob: Thanks for having me. I've been working on REST API design patterns for the past five years.
Carol: And I focus on GraphQL performance optimization. Happy to be here.
Alice: Let's start with the debate everyone wants to hear. REST versus GraphQL for microservices.
Bob: REST gives you clear resource boundaries. Each endpoint maps to a specific resource.
Carol: GraphQL gives you flexibility. One endpoint, and the client decides what data it needs.
Enter fullscreen mode Exit fullscreen mode
python VibeVoice \
  --model_path microsoft/VibeVoice-1.5B \
  --txt_path podcast_script.txt \
  --speaker_names Alice Bob Carol \
  --cfg_scale 1.5
Enter fullscreen mode Exit fullscreen mode

Voice Cloning (Zero-Shot)

Requirements

  • WAV mono, 24,000 Hz, 30–60 seconds of clear speech

Convert audio:

ffmpeg -i source_recording.m4a -ar 24000 -ac 1 reference_voice.wav
Enter fullscreen mode Exit fullscreen mode

Use Gradio demo:

python demo/gradio_demo.py
Enter fullscreen mode Exit fullscreen mode

Open http://127.0.0.1:7860, upload reference audio, and generate speech.

Streaming with VibeVoice-Realtime-0.5B

For low-latency (~300ms) streaming output:

python demo/streaming_inference_from_file.py \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --txt_path script.txt \
  --speaker_name Alice
Enter fullscreen mode Exit fullscreen mode

Use Realtime for interactive apps; 1.5B for highest quality.

Using VibeVoice with Python

Pipeline API Example

from transformers import pipeline
from huggingface_hub import snapshot_download

# Download model
model_path = snapshot_download("microsoft/VibeVoice-1.5B")

# Load pipeline
pipe = pipeline(
    "text-to-speech",
    model=model_path,
    no_processor=False
)

# Multi-speaker script
script = [
    {"role": "Alice", "content": "How do you handle API versioning?"},
    {"role": "Bob", "content": "We use URL path versioning. v1, v2, and so on."},
]

# Apply chat template
input_data = pipe.processor.apply_chat_template(script)

# Generate audio
generate_kwargs = {
    "cfg_scale": 1.5,
    "n_diffusion_steps": 50,
}

output = pipe(input_data, generate_kwargs=generate_kwargs)
Enter fullscreen mode Exit fullscreen mode

FastAPI Wrapper for Production

Run an OpenAI-compatible REST API with Docker:

git clone https://github.com/ncoder-ai/VibeVoice-FastAPI.git
cd VibeVoice-FastAPI
docker compose up
Enter fullscreen mode Exit fullscreen mode

Test with curl:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vibevoice-1.5b",
    "input": "Your API documentation should be a conversation, not a monologue.",
    "voice": "alice"
  }' \
  --output speech.wav
Enter fullscreen mode Exit fullscreen mode

You can test this endpoint with Apidog using the same format as OpenAI’s TTS API—import the endpoint, configure the request, and validate voice generation.

Using VibeVoice-ASR for Speech Recognition

Basic Transcription

python asr_inference.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_path meeting_recording.wav
Enter fullscreen mode Exit fullscreen mode

Structured Output Format

Each segment includes:

  • Who: Speaker ID
  • When: Start/end timestamps
  • What: Transcribed text

Example:

{
  "segments": [
    {
      "speaker": "Speaker 1",
      "start": 0.0,
      "end": 4.2,
      "text": "Let's review the API endpoints for the new release."
    },
    {
      "speaker": "Speaker 2",
      "start": 4.5,
      "end": 8.1,
      "text": "I've added three new endpoints for the billing module."
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

ASR as an MCP Server

Integrate ASR with coding tools using MCP:

pip install vibevoice-mcp-server
vibevoice-mcp serve
Enter fullscreen mode Exit fullscreen mode

This enables transcription as part of coding workflows (e.g., for agent-based tools).

When to Use VibeVoice-ASR vs Whisper

Use case Best choice Why
Long meetings (30–60 min) VibeVoice-ASR Single-pass, speaker ID
Multi-speaker interviews VibeVoice-ASR Built-in diarization
Podcasts with timestamps VibeVoice-ASR Structured output
Multilingual (50+ languages) VibeVoice-ASR Broader language support
Short, noisy clips Whisper Better noise robustness
Edge/mobile deployment Whisper Smaller model, more device support
Specialized non-English Whisper More mature fine-tuning

Testing Voice AI APIs with Apidog

Whether you use the VibeVoice FastAPI wrapper, Azure AI Foundry, or your own API, Apidog streamlines integration testing.

Apidog TTS test

Test the TTS Endpoint

  1. Create a new POST request in Apidog targeting your FastAPI server.
  2. Use OpenAI-compatible request body:
{
  "model": "vibevoice-1.5b",
  "input": "Test speech synthesis with proper intonation and pacing.",
  "voice": "alice",
  "response_format": "wav"
}
Enter fullscreen mode Exit fullscreen mode
  1. Send the request—verify audio/wav in response headers.
  2. Save and audit the output WAV file.

Test the ASR Endpoint

For speech-to-text:

  1. POST request with multipart/form-data.
  2. Attach your audio file as a form field.
  3. Confirm the JSON response includes speaker IDs, timestamps, and text.

Validate Audio API Contracts

Apidog supports:

  • Binary uploads (ASR endpoints)
  • JSON bodies (TTS endpoints)
  • Response schema validation for structured output
  • Environment variables for switching endpoints

Safety and Responsible Use

Microsoft implemented safeguards:

  • Audible AI disclaimer: “This segment was generated by AI” in all audio
  • Imperceptible watermarking: Content verification
  • Inference logging: Abuse detection (hashed, aggregated logs)
  • MIT License: Commercial use allowed, but production deployment requires further testing

Allowed Use

  • Research, academic, internal prototyping
  • Podcast generation with AI disclosure
  • Accessibility (TTS for visually impaired)

Not Allowed

  • Voice impersonation without explicit consent
  • Deepfakes or unlabelled AI audio
  • Real-time voice conversion for deepfakes
  • Generating music or non-speech audio

Limitations to Know About

  • TTS language support is limited: Only English and Chinese. Other languages are not intelligible. ASR supports 50+ languages.
  • High hardware requirements for ASR: Needs 24GB+ VRAM (A100/H100). TTS models run on 7–8GB consumer GPUs.
  • No overlapping speech: TTS is strictly turn-based dialogue.
  • Potential biases: Models inherit biases from Qwen2.5.
  • Research-grade: Not production-ready; expect rough edges and limited error handling.

Model coverage

Deploying VibeVoice-ASR on Azure AI Foundry

For managed, production-grade ASR, use Azure AI Foundry:

  • No GPU management needed
  • Automatic scaling and updates
  • HTTPS endpoint: upload audio, receive structured (Who/When/What) transcription

Check Azure AI Foundry’s model catalog for current options and pricing.

To test Azure-hosted VibeVoice endpoints, set the URL and authentication headers in Apidog and run test requests with sample audio.

Community and Ecosystem

VibeVoice has an active developer community:

  • 62,630+ monthly downloads (HuggingFace, 1.5B model)
  • 2,280+ likes, 79+ Spaces, 12 fine-tuned variants
  • 4 quantized versions for lower-VRAM use
  • Community fork: vibevoice-community/VibeVoice (active maintenance)

Notable projects:

  • VibeVoice-FastAPI: Production REST API + Docker
  • VibeVoice MCP Server: Model Context Protocol integration
  • Apple Silicon support: Community scripts for M-series Macs
  • Quantized models: GGUF and alternate formats

FAQ

Is VibeVoice free to use?

Yes. All models (TTS 1.5B, Realtime 0.5B, ASR) are MIT-licensed. Azure AI Foundry offers managed hosting with separate pricing.

Can VibeVoice run on Apple Silicon Macs?

Yes, via community scripts. Check HuggingFace discussions for updates. Slower than on CUDA GPUs.

How does VibeVoice compare to ElevenLabs?

  • VibeVoice: Free, local, privacy-friendly, no API costs.
  • ElevenLabs: Higher quality, more voices, easier setup, paid cloud service.

Why was the GitHub repository temporarily disabled?

Microsoft paused the repo to address voice cloning misuse. They added safety features (disclaimer, watermarking) before reopening. Development continued in community forks.

Can I fine-tune VibeVoice on custom voices?

Yes. Requires 30–60 seconds of clear WAV audio (24kHz mono) and GPU resources.

What audio formats does VibeVoice output?

WAV, 24kHz mono. Convert to MP3/OGG/FLAC with ffmpeg as needed.

Can I use VibeVoice-ASR as a Whisper replacement?

For long-form, multi-speaker audio: yes. VibeVoice-ASR processes 60-minute files with diarization. Whisper is better for noisy, short clips or edge/mobile deployment.

Does VibeVoice support real-time voice chat?

VibeVoice-Realtime-0.5B supports streaming input with ~300ms latency, suitable for near-real-time but not full-duplex chat. For true real-time, consider Azure OpenAI’s GPT-Realtime.


For hands-on API testing, debugging, and validation, use Apidog to streamline your voice AI development workflow.

Top comments (0)