TL;DR
Alibaba has released Qwen3.5-Omni (March 30, 2026), a single model that processes text, images, audio, and video, outputting both text and real-time speech. It outperforms Gemini 3.1 Pro on most audio tasks, recognizes speech in 113 languages, supports voice cloning, and ships in three variants: Plus, Flash, and Light.
One Model for Everything
Traditional AI stacks require separate models for speech-to-text, vision, text generation, and text-to-speech. Each integration adds latency, cost, and more potential failure points.
Qwen3.5-Omni handles all modalities—text, images, audio, video—in a single inference call. You can send mixed inputs (base64 audio, image URLs, video references, text), and receive text or speech as output. The model supports a 256,000-token context window (over 10 hours audio or ~400 seconds 720p video with audio).
Alibaba trained it on 100M+ hours of native audio-visual data, allowing Qwen3.5-Omni to reason across multiple input types in parallel.
If you’re building multimodal apps, you can now streamline your pipeline and simplify API integration.
What’s New from Qwen3-Omni
The previous Qwen3-Omni Flash (Dec 2025) had 234ms latency. Qwen3.5-Omni introduces key improvements:
Expanded Language Coverage
- Speech recognition: 19 → 113 languages and dialects.
- Speech generation: 10 → 36 languages.
- Enables global-scale products without extra ASR/TTS integrations.
Built-in Voice Cloning
- Upload a voice sample and generate responses in that voice.
- Available via API in Plus and Flash variants.
- Maintains consistent voice persona over long conversations.
ARIA Technology
- ARIA syncs text and speech output to prevent garbled pronunciation of numbers, technical terms, and proper nouns (e.g., “IPv6”, "$249.99", “Qwen3.5-Omni”).
Semantic Interruption
- Distinguishes between conversational backchannels (“uh-huh”) and true interruptions (“stop”).
- Enables more natural, human-like voice interactions.
Integrated Real-Time Web Search
- The model can fetch and incorporate live search results during inference.
- No need to pre-fetch context; retrieval is model-driven.
Audio-Visual Vibe Coding
- You can submit screen recordings as input; the model generates or modifies code based on the visual context.
- Multimodal code generation from video input.
Benchmark Results
Qwen3.5-Omni delivers state-of-the-art performance:
- SOTA on 32/36 audio and audio-visual benchmarks.
- Sets new SOTA on 22/36.
- Outperforms Gemini 3.1 Pro in audio understanding, reasoning, translation.
- Matches Gemini 3.1 Pro in audio-visual comprehension.
- Surpasses ElevenLabs, GPT-Audio, Minimax on multilingual TTS stability across 20 languages.
Model Variants
Alibaba offers three versions:
| Variant | Best for |
|---|---|
| Qwen3.5-Omni Plus | Maximum quality, audio-visual reasoning, voice cloning, long context tasks |
| Qwen3.5-Omni Flash | Balanced speed/quality, real-time voice chat, production APIs |
| Qwen3.5-Omni Light | Low-latency tasks, mobile and edge inference |
All variants process text, images, audio, and video. Choose based on your latency, cost, and quality requirements.
256K Token Context Window
What can you fit in 256K tokens?
- Audio: Over 10 hours of speech.
- Video: ~400 seconds of 720p video with audio.
- Text: ~190,000 words (novel-length).
For most use cases (long meetings, product demos, customer support calls), you won’t need to chunk inputs.
Compared to GPT-4o (128K) and Gemini 2.5 Pro (1M), Qwen3.5-Omni balances context size and SOTA audio-visual performance.
113-Language Speech Recognition
Why does this matter?
- Global customer support: Single model for Thai, Bengali, Swahili, Finnish, etc.
- Multilingual content: Transcribe, translate, summarize podcasts, videos, interviews in non-English languages.
- Language switching: Handles code-switching natively (e.g., English ↔ Spanish mid-sentence).
Architecture: Thinker-Talker with MoE
Qwen3.5-Omni uses a Thinker-Talker architecture:
- Thinker: Processes multimodal input, generates reasoning tokens.
- Talker: Converts tokens to natural speech in real time; multi-codebook approach for low latency.
Plus variant uses Mixture of Experts (MoE)—only a subset of parameters is active per token, making inference fast and memory efficient.
For local deployment:
-
Use
vLLMfor optimized MoE inference. - HuggingFace Transformers works but is slower with MoE.
Where Apidog Fits In
When integrating Qwen3.5-Omni’s API, you’ll send complex multimodal JSON with mixed base64 audio, image URLs, video references, and text.
Why Apidog:
- Build and save request templates for Qwen3.5-Omni.
- Set environment variables (API keys).
- Write automated tests for response structure/content.
- Easily compare Plus, Flash, and Light variants by running the same request and comparing latency/output.
Get started: Download Apidog free to test multimodal API workflows.
Who Should Use Qwen3.5-Omni?
- Voice assistants: Real-time speech in/out, conversation memory, web retrieval, natural interruptions.
- Video analysis: Video summarization, meeting transcription, tutorial/content generation from screen recordings.
- Multilingual products: Unified 113-language ASR and 36-language TTS.
- Accessibility tools: Alt-text for images, audio descriptions, real-time captions for under-resourced languages.
- Developer tools: Audio-Visual Vibe Coding—turn screen recordings into code.
Access
Qwen3.5-Omni is available via:
- Alibaba Cloud DashScope API (production access)
- qwen.ai (web interface for testing)
- HuggingFace Hub (model weights for local deployment)
- ModelScope (optimal for mainland China users)
API uses Alibaba Cloud’s standard authentication (get a DashScope API key). Refer to DashScope documentation for endpoint and pricing details.
What to Watch
- Test for your use case: Benchmark wins don’t always guarantee domain-specific quality. Validate with your own data, vocabularies, accents, and formats.
- Voice cloning: API-only for now; not yet in web UI.
- Local deployment: Plus variant (30B MoE) requires ≥40GB VRAM. Flash/Light variants are lighter.
FAQ
How is Qwen3.5-Omni different from Qwen2.5-Omni?
Qwen2.5-Omni: 7B/3B dense models, 19-language speech.
Qwen3.5-Omni: MoE architecture, 113-language speech recognition, voice cloning, ARIA for better TTS, larger context window, and improved benchmarks.
Can I run Qwen3.5-Omni locally?
Yes. Use HuggingFace Transformers or vLLM (recommended for MoE). Plus needs 40GB+ VRAM; Flash and Light are lighter.
Is there a free tier?
qwen.ai web interface is free. DashScope API is paid (see pricing per modality).
Does it support real-time streaming?
Yes. Thinker-Talker outputs audio in streaming chunks for responsive voice conversations.
Plus vs Flash vs Light?
- Plus: Highest quality.
- Flash: Balanced for most production APIs.
- Light: Fastest, for latency-sensitive/mobile/edge.
Can I use my own voice with the API?
Yes, via API voice cloning. Upload a sample and get speech in that voice (not yet in web UI).
How does it compare to ElevenLabs for voice generation?
Qwen3.5-Omni Plus outperforms ElevenLabs on multilingual voice stability across 20 languages (per Alibaba benchmarks). ElevenLabs has more voice customization; for integrated multimodal, Qwen3.5-Omni is cleaner.
Is it safe to send sensitive audio/video via API?
Review Alibaba Cloud’s data processing agreement for compliance. Always assume cloud data may be logged unless explicitly stated otherwise.



Top comments (0)