Wanda

Posted on Mar 31 • Originally published at apidog.com

Qwen3.5-Omni Is Here: Alibaba's Omnimodal AI Beats Gemini on Audio

TL;DR

Alibaba has released Qwen3.5-Omni (March 30, 2026), a single model that processes text, images, audio, and video, outputting both text and real-time speech. It outperforms Gemini 3.1 Pro on most audio tasks, recognizes speech in 113 languages, supports voice cloning, and ships in three variants: Plus, Flash, and Light.

Try Apidog today

One Model for Everything

Traditional AI stacks require separate models for speech-to-text, vision, text generation, and text-to-speech. Each integration adds latency, cost, and more potential failure points.

Qwen3.5-Omni handles all modalities—text, images, audio, video—in a single inference call. You can send mixed inputs (base64 audio, image URLs, video references, text), and receive text or speech as output. The model supports a 256,000-token context window (over 10 hours audio or ~400 seconds 720p video with audio).

Alibaba trained it on 100M+ hours of native audio-visual data, allowing Qwen3.5-Omni to reason across multiple input types in parallel.

If you’re building multimodal apps, you can now streamline your pipeline and simplify API integration.

What’s New from Qwen3-Omni

The previous Qwen3-Omni Flash (Dec 2025) had 234ms latency. Qwen3.5-Omni introduces key improvements:

Expanded Language Coverage

Speech recognition: 19 → 113 languages and dialects.
Speech generation: 10 → 36 languages.
Enables global-scale products without extra ASR/TTS integrations.

Built-in Voice Cloning

Upload a voice sample and generate responses in that voice.
Available via API in Plus and Flash variants.
Maintains consistent voice persona over long conversations.

ARIA Technology

ARIA syncs text and speech output to prevent garbled pronunciation of numbers, technical terms, and proper nouns (e.g., “IPv6”, "$249.99", “Qwen3.5-Omni”).

Semantic Interruption

Distinguishes between conversational backchannels (“uh-huh”) and true interruptions (“stop”).
Enables more natural, human-like voice interactions.

Integrated Real-Time Web Search

The model can fetch and incorporate live search results during inference.
No need to pre-fetch context; retrieval is model-driven.

Audio-Visual Vibe Coding

You can submit screen recordings as input; the model generates or modifies code based on the visual context.
Multimodal code generation from video input.

Benchmark Results

Qwen3.5-Omni delivers state-of-the-art performance:

SOTA on 32/36 audio and audio-visual benchmarks.
Sets new SOTA on 22/36.
Outperforms Gemini 3.1 Pro in audio understanding, reasoning, translation.
Matches Gemini 3.1 Pro in audio-visual comprehension.
Surpasses ElevenLabs, GPT-Audio, Minimax on multilingual TTS stability across 20 languages.

Model Variants

Alibaba offers three versions:

Variant	Best for
Qwen3.5-Omni Plus	Maximum quality, audio-visual reasoning, voice cloning, long context tasks
Qwen3.5-Omni Flash	Balanced speed/quality, real-time voice chat, production APIs
Qwen3.5-Omni Light	Low-latency tasks, mobile and edge inference

All variants process text, images, audio, and video. Choose based on your latency, cost, and quality requirements.

256K Token Context Window

What can you fit in 256K tokens?

Audio: Over 10 hours of speech.
Video: ~400 seconds of 720p video with audio.
Text: ~190,000 words (novel-length).

For most use cases (long meetings, product demos, customer support calls), you won’t need to chunk inputs.

Compared to GPT-4o (128K) and Gemini 2.5 Pro (1M), Qwen3.5-Omni balances context size and SOTA audio-visual performance.

113-Language Speech Recognition

Why does this matter?

Global customer support: Single model for Thai, Bengali, Swahili, Finnish, etc.
Multilingual content: Transcribe, translate, summarize podcasts, videos, interviews in non-English languages.
Language switching: Handles code-switching natively (e.g., English ↔ Spanish mid-sentence).

Architecture: Thinker-Talker with MoE

Qwen3.5-Omni uses a Thinker-Talker architecture:

Thinker: Processes multimodal input, generates reasoning tokens.
Talker: Converts tokens to natural speech in real time; multi-codebook approach for low latency.

Plus variant uses Mixture of Experts (MoE)—only a subset of parameters is active per token, making inference fast and memory efficient.

For local deployment:

Use vLLM for optimized MoE inference.
HuggingFace Transformers works but is slower with MoE.

Where Apidog Fits In

When integrating Qwen3.5-Omni’s API, you’ll send complex multimodal JSON with mixed base64 audio, image URLs, video references, and text.

Why Apidog:

Build and save request templates for Qwen3.5-Omni.
Set environment variables (API keys).
Write automated tests for response structure/content.
Easily compare Plus, Flash, and Light variants by running the same request and comparing latency/output.

Get started: Download Apidog free to test multimodal API workflows.

Who Should Use Qwen3.5-Omni?

Voice assistants: Real-time speech in/out, conversation memory, web retrieval, natural interruptions.
Video analysis: Video summarization, meeting transcription, tutorial/content generation from screen recordings.
Multilingual products: Unified 113-language ASR and 36-language TTS.
Accessibility tools: Alt-text for images, audio descriptions, real-time captions for under-resourced languages.
Developer tools: Audio-Visual Vibe Coding—turn screen recordings into code.

Access

Qwen3.5-Omni is available via:

Alibaba Cloud DashScope API (production access)
qwen.ai (web interface for testing)
HuggingFace Hub (model weights for local deployment)
ModelScope (optimal for mainland China users)

API uses Alibaba Cloud’s standard authentication (get a DashScope API key). Refer to DashScope documentation for endpoint and pricing details.

What to Watch

Test for your use case: Benchmark wins don’t always guarantee domain-specific quality. Validate with your own data, vocabularies, accents, and formats.
Voice cloning: API-only for now; not yet in web UI.
Local deployment: Plus variant (30B MoE) requires ≥40GB VRAM. Flash/Light variants are lighter.

FAQ

How is Qwen3.5-Omni different from Qwen2.5-Omni?

Qwen2.5-Omni: 7B/3B dense models, 19-language speech.

Qwen3.5-Omni: MoE architecture, 113-language speech recognition, voice cloning, ARIA for better TTS, larger context window, and improved benchmarks.

Can I run Qwen3.5-Omni locally?

Yes. Use HuggingFace Transformers or vLLM (recommended for MoE). Plus needs 40GB+ VRAM; Flash and Light are lighter.

Is there a free tier?

qwen.ai web interface is free. DashScope API is paid (see pricing per modality).

Does it support real-time streaming?

Yes. Thinker-Talker outputs audio in streaming chunks for responsive voice conversations.

Plus vs Flash vs Light?

Plus: Highest quality.
Flash: Balanced for most production APIs.
Light: Fastest, for latency-sensitive/mobile/edge.

Can I use my own voice with the API?

Yes, via API voice cloning. Upload a sample and get speech in that voice (not yet in web UI).

How does it compare to ElevenLabs for voice generation?

Qwen3.5-Omni Plus outperforms ElevenLabs on multilingual voice stability across 20 languages (per Alibaba benchmarks). ElevenLabs has more voice customization; for integrated multimodal, Qwen3.5-Omni is cleaner.

Is it safe to send sensitive audio/video via API?

Review Alibaba Cloud’s data processing agreement for compliance. Always assume cloud data may be logged unless explicitly stated otherwise.

DEV Community