TL;DR
Qwen3.5-Omni accepts text, images, audio, and video as input and returns text or real-time speech. You can access it through the Alibaba Cloud DashScope API or run it locally with HuggingFace Transformers. This guide walks through API setup, working examples for each modality, voice cloning, streaming, local deployment, retries, and testing with Apidog.
What you’re working with
Qwen3.5-Omni is a single multimodal model that handles four input types in one request:
- Text
- Images
- Audio
- Video
It can return either text or natural speech, depending on your request configuration.
Released March 30, 2026, Qwen3.5-Omni uses a Thinker-Talker architecture with an MoE backbone:
- Thinker: processes multimodal input and performs reasoning.
- Talker: converts output into speech using a multi-codebook system that can start streaming audio before the full response is complete.
Available variants:
| Variant | Best for |
|---|---|
| Plus | Highest quality, reasoning, voice cloning |
| Flash | Balanced speed and quality for most production use |
| Light | Lowest latency for mobile and edge scenarios |
Most examples below use qwen3.5-omni-flash. Use qwen3.5-omni-plus when quality matters most, and qwen3.5-omni-light when latency is the main constraint.
API access via DashScope
Alibaba Cloud DashScope is the primary hosted API for Qwen3.5-Omni. You need:
- A DashScope account
- A DashScope API key
- Either the DashScope SDK or the OpenAI-compatible API endpoint
Step 1: Create a DashScope account
Go to dashscope.aliyuncs.com and sign up. If you already have an Alibaba Cloud account, use that account.
Step 2: Create an API key
In the DashScope console:
- Open API Key Management.
- Click Create API Key.
- Copy the key.
The key format starts with:
sk-...
Step 3: Install an SDK
Install the DashScope SDK:
pip install dashscope
Or use the OpenAI Python SDK with DashScope’s OpenAI-compatible endpoint:
pip install openai
DashScope exposes an OpenAI-compatible API at:
https://dashscope.aliyuncs.com/compatible-mode/v1
That means you can reuse OpenAI-style chat completion code by changing base_url and api_key.
Text input and output
Start with the simplest request: text in, text out.
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=[
{
"role": "user",
"content": "Explain the difference between REST and GraphQL APIs in plain terms."
}
],
)
print(response.choices[0].message.content)
Use a different model ID when needed:
model="qwen3.5-omni-plus" # better quality
model="qwen3.5-omni-flash" # balanced default
model="qwen3.5-omni-light" # lower latency
Audio input: transcription and understanding
You can send audio as a URL or as base64-encoded data. The model can transcribe and reason over the audio directly, so you do not need a separate ASR step.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
with open("meeting_recording.wav", "rb") as f:
audio_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "wav"
}
},
{
"type": "text",
"text": "Summarize the key decisions made in this meeting and list any action items."
}
]
}
],
)
print(response.choices[0].message.content)
Qwen3.5-Omni handles 113 languages for speech recognition and detects the language automatically.
Supported audio formats include:
- WAV
- MP3
- M4A
- OGG
- FLAC
Audio output: text-to-speech response
To receive speech in the response, set modalities and configure the audio output.
from openai import OpenAI
import base64
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
modalities=["text", "audio"],
audio={"voice": "Chelsie", "format": "wav"},
messages=[
{
"role": "user",
"content": "Describe the steps to authenticate a REST API using OAuth 2.0."
}
],
)
text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data
with open("response.wav", "wb") as f:
f.write(base64.b64decode(audio_data))
print(f"Text: {text_content}")
print("Audio saved to response.wav")
Built-in voices:
ChelsieEthan
Speech generation works in 36 languages.
Image input: visual understanding
Send an image URL with a text prompt when you want the model to inspect visual content.
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/api-diagram.png"
}
},
{
"type": "text",
"text": "Describe this API architecture diagram and identify any potential bottlenecks."
}
]
}
],
)
print(response.choices[0].message.content)
For local images, encode the file as base64 and pass it as a data URL.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
with open("screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
image_url = f"data:image/png;base64,{image_data}"
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_url}
},
{
"type": "text",
"text": "What error is shown in this screenshot?"
}
]
}
],
)
print(response.choices[0].message.content)
Video input: understand recordings and screen captures
Video input lets Qwen3.5-Omni reason across visual frames and audio tracks in one request.
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": "https://example.com/product-demo.mp4"
}
},
{
"type": "text",
"text": "Describe what the developer is building in this demo and write equivalent code."
}
]
}
],
)
print(response.choices[0].message.content)
Audio-Visual Vibe Coding
A common workflow is to pass a screen recording and ask the model to generate code from what it sees.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
with open("screen_recording.mp4", "rb") as f:
video_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="qwen3.5-omni-plus",
messages=[
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": f"data:video/mp4;base64,{video_data}"
}
},
{
"type": "text",
"text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
}
]
}
],
)
print(response.choices[0].message.content)
The 256K token context window fits roughly 400 seconds of 720p video with audio. For longer recordings, trim or split the video before sending it.
Voice cloning
Voice cloning lets you provide a target voice sample and have the model respond in that voice. This is available through the API on Plus and Flash.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
with open("voice_sample.wav", "rb") as f:
voice_sample = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="qwen3.5-omni-plus",
modalities=["text", "audio"],
audio={
"voice": "custom",
"format": "wav",
"voice_sample": {
"data": voice_sample,
"format": "wav"
}
},
messages=[
{
"role": "user",
"content": "Welcome to the Apidog developer portal. How can I help you today?"
}
],
)
audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
f.write(base64.b64decode(audio_data))
For better voice cloning quality:
- Use a clean recording with minimal background noise.
- Prefer 15-30 seconds of speech.
- Use WAV at 16kHz or higher.
- Use natural speech instead of read-aloud text when possible.
Streaming responses
For real-time voice chat or interactive apps, use streaming. The model can start returning audio before the full response is complete.
from openai import OpenAI
import base64
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
stream = client.chat.completions.create(
model="qwen3.5-omni-flash",
modalities=["text", "audio"],
audio={"voice": "Ethan", "format": "pcm16"},
messages=[
{
"role": "user",
"content": "Explain how WebSocket connections differ from HTTP polling."
}
],
stream=True,
)
audio_chunks = []
text_chunks = []
for chunk in stream:
delta = chunk.choices[0].delta
if hasattr(delta, "audio") and delta.audio:
if delta.audio.get("data"):
audio_chunks.append(delta.audio["data"])
if delta.content:
text_chunks.append(delta.content)
print(delta.content, end="", flush=True)
print()
if audio_chunks:
full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
with open("streamed_response.pcm", "wb") as f:
f.write(full_audio)
pcm16 is useful for streaming because you can pipe chunks directly into an audio output buffer without waiting for a complete file.
Multi-turn conversation with mixed modalities
You can keep conversation state across turns and mix modalities as needed.
from openai import OpenAI
import base64
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
conversation = []
def send_message(content_parts):
conversation.append({"role": "user", "content": content_parts})
response = client.chat.completions.create(
model="qwen3.5-omni-flash",
messages=conversation,
)
reply = response.choices[0].message.content
conversation.append({"role": "assistant", "content": reply})
return reply
# Turn 1: text
print(send_message([
{
"type": "text",
"text": "I have an API that keeps returning 503 errors."
}
]))
# Turn 2: image + text
with open("error_log.png", "rb") as f:
img = base64.b64encode(f.read()).decode()
print(send_message([
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{img}"
}
},
{
"type": "text",
"text": "Here's the error log screenshot. What's causing this?"
}
]))
# Turn 3: follow-up text
print(send_message([
{
"type": "text",
"text": "How do I fix the connection pool exhaustion you mentioned?"
}
]))
The 256K context window supports long conversations, including conversations with images and audio in the history.
Local deployment with HuggingFace
To run Qwen3.5-Omni on your own infrastructure, install the required packages.
pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
Then load the model and processor.
import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
model_path,
device_map="auto",
attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
}
],
},
{
"role": "user",
"content": [
{
"type": "audio",
"audio": "path/to/your/audio.wav"
},
{
"type": "text",
"text": "What is being discussed in this audio?"
}
],
},
]
text = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=False,
)
audios, images, videos = process_mm_info(
conversation,
use_audio_in_video=True
)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)
text_ids, audio_output = model.generate(
**inputs,
speaker="Chelsie"
)
text_response = processor.batch_decode(
text_ids,
skip_special_tokens=True
)[0]
sf.write(
"local_response.wav",
audio_output.reshape(-1).cpu().numpy(),
samplerate=24000
)
print(text_response)
Approximate GPU memory requirements:
| Variant | Precision | Min VRAM |
|---|---|---|
| Plus / 30B MoE | BF16 | ~40GB |
| Flash | BF16 | ~20GB |
| Light | BF16 | ~10GB |
For production local inference, use vLLM instead of HuggingFace Transformers. MoE models run faster with vLLM routing optimizations.
Testing Qwen3.5-Omni requests with Apidog
Multimodal API requests are harder to debug than plain JSON. You may need to inspect base64-encoded media, nested content arrays, and responses that include both text and audio.
A practical workflow in Apidog:
- Create a new collection for DashScope.
- Set the base URL to:
https://dashscope.aliyuncs.com/compatible-mode/v1
- Store your DashScope API key as an environment variable.
- Create request templates for each modality:
- Text
- Audio input
- Audio output
- Image input
- Video input
- Duplicate the base request for
Plus,Flash, andLight. - Change only the
modelparameter to compare output quality and latency.
You can also add test assertions for response validation:
- Verify
choices[0].message.contentis not empty for text responses. - Verify
choices[0].message.audio.dataexists when audio output is requested. - Assert that Flash latency stays under your target threshold.
This helps you choose the right model variant before building the full application.
Error handling and retry logic
Large multimodal requests can hit rate limits, connection errors, or timeouts. Add retries from the start.
import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
timeout=120,
)
def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model=model,
messages=messages,
)
except RateLimitError:
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limit hit. Waiting {wait:.1f}s...")
time.sleep(wait)
except (APITimeoutError, APIConnectionError) as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
time.sleep(wait)
raise RuntimeError(f"Failed after {max_retries} attempts")
For video inputs larger than 100MB:
- Trim to the relevant section.
- Reduce resolution to 480p when high visual detail is not required.
- Split long recordings into segments and aggregate the results in your app.
Common issues and fixes
Audio output is garbled on numbers or technical terms
This is the problem ARIA technology addresses. Make sure you are using Qwen3.5-Omni, not an earlier version. If you self-host, use the latest model weights from HuggingFace.
The model keeps talking when I send an audio interruption
Semantic interruption requires Flash or Plus. Light may not support this feature. Also make sure you are streaming the response instead of using a batch request.
Voice cloning quality is poor
Use a cleaner voice sample. Remove background noise with a tool such as Audacity before uploading. Use at least 15 seconds of audio. WAV at 16kHz or 44.1kHz works best.
Video input returns a token limit error
The 256K token context covers roughly 400 seconds of 720p video. For longer videos, trim the clip or lower the resolution. Keep videos under about 6 minutes for a safer margin.
Local deployment is very slow
Use vLLM instead of HuggingFace Transformers for production local inference. MoE models need vLLM routing optimizations for better throughput.
For developers whose primary need is high-fidelity speech synthesis rather than full multimodal inference, Fish Audio S2's dedicated TTS and voice-cloning API offers a narrower, faster endpoint worth evaluating alongside all-in-one multimodal models.
FAQ
Which DashScope model ID should I use for Qwen3.5-Omni?
Use one of these:
qwen3.5-omni-plus
qwen3.5-omni-flash
qwen3.5-omni-light
Start with qwen3.5-omni-flash for most use cases.
Can I use the OpenAI Python SDK with DashScope?
Yes. Configure the client like this:
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-YOUR_DASHSCOPE_KEY",
)
The request and response format follows the OpenAI-compatible API style.
How do I send multiple files, such as audio and image, in one request?
Put each file in the content array as a separate typed object.
content = [
{
"type": "input_audio",
"input_audio": {
"data": audio_data,
"format": "wav"
}
},
{
"type": "image_url",
"image_url": {
"url": image_url
}
},
{
"type": "text",
"text": "Use the audio and image to explain what is happening."
}
]
All four modalities can appear in the same message.
Is there a size limit for audio or video files?
DashScope has per-request payload limits. For large files, use a URL reference instead of base64 encoding. Host the file on accessible storage and pass the URL in the audio or video_url field.
How do I disable audio output and get text only?
Omit modalities, or set it to text only:
modalities=["text"]
Text-only responses are faster and cheaper.
Does Qwen3.5-Omni support function calling?
Yes. Use the standard tools parameter with function definitions, the same way you would with other OpenAI-compatible models. The model returns structured tool call objects that your application executes.
What is the best way to handle long audio recordings?
For recordings under 10 hours, send them as a single request. For longer recordings, split at natural pause points, process each segment, and aggregate the results in your application layer.
How do I test multimodal requests before building a full application?
Use Apidog to build saved request templates for each modality, switch between model variants, inspect the full response structure, and write assertions before integrating the API into your application.

Top comments (0)