Preecha

Posted on Jun 24

How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API

TL;DR

Qwen3.5-Omni accepts text, images, audio, and video as input and returns text or real-time speech. You can access it through the Alibaba Cloud DashScope API or run it locally with HuggingFace Transformers. This guide walks through API setup, working examples for each modality, voice cloning, streaming, local deployment, retries, and testing with Apidog.

Try Apidog today

What you’re working with

Qwen3.5-Omni is a single multimodal model that handles four input types in one request:

Text
Images
Audio
Video

It can return either text or natural speech, depending on your request configuration.

Released March 30, 2026, Qwen3.5-Omni uses a Thinker-Talker architecture with an MoE backbone:

Thinker: processes multimodal input and performs reasoning.
Talker: converts output into speech using a multi-codebook system that can start streaming audio before the full response is complete.

Available variants:

Variant	Best for
Plus	Highest quality, reasoning, voice cloning
Flash	Balanced speed and quality for most production use
Light	Lowest latency for mobile and edge scenarios

Most examples below use qwen3.5-omni-flash. Use qwen3.5-omni-plus when quality matters most, and qwen3.5-omni-light when latency is the main constraint.

API access via DashScope

Alibaba Cloud DashScope is the primary hosted API for Qwen3.5-Omni. You need:

A DashScope account
A DashScope API key
Either the DashScope SDK or the OpenAI-compatible API endpoint

Step 1: Create a DashScope account

Go to dashscope.aliyuncs.com and sign up. If you already have an Alibaba Cloud account, use that account.

Step 2: Create an API key

In the DashScope console:

Open API Key Management.
Click Create API Key.
Copy the key.

The key format starts with:

sk-...

Step 3: Install an SDK

Install the DashScope SDK:

pip install dashscope

Or use the OpenAI Python SDK with DashScope’s OpenAI-compatible endpoint:

pip install openai

DashScope exposes an OpenAI-compatible API at:

https://dashscope.aliyuncs.com/compatible-mode/v1

That means you can reuse OpenAI-style chat completion code by changing base_url and api_key.

Text input and output

Start with the simplest request: text in, text out.

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)

Use a different model ID when needed:

model="qwen3.5-omni-plus"   # better quality
model="qwen3.5-omni-flash"  # balanced default
model="qwen3.5-omni-light"  # lower latency

Audio input: transcription and understanding

You can send audio as a URL or as base64-encoded data. The model can transcribe and reason over the audio directly, so you do not need a separate ASR step.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Qwen3.5-Omni handles 113 languages for speech recognition and detects the language automatically.

Supported audio formats include:

WAV
MP3
M4A
OGG
FLAC

Audio output: text-to-speech response

To receive speech in the response, set modalities and configure the audio output.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data

with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")

Built-in voices:

Chelsie
Ethan

Speech generation works in 36 languages.

Image input: visual understanding

Send an image URL with a text prompt when you want the model to inspect visual content.

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

For local images, encode the file as base64 and pass it as a data URL.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Video input: understand recordings and screen captures

Video input lets Qwen3.5-Omni reason across visual frames and audio tracks in one request.

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Audio-Visual Vibe Coding

A common workflow is to pass a screen recording and ask the model to generate code from what it sees.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

The 256K token context window fits roughly 400 seconds of 720p video with audio. For longer recordings, trim or split the video before sending it.

Voice cloning

Voice cloning lets you provide a target voice sample and have the model respond in that voice. This is available through the API on Plus and Flash.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data

with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

For better voice cloning quality:

Use a clean recording with minimal background noise.
Prefer 15-30 seconds of speech.
Use WAV at 16kHz or higher.
Use natural speech instead of read-aloud text when possible.

Streaming responses

For real-time voice chat or interactive apps, use streaming. The model can start returning audio before the full response is complete.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta

    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])

    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()

if audio_chunks:
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)

    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)

pcm16 is useful for streaming because you can pipe chunks directly into an audio output buffer without waiting for a complete file.

Multi-turn conversation with mixed modalities

You can keep conversation state across turns and mix modalities as needed.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})

    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})

    return reply

# Turn 1: text
print(send_message([
    {
        "type": "text",
        "text": "I have an API that keeps returning 503 errors."
    }
]))

# Turn 2: image + text
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/png;base64,{img}"
        }
    },
    {
        "type": "text",
        "text": "Here's the error log screenshot. What's causing this?"
    }
]))

# Turn 3: follow-up text
print(send_message([
    {
        "type": "text",
        "text": "How do I fix the connection pool exhaustion you mentioned?"
    }
]))

The 256K context window supports long conversations, including conversations with images and audio in the history.

Local deployment with HuggingFace

To run Qwen3.5-Omni on your own infrastructure, install the required packages.

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

Then load the model and processor.

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio": "path/to/your/audio.wav"
            },
            {
                "type": "text",
                "text": "What is being discussed in this audio?"
            }
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)

audios, images, videos = process_mm_info(
    conversation,
    use_audio_in_video=True
)

inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)

inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(
    **inputs,
    speaker="Chelsie"
)

text_response = processor.batch_decode(
    text_ids,
    skip_special_tokens=True
)[0]

sf.write(
    "local_response.wav",
    audio_output.reshape(-1).cpu().numpy(),
    samplerate=24000
)

print(text_response)

Approximate GPU memory requirements:

Variant	Precision	Min VRAM
Plus / 30B MoE	BF16	~40GB
Flash	BF16	~20GB
Light	BF16	~10GB

For production local inference, use vLLM instead of HuggingFace Transformers. MoE models run faster with vLLM routing optimizations.

Testing Qwen3.5-Omni requests with Apidog

Multimodal API requests are harder to debug than plain JSON. You may need to inspect base64-encoded media, nested content arrays, and responses that include both text and audio.

A practical workflow in Apidog:

Create a new collection for DashScope.
Set the base URL to:

   https://dashscope.aliyuncs.com/compatible-mode/v1

Store your DashScope API key as an environment variable.
Create request templates for each modality:
- Text
- Audio input
- Audio output
- Image input
- Video input
Duplicate the base request for Plus, Flash, and Light.
Change only the model parameter to compare output quality and latency.

You can also add test assertions for response validation:

Verify choices[0].message.content is not empty for text responses.
Verify choices[0].message.audio.data exists when audio output is requested.
Assert that Flash latency stays under your target threshold.

This helps you choose the right model variant before building the full application.

Error handling and retry logic

Large multimodal requests can hit rate limits, connection errors, or timeouts. Add retries from the start.

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )

        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Waiting {wait:.1f}s...")
            time.sleep(wait)

        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise

            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
            time.sleep(wait)

    raise RuntimeError(f"Failed after {max_retries} attempts")

For video inputs larger than 100MB:

Trim to the relevant section.
Reduce resolution to 480p when high visual detail is not required.
Split long recordings into segments and aggregate the results in your app.

Common issues and fixes

Audio output is garbled on numbers or technical terms

This is the problem ARIA technology addresses. Make sure you are using Qwen3.5-Omni, not an earlier version. If you self-host, use the latest model weights from HuggingFace.

The model keeps talking when I send an audio interruption

Semantic interruption requires Flash or Plus. Light may not support this feature. Also make sure you are streaming the response instead of using a batch request.

Voice cloning quality is poor

Use a cleaner voice sample. Remove background noise with a tool such as Audacity before uploading. Use at least 15 seconds of audio. WAV at 16kHz or 44.1kHz works best.

Video input returns a token limit error

The 256K token context covers roughly 400 seconds of 720p video. For longer videos, trim the clip or lower the resolution. Keep videos under about 6 minutes for a safer margin.

Local deployment is very slow

Use vLLM instead of HuggingFace Transformers for production local inference. MoE models need vLLM routing optimizations for better throughput.

For developers whose primary need is high-fidelity speech synthesis rather than full multimodal inference, Fish Audio S2's dedicated TTS and voice-cloning API offers a narrower, faster endpoint worth evaluating alongside all-in-one multimodal models.

FAQ

Which DashScope model ID should I use for Qwen3.5-Omni?

Use one of these:

qwen3.5-omni-plus
qwen3.5-omni-flash
qwen3.5-omni-light

Start with qwen3.5-omni-flash for most use cases.

Can I use the OpenAI Python SDK with DashScope?

Yes. Configure the client like this:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

The request and response format follows the OpenAI-compatible API style.

How do I send multiple files, such as audio and image, in one request?

Put each file in the content array as a separate typed object.

content = [
    {
        "type": "input_audio",
        "input_audio": {
            "data": audio_data,
            "format": "wav"
        }
    },
    {
        "type": "image_url",
        "image_url": {
            "url": image_url
        }
    },
    {
        "type": "text",
        "text": "Use the audio and image to explain what is happening."
    }
]

All four modalities can appear in the same message.

Is there a size limit for audio or video files?

DashScope has per-request payload limits. For large files, use a URL reference instead of base64 encoding. Host the file on accessible storage and pass the URL in the audio or video_url field.

How do I disable audio output and get text only?

Omit modalities, or set it to text only:

modalities=["text"]

Text-only responses are faster and cheaper.

Does Qwen3.5-Omni support function calling?

Yes. Use the standard tools parameter with function definitions, the same way you would with other OpenAI-compatible models. The model returns structured tool call objects that your application executes.

What is the best way to handle long audio recordings?

For recordings under 10 hours, send them as a single request. For longer recordings, split at natural pause points, process each segment, and aggregate the results in your application layer.

How do I test multimodal requests before building a full application?

Use Apidog to build saved request templates for each modality, switch between model variants, inspect the full response structure, and write assertions before integrating the API into your application.

DEV Community