Wanda

Posted on Mar 31 • Originally published at apidog.com

How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API

TL;DR

Qwen3.5-Omni supports text, images, audio, and video as input, returning either text or real-time speech. You can access it via the Alibaba Cloud DashScope API or run it locally using HuggingFace Transformers. This article gives you practical setup, working code examples for each modality, voice cloning steps, and shows you how to test requests with Apidog.

Try Apidog today

What you’re working with

Qwen3.5-Omni is a single model that processes text, images, audio, and video inputs simultaneously, returning either text or natural speech depending on your configuration.

Launched March 30, 2026, Qwen3.5-Omni uses a Thinker-Talker architecture with a MoE backbone. The Thinker handles multimodal reasoning. The Talker generates speech, streaming audio output before the full response is available.

Variants:

Plus: Highest quality, ideal for reasoning and voice cloning
Flash: Balance of speed and quality; recommended for most production
Light: Lowest latency; mobile/edge use

This guide uses Flash. For max quality, use Plus.

API access via DashScope

Alibaba Cloud DashScope is the main way to use Qwen3.5-Omni. Start by getting an account and API key.

Step 1: Create a DashScope account

Step 2: Get your API key

Log in to the DashScope console.
In the left sidebar, select API Key Management.
Click Create API Key.
Copy your API key (format: sk-...).

Step 3: Install the SDK

pip install dashscope

Or, for OpenAI-compatible usage:

pip install openai

DashScope's OpenAI-compatible endpoint is https://dashscope.aliyuncs.com/compatible-mode/v1, so you can use your existing OpenAI code with a different base_url.

Text input and output

Send text and get text back:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)

Swap model to qwen3.5-omni-plus for more complex tasks or qwen3.5-omni-light for low latency.

Audio input: transcription and understanding

Send an audio file (URL or base64). The model transcribes and reasons over audio directly.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Handles 113 languages (auto-detected).
Supported formats: WAV, MP3, M4A, OGG, FLAC.

Audio output: text-to-speech in the response

To get speech output, set the modalities parameter and configure audio:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

# Access text and audio
text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data  # base64-encoded WAV

with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")

Voices: Chelsie (female), Ethan (male). Speech in 36 languages.

Image input: visual understanding

Send an image (URL or base64) with a text question:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

For local images, use base64:

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)

Video input: understanding recordings and screen captures

Pass video input for multimodal reasoning:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

Audio-Visual Vibe Coding

Generate code from a screen recording:

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",  # Use Plus for best code generation quality
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

256K token context = ~400 seconds of 720p video.
For longer videos: trim or split.

Voice cloning

Clone a speaker's voice for model responses:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

Tips:

Clean, noise-free recordings
15-30 seconds sample
WAV @ 16kHz or higher
Use natural speech, not monotone reading

Streaming responses

For real-time voice or chat, use streaming:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])
    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()  # newline after streaming text

# Combine and save audio chunks
if audio_chunks:
    import base64
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)

PCM16 format is ideal for real-time playback.

Multi-turn conversation with mixed modalities

Handle conversation history with different input types:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})

    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: text
print(send_message([{"type": "text", "text": "I have an API that keeps returning 503 errors."}]))

# Turn 2: add an image (error log screenshot)
import base64
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    {"type": "text", "text": "Here's the error log screenshot. What's causing this?"}
]))

# Turn 3: follow-up text
print(send_message([{"type": "text", "text": "How do I fix the connection pool exhaustion you mentioned?"}]))

The 256K context window allows for long, mixed-modality conversations.

Local deployment with HuggingFace

To run Qwen3.5-Omni locally:

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "path/to/your/audio.wav"},
            {"type": "text", "text": "What is being discussed in this audio?"}
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(**inputs, speaker="Chelsie")

text_response = processor.batch_decode(text_ids, skip_special_tokens=True)[0]
sf.write("local_response.wav", audio_output.reshape(-1).cpu().numpy(), samplerate=24000)

print(text_response)

GPU requirements:

Variant	Precision	Min VRAM
Plus (30B MoE)	BF16	~40GB
Flash	BF16	~20GB
Light	BF16	~10GB

For production inference, use vLLM instead of HuggingFace Transformers for better performance.

Testing your Qwen3.5-Omni requests with Apidog

Multimodal API requests involve complex payloads (base64 audio, nested arrays, etc.). Apidog streamlines this:

Add your DashScope endpoint as a new collection.
Store API keys as environment variables.
Build and save request templates for each modality.

For each variant (Plus, Flash, Light), duplicate the request and change the model parameter. Run all and compare responses, latency, and output quality in one place.

You can also write test assertions in Apidog:

Check choices[0].message.content for text
Verify choices[0].message.audio.data for audio output
Assert that Flash latency is under your target

This is especially useful for variant selection.

Error handling and retry logic

Large multimodal requests (especially video) can hit rate limits or timeouts. Implement retry logic:

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,  # 2-minute timeout for large video inputs
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Waiting {wait:.1f}s...")
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed after {max_retries} attempts")

For large video:

Trim to relevant section
Reduce resolution to 480p if possible
Split long recordings and aggregate responses

Common issues and fixes

Audio output is garbled on numbers/terms: Use Qwen3.5-Omni, not earlier versions. For local, use newest HuggingFace weights.
Model doesn’t interrupt on new audio: Use Flash or Plus, and stream the response.
Poor voice cloning: Clean, noise-free sample; at least 15 seconds; WAV at 16kHz+; use natural speech.
Video input exceeds token limits: 256K tokens ≈ 400s of 720p video. Trim or reduce resolution.
Local deployment is slow: Use vLLM, not Transformers, for production.

FAQ

Which DashScope model ID do I use for Qwen3.5-Omni?

Use qwen3.5-omni-plus, qwen3.5-omni-flash, or qwen3.5-omni-light based on your needs. Start with Flash.

Can I use the OpenAI Python SDK with DashScope?

Yes. Set base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" and use your DashScope api_key.

How do I send multiple files (audio + image) in one request?

Include both as separate objects in the content array, with your text prompt.

Is there a size limit for audio or video files?

DashScope has payload limits. For large files, use a URL reference rather than base64. Host the file and pass its URL.

How do I disable audio output and get text only?

Set modalities=["text"] or omit modalities.

Does it support function/tool calling?

Yes. Use the tools parameter with function definitions, just like OpenAI models.

What’s the best way to handle long audio recordings?

Recordings under 10 hours: send as single request. For longer, split at pauses, process each, and aggregate results.

How do I test my multimodal requests before building a full application?

Use Apidog to build/save request templates, switch model variants, inspect responses, and write assertions—no application code needed.

DEV Community