DEV Community

Cover image for How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API
Wanda
Wanda

Posted on • Originally published at apidog.com

How to Use Qwen3.5-Omni: Text, Audio, Video, and Voice Cloning via API

TL;DR

Qwen3.5-Omni supports text, images, audio, and video as input, returning either text or real-time speech. You can access it via the Alibaba Cloud DashScope API or run it locally using HuggingFace Transformers. This article gives you practical setup, working code examples for each modality, voice cloning steps, and shows you how to test requests with Apidog.

Try Apidog today

What you’re working with

Qwen3.5-Omni is a single model that processes text, images, audio, and video inputs simultaneously, returning either text or natural speech depending on your configuration.

Qwen3.5-Omni Modalities

Launched March 30, 2026, Qwen3.5-Omni uses a Thinker-Talker architecture with a MoE backbone. The Thinker handles multimodal reasoning. The Talker generates speech, streaming audio output before the full response is available.

Variants:

  • Plus: Highest quality, ideal for reasoning and voice cloning
  • Flash: Balance of speed and quality; recommended for most production
  • Light: Lowest latency; mobile/edge use

This guide uses Flash. For max quality, use Plus.

API access via DashScope

Alibaba Cloud DashScope is the main way to use Qwen3.5-Omni. Start by getting an account and API key.

Step 1: Create a DashScope account

Sign up at dashscope.aliyuncs.com.

Step 2: Get your API key

  1. Log in to the DashScope console.
  2. In the left sidebar, select API Key Management.
  3. Click Create API Key.
  4. Copy your API key (format: sk-...).

Step 3: Install the SDK

pip install dashscope
Enter fullscreen mode Exit fullscreen mode

Or, for OpenAI-compatible usage:

pip install openai
Enter fullscreen mode Exit fullscreen mode

DashScope's OpenAI-compatible endpoint is https://dashscope.aliyuncs.com/compatible-mode/v1, so you can use your existing OpenAI code with a different base_url.

Text input and output

Send text and get text back:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Swap model to qwen3.5-omni-plus for more complex tasks or qwen3.5-omni-light for low latency.


Audio input: transcription and understanding

Send an audio file (URL or base64). The model transcribes and reasons over audio directly.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode
  • Handles 113 languages (auto-detected).
  • Supported formats: WAV, MP3, M4A, OGG, FLAC.

Audio output: text-to-speech in the response

To get speech output, set the modalities parameter and configure audio:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

# Access text and audio
text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data  # base64-encoded WAV

with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")
Enter fullscreen mode Exit fullscreen mode

Voices: Chelsie (female), Ethan (male). Speech in 36 languages.

Image input: visual understanding

Send an image (URL or base64) with a text question:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

For local images, use base64:

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)
Enter fullscreen mode Exit fullscreen mode

Video input: understanding recordings and screen captures

Pass video input for multimodal reasoning:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Audio-Visual Vibe Coding

Generate code from a screen recording:

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",  # Use Plus for best code generation quality
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode
  • 256K token context = ~400 seconds of 720p video.
  • For longer videos: trim or split.

Voice cloning

Clone a speaker's voice for model responses:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))
Enter fullscreen mode Exit fullscreen mode

Tips:

  • Clean, noise-free recordings
  • 15-30 seconds sample
  • WAV @ 16kHz or higher
  • Use natural speech, not monotone reading

Streaming responses

For real-time voice or chat, use streaming:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])
    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()  # newline after streaming text

# Combine and save audio chunks
if audio_chunks:
    import base64
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)
Enter fullscreen mode Exit fullscreen mode

PCM16 format is ideal for real-time playback.

Multi-turn conversation with mixed modalities

Handle conversation history with different input types:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})

    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: text
print(send_message([{"type": "text", "text": "I have an API that keeps returning 503 errors."}]))

# Turn 2: add an image (error log screenshot)
import base64
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    {"type": "text", "text": "Here's the error log screenshot. What's causing this?"}
]))

# Turn 3: follow-up text
print(send_message([{"type": "text", "text": "How do I fix the connection pool exhaustion you mentioned?"}]))
Enter fullscreen mode Exit fullscreen mode

The 256K context window allows for long, mixed-modality conversations.

Local deployment with HuggingFace

To run Qwen3.5-Omni locally:

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
Enter fullscreen mode Exit fullscreen mode
import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "path/to/your/audio.wav"},
            {"type": "text", "text": "What is being discussed in this audio?"}
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(**inputs, speaker="Chelsie")

text_response = processor.batch_decode(text_ids, skip_special_tokens=True)[0]
sf.write("local_response.wav", audio_output.reshape(-1).cpu().numpy(), samplerate=24000)

print(text_response)
Enter fullscreen mode Exit fullscreen mode

GPU requirements:

Variant Precision Min VRAM
Plus (30B MoE) BF16 ~40GB
Flash BF16 ~20GB
Light BF16 ~10GB

For production inference, use vLLM instead of HuggingFace Transformers for better performance.

Testing your Qwen3.5-Omni requests with Apidog

Multimodal API requests involve complex payloads (base64 audio, nested arrays, etc.). Apidog streamlines this:

Apidog Screenshot

  • Add your DashScope endpoint as a new collection.
  • Store API keys as environment variables.
  • Build and save request templates for each modality.

For each variant (Plus, Flash, Light), duplicate the request and change the model parameter. Run all and compare responses, latency, and output quality in one place.

You can also write test assertions in Apidog:

  • Check choices[0].message.content for text
  • Verify choices[0].message.audio.data for audio output
  • Assert that Flash latency is under your target

This is especially useful for variant selection.

Error handling and retry logic

Large multimodal requests (especially video) can hit rate limits or timeouts. Implement retry logic:

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,  # 2-minute timeout for large video inputs
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Waiting {wait:.1f}s...")
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed after {max_retries} attempts")
Enter fullscreen mode Exit fullscreen mode

For large video:

  • Trim to relevant section
  • Reduce resolution to 480p if possible
  • Split long recordings and aggregate responses

Common issues and fixes

  • Audio output is garbled on numbers/terms: Use Qwen3.5-Omni, not earlier versions. For local, use newest HuggingFace weights.
  • Model doesn’t interrupt on new audio: Use Flash or Plus, and stream the response.
  • Poor voice cloning: Clean, noise-free sample; at least 15 seconds; WAV at 16kHz+; use natural speech.
  • Video input exceeds token limits: 256K tokens ≈ 400s of 720p video. Trim or reduce resolution.
  • Local deployment is slow: Use vLLM, not Transformers, for production.

FAQ

Which DashScope model ID do I use for Qwen3.5-Omni?

Use qwen3.5-omni-plus, qwen3.5-omni-flash, or qwen3.5-omni-light based on your needs. Start with Flash.

Can I use the OpenAI Python SDK with DashScope?

Yes. Set base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" and use your DashScope api_key.

How do I send multiple files (audio + image) in one request?

Include both as separate objects in the content array, with your text prompt.

Is there a size limit for audio or video files?

DashScope has payload limits. For large files, use a URL reference rather than base64. Host the file and pass its URL.

How do I disable audio output and get text only?

Set modalities=["text"] or omit modalities.

Does it support function/tool calling?

Yes. Use the tools parameter with function definitions, just like OpenAI models.

What’s the best way to handle long audio recordings?

Recordings under 10 hours: send as single request. For longer, split at pauses, process each, and aggregate results.

How do I test my multimodal requests before building a full application?

Use Apidog to build/save request templates, switch model variants, inspect responses, and write assertions—no application code needed.

Top comments (0)