Thanawat Wongchai

Posted on Mar 31 • Originally published at apidog.com

วิธีใช้ Qwen3.5-Omni: API ข้อความ เสียง วิดีโอ และการโคลนเสียง

สรุปย่อ

Qwen3.5-Omni รองรับอินพุตข้อความ รูปภาพ เสียง และวิดีโอ ส่งคืนผลลัพธ์เป็นข้อความหรือเสียงพูดแบบเรียลไทม์ ใช้งานได้ผ่าน Alibaba Cloud DashScope API หรือรันบนเครื่องของคุณผ่าน HuggingFace Transformers คู่มือนี้ครอบคลุมการตั้งค่า API ตัวอย่างโค้ดที่ใช้งานได้จริงสำหรับแต่ละ modality การโคลนเสียง และวิธีทดสอบคำขอของคุณด้วย Apidog

ทดลองใช้ Apidog วันนี้

สิ่งที่คุณกำลังใช้งาน

Qwen3.5-Omni เป็นโมเดลเดียวที่รองรับอินพุตสี่ประเภทพร้อมกัน ได้แก่ ข้อความ รูปภาพ เสียง และวิดีโอ โดยจะส่งคืนผลลัพธ์เป็นข้อความหรือเสียงพูดที่เป็นธรรมชาติ ขึ้นอยู่กับการกำหนดค่าคำขอของคุณ

เปิดตัวเมื่อ 30 มีนาคม 2026 ด้วยสถาปัตยกรรม Thinker-Talker (MoE) โดย Thinker จะประมวลผลอินพุตแบบ multimodal และ Talker จะแปลงเอาต์พุตเป็นเสียงพูด (multi-codebook) พร้อมสตรีมเสียงล่วงหน้า

มี 3 เวอร์ชัน:

Plus: คุณภาพสูงสุด เหมาะกับงาน reasoning และโคลนเสียง
Flash: สมดุลระหว่างความเร็วและคุณภาพ เหมาะกับการใช้งานทั่วไป
Light: ความหน่วงต่ำ เหมาะกับมือถือ/edge

ตัวอย่างในคู่มือนี้ใช้ Flash เป็นหลัก สลับไป Plus เมื่อต้องการคุณภาพสูงสุด

การเข้าถึง API ผ่าน DashScope

DashScope API ของ Alibaba Cloud คือทางหลักในการใช้งาน Qwen3.5-Omni สำหรับ production ต้องมีบัญชี DashScope และ API key

ขั้นตอนที่ 1: สร้างบัญชี DashScope

ไปที่ dashscope.aliyuncs.com แล้วลงทะเบียน หรือใช้บัญชี Alibaba Cloud เดิม

ขั้นตอนที่ 2: รับ API key

เข้าสู่ระบบ DashScope Console
คลิก API Key Management ด้านซ้าย
คลิก Create API Key
คัดลอกคีย์ (รูปแบบ sk-...)

ขั้นตอนที่ 3: ติดตั้ง SDK

pip install dashscope

หรือใช้ OpenAI-compatible SDK โดยตรง:

pip install openai

DashScope มี API ที่ compatible กับ OpenAI ที่ https://dashscope.aliyuncs.com/compatible-mode/v1 สามารถสลับ base_url ได้ทันที

อินพุตและเอาต์พุตข้อความ

อินพุตข้อความธรรมดา ตัวอย่างโค้ด:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)

เปลี่ยนเป็น qwen3.5-omni-plus เมื่อต้องการความแม่นยำ reasoning สูง หรือ qwen3.5-omni-light หากความหน่วงสำคัญกว่า

อินพุตเสียง: การถอดความและทำความเข้าใจ

ส่งไฟล์เสียง (base64 หรือ URL) โมเดลจะถอดความและ reasoning โดยตรง ไม่ต้องแยก ASR:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

รองรับ 113 ภาษา ไม่ต้องระบุภาษา
รองรับไฟล์: WAV, MP3, M4A, OGG, FLAC

เอาต์พุตเสียง: ข้อความเป็นเสียงพูดในการตอบกลับ

ต้องการเสียงกลับ ให้ตั้งค่า modalities และเลือก voice:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data

with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")

มีเสียง Chelsie (หญิง) และ Ethan (ชาย)
รองรับ 36 ภาษา

อินพุตภาพ: การทำความเข้าใจด้วยภาพ

ส่ง URL หรือ base64 รูป พร้อมคำถาม:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

เข้ารหัสภาพในเครื่องเป็น base64:

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")
image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)

อินพุตวิดีโอ: การทำความเข้าใจการบันทึกและภาพหน้าจอ

รองรับ reasoning ข้ามภาพและเสียงในวิดีโอ:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

การเขียนโค้ดตามบรรยากาศภาพและเสียง (Audio-Visual Vibe Coding)

ส่ง screen recording และขอให้โมเดลสร้างโค้ด:

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

ขนาด context window 256K token รองรับวิดีโอ 720p ~400 วินาที
วิดีโอยาว: ตัดหรือแบ่งส่วน

การโคลนเสียง

กำหนดเสียงเป้าหมายและให้โมเดลตอบกลับเสียงนั้น ใช้ได้ใน Plus/Flash ผ่าน API:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

Tips คุณภาพการโคลนเสียง:

ใช้เสียงบันทึกที่สะอาด ไม่มีเสียงรบกวน
15-30 วินาที จะดีกว่าคลิปสั้นๆ
WAV 16kHz+ แนะนำ
ควรเป็นเสียงพูดจริง ไม่ใช่แค่อ่านข้อความ

การสตรีมการตอบกลับ

สำหรับแอปสนทนาแบบเรียลไทม์ ใช้ streaming เพื่อให้เสียงออกมาก่อนที่ข้อความจะจบ:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])
    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()

if audio_chunks:
    import base64
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)

PCM16 เหมาะกับการสตรีม ส่งต่อ buffer เสียงได้ทันที

การสนทนาหลายรอบด้วยรูปแบบที่หลากหลาย

ผสมผสานอินพุตแต่ละรอบ:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})

    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

# Turn 1: text
print(send_message([{"type": "text", "text": "I have an API that keeps returning 503 errors."}]))

# Turn 2: image
import base64
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    {"type": "text", "text": "Here's the error log screenshot. What's causing this?"}
]))

# Turn 3: follow-up text
print(send_message([{"type": "text", "text": "How do I fix the connection pool exhaustion you mentioned?"}]))

context window 256K ทำให้สนทนาหลายรอบได้โดยไม่ถูกตัดทอน

การติดตั้งใช้งานบนเครื่องด้วย HuggingFace

ต้องการรันบนเครื่องเอง:

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "path/to/your/audio.wav"},
            {"type": "text", "text": "What is being discussed in this audio?"}
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(**inputs, speaker="Chelsie")

text_response = processor.batch_decode(text_ids, skip_special_tokens=True)[0]
sf.write("local_response.wav", audio_output.reshape(-1).cpu().numpy(), samplerate=24000)

print(text_response)

ข้อกำหนดหน่วยความจำ GPU:

เวอร์ชัน	ความแม่นยำ	VRAM ขั้นต่ำ
Plus (30B MoE)	BF16	~40GB
Flash	BF16	~20GB
Light	BF16	~10GB

สำหรับ production inference แนะนำใช้ vLLM แทน HuggingFace Transformers

การทดสอบคำขอ Qwen3.5-Omni ของคุณด้วย Apidog

คำขอ multimodal API debug ยากกว่าปกติ เนื่องจากต้องจัดการไฟล์ base64, อาร์เรย์ content, และการตอบกลับที่มีทั้งข้อความ/เสียง การทำผ่านเทอร์มินัลจะซับซ้อน

Apidog รองรับการจัดการ Endpoint DashScope, เก็บ API key ใน environment, และสร้างเทมเพลต request แยกตาม modality

สำหรับแต่ละเวอร์ชัน (Plus, Flash, Light) สามารถ duplicate request แล้วเปลี่ยน model parameter รันเทียบ latency/quality ได้ในมุมมองเดียวกัน

เขียน assertion test ใน Apidog ได้ เช่น:

ตรวจสอบ <code>choices[0].message.content</code> มีข้อความ
ตรวจสอบ <code>choices[0].message.audio.data</code> มีข้อมูล (เมื่อขอเสียง)
ตรวจสอบ latency ของ Flash อยู่ในเกณฑ์

ช่วยตัดสินใจเลือกเวอร์ชัน production ได้ง่ายขึ้น

การจัดการข้อผิดพลาดและกลไกการลองใหม่

โมเดล multimodal ขนาดใหญ่มีโอกาสเจอ rate limit/timeout โดยเฉพาะวิดีโอ ควรมี retry logic:

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Waiting {wait:.1f}s...")
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Connection error: {e}. Retrying in {wait:.1f}s...")
            time.sleep(wait)
    raise RuntimeError(f"Failed after {max_retries} attempts")

สำหรับวิดีโอใหญ่ (>100MB) ให้พิจารณา:

ตัดเฉพาะช่วงสำคัญ
ลดความละเอียดเป็น 480p (ถ้าเนื้อหาไม่ต้องละเอียด)
แบ่ง recording ยาวเป็นส่วน ๆ

ปัญหาทั่วไปและการแก้ไข

"เอาต์พุตเสียงผิดเพี้ยนเมื่อมีตัวเลขหรือศัพท์เทคนิค"

ตรวจสอบว่าใช้ Qwen3.5-Omni (ไม่ใช่เวอร์ชันเก่า) และใช้ model weights ล่าสุด

"โมเดลยังคงพูดต่อเมื่อขัดจังหวะเสียง"

Semantic interruption ต้องใช้ Flash หรือ Plus และควรสตรีมการตอบกลับ

"คุณภาพการโคลนเสียงไม่ดี"

เสียงต้องสะอาด ไม่มี noise ใช้ Audacity ช่วย ลองเพิ่มความยาว sample

"อินพุตวิดีโอเกิดข้อผิดพลาดขีดจำกัดโทเค็น"

256K token รองรับวิดีโอ 720p ~400 วินาที ตัดหรือย่อวิดีโอให้ไม่เกิน 6 นาที

"การติดตั้งใช้งานในเครื่องช้ามาก"

ใช้ vLLM ไม่ใช่ Transformers เพื่อปริมาณงานที่เหมาะสม

คำถามที่พบบ่อย

ฉันควรใช้ ID โมเดล DashScope ใดสำหรับ Qwen3.5-Omni?

ใช้ qwen3.5-omni-plus, qwen3.5-omni-flash หรือ qwen3.5-omni-light ตามคุณภาพ/latency ที่ต้องการ เริ่มจาก Flash

ฉันสามารถใช้ OpenAI Python SDK กับ DashScope ได้หรือไม่?

ได้ ตั้งค่า base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" และใส่ DashScope key ใน api_key

ส่งไฟล์หลายไฟล์ (เสียง+รูปภาพ) ในคำขอเดียวได้อย่างไร?

ใส่ไฟล์ใน array content เป็น object แยกแต่ละ type พร้อมข้อความ

มีข้อจำกัดขนาดสำหรับไฟล์เสียงหรือวิดีโอหรือไม่?

DashScope มีเพย์โหลดลิมิต ขนาดไฟล์ใหญ่ควรใช้ URL แทน base64

จะปิดเอาต์พุตเสียงและรับเฉพาะข้อความได้อย่างไร?

ตั้งค่า modalities=["text"] หรือข้าม parameter นี้ไป

รองรับการเรียกฟังก์ชัน/เครื่องมือหรือไม่?

รองรับ ใช้ tools parameter เหมือน OpenAI

วิธีจัดการกับการบันทึกเสียงที่ยาวนาน?

น้อยกว่า 10 ชม. ส่งคำขอเดียว เกินนั้นแบ่งส่วนและรวมผลลัพธ์เอง

ทดสอบคำขอ multimodal อย่างไรก่อนสร้างแอปจริง?

ใช้ Apidog สร้าง/batch เทมเพลตคำขอแต่ละ modality สลับโมเดล ตรวจสอบ output ได้ครบ

ทดลองใช้ Apidog วันนี้เพื่อเทสและพัฒนา workflow API multimodal ของคุณอย่างมีประสิทธิภาพ!

DEV Community