Rihpig

Posted on Mar 31 • Originally published at apidog.com

Qwen3.5-Omni API 활용법: 텍스트, 오디오, 비디오, 음성 복제

TL;DR

Qwen3.5-Omni는 텍스트, 이미지, 오디오, 비디오를 입력으로 받아들이고 텍스트 또는 실시간 음성으로 응답합니다. Alibaba Cloud DashScope API를 통해 액세스하거나 HuggingFace Transformers를 통해 로컬에서 실행할 수 있습니다. 이 가이드는 API 설정, 각 양식에 대한 작동 코드 예시, 음성 복제 및 Apidog로 요청을 테스트하는 방법을 다룹니다.

지금 Apidog를 사용해보세요

작업 개요

Qwen3.5-Omni는 텍스트, 이미지, 오디오, 비디오 등 네 가지 입력 유형을 동시에 처리하는 단일 모델입니다. 요청 구성에 따라 텍스트 또는 자연스러운 음성으로 응답합니다.

2026년 3월 30일에 출시된 이 모델은 MoE(Mixture-of-Experts) 백본을 가진 Thinker-Talker 아키텍처를 기반으로 구축되었습니다. Thinker는 다중 모달 입력을 처리하고 추론하며, Talker는 전체 응답이 완료되기 전에 오디오 스트리밍을 시작하는 다중 코드북 시스템을 사용하여 출력을 음성으로 변환합니다.

세 가지 변형:

Plus: 최고 품질, 추론 및 음성 복제에 가장 적합
Flash: 균형 잡힌 속도와 품질, 대부분의 프로덕션 사용에 권장
Light: 최저 지연 시간, 모바일 및 엣지 시나리오용

이 가이드의 예시는 주로 Flash를 사용합니다. 최대 품질이 필요한 경우 Plus로 전환하세요.

DashScope를 통한 API 액세스

Alibaba Cloud의 DashScope API는 Qwen3.5-Omni를 프로덕션에서 사용하는 주요 방법입니다. DashScope 계정과 API 키가 필요합니다.

1단계: DashScope 계정 생성

https://dashscope.aliyuncs.com 으로 이동하여 가입하세요. 이미 Alibaba Cloud 계정이 있다면 해당 계정을 사용하세요.

2단계: API 키 가져오기

DashScope 콘솔에 로그인합니다.
왼쪽 사이드바에서 API 키 관리 클릭
API 키 생성 클릭
키(형식: sk-...) 복사

3단계: SDK 설치

pip install dashscope

또는 OpenAI 호환 엔드포인트를 사용할 경우:

pip install openai

DashScope는 https://dashscope.aliyuncs.com/compatible-mode/v1에서 OpenAI 호환 API를 제공합니다. base_url만 변경하면 OpenAI용 코드 그대로 사용할 수 있습니다.

텍스트 입력 및 출력

텍스트 입력, 텍스트 출력의 기본 예시입니다.

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between REST and GraphQL APIs in plain terms."
        }
    ],
)

print(response.choices[0].message.content)

더 어려운 추론 작업에는 qwen3.5-omni-plus로 전환, 지연 시간이 우선이라면 qwen3.5-omni-light를 사용하세요.

오디오 입력: 전사 및 이해

오디오 파일 URL 또는 base64 인코딩 오디오 전달 시 모델이 자동으로 전사 및 추론합니다. 별도 ASR 필요 없음.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("meeting_recording.wav", "rb") as f:
    audio_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_data,
                        "format": "wav"
                    }
                },
                {
                    "type": "text",
                    "text": "Summarize the key decisions made in this meeting and list any action items."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

113개 언어 지원, 자동 감지
지원 포맷: WAV, MP3, M4A, OGG, FLAC

오디오 출력: 응답 내 텍스트 음성 변환

텍스트 대신 음성을 받으려면 modalities를 설정하고 오디오 출력을 구성합니다.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Chelsie", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": "Describe the steps to authenticate a REST API using OAuth 2.0."
        }
    ],
)

text_content = response.choices[0].message.content
audio_data = response.choices[0].message.audio.data

with open("response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

print(f"Text: {text_content}")
print("Audio saved to response.wav")

내장 음성: Chelsie(여성), Ethan(남성)
음성 생성 지원 언어: 36개

이미지 입력: 시각적 이해

이미지 URL 또는 base64 인코딩 이미지를 텍스트 질문과 함께 전달:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/api-diagram.png"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this API architecture diagram and identify any potential bottlenecks."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

로컬 이미지는 base64로 인코딩:

import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

image_url = f"data:image/png;base64,{image_data}"

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": image_url}
                },
                {
                    "type": "text",
                    "text": "What error is shown in this screenshot?"
                }
            ]
        }
    ],
)

비디오 입력: 녹화 및 화면 캡처 이해

비디오 입력은 시각 및 오디오 트랙에서 동시에 추론합니다.

from openai import OpenAI
import base64

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "https://example.com/product-demo.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "Describe what the developer is building in this demo and write equivalent code."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

오디오-시각적 바이브 코딩

화면 녹화를 전달하고 코드 생성을 요청:

with open("screen_recording.mp4", "rb") as f:
    video_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",  # 최고의 코드 생성 품질
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": f"data:video/mp4;base64,{video_data}"
                    }
                },
                {
                    "type": "text",
                    "text": "Watch this screen recording and write the complete code that replicates what you see being built. Include all the UI components and their interactions."
                }
            ]
        }
    ],
)

print(response.choices[0].message.content)

256K 토큰 컨텍스트 = 720p 비디오 약 400초
더 긴 녹화는 분할 필요

음성 복제

음성 샘플을 제공해 해당 목소리로 응답 생성:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

with open("voice_sample.wav", "rb") as f:
    voice_sample = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    modalities=["text", "audio"],
    audio={
        "voice": "custom",
        "format": "wav",
        "voice_sample": {
            "data": voice_sample,
            "format": "wav"
        }
    },
    messages=[
        {
            "role": "user",
            "content": "Welcome to the Apidog developer portal. How can I help you today?"
        }
    ],
)

audio_data = response.choices[0].message.audio.data
with open("cloned_response.wav", "wb") as f:
    f.write(base64.b64decode(audio_data))

음성 복제 품질 팁:

깨끗한 녹음 사용(배경 소음 제거)
15–30초 클립
16kHz 이상의 WAV
자연스러운 음성(낭독 아님)

스트리밍 응답

실시간 음성 채팅 등에서는 스트리밍 모드 사용:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

stream = client.chat.completions.create(
    model="qwen3.5-omni-flash",
    modalities=["text", "audio"],
    audio={"voice": "Ethan", "format": "pcm16"},
    messages=[
        {
            "role": "user",
            "content": "Explain how WebSocket connections differ from HTTP polling."
        }
    ],
    stream=True,
)

audio_chunks = []
text_chunks = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "audio") and delta.audio:
        if delta.audio.get("data"):
            audio_chunks.append(delta.audio["data"])
    if delta.content:
        text_chunks.append(delta.content)
        print(delta.content, end="", flush=True)

print()

if audio_chunks:
    import base64
    full_audio = b"".join(base64.b64decode(chunk) for chunk in audio_chunks)
    with open("streamed_response.pcm", "wb") as f:
        f.write(full_audio)

PCM16은 오디오 출력 버퍼에 직접 파이프 가능(스트리밍에 적합)

다중 모달이 혼합된 다중 턴 대화

대화 내에서 모달리티를 다양하게 섞어 사용:

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
)

conversation = []

def send_message(content_parts):
    conversation.append({"role": "user", "content": content_parts})

    response = client.chat.completions.create(
        model="qwen3.5-omni-flash",
        messages=conversation,
    )

    reply = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": reply})
    return reply

# 턴 1: 텍스트
print(send_message([{"type": "text", "text": "I have an API that keeps returning 503 errors."}]))

# 턴 2: 이미지 추가 (오류 로그 스크린샷)
import base64
with open("error_log.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

print(send_message([
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img}"}},
    {"type": "text", "text": "Here's the error log screenshot. What's causing this?"}
]))

# 턴 3: 후속 텍스트
print(send_message([{"type": "text", "text": "How do I fix the connection pool exhaustion you mentioned?"}]))

256K 컨텍스트 창으로 긴 대화, 이미지, 오디오 모두 처리

HuggingFace를 사용한 로컬 배포

자체 인프라에서 실행하려면:

pip install transformers==4.57.3
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

model_path = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(model_path)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "path/to/your/audio.wav"},
            {"type": "text", "text": "What is being discussed in this audio?"}
        ],
    },
]

text = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
)
inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio_output = model.generate(**inputs, speaker="Chelsie")

text_response = processor.batch_decode(text_ids, skip_special_tokens=True)[0]
sf.write("local_response.wav", audio_output.reshape(-1).cpu().numpy(), samplerate=24000)

print(text_response)

로컬 배포 GPU 최소 VRAM:

변형	정밀도	최소 VRAM
Plus (30B MoE)	BF16	~40GB
Flash	BF16	~20GB
Light	BF16	~10GB

프로덕션 로컬 추론은 HuggingFace 대신 vLLM 사용 권장 (성능 최적화)

Apidog로 Qwen3.5-Omni 요청 테스트

다중 모달 API 요청은 디버깅이 어렵습니다. base64 오디오/비디오, 중첩 배열, 복합 응답 등. 터미널 기반보다 Apidog이 효율적입니다.

DashScope 엔드포인트를 새 컬렉션으로 설정
API 키를 환경 변수로 저장
각 모달리티별 요청 템플릿 생성

변형별 기본 요청을 복제해 모델 파라미터만 변경, 응답/지연/품질을 한눈에 비교할 수 있습니다.

테스트 어설션 예시:

텍스트 응답: choices[0].message.content가 비어있지 않은지 확인
오디오 응답: choices[0].message.audio.data 존재 확인
Flash 응답 지연 시간 임계값 체크

프로덕션에서 사용할 변형 결정에 유용합니다.

오류 처리 및 재시도 로직

속도 제한 및 시간 초과는 대규모 입력(특히 비디오)에서 흔함. 재시도 로직을 반드시 구현하세요.

import time
import random
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError

client = OpenAI(
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    api_key="sk-YOUR_DASHSCOPE_KEY",
    timeout=120,
)

def call_with_retry(messages, model="qwen3.5-omni-flash", max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages,
            )
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"속도 제한에 도달했습니다. {wait:.1f}초 동안 기다립니다...")
            time.sleep(wait)
        except (APITimeoutError, APIConnectionError) as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"연결 오류: {e}. {wait:.1f}초 후에 다시 시도합니다...")
            time.sleep(wait)
    raise RuntimeError(f"{max_retries}번 시도 후 실패했습니다.")

100MB 초과 비디오는:

필요 부분만 전송
고해상도 불필요시 480p로 리사이즈
긴 녹음은 세그먼트 분할 후 결과 집계

일반적인 문제 및 해결 방법

"숫자나 전문 용어에서 오디오 출력이 엉망입니다."

→ Qwen3.5-Omni 최신 모델 사용, 자체 호스팅 시 최신 가중치 확인

"오디오 중단을 보내도 모델이 계속 말합니다."

→ 의미론적 중단은 Flash/Plus에서만 지원. 스트리밍 모드 사용 필수

"음성 복제 품질이 좋지 않습니다."

→ 깨끗한 녹음(최소 15초, 16kHz 이상 WAV), 배경 소음 제거 필요

"비디오 입력에서 토큰 제한에 대한 오류가 반환됩니다."

→ 256K 토큰 ≈ 400초 720p, 더 길면 자르거나 해상도 낮추기

"로컬 배포가 매우 느립니다."

→ 프로덕션 추론은 HuggingFace 대신 vLLM 사용(vLLM 라우팅 최적화)

FAQ

Qwen3.5-Omni에 어떤 DashScope 모델 ID를 사용해야 하나요?

품질 및 지연 시간에 따라 qwen3.5-omni-plus, qwen3.5-omni-flash, qwen3.5-omni-light 중 선택. 대부분은 Flash 추천.

DashScope와 OpenAI Python SDK를 함께 사용할 수 있나요?

네. base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"와 DashScope API 키 사용. 포맷 동일.

하나의 요청으로 여러 파일(오디오+이미지)을 어떻게 보내나요?

content 배열에 각 모달리티의 객체를 별도로 추가하면 됩니다.

오디오/비디오 파일 크기 제한이 있나요?

요청당 페이로드 제한 있음. 대용량 파일은 base64 대신 URL 참조 사용. 스토리지에 업로드 후 URL 전달.

오디오 출력을 비활성화하고 텍스트만 받으려면?

modalities=["text"] 설정 또는 modalities 생략. 텍스트 전용이 더 빠르고 저렴.

함수/도구 호출을 지원하나요?

네. 표준 tools 매개변수로 함수 정의, 구조화된 도구 호출 객체 반환 지원.

긴 오디오 녹음을 처리하는 가장 좋은 방법?

10시간 미만은 단일 요청. 그 이상은 자연스러운 일시정지 기준으로 분할해 처리, 애플리케이션 레이어에서 결과 집계.

전체 애플리케이션 구축 전 다중 모달 요청을 어떻게 테스트하나요?

Apidog에서 각 모달리티별 요청 템플릿 저장, 응답 구조 확인, 변형 전환, 품질 어설션 작성 가능.

DEV Community