DEV Community

tumf
tumf

Posted on • Originally published at blog.tumf.dev

Qwen3-TTS: Surprised by the Quality of Japanese on Apple Silicon M3 — Creating Rights-Free Voices with VoiceDesign

Originally published on 2026-01-24
Original article (Japanese): Qwen3-TTS: Apple Silicon M3で試したら日本語品質に驚いた——VoiceDesignで権利フリーの声を作る

"I can't believe such natural Japanese can come from open-source TTS" — honestly, I was amazed.

On January 22, 2026, the Qwen Team from Alibaba Cloud released Qwen3-TTS, which I tested on my MacBook Air (Apple Silicon M3). To cut to the chase, the quality of Japanese speech synthesis is quite impressive. Moreover, the ability to create "your own voice" using the VoiceDesign feature is fascinating. You can generate rights-free voices simply through natural language instructions.

In this article, I will summarize the results of running it on Apple Silicon (MPS) and the practicality of the VoiceDesign → Clone workflow. The code used for testing is available on GitHub.

What is Qwen3-TTS?

Qwen3-TTS is a collection of open-source TTS (Text-to-Speech) models released under the Apache License 2.0. It is available for commercial use as well. There is also an official blog (Qwen Official Blog).

The models are categorized by use case, and it helps to understand the following three main types:

  • CustomVoice: Choose from pre-prepared "premium speakers" and adjust the style with instructions.
  • VoiceDesign: Design the voice's timbre, emotion, and prosody using instructions (this is interesting).
  • Base: Clone a voice from a reference audio in about 3 seconds.

The models are consolidated in the Hugging Face Qwen3-TTS collection. Supported languages include Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.

Running on Apple Silicon M3

Environment

Item Value
Machine MacBook Air (Apple Silicon M3)
OS macOS
Python 3.13.1
PyTorch 2.10.0
Backend MPS (Metal Performance Shaders)

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -U qwen-tts soundfile
Enter fullscreen mode Exit fullscreen mode

FlashAttention is not available on Mac, but it runs without issues with the following warning.

Warning: flash-attn is not installed. Will only run the manual PyTorch version.
Enter fullscreen mode Exit fullscreen mode

Location of Model Files

The models are automatically downloaded during the first run and stored in the Hugging Face cache.

# Check cache
ls -la ~/.cache/huggingface/hub/ | grep -i qwen
Enter fullscreen mode Exit fullscreen mode
Model Size
Qwen3-TTS-12Hz-0.6B-Base 2.3GB
Qwen3-TTS-12Hz-1.7B-VoiceDesign 4.2GB

If you use both, the total will be about 6.5GB. If you only need the Voice Clone, the 0.6B-Base model at 2.3GB will suffice.

You can delete them when no longer needed (they will be automatically re-downloaded on the next run).

rm -rf ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-*
Enter fullscreen mode Exit fullscreen mode

For the CUDA version, additional libraries like FlashAttention are required, but the Mac version is simpler to set up as these are not needed.

Operation Confirmation Results

Model Operation Notes
Qwen3-TTS-12Hz-1.7B-VoiceDesign Works with float16
Qwen3-TTS-12Hz-1.7B-Base float32 is required
Qwen3-TTS-12Hz-0.6B-Base float32 is required

Important: The Voice Clone (Base model) requires float32. Using float16 will result in RuntimeError: probability tensor contains either inf, nan or element < 0.

Configuration Differences with CUDA

Setting CUDA (Official README) Mac (MPS)
device_map "cuda:0" "mps"
dtype (VoiceDesign) torch.bfloat16 torch.float16
dtype (Voice Clone) torch.bfloat16 torch.float32
attn_implementation "flash_attention_2" omitted

VoiceDesign: Designing Voices

The most important aspect I want to convey in this article is VoiceDesign.

Traditional TTS typically involved "choosing from prepared voices." VoiceDesign shifts towards "designing voices using natural language."

For example, you can define a voice with instructions like the following:

A calm young male voice. A bit low and gentle.
Enter fullscreen mode Exit fullscreen mode
A young female. The voice is not too high, with a slight breathy quality. Modest and nervous.
Enter fullscreen mode Exit fullscreen mode

With just this, a new voice is generated on the spot. Since it does not clone an existing voice, you can create rights-free voices.

Implementation Example (for Mac)

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

# Load VoiceDesign model (Mac MPS)
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="mps",
    dtype=torch.float16,
)

# Test in Japanese
wavs, sr = model.generate_voice_design(
    text="こんにちは、これはMacでの音声合成テストです。",
    language="Japanese",
    instruct="落ち着いた若い男性の声。少し低めで穏やか。",
)

sf.write("test_voicedesign_ja.wav", wavs[0], sr)
Enter fullscreen mode Exit fullscreen mode

When actually generated, the intonation in Japanese is quite natural. There is very little "synthetic voice" quality, making it pleasant to listen to.

How to Write Instructions

The instruct parameter controls the voice. Combining the following elements can be effective.

Element Description Example
Timbre Voice quality Low, high, husky, breathy
Emotion Feelings Calm, nervous, cheerful
Prosody Rhythm and speed Slow, fast, rising intonation
Character Age and gender Young female, middle-aged male, elderly
# Calm narrator
instruct = "落ち着いた中年男性。声は低めで安定している。ゆっくり話す。"

# Cheerful young female
instruct = "元気な若い女性。声は明るく高め。テンポは早め。"

# Nervous character
instruct = "若い女性。声は高すぎず、少し息成分がある。控えめで緊張している。語尾は弱め。"
Enter fullscreen mode Exit fullscreen mode

VoiceDesign → Clone Workflow

A drawback of VoiceDesign is that "the voice can vary slightly with the same instructions." While this is fine for single reads, it can accumulate as a "different person feeling" when generating multiple lines of a character's dialogue.

Therefore, a workflow that creates a reference voice with VoiceDesign and then clones it with the Base model for reuse is effective.

  1. Create a reference voice from the "voice blueprint" using VoiceDesign.
  2. Create a create_voice_clone_prompt from the reference voice and transcript.
  3. From then on, use generate_voice_clone(..., voice_clone_prompt=...) for mass production.

Implementation Example (for Mac, Complete Version)

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

# Step 1: Generate reference voice with VoiceDesign
print("Step 1: Loading VoiceDesign model...")
design_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="mps",
    dtype=torch.float16,
)

ref_text = "えっと……その、ちょっとだけ相談があるんですが……。"
ref_instruct = (
    "若い女性。声は高すぎず、少し息成分がある。"
    "控えめで緊張しているが、話しているうちに少しずつ落ち着く。"
)

ref_wavs, sr = design_model.generate_voice_design(
    text=ref_text,
    language="Japanese",
    instruct=ref_instruct,
)
sf.write("ref_voice.wav", ref_wavs[0], sr)
print("Reference audio saved: ref_voice.wav")

# Free memory
del design_model
torch.mps.empty_cache()

# Step 2: Create voice prompt with Base model
print("Step 2: Loading Base model for cloning...")
clone_model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map="mps",
    dtype=torch.float32,  # ⚠️ float32 is required!
)

voice_clone_prompt = clone_model.create_voice_clone_prompt(
    ref_audio=(ref_wavs[0], sr),
    ref_text=ref_text,
)

# Step 3: Generate speech with cloned voice
print("Step 3: Generating speech with cloned voice...")
lines = [
    "あ、ありがとうございます。少し安心しました。",
    "じゃあ、まず状況を整理してもいいですか?",
    "……うん。たぶん、今なら言えます。",
]

for i, line in enumerate(lines):
    wavs, sr = clone_model.generate_voice_clone(
        text=line,
        language="Japanese",
        voice_clone_prompt=voice_clone_prompt,
    )
    sf.write(f"clone_line_{i}.wav", wavs[0], sr)
    print(f"  clone_line_{i}.wav saved")

print("Done!")
Enter fullscreen mode Exit fullscreen mode

With this, you can "asset" the voice created with VoiceDesign and produce consistent audio.

Differentiating Between 3-Second Cloning and VoiceDesign

Qwen3-TTS also has a feature to "clone existing voices in 3 seconds." Here are the criteria for differentiation.

Aspect 3-Second Clone VoiceDesign → Clone
Purpose Want to reproduce an existing voice Want to create a new voice
Input Reference audio (+ transcript) Design instructions (instruct)
Rights Directly tied to the rights of the original voice (permission is crucial) Rights-free
Consistency Depends on the quality of the reference audio Stable when prompted

My conclusion:

  • If you want to reproduce someone else's voice, use the 3-second clone (but permission is essential).
  • If you want to create a character voice for your product, use VoiceDesign → Clone.

The voices created with VoiceDesign are "not anyone's voice," making it easier to avoid copyright and portrait rights issues.

Addendum: Comparing Cloning Accuracy with VOICEVOX Zundamon

To grasp how well the 3-second clone can reproduce existing voices, I created a demo comparing five reference samples of VOICEVOX's Zundamon (Speaker ID: 3) by cloning them and having them speak the same text.

You can see how much the cloning results vary for each reference audio, making it a good reference for checking before implementation (if you plan to use it commercially, make sure to check the usage terms for VOICEVOX and the character).

Impressions of Japanese Quality

Here are the points that surprised me when I actually tried it:

  1. Natural intonation: There is little "synthetic voice" quality.
  2. Accurate reading of kanji: Almost no issues unless it's specialized terminology.
  3. Effective emotional expression: When specifying "nervous" in the instructions, it truly sounds that way.
  4. Natural pauses with punctuation: It handles "……" and "、" well.

Conclusion

After testing Qwen3-TTS on Apple Silicon M3, I was surprised by the following points:

  • High quality of Japanese: This is the first time I've encountered such natural speech synthesis in open-source.
  • Ability to create voices with VoiceDesign: You can design rights-free voices using natural language.
  • Works on Mac: It is practical with MPS support, although there are constraints with float32.

This tool feels like it could change the way we use TTS, shifting from "choosing voices" to "creating voices."

If you're interested, I encourage you to try it out for yourself. The experience of creating "your own voice" with VoiceDesign is quite fascinating.

Reference Links

Top comments (0)