Run Qwen3-TTS on Windows with RTX 5090: The Complete Guide to Voice Cloning in 3 Seconds
Clone any voice with just 3 seconds of audio — now with native Windows support and the latest Blackwell GPUs
TL;DR
Qwen3-TTS-JP is a fork of Alibaba's Qwen3-TTS that adds:
- ✅ Native Windows support (no WSL required!)
- ✅ RTX 5090 / Blackwell GPU tested and optimized
- ✅ Auto-transcription via Whisper integration
- ✅ Localized GUI (Japanese, easy to adapt)
hiroki-abe-58
/
Qwen3-TTS-JP
Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.
Qwen3-TTS-JP
Windowsネイティブ対応 の Qwen3-TTS 日本語ローカライズ版フォークです。
オリジナルのQwen3-TTSはLinux環境を前提として開発されており、FlashAttention 2の使用が推奨されていますが、FlashAttention 2はWindowsでは動作しません。本フォークでは、WSL2やDockerを使わずにWindows上で直接動作させるための対応と、GUIの完全日本語化、Whisperによる自動文字起こし機能を追加しています。
特徴
Windowsネイティブ対応
-
FlashAttention 2不要:
--no-flash-attnオプションによりPyTorch標準のSDPA(Scaled Dot Product Attention)を使用 - WSL2/Docker不要: Windows上で直接実行可能
- RTX 50シリーズ対応: NVIDIA Blackwellアーキテクチャ(sm_120)用PyTorch nightlyビルドの導入手順を記載
- SoX依存の回避: SoXがなくても動作(警告は表示されますが無視可能)
日本語ローカライズ & 機能拡張
- GUIの完全日本語化: ラベル、ボタン、プレースホルダー、エラーメッセージ、免責事項すべてを日本語化
- Whisper自動文字起こし機能: ボイスクローン時の参照音声テキスト入力を自動化(faster-whisper を使用)
-
Whisperモデル選択機能: 用途に応じて5種類のモデルから選択可能
-
tiny- 最速・最小(39M パラメータ) -
base- 高速(74M パラメータ) -
small- バランス型(244M パラメータ)※デフォルト -
medium- 高精度(769M パラメータ) -
large-v3- 最高精度(1550M パラメータ)
-
動作環境
- OS: Windows 10/11(ネイティブ環境、WSL2不要)
-
GPU: NVIDIA GPU(CUDA対応)
- RTX 30/40シリーズ: PyTorch安定版で動作
- RTX 50シリーズ(Blackwell): PyTorch nightlyビルド(cu128)が必要
- Python: 3.10以上
- VRAM: 8GB以上推奨(モデルサイズにより異なる)
インストール
1. リポジトリのクローン
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP
2. 仮想環境の作成と有効化
python -m venv .venv
.venv\Scripts\activate
3. 依存パッケージのインストール
pip install -e .
pip install faster-whisper
4. PyTorch(CUDA対応版)のインストール
お使いのCUDAバージョンに合わせてインストールしてください。
# CUDA 12.x の場合
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# RTX 50シリーズ(sm_120)の場合はnightlyビルドが必要
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
使用方法
GUIの起動
コマンドラインから起動
# CustomVoiceモデル(プリセット話者)
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 127.0.0.1…The Problem: Getting Qwen3-TTS Running on Windows
When Alibaba released Qwen3-TTS in January 2026, the AI community was amazed: 3 seconds of reference audio is all you need to clone a voice. Ten languages supported, 97ms latency, emotion control — impressive specs on paper.
But there was a catch.
The official repo assumed Linux. CUDA setup was finicky. And if you wanted to use the voice cloning feature, you had to manually transcribe your reference audio — defeating the purpose of a quick workflow.
I'd just upgraded to an RTX 5090 (Blackwell architecture), eager to push local AI to its limits. After days of wrestling with environments, I got it working and decided to package the solution for everyone else.
What Makes This Fork Different?
1. Native Windows Support
No WSL. No Docker (though it's optional). Just Python, CUDA, and you're good to go.
# Clone the repo
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate
# Install
pip install -e .
pip install faster-whisper
That's it. Works on Windows 10/11 with any CUDA-capable GPU (RTX 30/40/50 series).
2. RTX 5090 (Blackwell) Tested
This fork was developed and tested on an RTX 5090. The latest CUDA 12.x with Blackwell architecture can be tricky — many AI repos break on it. This one doesn't.
| GPU | VRAM | Model | Status |
|---|---|---|---|
| RTX 5090 | 32GB | 1.7B | ✅ Works |
| RTX 4090 | 24GB | 1.7B | ✅ Works |
| RTX 3080 | 10GB | 0.6B | ✅ Works |
3. Whisper Auto-Transcription
The original Qwen3-TTS requires you to provide the transcript of your reference audio. This fork integrates faster-whisper to do it automatically:
- Upload 3 seconds of audio
- Whisper transcribes it
- Qwen3-TTS clones the voice
No manual typing. Choose from 5 Whisper models:
| Model | Params | Speed | Accuracy |
|---|---|---|---|
| tiny | 39M | ⚡⚡⚡⚡⚡ | ★★ |
| small | 244M | ⚡⚡⚡ | ★★★★ |
| large-v3 | 1.5B | ⚡ | ★★★★★ |
Quick Start
Launch the GUI
python -m qwen_tts.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--ip 0.0.0.0 --port 8000
Open http://localhost:8000 in your browser.
Python API
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
)
# Clone a voice with 3-second reference
wavs, sr = model.generate_voice_clone(
text="This is my cloned voice speaking!",
language="English",
ref_audio="reference.wav",
ref_text="Hello, this is a test.",
)
sf.write("output.wav", wavs[0], sr)
Use Cases
Content Creators
Clone your own voice for consistent narration across videos.
Game Developers
Create character voices without expensive voice actors:
model.generate_voice_design(
text="Hero, your quest awaits!",
language="English",
instruct="Deep male voice, 40 years old, British accent"
)
Podcasters
Quick voice-over generation for intros and outros.
Supported Languages
🇨🇳 Chinese | 🇺🇸 English | 🇯🇵 Japanese | 🇰🇷 Korean | 🇩🇪 German | 🇫🇷 French | 🇷🇺 Russian | 🇧🇷 Portuguese | 🇪🇸 Spanish | 🇮🇹 Italian
Troubleshooting
| Error | Fix |
|---|---|
CUDA out of memory |
Use 0.6B model or add FlashAttention 2 |
faster-whisper not found |
pip install faster-whisper |
Ethical Note
Voice cloning is powerful. Please:
- Only clone voices with consent
- Don't use for fraud or misinformation
- Disclose AI-generated audio
Links
If this helped, please ⭐ the repo!
Questions? Drop a comment below! 👇

Top comments (0)