Run Qwen3-TTS on Windows with RTX 5090: The Complete Guide to Voice Cloning in 3 Seconds
Clone any voice with just 3 seconds of audio — now with native Windows support and the latest Blackwell GPUs
TL;DR
Qwen3-TTS-JP is a fork of Alibaba's Qwen3-TTS that adds:
- ✅ Native Windows support (no WSL required!)
- ✅ RTX 5090 / Blackwell GPU tested and optimized
- ✅ Auto-transcription via Whisper integration
- ✅ Localized GUI (Japanese, easy to adapt)
hiroki-abe-58
/
Qwen3-TTS-JP
Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.
Qwen3-TTS-JP
English | 日本語 | 中文 | 한국어 | Русский | Español | Italiano | Deutsch | Français | Português
A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.
The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.
Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.
Custom Voice -- Speech synthesis with preset speakers
Voice Design -- Describe voice characteristics to synthesize
Voice Clone -- Clone voice from reference audio
Settings -- GPU / VRAM / Model information
Related Projects
Platform
Repository
Description
Windows
This
The Problem: Getting Qwen3-TTS Running on Windows
When Alibaba released Qwen3-TTS in January 2026, the AI community was amazed: 3 seconds of reference audio is all you need to clone a voice. Ten languages supported, 97ms latency, emotion control — impressive specs on paper.
But there was a catch.
The official repo assumed Linux. CUDA setup was finicky. And if you wanted to use the voice cloning feature, you had to manually transcribe your reference audio — defeating the purpose of a quick workflow.
I'd just upgraded to an RTX 5090 (Blackwell architecture), eager to push local AI to its limits. After days of wrestling with environments, I got it working and decided to package the solution for everyone else.
What Makes This Fork Different?
1. Native Windows Support
No WSL. No Docker (though it's optional). Just Python, CUDA, and you're good to go.
# Clone the repo
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate
# Install
pip install -e .
pip install faster-whisper
That's it. Works on Windows 10/11 with any CUDA-capable GPU (RTX 30/40/50 series).
2. RTX 5090 (Blackwell) Tested
This fork was developed and tested on an RTX 5090. The latest CUDA 12.x with Blackwell architecture can be tricky — many AI repos break on it. This one doesn't.
| GPU | VRAM | Model | Status |
|---|---|---|---|
| RTX 5090 | 32GB | 1.7B | ✅ Works |
| RTX 4090 | 24GB | 1.7B | ✅ Works |
| RTX 3080 | 10GB | 0.6B | ✅ Works |
3. Whisper Auto-Transcription
The original Qwen3-TTS requires you to provide the transcript of your reference audio. This fork integrates faster-whisper to do it automatically:
- Upload 3 seconds of audio
- Whisper transcribes it
- Qwen3-TTS clones the voice
No manual typing. Choose from 5 Whisper models:
| Model | Params | Speed | Accuracy |
|---|---|---|---|
| tiny | 39M | ⚡⚡⚡⚡⚡ | ★★ |
| small | 244M | ⚡⚡⚡ | ★★★★ |
| large-v3 | 1.5B | ⚡ | ★★★★★ |
Quick Start
Launch the GUI
python -m qwen_tts.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--ip 0.0.0.0 --port 8000
Open http://localhost:8000 in your browser.
Python API
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
)
# Clone a voice with 3-second reference
wavs, sr = model.generate_voice_clone(
text="This is my cloned voice speaking!",
language="English",
ref_audio="reference.wav",
ref_text="Hello, this is a test.",
)
sf.write("output.wav", wavs[0], sr)
Use Cases
Content Creators
Clone your own voice for consistent narration across videos.
Game Developers
Create character voices without expensive voice actors:
model.generate_voice_design(
text="Hero, your quest awaits!",
language="English",
instruct="Deep male voice, 40 years old, British accent"
)
Podcasters
Quick voice-over generation for intros and outros.
Supported Languages
🇨🇳 Chinese | 🇺🇸 English | 🇯🇵 Japanese | 🇰🇷 Korean | 🇩🇪 German | 🇫🇷 French | 🇷🇺 Russian | 🇧🇷 Portuguese | 🇪🇸 Spanish | 🇮🇹 Italian
Troubleshooting
| Error | Fix |
|---|---|
CUDA out of memory |
Use 0.6B model or add FlashAttention 2 |
faster-whisper not found |
pip install faster-whisper |
Ethical Note
Voice cloning is powerful. Please:
- Only clone voices with consent
- Don't use for fraud or misinformation
- Disclose AI-generated audio
Links
If this helped, please ⭐ the repo!
Questions? Drop a comment below! 👇




Top comments (0)