DEV Community

Cover image for Run Qwen3-TTS on Windows with RTX 5090: Voice Cloning in 3 Seconds
GeneLab_999
GeneLab_999

Posted on

Run Qwen3-TTS on Windows with RTX 5090: Voice Cloning in 3 Seconds

Run Qwen3-TTS on Windows with RTX 5090: The Complete Guide to Voice Cloning in 3 Seconds

Clone any voice with just 3 seconds of audio — now with native Windows support and the latest Blackwell GPUs


TL;DR

Qwen3-TTS-JP is a fork of Alibaba's Qwen3-TTS that adds:

  • Native Windows support (no WSL required!)
  • RTX 5090 / Blackwell GPU tested and optimized
  • Auto-transcription via Whisper integration
  • Localized GUI (Japanese, easy to adapt)

GitHub logo hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

English | 日本語 | 中文 | 한국어 | Русский | Español | Italiano | Deutsch | Français | Português

A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.

The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.

Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects















Platform Repository Description
Windows This





The Problem: Getting Qwen3-TTS Running on Windows

When Alibaba released Qwen3-TTS in January 2026, the AI community was amazed: 3 seconds of reference audio is all you need to clone a voice. Ten languages supported, 97ms latency, emotion control — impressive specs on paper.

But there was a catch.

The official repo assumed Linux. CUDA setup was finicky. And if you wanted to use the voice cloning feature, you had to manually transcribe your reference audio — defeating the purpose of a quick workflow.

I'd just upgraded to an RTX 5090 (Blackwell architecture), eager to push local AI to its limits. After days of wrestling with environments, I got it working and decided to package the solution for everyone else.


What Makes This Fork Different?

1. Native Windows Support

No WSL. No Docker (though it's optional). Just Python, CUDA, and you're good to go.

# Clone the repo
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate

# Install
pip install -e .
pip install faster-whisper
Enter fullscreen mode Exit fullscreen mode

That's it. Works on Windows 10/11 with any CUDA-capable GPU (RTX 30/40/50 series).

2. RTX 5090 (Blackwell) Tested

This fork was developed and tested on an RTX 5090. The latest CUDA 12.x with Blackwell architecture can be tricky — many AI repos break on it. This one doesn't.

GPU VRAM Model Status
RTX 5090 32GB 1.7B ✅ Works
RTX 4090 24GB 1.7B ✅ Works
RTX 3080 10GB 0.6B ✅ Works

3. Whisper Auto-Transcription

The original Qwen3-TTS requires you to provide the transcript of your reference audio. This fork integrates faster-whisper to do it automatically:

  1. Upload 3 seconds of audio
  2. Whisper transcribes it
  3. Qwen3-TTS clones the voice

No manual typing. Choose from 5 Whisper models:

Model Params Speed Accuracy
tiny 39M ⚡⚡⚡⚡⚡ ★★
small 244M ⚡⚡⚡ ★★★★
large-v3 1.5B ★★★★★

Quick Start

Launch the GUI

python -m qwen_tts.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --ip 0.0.0.0 --port 8000
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8000 in your browser.

Python API

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

# Clone a voice with 3-second reference
wavs, sr = model.generate_voice_clone(
    text="This is my cloned voice speaking!",
    language="English",
    ref_audio="reference.wav",
    ref_text="Hello, this is a test.",
)
sf.write("output.wav", wavs[0], sr)
Enter fullscreen mode Exit fullscreen mode

Use Cases

Content Creators

Clone your own voice for consistent narration across videos.

Game Developers

Create character voices without expensive voice actors:

model.generate_voice_design(
    text="Hero, your quest awaits!",
    language="English",
    instruct="Deep male voice, 40 years old, British accent"
)
Enter fullscreen mode Exit fullscreen mode

Podcasters

Quick voice-over generation for intros and outros.


Supported Languages

🇨🇳 Chinese | 🇺🇸 English | 🇯🇵 Japanese | 🇰🇷 Korean | 🇩🇪 German | 🇫🇷 French | 🇷🇺 Russian | 🇧🇷 Portuguese | 🇪🇸 Spanish | 🇮🇹 Italian


Troubleshooting

Error Fix
CUDA out of memory Use 0.6B model or add FlashAttention 2
faster-whisper not found pip install faster-whisper

Ethical Note

Voice cloning is powerful. Please:

  • Only clone voices with consent
  • Don't use for fraud or misinformation
  • Disclose AI-generated audio

Links


If this helped, please ⭐ the repo!

Questions? Drop a comment below! 👇

Top comments (0)