GeneLab_999

Posted on Feb 21

I Built a Voice Cloning GUI That Supports 10 Languages — Here's What I Learned Wrestling with CUDA on Windows published

#python #ai #opensource #showdev

Have you ever recorded yourself speaking and thought, "I wish I could just type what I want to say and have my own voice read it back"?

That's exactly the rabbit hole I fell down when Alibaba dropped Qwen3-TTS — an open-source TTS model that can clone any voice from just 3 seconds of audio. Ten languages. 97ms latency. Apache 2.0 license. On paper, it was everything I'd ever wanted.

In practice? It assumed Linux. FlashAttention 2 (recommended) doesn't run on Windows. And voice cloning required you to manually transcribe your reference audio — which kind of defeats the purpose of a "quick clone" workflow.

So I did what any developer would do: I forked it.

What I Built

hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.

The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.

Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

Platform	Repository	Description
Windows	This

…

View on GitHub

Qwen3-TTS-JP started as a personal fix — a Japanese-localized fork with Whisper auto-transcription bolted on. But as people started using it, I realized the same pain points existed for developers everywhere. So I expanded it:

10-language Web UI — Japanese, English, Chinese, Korean, German, French, Russian, Portuguese, Spanish, Italian. The UI auto-detects your browser locale.
Native Windows support — No WSL. No Docker. Just Python + CUDA.
Whisper auto-transcription — Upload 3 seconds of audio, Whisper handles the rest. Pick from 5 model sizes (tiny → large-v3) depending on your speed/accuracy tradeoff.
RTX 5090 (Blackwell) tested — I developed this on a Blackwell GPU, so sm_120 architecture is a first-class citizen.
Mac support — Apple Silicon users get a dedicated fork with MLX + PyTorch dual engine and 4bit/8bit quantization.

The Architecture in 30 Seconds

Qwen3-TTS isn't your typical TTS pipeline. Instead of the usual Text → LM → DiT → Audio cascade, it uses a discrete multi-codebook LM that goes straight from text to audio codes:

Traditional:  Text → Language Model → Intermediate Repr → DiT → Audio
Qwen3-TTS:    Text → Language Model → Audio Codes → Decoder → Audio

This bypasses the information bottleneck that makes most TTS systems sound robotic. The result is eerily human-sounding output — with emotion, prosody, and natural pauses all preserved.

The dual-track streaming architecture means it starts generating audio from the first character of input. That 97ms first-packet latency is real.

Getting It Running (It's Actually Easy Now)

git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP

python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activate

pip install -e .
pip install faster-whisper

# RTX 30/40 series
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# RTX 50 series (Blackwell) — needs nightly
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Launch the GUI:

# Voice cloning mode
python -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --ip 127.0.0.1 --port 7860 --no-flash-attn

Open http://127.0.0.1:7860. Done.

What You Can Actually Build With This

Here's where it gets interesting for developers. This isn't just a toy — the Python API is clean enough to integrate into real projects.

Voice Cloning in 5 Lines

from qwen_tts import Qwen3TTSModel
import torch, soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0", dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_clone(
    text="This is my cloned voice. It only needed 3 seconds of audio.",
    language="English",
    ref_audio="my_voice.wav",       # 3 seconds is enough
    ref_text="Hello, testing.",      # Whisper can auto-generate this
)
sf.write("output.wav", wavs[0], sr)

Design a Voice From Scratch

No reference audio needed — just describe what you want:

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0", dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_design(
    text="Welcome back, adventurer. Your quest awaits.",
    language="English",
    instruct="Deep male voice, 45 years old, slight British accent, warm and commanding",
)

Cross-Lingual Cloning

Clone a voice in one language, generate speech in another. The model preserves the speaker's timbre across languages:

wavs, sr = model.generate_voice_clone(
    text="Bonjour, comment allez-vous aujourd'hui?",
    language="French",
    ref_audio="english_speaker.wav",
    ref_text="Hi, this is a test recording.",
)

Practical Use Cases I've Seen

Since releasing this fork, I've seen developers use it for:

Game dev — Generating NPC dialogue dynamically instead of recording thousands of audio files
Podcasting — Creating consistent intro/outro narration
Accessibility — Multilingual audio versions of documentation
Localization — Same voice, 10 languages, zero re-recording
Prototyping — Testing voice UX before hiring voice actors

GPU Compatibility

GPU	VRAM	Recommended Model	Status
RTX 5090	32GB	1.7B	Tested & verified
RTX 4090	24GB	1.7B	Works great
RTX 4070	12GB	0.6B or 1.7B (tight)	Works
RTX 3080	10GB	0.6B	Works
Apple Silicon	16GB+	Via Mac fork	MLX optimized

If you're VRAM-constrained, the 0.6B model is surprisingly capable — and FlashAttention 2 can help on Linux:

pip install flash-attn --no-build-isolation

Things I Learned the Hard Way

A few gotchas from building this that might save you time:

Windows cp932 encoding hell. Japanese Windows defaults to cp932 encoding, which chokes on Unicode output from the model. The fix is wrapping stdout/stderr:

import sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')

FlashAttention 2 doesn't compile on Windows. The solution is using PyTorch's built-in SDPA (Scaled Dot Product Attention) via --no-flash-attn. Performance hit is minimal for single-user inference.

Blackwell (sm_120) needs nightly PyTorch. As of early 2026, stable PyTorch doesn't support RTX 50-series. Nightly builds with cu128 work, but you'll see warnings about torchao version mismatches. They're cosmetic — ignore them.

SoX is optional. The model prints warnings about missing SoX, but it works fine without it. Don't waste time installing it on Windows.

What's Next

I'm currently exploring:

vLLM integration for production-grade serving
Fine-tuning workflows for custom voice models
Streaming WebSocket API for real-time applications

Try It

If you're working on anything voice-related — games, accessibility, content creation, or just want to mess around with state-of-the-art TTS — give it a spin:

Windows/Linux: Qwen3-TTS-JP
Mac (Apple Silicon): Qwen3-TTS-Mac-GeneLab

Stars are appreciated — they help other developers find the project.

I'm curious: What would you build with 3-second voice cloning? Drop your ideas in the comments — I'd love to hear what use cases I haven't thought of yet.

Ethical Note

Voice cloning is powerful tech. Please use it responsibly — clone only with consent, disclose AI-generated audio, and don't use it for fraud or impersonation. The Apache 2.0 license gives you freedom, but with great power... you know the rest.

DEV Community