DEV Community: GeneLab_999

🛠️ I Built a One-Click ComfyUI Setup for RTX 5090 on Windows — No WSL2, No Docker

GeneLab_999 — Mon, 02 Mar 2026 15:18:28 +0000

I bought an RTX 5090. 32GB VRAM. The most powerful consumer GPU on the planet.

Then I tried to run ComfyUI on Windows. It broke immediately.

RuntimeError: sm_120 is not compatible

Three days later, I had a fully working solution. I packaged it and open-sourced it:

ComfyUI-Win-Blackwell

Here's the whole story.

Why RTX 50-series Breaks Everything

NVIDIA's Blackwell architecture (RTX 5090/5080/5070) uses a new compute capability code called sm_120. The problem? PyTorch's stable release doesn't include kernels for it.

This means:

pip install torch → doesn't work on Blackwell
You need PyTorch nightly with CUDA 13.0 (cu130)
But then xformers (the standard ComfyUI speed boost) forces PyTorch back to stable
And custom nodes silently pull stable PyTorch through their dependencies

It's a dependency trap. Every fix creates a new problem.

The 5 Rules I Discovered

After 3 days of trial and error, I distilled everything into 5 rules. Break any one of them and your environment dies.

Rule 1: Use PyTorch nightly cu130 — stable doesn't have sm_120 kernels.

Rule 2: Never install xformers — it force-downgrades PyTorch to stable. This is the trap that got me twice.

Rule 3: Strip torch from every requirements.txt — custom nodes list torch as a dependency, and pip will happily replace your nightly build with stable.

Rule 4: Verify PyTorch after every custom node install — run python -c "import torch; print(torch.__version__)" and check that it still says cu130.

Rule 5: Clear proxy environment variables — system proxies block pip and git silently on Windows.

💡 Pro Tip: Rule 2 was the hardest to figure out. xformers installs successfully, ComfyUI starts fine, and then crashes mid-inference with sm_120 is not compatible. You don't even realize PyTorch was downgraded until you check the version.

What I Built

I automated all 5 rules into a one-click setup:

git clone https://github.com/hiroki-abe-58/ComfyUI-Win-Blackwell.git
cd ComfyUI-Win-Blackwell
# Double-click setup.bat — done in ~20 minutes

What setup.bat handles:

Python 3.13 environment
PyTorch nightly cu130 (not stable, not cu128)
triton-windows + torch.compile (replaces xformers)
ComfyUI core + custom dependencies (with torch stripped out)
28 verified custom nodes
Post-install verification that cu130 is still intact

I also built companion tools:

verify_env.py — Blackwell-specific environment checker (sm_120, cu130, Triton, torch.compile)
fix_windows_compat.py — Converts Linux workflow JSON paths to Windows format, replaces SageAttention with SDPA
update.bat — Updates everything while preserving Blackwell compatibility

What I Verified

I tested 28 custom nodes one by one. Install → check PyTorch version → run test → record result. That was the most tedious part.

I also tested 5 Image-to-Video pipelines on 32GB VRAM:

HunyuanVideo 1.5 I2V (8.3B params, ~16GB) — Smooth. My top recommendation.
Kandinsky 5.0 Lite I2V (2B, ~4GB) — Very smooth. Great for quick tests.
LTX-2 I2V (19B, ~25GB) — Works in FP8. Tight but fine.
LongCat-Video TI2V (13.6B, ~14.5GB) — Works with adjustments.
Kandinsky 5.0 Pro I2V (19B, ~40GB) — Needs CPU offload. Slow.

Why Not Just Use WSL2 or Docker?

The short answer: performance.

Loading safetensors through WSL2's NTFS translation layer is noticeably slower. Docker has the same issue plus additional setup complexity. For a tool like ComfyUI where you're iterating on workflows and loading large models frequently, native Windows file I/O makes a real difference.

Also, most AI artists using ComfyUI on Windows aren't Docker experts. A .bat file they can double-click is the right UX.

Wrapping Up

If you have an RTX 5090/5080/5070 and want to run ComfyUI on Windows without WSL2 or Docker, give it a try:

github.com/hiroki-abe-58/ComfyUI-Win-Blackwell

hiroki-abe-58 / ComfyUI-Win-Blackwell

ComfyUI for GeForce RTX 50-Series (Blackwell)

The first fully documented, Windows-native ComfyUI setup for NVIDIA GeForce RTX 5090/5080/5070 (Blackwell architecture, sm_120) with CUDA 13.0.

Other languages: 日本語 | 中文 | 한국어

What Makes This Special

RTX 50-series GPUs (Blackwell, Compute Capability sm_120) are not supported by PyTorch stable releases as of early 2026. Running ComfyUI on these GPUs requires specific versions and workarounds that are not documented anywhere else in a single, reproducible package.

Technical Highlights

Feature	Details
GPU Architecture	NVIDIA Blackwell (sm_120) -- RTX 5090 / 5080 / 5070
CUDA Version	13.0 (cu130) -- the latest CUDA runtime
PyTorch	Nightly cu130 build (not stable, not cu128)
Python	3.13 (latest)
Triton	triton-windows fork (official Triton is Linux-only)
xformers	Deliberately excluded (causes PyTorch downgrade)
Custom Nodes	28 verified nodes including video & music generation
Platform	Windows Native (no WSL2, no Docker required)

Why This Is Unique

Blackwell + Windows Native +…

View on GitHub

MIT licensed. Stars and PRs welcome — especially if you verify additional custom nodes on Blackwell hardware.

Have you tried running AI tools on RTX 50-series? What was your experience? Let me know in the comments! 👇

If you found this helpful, consider following me for more AI + GPU content!

📝 Japanese version: Qiita / Zenn
🐦 Follow me on X: @geneLab_999
💻 GitHub: hiroki-abe-58

How I Run 6 AI Services Simultaneously on RTX 5090 + WSL2 + Docker (And You Can Too)

GeneLab_999 — Sat, 21 Feb 2026 22:53:44 +0000

TL;DR:

I built a multi-service local AI stack (image gen, video gen, voice synthesis, voice cloning) running on RTX 5090 via WSL2 Docker. The key breakthrough was solving the GPU driver passthrough layer that nobody documented. Here's the architecture, the critical gpu-run function, and everything I learned the hard way.

The Problem Nobody Solved

In August 2025, I bought an RTX 5090. Blackwell architecture. 32GB GDDR7. Compute capability sm_120.

And nobody could make it work with WSL2 + Docker + PyTorch.

The issue wasn't any single component. nvidia-smi worked fine in containers. libcuda.so.1 loaded correctly. But PyTorch kept returning torch.cuda.is_available() = False with a cryptic Error 500: named symbol not found.

I spent roughly 40 hours debugging. Here's what I found, and how I turned it into a production multi-service AI environment.

The Root Cause

The failure point was in the interaction layer between WSL2's driver mounting and Docker's GPU runtime.

When you run --gpus all in a Docker container on WSL2, the NVIDIA Container Toolkit mounts /usr/lib/wsl/lib into the container. This directory contains libcuda.so.1 and friends. For most GPUs, this is enough.

For the RTX 5090, it's not.

The actual driver binaries live in a separate directory: /usr/lib/wsl/drivers/nvmdi.inf_amd64_<hash>. This directory contains the real libcuda.so.1.1, libnvdxgdmal.so.1, libnvidia-ptxjitcompiler.so.1, and other dependencies that the PyTorch CUDA runtime needs to initialize the Blackwell architecture.

Without mounting this directory AND setting LD_LIBRARY_PATH to include it, PyTorch's CUDA initialization hits a dead end -- it finds libcuda.so.1 but can't resolve the sm_120-specific symbols.

The Solution: `gpu-run`

Here's the function that makes everything work:

gpu-run () {
  local D BN
  D=$(ls -d /usr/lib/wsl/drivers/nvmdi.inf_amd64_* | head -n1) || return 1
  BN=$(basename "$D")
  echo "Using driver path: $D"
  docker run --rm --gpus all \
    -v /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro \
    -v "$D":/usr/lib/wsl/drivers/"$BN":ro \
    -e LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/"$BN" \
    "$@"
}

What this does:

Finds the driver directory dynamically -- the hash suffix changes with driver updates
Mounts both WSL lib paths -- the standard /usr/lib/wsl/lib AND the driver-specific directory
Sets LD_LIBRARY_PATH to prioritize these paths for symbol resolution

Verification:

source gpu-run.sh
gpu-run torch-wsl-cu128 python3 -c "
import torch
print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('GPU:', torch.cuda.get_device_name(0))
print('VRAM:', torch.cuda.get_device_properties(0).total_mem // 1024**3, 'GB')
"

Output:

Using driver path: /usr/lib/wsl/drivers/nvmdi.inf_amd64_fb80e95fa979ce23
PyTorch: 2.9.0.dev20250812+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 5090
VRAM: 32 GB

The Dockerfile Template

Every AI service in my stack uses a variation of this base:

FROM nvidia/cuda:12.8.0-devel-ubuntu22.04

ENV TZ=Asia/Tokyo
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV CUDA_HOME=/usr/local/cuda

RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip python3-dev git ffmpeg ca-certificates \
    build-essential cmake ninja-build libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --upgrade pip
RUN pip3 install --no-cache-dir numpy==1.26.4

RUN pip3 install --no-cache-dir --pre \
    torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu128

RUN python3 -c "import torch; print('PyTorch:', torch.__version__); assert 'cu128' in torch.__version__"

WORKDIR /app

Key decisions:

nvidia/cuda:12.8.0-devel-ubuntu22.04 -- CUDA 12.8 is the minimum for sm_120. Using devel (not runtime) because some AI frameworks compile CUDA extensions at build time.
PyTorch nightly cu128 -- as of early 2026, stable PyTorch still has incomplete Blackwell support. Nightly cu128 is non-negotiable.
numpy pinned to 1.26.4 -- numpy 2.x breaks several AI frameworks that haven't updated their C extensions.
Install torch LAST -- many requirements.txt files include torch. If you install dependencies first, they'll pull in a stable torch that doesn't support sm_120. Always install your carefully selected torch version as the final step.

Docker Compose Architecture

Here's how six AI services coexist in a single compose.yaml:

services:
  comfyui:
    build:
      context: ./apps/comfyui
      dockerfile: Dockerfile
    image: comfyui:wsl-cu12
    profiles: ["comfyui", "all"]
    runtime: nvidia
    environment:
      - LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/${WSL_DRV_BN}
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro
      - ${WSL_DRV_DIR}:/usr/lib/wsl/drivers/${WSL_DRV_BN}:ro
      - ./data/comfyui-models:/app/models
      - ./shared/models:/shared/models:ro
    ports:
      - "8188:8188"
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

  sbv2:
    build:
      context: ./apps/sbv2
      dockerfile: Dockerfile
    image: sbv2:wsl-cu12
    profiles: ["sbv2", "all"]
    runtime: nvidia
    environment:
      - LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/lib/wsl/drivers/${WSL_DRV_BN}
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
    volumes:
      - /usr/lib/wsl/lib:/usr/lib/wsl/lib:ro
      - ${WSL_DRV_DIR}:/usr/lib/wsl/drivers/${WSL_DRV_BN}:ro
      - ./data/sbv2-models:/opt/models
    ports:
      - "5000:5000"
    ipc: host
    ulimits:
      memlock: -1
      stack: 67108864

  cosyvoice:
    profiles: ["cosyvoice", "all"]
    ports:
      - "7865:7865"

  rvc:
    profiles: ["rvc", "all"]
    ports:
      - "7866:7866"

  framepack:
    profiles: ["framepack", "all"]
    ports:
      - "7862:7862"

(Each service follows the same WSL driver mount pattern -- I've abbreviated the later ones for readability.)

The .env file is auto-generated:

WSL_DRV_DIR=$(ls -d /usr/lib/wsl/drivers/nvmdi.inf_amd64_* | head -n1)
WSL_DRV_BN=$(basename "$WSL_DRV_DIR")
cat > .env << EOF
WSL_DRV_DIR=$WSL_DRV_DIR
WSL_DRV_BN=$WSL_DRV_BN
EOF

Design Decisions That Saved My Sanity

1. Docker Profiles for Resource Isolation

With 32GB VRAM, you can't run everything simultaneously. Video generation alone can eat 24GB. Docker profiles let me spin up exactly what I need:

docker compose --profile comfyui up -d
docker compose --profile sbv2 --profile cosyvoice up -d
docker compose --profile all up -d

2. Shared Model Directory

AI models are enormous. Flux checkpoints, HunyuanVideo weights, voice models -- easily 200GB+. Instead of duplicating them per container:

~/ai-workspace-correct/
  shared/
    models/           # Cross-service shared models
    hf_cache/         # HuggingFace cache (persistent)
  data/
    comfyui-models/   # Service-specific models
    sbv2-models/
    cosyvoice-models/

Each service mounts shared/models read-only. Service-specific models go in their own data/ directory.

3. Port Allocation Strategy

I carved out port ranges by domain:

Range	Domain	Services
5000-5009	Voice synthesis	Style-BERT-VITS2
7860-7869	Voice/Video AI	FramePack, CosyVoice, RVC
8180-8189	Image AI	ComfyUI

This avoids collisions and makes firewall rules predictable.

4. The torchaudio Trap

This one cost me hours. Several voice synthesis frameworks use torchaudio.info() and torchaudio.load(). The nightly cu128 build of torchaudio has breaking API changes. The fix:

import soundfile as sf

sample_rate = sf.info(wav_path).samplerate
audio_data, sr = sf.read(wav_path)

I patch these at Docker build time with sed:

RUN sed -i 's/import torchaudio/import torchaudio\nimport soundfile as sf/' /opt/app/webui.py && \
    sed -i 's/torchaudio.info(prompt_wav).sample_rate/sf.info(prompt_wav).samplerate/g' /opt/app/webui.py

Lessons Learned (The Hard Way)

1. Never let requirements.txt install torch.
Strip torch, torchvision, torchaudio from every requirements.txt before installing. Then install your nightly cu128 build as the final step. If you don't, pip will happily overwrite your working torch with a stable version that can't see your GPU.

2. Driver updates break the hash.
The nvmdi.inf_amd64_<hash> directory changes when you update NVIDIA drivers. The gpu-run function handles this with dynamic lookup. But if you hardcode the path anywhere, you'll have a bad time.

3. ipc: host is non-negotiable for AI workloads.
Without it, PyTorch's shared memory operations fail silently or with cryptic errors. Always set it.

4. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This environment variable enables PyTorch's memory-efficient allocation strategy. Without it on 32GB VRAM, you'll hit fragmentation issues on large models that shouldn't theoretically run out of memory.

5. Document everything as if you'll have amnesia tomorrow.
I wrote my setup docs with the goal of "restore everything from scratch in 30 minutes." That document has saved me three times already.

Current Stack (February 2026)

Service	Purpose	Port	Status
ComfyUI	Image generation (Flux, SDXL)	8188	Stable
Style-BERT-VITS2	Japanese TTS voice synthesis	5000	Stable
CosyVoice	Multi-speaker voice synthesis	7865	Stable
RVC	Real-time voice conversion	7866	Stable
FramePack	Video generation (HunyuanVideo)	7862	Stable

All running on:

GPU: RTX 5090 32GB GDDR7
CPU: Intel Core Ultra 9 285K
RAM: 64GB DDR5
OS: Windows 11 Pro + WSL2 Ubuntu 22.04
Container runtime: Docker with NVIDIA Container Toolkit

Is This Still Unique?

As of February 2026, there are published examples of single-service RTX 5090 + Docker setups (vLLM, ComfyUI, basic PyTorch). What I haven't found elsewhere is:

A multi-service Docker Compose stack orchestrating 5+ AI services on Blackwell
The specific WSL2 driver mount solution documented with the nvmdi.inf_amd64_* path
A systematic approach to dependency isolation across services sharing one GPU
Production-grade patterns for model sharing, port management, and environment recovery

If you've done something similar, I'd genuinely love to hear about it. Drop a comment or reach out.

Built with ~40 hours of debugging, 200+ GB of model files, and an unreasonable amount of stubbornness. Based in Tokyo.

#rtx5090 #docker #wsl2 #pytorch #cuda #blackwell #ai #selfhosted

I Built a Voice Cloning GUI That Supports 10 Languages — Here's What I Learned Wrestling with CUDA on Windows published

GeneLab_999 — Sat, 21 Feb 2026 22:05:52 +0000

Have you ever recorded yourself speaking and thought, "I wish I could just type what I want to say and have my own voice read it back"?

That's exactly the rabbit hole I fell down when Alibaba dropped Qwen3-TTS — an open-source TTS model that can clone any voice from just 3 seconds of audio. Ten languages. 97ms latency. Apache 2.0 license. On paper, it was everything I'd ever wanted.

In practice? It assumed Linux. FlashAttention 2 (recommended) doesn't run on Windows. And voice cloning required you to manually transcribe your reference audio — which kind of defeats the purpose of a "quick clone" workflow.

So I did what any developer would do: I forked it.

What I Built

hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.

The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.

Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

Platform	Repository	Description
Windows	This

…

View on GitHub

Qwen3-TTS-JP started as a personal fix — a Japanese-localized fork with Whisper auto-transcription bolted on. But as people started using it, I realized the same pain points existed for developers everywhere. So I expanded it:

10-language Web UI — Japanese, English, Chinese, Korean, German, French, Russian, Portuguese, Spanish, Italian. The UI auto-detects your browser locale.
Native Windows support — No WSL. No Docker. Just Python + CUDA.
Whisper auto-transcription — Upload 3 seconds of audio, Whisper handles the rest. Pick from 5 model sizes (tiny → large-v3) depending on your speed/accuracy tradeoff.
RTX 5090 (Blackwell) tested — I developed this on a Blackwell GPU, so sm_120 architecture is a first-class citizen.
Mac support — Apple Silicon users get a dedicated fork with MLX + PyTorch dual engine and 4bit/8bit quantization.

The Architecture in 30 Seconds

Qwen3-TTS isn't your typical TTS pipeline. Instead of the usual Text → LM → DiT → Audio cascade, it uses a discrete multi-codebook LM that goes straight from text to audio codes:

Traditional:  Text → Language Model → Intermediate Repr → DiT → Audio
Qwen3-TTS:    Text → Language Model → Audio Codes → Decoder → Audio

This bypasses the information bottleneck that makes most TTS systems sound robotic. The result is eerily human-sounding output — with emotion, prosody, and natural pauses all preserved.

The dual-track streaming architecture means it starts generating audio from the first character of input. That 97ms first-packet latency is real.

Getting It Running (It's Actually Easy Now)

git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP

python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activate

pip install -e .
pip install faster-whisper

# RTX 30/40 series
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# RTX 50 series (Blackwell) — needs nightly
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Launch the GUI:

# Voice cloning mode
python -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --ip 127.0.0.1 --port 7860 --no-flash-attn

Open http://127.0.0.1:7860. Done.

What You Can Actually Build With This

Here's where it gets interesting for developers. This isn't just a toy — the Python API is clean enough to integrate into real projects.

Voice Cloning in 5 Lines

from qwen_tts import Qwen3TTSModel
import torch, soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0", dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_clone(
    text="This is my cloned voice. It only needed 3 seconds of audio.",
    language="English",
    ref_audio="my_voice.wav",       # 3 seconds is enough
    ref_text="Hello, testing.",      # Whisper can auto-generate this
)
sf.write("output.wav", wavs[0], sr)

Design a Voice From Scratch

No reference audio needed — just describe what you want:

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0", dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_design(
    text="Welcome back, adventurer. Your quest awaits.",
    language="English",
    instruct="Deep male voice, 45 years old, slight British accent, warm and commanding",
)

Cross-Lingual Cloning

Clone a voice in one language, generate speech in another. The model preserves the speaker's timbre across languages:

wavs, sr = model.generate_voice_clone(
    text="Bonjour, comment allez-vous aujourd'hui?",
    language="French",
    ref_audio="english_speaker.wav",
    ref_text="Hi, this is a test recording.",
)

Practical Use Cases I've Seen

Since releasing this fork, I've seen developers use it for:

Game dev — Generating NPC dialogue dynamically instead of recording thousands of audio files
Podcasting — Creating consistent intro/outro narration
Accessibility — Multilingual audio versions of documentation
Localization — Same voice, 10 languages, zero re-recording
Prototyping — Testing voice UX before hiring voice actors

GPU Compatibility

GPU	VRAM	Recommended Model	Status
RTX 5090	32GB	1.7B	Tested & verified
RTX 4090	24GB	1.7B	Works great
RTX 4070	12GB	0.6B or 1.7B (tight)	Works
RTX 3080	10GB	0.6B	Works
Apple Silicon	16GB+	Via Mac fork	MLX optimized

If you're VRAM-constrained, the 0.6B model is surprisingly capable — and FlashAttention 2 can help on Linux:

pip install flash-attn --no-build-isolation

Things I Learned the Hard Way

A few gotchas from building this that might save you time:

Windows cp932 encoding hell. Japanese Windows defaults to cp932 encoding, which chokes on Unicode output from the model. The fix is wrapping stdout/stderr:

import sys, io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')

FlashAttention 2 doesn't compile on Windows. The solution is using PyTorch's built-in SDPA (Scaled Dot Product Attention) via --no-flash-attn. Performance hit is minimal for single-user inference.

Blackwell (sm_120) needs nightly PyTorch. As of early 2026, stable PyTorch doesn't support RTX 50-series. Nightly builds with cu128 work, but you'll see warnings about torchao version mismatches. They're cosmetic — ignore them.

SoX is optional. The model prints warnings about missing SoX, but it works fine without it. Don't waste time installing it on Windows.

What's Next

I'm currently exploring:

vLLM integration for production-grade serving
Fine-tuning workflows for custom voice models
Streaming WebSocket API for real-time applications

Try It

If you're working on anything voice-related — games, accessibility, content creation, or just want to mess around with state-of-the-art TTS — give it a spin:

Windows/Linux: Qwen3-TTS-JP
Mac (Apple Silicon): Qwen3-TTS-Mac-GeneLab

Stars are appreciated — they help other developers find the project.

I'm curious: What would you build with 3-second voice cloning? Drop your ideas in the comments — I'd love to hear what use cases I haven't thought of yet.

Ethical Note

Voice cloning is powerful tech. Please use it responsibly — clone only with consent, disclose AI-generated audio, and don't use it for fraud or impersonation. The Apache 2.0 license gives you freedom, but with great power... you know the rest.

ComfyUI-AceMusic: The First Full Implementation of ACE-Step 1.5 Features That "Weren't Yet Supported"

GeneLab_999 — Wed, 04 Feb 2026 15:06:27 +0000

TL;DR

On February 3rd, 2026, the official ComfyUI blog announced ACE-Step 1.5 support with a notable caveat: "Cover, Repaint, and other features aren't yet supported in ComfyUI."

The next day, I released ComfyUI-AceMusic — a complete implementation of all 15 ACE-Step 1.5 features as ComfyUI nodes.

Key highlights:

World-first: Full Cover, Repaint, Edit, Retake, Extend support in ComfyUI
15 nodes covering every ACE-Step 1.5 capability
Modular architecture that eliminates widget ordering issues
Windows + Python 3.13+ compatible using soundfile/scipy instead of problematic torchaudio backends
HeartMuLa interoperability for hybrid AI music workflows

GitHub: github.com/hiroki-abe-58/ComfyUI-AceMusic

The Problem: Official Support Was Incomplete

ACE-Step 1.5 is a game-changer for open-source music generation. It outperforms most commercial alternatives, runs on consumer hardware (4GB VRAM), and generates full songs in under 10 seconds on an RTX 3090.

When ComfyUI announced native support, the community was excited. But there was a catch.

From the official ComfyUI blog (February 3rd, 2026):

"ACE-Step 1.5 has a few more tricks up its sleeve. These aren't yet supported in ComfyUI, but we have no doubt the community will figure it out."

The "tricks" they mentioned? Only the most powerful features of ACE-Step 1.5:

Feature	Description	Official Support
Cover	Transform any song into a different style	❌ Not supported
Repaint	Regenerate specific sections of audio	❌ Not supported
Edit	Change tags/lyrics while preserving melody	❌ Not supported
Retake	Create variations of existing audio	❌ Not supported
Extend	Add new content before/after audio	❌ Not supported

So I built them.

What ComfyUI-AceMusic Offers

Complete Feature Coverage

Node	Function
Model Loader	Downloads and caches ACE-Step 1.5 models
Settings	Configure generation parameters
Generator	Text-to-Music generation
Lyrics Input	Dedicated lyrics input with section markers
Caption Input	Style/genre description input
Cover	Transform existing audio into different styles
Repaint	Regenerate specific time ranges
Retake	Create variations with same settings
Extend	Add content to beginning or end
Edit	Change tags/lyrics, preserve melody (FlowEdit)
Conditioning	Combine parameters into conditioning object
Generator (from Cond)	Generate from conditioning
Load LoRA	Load fine-tuned adapters
Understand	Extract metadata from audio
Create Sample	Generate params from natural language

Comparison with Existing Implementations

Implementation	ACE-Step Version	Cover	Repaint	Edit	Retake	Extend	Win 3.13+
ComfyUI Native	1.5	❌	❌	❌	❌	❌	Untested
billwuhao	1.0	Partial	✅	❌	❌	✅	Untested
ryanontheinside	1.0	❌	✅	❌	❌	✅	Untested
ComfyUI-AceMusic	1.5	✅	✅	✅	✅	✅	✅

Technical Deep Dive

1. Modular Architecture

Previous implementations crammed 30+ parameters into a single node, causing widget ordering issues — a known ComfyUI quirk where input field order can cause unexpected behavior.

ComfyUI-AceMusic separates concerns:

[Model Loader] → Model loading only
[Settings] → Generation parameters only  
[Lyrics Input] → Lyrics entry only
[Caption Input] → Style description only
[Generator] → Generation execution only

This separation:

Eliminates widget ordering bugs
Improves workflow readability
Makes nodes reusable across different workflows
Follows single-responsibility principle

2. Cross-Platform Compatibility

The Problem: torchaudio backends can fail on Windows + Python 3.13+.

The Solution: Use soundfile and scipy instead.

# Problematic approach
import torchaudio
audio, sr = torchaudio.load("file.wav")  # Fails on Windows 3.13+

# ComfyUI-AceMusic approach
import soundfile as sf
audio, sr = sf.read("file.wav")  # Works everywhere

This isn't just a workaround — it's a more robust solution that works across all platforms without requiring specific backend configurations.

3. HeartMuLa Interoperability

The AUDIO type in ComfyUI-AceMusic is compatible with HeartMuLa outputs, enabling hybrid workflows:

[HeartMuLa Generator] → [AceMusic Cover] → [AceMusic Extend] → [Output]

This lets you combine the strengths of different music generation models in a single workflow.

Quick Start

Installation

Via ComfyUI Manager (Recommended):
Search for "ComfyUI-AceMusic" and install.

Manual:

cd ComfyUI/custom_nodes
git clone https://github.com/hiroki-abe-58/ComfyUI-AceMusic.git
cd ComfyUI-AceMusic
pip install -r requirements.txt

# Install ACE-Step 1.5
pip install git+https://github.com/ace-step/ACE-Step.git

Models auto-download from Hugging Face on first use.

Basic Workflow (Text-to-Music)

Add AceMusic Model Loader → set device to cuda
Add AceMusic Settings → configure duration, language, etc.
Add AceMusic Lyrics Input:

   [Verse]
   Walking down the empty street
   Thinking about you and me

   [Chorus]
   We belong together
   Now and forever

Add AceMusic Caption Input: pop, female vocal, energetic
Connect all to AceMusic Generator → Preview Audio

Load the example workflow: workflow/AceMusic_Lyrics_v3.json

Cover Workflow (Style Transfer)

[Load Audio] ──────────────────┐
                               ↓
[Model Loader] → [Settings] → [AceMusic Cover] → [Preview Audio]
                               ↑
[Caption Input] ───────────────┘
"jazz piano trio, smooth, relaxed"

Use cases:

Pop → Jazz arrangement
Rock → Acoustic version
EDM → Orchestral arrangement

Repaint Workflow (Section Regeneration)

[Load Audio] ──────────────────┐
                               ↓
[Model Loader] → [Settings] → [AceMusic Repaint] → [Preview Audio]
                               ↑
[Time Range: 30-45s] ──────────┘

Use cases:

Fix a problematic chorus
Improve the intro
Regenerate specific vocal sections

Performance

Generation Speed

Device	RTF (27 steps)	Time for 1 min audio
RTX 5090	~50x	~1.2s
RTX 4090	34.48x	1.74s
A100	27.27x	2.20s
RTX 3090	12.76x	4.70s
M2 Max	2.27x	26.43s

VRAM Requirements

Mode	VRAM	Notes
Normal	8GB+	Full speed
CPU Offload	~4GB	Slower but works on limited VRAM

Troubleshooting

Error	Cause	Solution
`CUDA out of memory`	Insufficient GPU memory	Enable `cpu_offload` or reduce `duration`
`ModuleNotFoundError: acestep`	ACE-Step not installed	`pip install git+https://github.com/ace-step/ACE-Step.git`
`soundfile not found`	Missing dependency	`pip install soundfile scipy`
`Model download failed`	Network issue	Check Hugging Face access
`torchaudio backend error`	Windows 3.13+ issue	Ensure soundfile is properly installed

Environment Check Script

#!/usr/bin/env python3
"""ComfyUI-AceMusic Environment Checker"""
import sys

def check():
    issues = []

    # Python version
    print(f"Python: {sys.version}")
    if sys.version_info < (3, 10):
        issues.append("Python 3.10+ required")

    # PyTorch + CUDA
    try:
        import torch
        print(f"✅ PyTorch: {torch.__version__}")
        if torch.cuda.is_available():
            print(f"✅ CUDA: {torch.version.cuda}")
            vram = torch.cuda.get_device_properties(0).total_memory / 1e9
            print(f"✅ GPU VRAM: {vram:.1f} GB")
        else:
            issues.append("CUDA not available")
    except ImportError:
        issues.append("PyTorch not installed")

    # ACE-Step
    try:
        import acestep
        print("✅ ACE-Step: installed")
    except ImportError:
        issues.append("ACE-Step not installed")

    # Audio libraries
    try:
        import soundfile
        print("✅ soundfile: installed")
    except ImportError:
        issues.append("soundfile not installed")

    # Results
    print("\n" + "="*50)
    if issues:
        print("❌ Issues found:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("✅ Environment OK!")

if __name__ == "__main__":
    check()

Why I Built This

When I saw the official announcement saying "these features aren't yet supported," I knew exactly what needed to be done. The ACE-Step team built an incredible model with Cover, Repaint, Edit, and other powerful features — but without ComfyUI support, most users couldn't access them.

The hardest part was the torchaudio issue. On Windows with Python 3.13+, the audio backends just don't work reliably. The solution was to bypass torchaudio entirely and use soundfile/scipy for all audio I/O. It's a more robust approach that should work on any platform.

The modular architecture came from frustration with existing implementations. Stuffing 30+ parameters into one node isn't just ugly — it causes real bugs. Separating concerns made the nodes more reliable and the workflows more readable.

This is what open source is about. The official team sets the direction, and the community fills in the gaps. I'm proud to contribute to the music generation ecosystem.

License

Apache 2.0

If you find this useful, consider starring the repo. And if you build something cool with it, I'd love to see it!

Run Qwen3-TTS on Windows with RTX 5090: Voice Cloning in 3 Seconds

GeneLab_999 — Sat, 31 Jan 2026 09:37:20 +0000

Run Qwen3-TTS on Windows with RTX 5090: The Complete Guide to Voice Cloning in 3 Seconds

Clone any voice with just 3 seconds of audio — now with native Windows support and the latest Blackwell GPUs

TL;DR

Qwen3-TTS-JP is a fork of Alibaba's Qwen3-TTS that adds:

✅ Native Windows support (no WSL required!)
✅ RTX 5090 / Blackwell GPU tested and optimized
✅ Auto-transcription via Whisper integration
✅ Localized GUI (Japanese, easy to adapt)

hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.

Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

Platform	Repository	Description
Windows	This

…

View on GitHub

The Problem: Getting Qwen3-TTS Running on Windows

When Alibaba released Qwen3-TTS in January 2026, the AI community was amazed: 3 seconds of reference audio is all you need to clone a voice. Ten languages supported, 97ms latency, emotion control — impressive specs on paper.

But there was a catch.

The official repo assumed Linux. CUDA setup was finicky. And if you wanted to use the voice cloning feature, you had to manually transcribe your reference audio — defeating the purpose of a quick workflow.

I'd just upgraded to an RTX 5090 (Blackwell architecture), eager to push local AI to its limits. After days of wrestling with environments, I got it working and decided to package the solution for everyone else.

What Makes This Fork Different?

1. Native Windows Support

No WSL. No Docker (though it's optional). Just Python, CUDA, and you're good to go.

# Clone the repo
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate

# Install
pip install -e .
pip install faster-whisper

That's it. Works on Windows 10/11 with any CUDA-capable GPU (RTX 30/40/50 series).

2. RTX 5090 (Blackwell) Tested

This fork was developed and tested on an RTX 5090. The latest CUDA 12.x with Blackwell architecture can be tricky — many AI repos break on it. This one doesn't.

GPU	VRAM	Model	Status
RTX 5090	32GB	1.7B	✅ Works
RTX 4090	24GB	1.7B	✅ Works
RTX 3080	10GB	0.6B	✅ Works

3. Whisper Auto-Transcription

The original Qwen3-TTS requires you to provide the transcript of your reference audio. This fork integrates faster-whisper to do it automatically:

Upload 3 seconds of audio
Whisper transcribes it
Qwen3-TTS clones the voice

No manual typing. Choose from 5 Whisper models:

Model	Params	Speed	Accuracy
tiny	39M	⚡⚡⚡⚡⚡	★★
small	244M	⚡⚡⚡	★★★★
large-v3	1.5B	⚡	★★★★★

Quick Start

Launch the GUI

python -m qwen_tts.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --ip 0.0.0.0 --port 8000

Open http://localhost:8000 in your browser.

Python API

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

# Clone a voice with 3-second reference
wavs, sr = model.generate_voice_clone(
    text="This is my cloned voice speaking!",
    language="English",
    ref_audio="reference.wav",
    ref_text="Hello, this is a test.",
)
sf.write("output.wav", wavs[0], sr)

Use Cases

Content Creators

Clone your own voice for consistent narration across videos.

Game Developers

Create character voices without expensive voice actors:

model.generate_voice_design(
    text="Hero, your quest awaits!",
    language="English",
    instruct="Deep male voice, 40 years old, British accent"
)

Podcasters

Quick voice-over generation for intros and outros.

Supported Languages

Troubleshooting

Error	Fix
`CUDA out of memory`	Use 0.6B model or add FlashAttention 2
`faster-whisper not found`	`pip install faster-whisper`

Ethical Note

Voice cloning is powerful. Please:

Only clone voices with consent
Don't use for fraud or misinformation
Disclose AI-generated audio

Links

If this helped, please ⭐ the repo!

Questions? Drop a comment below! 👇

DEV Community: GeneLab_999

🛠️ I Built a One-Click ComfyUI Setup for RTX 5090 on Windows — No WSL2, No Docker

Why RTX 50-series Breaks Everything

The 5 Rules I Discovered

What I Built

What I Verified

Why Not Just Use WSL2 or Docker?

Wrapping Up

hiroki-abe-58 / ComfyUI-Win-Blackwell

ComfyUI for GeForce RTX 50-Series (Blackwell)

What Makes This Special

Technical Highlights

Why This Is Unique

How I Run 6 AI Services Simultaneously on RTX 5090 + WSL2 + Docker (And You Can Too)

The Problem Nobody Solved

The Root Cause

The Solution: gpu-run

The Dockerfile Template

Docker Compose Architecture

Design Decisions That Saved My Sanity

1. Docker Profiles for Resource Isolation

2. Shared Model Directory

3. Port Allocation Strategy

4. The torchaudio Trap

Lessons Learned (The Hard Way)

Current Stack (February 2026)

Is This Still Unique?

I Built a Voice Cloning GUI That Supports 10 Languages — Here's What I Learned Wrestling with CUDA on Windows published

What I Built

hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

The Architecture in 30 Seconds

Getting It Running (It's Actually Easy Now)

What You Can Actually Build With This

Voice Cloning in 5 Lines

Design a Voice From Scratch

Cross-Lingual Cloning

Practical Use Cases I've Seen

GPU Compatibility

Things I Learned the Hard Way

What's Next

Try It

Ethical Note

ComfyUI-AceMusic: The First Full Implementation of ACE-Step 1.5 Features That "Weren't Yet Supported"

TL;DR

The Problem: Official Support Was Incomplete

What ComfyUI-AceMusic Offers

Complete Feature Coverage

Comparison with Existing Implementations

Technical Deep Dive

1. Modular Architecture

2. Cross-Platform Compatibility

3. HeartMuLa Interoperability

Quick Start

Installation

Basic Workflow (Text-to-Music)

Cover Workflow (Style Transfer)

Repaint Workflow (Section Regeneration)

Performance

Generation Speed

VRAM Requirements

Troubleshooting

Environment Check Script

Why I Built This

Links

License

Run Qwen3-TTS on Windows with RTX 5090: Voice Cloning in 3 Seconds

Run Qwen3-TTS on Windows with RTX 5090: The Complete Guide to Voice Cloning in 3 Seconds

TL;DR

hiroki-abe-58 / Qwen3-TTS-JP

Japanese GUI + Whisper auto-transcription for Qwen3-TTS. RTX 5090 tested.

Qwen3-TTS-JP

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

The Solution: `gpu-run`