DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

MOSS-TTS in ComfyUI 2026: Zero-Shot Voice Cloning From a 10-Second Clip on Your RTX or Mac

This article was originally published on runaihome.com

TL;DR: MOSS-TTS clones a voice from a clean 3–10 second clip with no reference transcript, runs locally under Apache 2.0, and slots into ComfyUI through a custom-node pack. The Local 1.7B model fits in roughly 5GB of VRAM and is the only variant fast enough for iterative work; the Delay 8B wants ~18GB and trades speed for a little more expressiveness across its 31 languages.

MOSS-TTS Local 1.7B MOSS-TTS Delay 8B MOSS-TTS Nano 0.1B
Best for Day-to-day cloning on one GPU Maximum stability, long-form narration CPU-only / no-GPU machines
Hardware ~5 GB VRAM (RTX 3060 12GB and up) ~18 GB VRAM (RTX 3090 / 4090) Runs real-time on 4 CPU cores
The catch Slightly lower speaker similarity than 8B Slow enough that iteration hurts Lower fidelity, single-speaker focus

Honest take: Start with the Local 1.7B. It clones a voice convincingly in 5GB of VRAM, and unless you're producing hours of narration where the 8B's marginally higher stability matters, you'll never feel the difference.


What MOSS-TTS Actually Is

MOSS-TTS is an open-source speech generation family from MOSI.AI and the OpenMOSS team — the same group behind the MOSS large language models. The current release, MOSS-TTS v1.5, shipped May 26, 2026 under the Apache 2.0 license, which means you can use it commercially without the non-commercial restrictions that hobble a lot of "open" TTS models.

The headline feature is zero-shot voice cloning: you hand it a short audio clip of someone speaking, type the text you want, and it generates new speech in that voice. Critically — and unlike Qwen-TTS or many older cloning pipelines — MOSS-TTS does not require a transcript of the reference audio. You drop in a clip, you get the voice. That single difference removes the most error-prone step from the whole workflow.

v1.5 covers 31 languages (Chinese, English, French, German, Spanish, Japanese, Korean, Cantonese, Dutch, Hindi, Finnish, and more), follows punctuation-driven pauses more reliably than v1.0, and supports explicit inline pause markers for fine pacing control.

If you've already got ComfyUI running on Windows or in production on Linux, MOSS-TTS bolts on as a custom-node pack — no separate framework, no new server to babysit.


The Model Lineup (and Which One You Actually Want)

The MOSS-TTS family is wider than just two models, and the ComfyUI node pack exposes most of it. Here's the practical breakdown with the VRAM figures the node author documents:

  • MOSS-TTS Local 1.7B — ~5 GB VRAM. The fast lane. The node documentation flatly states it's "the only model fast enough for practical iterative use on a single consumer GPU." This is your default.
  • MOSS-TTS Delay 8B — ~18 GB VRAM. The production-recommended model for long-form stability and the cleanest voice cloning. Needs a 24GB card to be comfortable.
  • MOSS-TTSD v1.0 (dialogue) — ~18 GB VRAM. Multi-speaker conversation generation — think two distinct voices in a podcast clip.
  • MOSS-VoiceGenerator (1.7B) — ~18 GB VRAM as packed. Designs a voice from a text description rather than a reference clip.
  • MOSS-SoundEffect v2.0 (1.3B DiT) — ~18 GB VRAM. Environmental sound effects, not speech.
  • MOSS-TTS-Nano (0.1B) — a CPU-first variant that does real-time generation on just 4 CPU cores, for machines with no usable GPU at all.

There's also a MOSS-TTS-Realtime (1.7B) streaming build aimed at voice agents, with a measured ~180ms time-to-first-byte after warm-up and a real-time factor (RTF) of 0.51 — meaning it generates audio roughly twice as fast as playback. That's the one to watch if you're wiring TTS into a live assistant rather than rendering files.

How good is the cloning, in numbers?

On the standard Seed-TTS-eval benchmark, the two main models land here:

Model EN word error rate EN speaker similarity ZH char error rate ZH speaker similarity
MOSS-TTS Local 1.7B 1.93% 73.28% 1.44% 79.62%
MOSS-TTS Delay 8B 1.84% 70.86% 1.37% 76.98%

Read that carefully: the 8B has a marginally lower error rate, but the Local 1.7B actually scores higher on speaker similarity (73.28% vs 70.86% in English). For voice-cloning specifically, the small model is not the compromise the parameter count suggests — it's arguably the better clone. That's the data point that should settle the "which model" debate for most home labs.


Hardware: What You Need to Run It

The audio path is light compared to image generation. The codec runs at a 12.5-token-per-second frame rate (1 second of audio ≈ 12.5 tokens), so the model isn't pushing the enormous token volumes a diffusion image model does. Output is 24 kHz mono.

For the Local 1.7B, a 12GB card like the RTX 3060 clears the ~5GB requirement with room to keep ComfyUI's other nodes resident. A 16GB RTX 5060 Ti gives you headroom to also run an image or LLM workflow in the same session. The node pack uses FlashAttention 2 on CUDA GPUs with compute capability 8.0+ (Ampere and newer — RTX 30-series and up), which is most cards anyone is buying for AI in 2026.

For the Delay 8B, you're at ~18GB, which realistically means a 24GB RTX 3090 or RTX 4090. The used 3090 remains the value pick here — see our used RTX 3090 breakdown for current pricing.

No GPU at all? You have two real options: run the Nano 0.1B on CPU, or rent a GPU by the hour. A single TTS render job is short enough that spinning up a cloud box on RunPod for an afternoon of voice work costs less than a coffee — see our rent-vs-buy math before committing to hardware.

Apple Silicon

There's a first-class MLX path. The mlx-community has published 8-bit MLX conversions (mlx-community/MOSS-TTS-8B-8bit and MOSS-TTS-Local-Transformer-MLX-8bit) that run through mlx-audio, Apple's MLX-based audio toolkit. The default runtime uses W8A-bf16 mixed precision with global and local KV cache enabled. On unified-memory Macs the VRAM-vs-RAM distinction disappears, so a Mac Mini M4 Pro with 24GB+ handles even the 8B 8-bit conversion. If you're already running Ollama with MLX, the toolchain will feel familiar.


Installing the ComfyUI Node Pack

The community node pack is richservo/comfyui-moss-tts. Installation is the standard custom-node routine:

cd ComfyUI/custom_nodes
git clone https://github.com/richservo/comfyui-moss-tts
cd comfyui-moss-tts
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

The hard dependency that catches everyone: transformers>=5.0.0. MOSS-TTS uses architecture code that landed in the Transformers 5.x line, and most existing ComfyUI installs are still on a 4.x release pinned by some other node. After installing, restart ComfyUI fully (not just a workflow refresh).

Models auto-download to ComfyUI/models/moss-tts/ the first time you queue a prompt with a given model selected. All variants share one audio codec, OpenMOSS-Team/MOSS-Audio-Tokenizer, which downloads alongside the first model.

The error that will bite you

If you skipped the Transformers upgrade, the first prompt fails immediately:

ImportError: cannot import name 'MossTTSForConditionalGeneration' from 'transformers'
Enter fullscreen mode Exit fullscreen mode

Fix: force the upgrade inside ComfyUI's own Python environment, not your system Python:


bash
# from the ComfyUI root, using its bundled python
python_embeded\python.exe -m pip ins
Enter fullscreen mode Exit fullscreen mode

Top comments (0)