Jovan Chan

Posted on Jun 27 • Originally published at aifoss.dev

MOSS-TTS 1.5 Review 2026: Apache Voice Cloning on 8GB

#tts #voicecloning #selfhosted #ai

This article was originally published on aifoss.dev

TL;DR: MOSS-TTS 1.5 is an 8B open TTS model that clones a voice from a short reference clip and — unlike XTTS v2 and F5-TTS — ships under Apache 2.0, so you can actually use it in a paid product. It fits on an 8GB GPU with the llama.cpp path and has MLX builds for Apple Silicon. The catch: cloning fidelity trails XTTS v2 slightly, and setup is rougher than a one-click app.

	MOSS-TTS 1.5	F5-TTS	XTTS v2
Best for	Commercial cloning + long-form	Personal cloning projects	Personal cloning, broad community
License	Apache 2.0 (commercial OK)	CC-BY-NC (non-commercial)	CPML (non-commercial, vendor defunct)
Zero-shot cloning	Yes, from a short clip	Yes, ~3s reference	Yes, ~6s reference
Min VRAM	~8GB (llama.cpp build)	~8–12GB	~6–8GB
The catch	Setup friction, newer ecosystem	Can't ship commercially	No one left to sell a license

Honest take: If you need voice cloning inside something you'll sell, MOSS-TTS 1.5 is the first open model that's both good and legally clean — pick it over F5-TTS and XTTS v2 the moment money is involved.

What MOSS-TTS 1.5 actually is

MOSS-TTS is the speech-generation family from the OpenMOSS team (the group behind the MOSS LLM work) and MOSI.AI. Version 1.5 of the flagship model landed on May 26, 2026, alongside MOSS-SoundEffect-v2.0. It is an 8-billion-parameter model using an architecture the repo calls MossTTSDelay, and every model in the family is released under the Apache License 2.0.

That license line is the whole story for anyone building a product. Voice cloning in the open-source world has been a legal minefield: XTTS v2 is under Coqui's CPML (non-commercial), and Coqui Inc. shut down in January 2024, so there is literally no one left to sell you a commercial license. F5-TTS ships its weights under CC-BY-NC-4.0 — also non-commercial. MOSS-TTS 1.5 is the rare zero-shot cloning model you can drop into a paid app, a client deliverable, or an internal tool at work without a lawyer flagging it.

The family is broader than one checkpoint:

MOSS-TTS-v1.5 — 8B, the main quality model.
MOSS-TTS-Local-Transformer-v1.5 — 4B, MossTTSLocal architecture, 48kHz stereo output, released June 18, 2026.
MOSS-TTS-Nano — ~100M params, runs on CPU, launched April 13, 2026.

This review focuses on the 8B v1.5 model, since that's the one the queue topic and most of the r/LocalLLaMA discussion centers on.

What it does well

Cloning quality is genuinely competitive. On the standard Seed-TTS-eval benchmark, the 8B MossTTSDelay model reports an English word error rate of 1.84% and English speaker similarity of 70.86%, with Chinese CER of 1.37% and Chinese speaker similarity of 76.98%. The 4B local-transformer variant pushes similarity higher (73.28% English, 79.62% Chinese). For context, sub-2% WER means the model rarely mangles or skips words — the failure mode that makes most local TTS unusable for real narration.

Long-form stability is the standout feature. The model card claims up to one hour of coherent audio in a single run while holding a consistent speaker identity. Most open TTS models drift, change timbre, or fall apart past a few minutes. If you're producing audiobooks, podcasts, or long documentation read-throughs, that single-run stability matters more than a fractional similarity-score win.

31 languages, up from 20 in the 1.0 release, covering Chinese, English, French, German, Spanish, Japanese, Korean, Arabic, Hindi, Thai, and Vietnamese among others.

Control you don't usually get. v1.5 adds reliable punctuation-driven pausing and explicit inline pause markers — you can write [pause 3.2s] directly in your text. There's phoneme-level pronunciation control via mixed Pinyin/IPA input for names and jargon the model would otherwise butcher. The repo also gives a handy planning rule: 1 second of audio ≈ 12.5 tokens, so you can estimate generation length before you run anything.

It runs on hardware you own. After the llama.cpp optimization work, the OpenMOSS team states the 8B model now fits onto 8GB GPUs. That puts it within reach of a RTX 3060 12GB or even an 8GB card, instead of demanding a 24GB workstation GPU. For a model that clones voices at this quality, that's the headline that makes it practical.

Install and first run

There are two install paths. The standard PyTorch runtime:

git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install -e ".[torch-runtime]"

Or the torch-free path, which is what gets you onto an 8GB card and onto edge devices. It uses GGUF weights plus an ONNX audio tokenizer instead of dragging in the full PyTorch stack:

pip install -e ".[llama-cpp-onnx]"

A minimal zero-shot clone looks like this — point the model at a short reference clip and a transcript, then synthesize new text in that voice:

from moss_tts import MossTTS

tts = MossTTS.from_pretrained("OpenMOSS-Team/MOSS-TTS-v1.5")

audio = tts.generate(
    text="The quarterly numbers are in, and they look better than we feared. [pause 0.8s] Let's walk through them.",
    ref_audio="samples/narrator.wav",   # short reference clip of the target voice
    ref_text="This is the reference transcript.",
    language="en",
)
audio.save("out.wav")

Expect the first run to download several GB of weights and the ONNX tokenizer. On a 12GB card the 8B model loads comfortably; on 8GB you'll want the llama.cpp/GGUF build and should close other GPU apps first.

Apple Silicon and ComfyUI

Two integration points matter for this audience.

MLX on Apple Silicon. MOSS-TTS and the MOSS audio tokenizer support mlx-audio, and the community has published quantized builds such as mlx-community/MOSS-TTS-8B-8bit. On a Mac with unified memory this is the cleanest route — no CUDA, no driver wrangling. If you're already running local models on a Mac, the same logic from our Ollama MLX backend setup guide applies: MLX builds trade a little quality headroom for a big jump in setup simplicity and memory efficiency on M-series chips.

ComfyUI. There's a community extension, comfyui-moss-tts, that wires the model into ComfyUI's node graph. If you already run an image pipeline, you can bolt TTS onto the same canvas — useful for generating narrated video assets in one workflow. If you're new to ComfyUI nodes, our ComfyUI custom nodes guide covers how to install and manage third-party packs without breaking your install.

How it compares

The real decision is rarely "MOSS vs. everything." It's "which cloning model can I legally ship, and is it good enough?" Here's the honest breakdown against the two models people actually reach for.

vs. XTTS v2 — XTTS v2 is still the community's reference point for cloning fidelity from ~6 seconds of audio across 17 languages, and its tooling ecosystem is enormous. But the CPML license is non-commercial and, with Coqui gone, unfixable. MOSS-TTS 1.5 gets you most of the way on quality with a license you can build on and a wider 31-language footprint. If you're doing a personal project and want the largest pile of tutorials, XTTS v2 still wins on ecosystem. For anything commercial, it's disqualified.

vs. F5-TTS — F5-TTS clones from roughly 3 seconds of reference audio and is one of the fastest-moving local TTS projects, with excellent few-shot results. Same blocker: CC-BY-NC weights mean no commercial use. F5-TTS is arguably easier to get a quick demo running. MOSS-TTS 1.5 wins on long-form stability (the one-hour single-run claim) and, again, on licensing.

vs. Kokoro / Piper — worth naming because they come up constantly. Kokoro (Apache 2.0) and Piper (MIT) are both commercial-friendly and excellent, but neither clones voices — they ship fix

DEV Community