DEV Community

Akshat Prakash
Akshat Prakash

Posted on

Introducing MARS5, open-source, insanely prosodic text-to-speech (TTS) model.

Image description

CAMB.AI introduces MARS5, a fully open-source (commercially usable) TTS with break-through prosody and realism available on our Github: https://www.github.com/camb-ai/mars5-tts

Watch our full release video here:
https://www.youtube.com/watch?v=bmJSLPYrKtE

Why is it different?
MARS5 is able to replicate performances (from 2-3s of audio reference) in 140+ languages, even for extremely tough prosodic scenarios like sports commentary, movies, anime and more; hard prosody that most closed-source and open-source TTS models struggle with today.

We're excited for you to try, build on and use MARS5 for research and creative applications. Let us know any feedback on our Discord!

Top comments (3)

Collapse
 
akshat_prakash_ffa10d8bcb profile image
Akshat Prakash

Highlights:
Training data: Trained on over 150K+ hours of data.
Params: 1.2 Bn (750/450)
Multilingual: Open-sourcing in English to begin with, but can access it in 140+ languages on camb.ai
Diversity in prosody: can handle very hard prosodic elements like commentary, shouting, anime etc.

Collapse
 
akshat_prakash_ffa10d8bcb profile image
Akshat Prakash

The model follows a two-stage setup, operating on 6kbps encodec tokens. Concretely, it consists of a ~750M parameter autoregressive part (which we call the AR model) and a ~450M parameter non-autoregressive multinomial diffusion part (which we call the NAR model). The AR model iteratively predicts the most coarse (lowest level) codebook value for the encodec features, while the NAR model takes the AR output and infers the remaining codebook values in a discrete denoising diffusion task. Specifically, the NAR model is trained as a DDPM using a multinomial distribution on encodec features, effectively ‘inpainting’ the remaining codebook entries after the AR model has predicted the coarse codebook values.

The model was trained on a combination of publicly available datasets, as well as internally provided by our customers which include large sports leagues, and international creatives.

The model follows a two-stage setup, operating on 6kbps encodec tokens. Concretely, it consists of a ~750M parameter autoregressive part (which we call the AR model) and a ~450M parameter non-autoregressive multinomial diffusion part (which we call the NAR model). The AR model iteratively predicts the most coarse (lowest level) codebook value for the encodec features, while the NAR model takes the AR output and infers the remaining codebook values in a discrete denoising diffusion task. Specifically, the NAR model is trained as a DDPM using a multinomial distribution on encodec features, effectively ‘inpainting’ the remaining codebook entries after the AR model has predicted the coarse codebook values.

The model was trained on a combination of publicly available datasets, as well as internally provided by our customers which include large sports leagues, and international creatives.

Collapse
 
akshat_prakash_ffa10d8bcb profile image
Akshat Prakash • Edited