DEV Community

loading...

Tacotron-2 - Text to Speech, My Speech - Part 1

Jack Moore
I'm interested in infrastructure, automation, and reliability. I write a fair amount of Ruby on Rails.
Originally published at jmoore53.com on ・5 min read

The gist of this post is that every day for the past few weeks I have gone home and talked to my computer.Yes, you read that correctly. I have gone home and talked to my computer.Also, don’t let the title fool you. The beef of this post is setting up an environment and a process to help me develop a custom text to speech program. I would only call this a mere glance into the Text-To-Speech Algorithms.

I need the future, now. This project is to build a text to speech system using my own voice as the training model! I am very excited to build this from the ground up with my own voice as the training data.

I will be using a handful of Artificial Intelligence libraries to ensure this process goes as smooth as possible. Some include Mozilla’s TTS, Gentle for speech mapping, SOX for data cutting, and of course Python…

Picking a Library

After looking at and evaluating the libraries that are out there, I decided to go with Mozilla’s TTS Library. I felt that their library was relatively easy to put together and felt I could easily reproduce my own voice from text with their library.

Building a Dataset

It appears having a good, no scratch that, perfect, dataset is the most important part of building any decent text to speech application.

I looked at the LJ Speech Dataset and decided about 24hrs, (close to ~13,100 utterances/audioclips) of my own time would be needed to record and collect data. I am sure this will take upwards of 40-50hrs to ensure the data is properly Extracted, Transformed and Loaded. I plan to use the aeneas library to match my speech to text. Confirming this is the correct way to build a model, I also looked at building other datasets as well. The other popular dataset is from the Blizzard challenge, and the M-AILABS Speech Dataset. I could have used these, but the LJ Speech Dataseet seemed easier to replicate.

Mozilla provides a great article here on how to build a custom voice Text to Speech Application. The article mentioned will be one of many I will be using to learn more about building my custom TTS program.

There’s really just no chance

There is not a fucking chance I sit down and read 13,000 one liners back to back. I need to find an already exiting text broken up and I need it then matched to the wav file and broken down into 5 second wav files. Enter Python. This isn’t too crazy, but my plan is to basically read a chapter a day until I am done with three books. This should get me to ~15,000 sentences which should be all that’s needed for the training model. I will feed the model more data if I feel it is necessary.

The process will look like the following: 1) Find a Full Plain Text Book Online 2) Parse Text Sentence by Sentence into a single file data (python..) 3) Read and Record the Single file to a single wav file 4) Use Python Library Aeneas to match text to speech (still in bigger file) 5) Use Python to break up the large wav file into a smaller wav file using ffmpeg 6) Aeneas to Create the LJ wav folder and .csv file’

So I thought…

Refined Process

The 6 step process above is nice, but almost a little unrealistic and way too time consuming for someone as lazy as myself. If it can be better I will make it better. Here is the new process: 1) Find Text/Plaintext Script (Movie Scripts are fun to read) () -> plaintext file 2) Record Text on Garageband (Im on Mac, I couldn’t get my MIC to work with audacity) and save it to Wav Format () -> wav file 3) Upload Text & Unbroken Large Wav File to Gentle for it () -> json file 4) Parse JSON returned from Gentle and break large file with sox into LJSpeech Dataset, () -> wavs folder and csv mapping to the file 5) Pass LJSpeech Dataset to TTS Model

Seems Easy Enough.

Dataset

Microsoft’s site says, “the data needs to be a collection (.zip) of audio files (.wav) as individual utterances. Each audio file should be 15 seconds or less in length, paired with a formatted transcript (.txt).” They are basically correct with the information they provide.

This dataset example for mozillas TTS is what the custom dataset example should look like. I found the link on the Mozilla Form here. There is a good forum post mostly here and here that goes over training a custom voice.

In the /custom-dataset-sample/ directory there exists a wavs directory and a metadata_sample.csv file. The wavs directory stores .wav files and the metadata_sample.csv is structured to map wavs/file1.wav to the text inside of the wav file.

Writing a Preprocessor

Because we will be using a simliar format to the LJ Dataset, I will need to make sure the preprocessor uses the correct data processor. This could really fuck my model otherwise.

Training the Model

Looking at this example of the tacotron example, it appears the LJ Speech Dataset went through 441k steps and the results sound decent. I will be using the Tacotron2 library.

Looking Forward

Currently I know the process I am going to follow to achieve this goal of having my voice used by a computer. My plan is to write part 2 of this series after I am done with all the data collection.

This will allow me to really dive deep into curve fitting and understand the specifics of how ML/AI works.I plan to have a demystified understanding of AI/ML when I return for the second post.

Resources

Resources Final?

Discussion (0)