DEV Community

git-leo-here
git-leo-here

Posted on

TDoC '24 Day 4: A Dive into Neural Networks and Tacotron2 for Text-To-Speech

Overview

Deep learning is a subset of machine learning inspired by how the human brain processes information. While the complexity of biological neurons is unmatched, artificial neurons model their fundamental characteristics to process data in computational systems.

This blog delves into deep learning concepts like sequential models, activation layers, and popular architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We will also explore Tacotron2, an end-to-end text-to-speech (TTS) system.


Neural Networks

Artificial Neurons

At the core of deep learning are artificial neurons. These consist of:

  • Linear Activation Function: Models relationships similar to linear regression.
  • Non-linear Activation Function: Adds complexity to model non-linear data, using functions like sigmoid or ReLU.

Sequential Model

The sequential model stacks layers linearly, where each layer's output serves as input to the next.

Types of Layers

  • Linear Layer: Fully connects all outputs of one layer to neurons in the next.
  • Activation Layer: Applies non-linear transformations (e.g., sigmoid, ReLU) to mimic real-world complexity.

Along with this there are many other kinds of layers such as Convolutional layers , Recurrent Layers , Dropout Layers etc.


Neural Network Architectures

Convolutional Neural Networks (CNNs)

  • Efficiently process spatial data (e.g., images).
  • Apply filters to detect local patterns like edges and textures.

Recurrent Neural Networks (RNNs)

  • Handle sequential data like text or time series by retaining "memory" of previous inputs.
  • Suitable for tasks where context is important, such as language modeling.

Long Short-Term Memory (LSTM)

LSTMs solve the vanishing gradient problem of RNNs by maintaining both long-term and short-term memories.


Tacotron2: Revolutionizing Text-to-Speech

Tacotron2, developed by Google, simplifies traditional TTS pipelines into just two components: Text-to-Spectrogram and Vocoder.

Why Tacotron2?

  • Natural Sounding Speech: Generates human-like prosody.
  • End-to-End Learning: Reduces manual feature engineering.
  • Flexibility: Adapts to diverse voice styles.

Tacotron2 Architecture

  1. Text-to-Spectrogram Module:

    • Encoder: Extracts linguistic features from text.
    • Decoder: Converts these features into mel spectrograms.
    • Attention Mechanism: Aligns input text with corresponding audio frames.
  2. Vocoder:

    • Converts the mel spectrogram into raw audio using tools like WaveGlow or WaveRNN.

Implementation Steps

Preparation

  • Install dependencies:
  pip install deep_phonemizer torchaudio matplotlib
Enter fullscreen mode Exit fullscreen mode

Text Processing

  • Character-based encoding:
symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)

def text_to_sequence(text):
    text = text.lower()
    return [look_up[s] for s in text if s in symbols]

text = "Hello world! Text to speech!"
print(text_to_sequence(text))
Enter fullscreen mode Exit fullscreen mode
  • Phoneme-based encoding:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "Hello world! Text to speech!"
with torch.inference_mode():
    processed, lengths = processor(text)
Enter fullscreen mode Exit fullscreen mode

The intermediate representation of the processed text can be obtained by executing the following statement :

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
Enter fullscreen mode Exit fullscreen mode

Spectrogram Generation

  • Generate spectrograms with Tacotron2:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, _, _ = tacotron2.infer(processed, lengths)
Enter fullscreen mode Exit fullscreen mode

Waveform Generation

  • WaveRNN Vocoder:
bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
    waveforms, lengths = vocoder(spec, spec_lengths)
Enter fullscreen mode Exit fullscreen mode

We will also create a function to plot the waveform as well as provide us with the audio corresponding to the output.

def plot(waveforms, spec, sample_rate):
    waveforms = waveforms.cpu().detach()

    fig, [ax1, ax2] = plt.subplots(2, 1)
    ax1.plot(waveforms[0])
    ax1.set_xlim(0, waveforms.size(-1))
    ax1.grid(True)
    ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
    return IPython.display.Audio(waveforms[0:1], rate=sample_rate)


plot(waveforms, spec, vocoder.sample_rate)
Enter fullscreen mode Exit fullscreen mode
  • Griffin-Lim Vocoder:
bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
    processed, lengths = processor(text)
    processed = processed.to(device)
    lengths = lengths.to(device)
    spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
Enter fullscreen mode Exit fullscreen mode

To check out the output , we can again use the "plot" function we create earlier with the WaveRNN vocoder .

plot(waveforms, spec, vocoder.sample_rate)
Enter fullscreen mode Exit fullscreen mode

Integrating Tacotron2 in Your Project

To use Tacotron2 with the TTS library, create a CLI tool as shown below:

Creating a CLI Tool

Environment Setup:
Create a virtual environment using Anaconda with Python 3.10 . Here I have named the environment as "vocalshift".

conda create -n vocalshift python=3.10
Enter fullscreen mode Exit fullscreen mode

Activate the environment that was just created earlier .

conda activate vocalshift
Enter fullscreen mode Exit fullscreen mode

Install Pytorch . Here I have installed the CPU version as it will be supported in all computers but you are free to download the GPU supported version if you have a compatible GPU.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Enter fullscreen mode Exit fullscreen mode

Install TTS Library for Converting from text to speech .

pip install TTS
Enter fullscreen mode Exit fullscreen mode

Install Librosa and Soundfile for audio manipulation and processing .

pip install librosa==0.10.2 soundfile
Enter fullscreen mode Exit fullscreen mode

Main Script:
Now , we will create a file called main.py for text-to-speech

First we will be creating the argument parser using argparse

import argparse
import os
from TTS.api import TTS
from pathlib import Path
from voice_manipulator import audio_manipulator

def create_tts_cli():
    parser = argparse.ArgumentParser(description='Text to Speech CLI Tool')
    parser.add_argument('--text', type=str, help='Text to convert to speech')
    parser.add_argument('--input-file', type=str, help='Text file to convert to speech')
    parser.add_argument('--output', type=str, default='output.wav', help='Output audio file path')
    parser.add_argument('--speaker', type=str, help='Path to speaker voice sample')
    # parser.add_argument('--language', type=str, help='Language code (default: en)')
    parser.add_argument('--effect', type=str, default=None , help='Effect to apply to the audio')
    parser.add_argument('--effect-level', type=float, default=1.0 , help='Effect level to apply to the audio')
    return parser
Enter fullscreen mode Exit fullscreen mode

Then we will create the function in which this parser will be instantiated , and the arguments will be used further for TTS conversion .

def main():
    # Create the argument parser and define the CLI arguments
    parser = create_tts_cli()

    # Parse the command-line arguments
    args = parser.parse_args()

    # Ensure that either --text or --input-file is provided
    if not args.text and not args.input_file:
        parser.error("Either --text or --input-file must be provided")

    # Get the directory of the output file path
    output_dir = os.path.dirname(args.output)

    # Create the output directory if it does not exist
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)

    # If an input file is provided, read the text from the file
    if args.input_file:
        try:
            with open(args.input_file, 'r') as f:
                text = f.read()
        except Exception as e:
            print(f"Error reading input file: {str(e)}")
            return
    else:
        # Otherwise, use the text provided directly via the --text argument
        text = args.text

    # Call the process_tts function to perform the text-to-speech conversion
    success = process_tts(
        text=text,
        output_path=args.output,
        speaker_path=args.speaker,
        effect=args.effect,
        effect_level=args.effect_level,
    )

    # If the conversion failed, print an error message
    if not success:
        print("TTS conversion failed")
        return

# If this script is executed directly, call the main function
if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Now to process the TTS conversion , we will create the process_tts function , just below our import statements

def process_tts(text, output_path, speaker_path=None, language='en', effect=None, effect_level=None):
    try:
        # Initialize the TTS model with a specific model path
        tts = TTS("tts_models/en/ljspeech/tacotron2-DDC")

        # Define a temporary file path for intermediate audio processing
        temp_path = Path(output_path).parent / "temp.wav"

        # Print a message indicating the start of the text-to-speech conversion
        print(f"Converting text to speech...")

        # Check if an audio effect is specified
        if effect:
            # Convert text to speech and save to the temporary file
            tts.tts_to_file(
                text=text,
                file_path=temp_path,
                speaker_wav=speaker_path if speaker_path else None,
                split_sentences=True
            )
        else:
            # Convert text to speech and save directly to the output file
            tts.tts_to_file(
                text=text,
                file_path=output_path,
                speaker_wav=speaker_path if speaker_path else None,
                split_sentences=True
            )

        # If an effect is specified, apply it to the temporary audio file
        if effect:
            print(f"Applying effect: {effect} with level: {effect_level}")
            audio_manipulator(temp_path, output_path, effect, effect_level)
            print(f"Effect applied and audio saved to: {output_path}")
        else:
            # Print a message indicating the audio has been saved
            print(f"Audio saved to: {output_path}")

    except Exception as e:
        # Print an error message if an exception occurs during the process
        print(f"Error during conversion: {str(e)}")
        return False

    # Return True if the process completes successfully
    return True
Enter fullscreen mode Exit fullscreen mode

Run the CLI:

   python main.py --text "Hello world!" --output output.wav
Enter fullscreen mode Exit fullscreen mode

Conclusion

Deep learning has revolutionized how machines interpret and produce data. Tacotron2 exemplifies this by delivering human-like TTS capabilities with its simple, yet powerful architecture. Start experimenting today and transform how machines speak!

What You Achieved on Day 3

By the end of today, you:

  • Developed a strong grasp of Deep Learning and its inspiration from the human brain.
  • Learned about fundamental concepts like artificial neurons, activation functions, and the role of non-linearity in models.
  • Explored key architectures including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
  • Understood how Tacotron2 works for end-to-end Text-to-Speech (TTS) conversion.
  • Implemented TTS pipelines using Python libraries like torchaudio and TTS.
  • Used the TTS library to build a Text-To-Speech CLI Tool , which will serve as a fundamental part of Vocalshift .

Resources for Further Learning


Your Feedback Matters!

Share your thoughts, challenges, or results in the comments below. Let’s keep learning and growing together. πŸš€

Top comments (0)