0xkoji

Posted on Jan 27, 2025

Exploring Kokoro TTS Voice Synthesis on Google Colab with T4

#python #tts #kokoro #ai

What is Kokoro-82M?

Kokoro-82M is a high-performance TTS (Text-to-Speech) model capable of generating high-quality audio. It allows for straightforward text-to-audio conversion and enables easy voice synthesis by applying weights to audio files.

Kokoro-82M on Hugging Face

From version 0.23, Japanese is also supported.

You can try it out easily via the following link:

Kokoro TTS on Hugging Face Spaces

However, the intonation for Japanese still feels slightly unnatural.

In this post, we will use kokoro-onnx, a TTS implementation utilizing Kokoro and the ONNX runtime. We will use version 0.19, a stable release, which only supports American English and British English for voice synthesis.

As the title suggests, the code execution will be done using Google Colab.

Installing kokoro-onnx

!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch
!pip install -U kokoro-onnx

Loading Packages

import numpy as np
from scipy.io.wavfile import write
from IPython.display import display, Audio
from models import build_model
import torch
from models import build_model
from kokoro import generate

Running the Sample

Before testing voice synthesis, let’s run the official sample.
Running the following code will generate and play audio within a few seconds.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
MODEL = build_model('kokoro-v0_19.pth', device)
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
][0]
VOICEPACK = torch.load(f'voices/{VOICE_NAME}.pt', weights_only=True).to(device)
print(f'Loaded voice: {VOICE_NAME}')

text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
audio, out_ps = generate(MODEL, text, VOICEPACK, lang=VOICE_NAME[0])

display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Voice Synthesis

Now, let’s get into the main topic and test voice synthesis.

Defining Voice Packs

af: American English female voice
am: American English male voice
bf: British English female voice
bm: British English male voice
We will load all available voice packs for now.

We will load all available voice packs for now.

voicepack_af = torch.load(f'voices/af.pt', weights_only=True).to(device)
voicepack_af_bella = torch.load(f'voices/af_bella.pt', weights_only=True).to(device)
voicepack_af_nicole = torch.load(f'voices/af_nicole.pt', weights_only=True).to(device)
voicepack_af_sarah = torch.load(f'voices/af_sarah.pt', weights_only=True).to(device)
voicepack_af_sky = torch.load(f'voices/af_sky.pt', weights_only=True).to(device)
voicepack_am_adam = torch.load(f'voices/am_adam.pt', weights_only=True).to(device)
voicepack_am_michael = torch.load(f'voices/am_michael.pt', weights_only=True).to(device)
voicepack_bf_emma = torch.load(f'voices/bf_emma.pt', weights_only=True).to(device)
voicepack_bf_isabella = torch.load(f'voices/bf_isabella.pt', weights_only=True).to(device)
voicepack_bm_george = torch.load(f'voices/bm_george.pt', weights_only=True).to(device)
voicepack_bm_lewis = torch.load(f'voices/bm_lewis.pt', weights_only=True).to(device)

Generating Text with Predefined Voices

To check the difference between synthesized voices, let’s generate audio using different voice packs.
We will use the sample text as is, but you can change the voicepack_ variable to use any desired voice pack.

audio, out_ps = generate(MODEL,
                         text,
                         voicepack_bf_emma,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

audio, out_ps = generate(MODEL,
                         text,
                         voicepack_bf_isabella,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

audio, out_ps = generate(MODEL,
                         text,
                         voicepack_bm_lewis,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Voice Synthesis

First, let’s create an average voice combining two British female voices (bf).

bf_average = (voicepack_bf_emma + voicepack_bf_isabella) / 2
audio, out_ps = generate(MODEL,
                         text,
                         bf_average,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Next, let’s synthesize a combination of two female and one male voice.

weight_1 = 0.25
weight_2 = 0.45
weight_3 = 0.3
weighted_voice = (voicepack_bf_emma * weight_1 +
                  voicepack_bf_isabella * weight_2 +
                  voicepack_bm_lewis * weight_3)
audio, out_ps = generate(MODEL,
                         text,
                         weighted_voice,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

Finally, let’s synthesize a mix of American and British male voices.

m_average = (voicepack_am_michael + voicepack_bm_george) / 2
audio, out_ps = generate(MODEL,
                         text,
                         m_average,
                         lang=VOICE_NAME[0])
display(Audio(data=audio, rate=24000, autoplay=True))
print(out_ps)

I also tested mixing voices with Gradio to see what happens:

Combining this with Ollama could lead to some fun experiments.

DEV Community

Exploring Kokoro TTS Voice Synthesis on Google Colab with T4

What is Kokoro-82M?

Installing kokoro-onnx

Loading Packages

Running the Sample

Voice Synthesis

Generating Text with Predefined Voices

Voice Synthesis

Top comments (0)