August 16th, 2024 · 2 min read
Text-to-Speech (TTS) technology, also known as Speech Synthesis, converts text into human-like speech. The rise of deep learning has led to major advancements in TTS quality and naturalness, but at the cost of increased computational requirements. Most big tech companies offer cloud-based TTS APIs, like Google Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech, and new companies with similar offerings have emerged, such as ElevenLabs, or Coqui Studio. While convenient, these services require an internet connection, raise privacy concerns, and are prone to network outages. On-device solutions allow for more flexibility and privacy by synthesizing speech directly on the user's device. However, few options exist for on-device TTS. This article explores three open-source Python libraries and Picovoice Orca Text-to-Speech.
🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.
PyTTSx3
PyTTSx3 is a Python library that utilizes the popular eSpeak speech synthesis engine on Linux (NSSpeechSynthesizer is used on MacOS and SAPI5 on Windows). Getting started is straightforward:
- Install pyTTSx3:
pip install pyttsx3
- Save synthesized speech to a file in Python:
import pyttsx3
engine = pyttsx3.init()
engine.save_to_file(text='Hello World', filename='PATH/TO/OUTPUT.wav')
engine.runAndWait()
While simple to use, eSpeak's voice quality is robotic compared to more modern TTS systems.
Coqui TTS
Coqui TTS is the open-source repository of Coqui Studio. Developers can leverage Coqui's pretrained models or train custom voices. To synthesize speech, follow the steps:
- Install Coqui TTS:
pip install TTS
- List available models in Python:
from TTS.api import TTS
TTS().list_models()
- Choose a model name and save synthesized speech to a file:
tts = TTS("CHOSEN/MODEL/NAME")
tts.tts_to_file(text="Hello World", output_path="PATH/TO/OUTPUT.wav")
Coqui offers high-quality voices with natural prosody, at the cost of larger model sizes and longer processing times.
Mimic3 from Mycroft
Mycroft is a free and open-source virtual assistant that offers a TTS system called Mimic3. This framework currently lacks a pure Python API, so we will use Python's subprocess:
- Install Mycroft:
pip install mycroft-mimic3-tts
- Synthesize speech and save file to directory OUTPUT/DIR:
import subprocess
args = [
"mimic3",
"\"Hello World\"",
"--output-dir", "OUTPUT/DIR"]
try:
subprocess.check_call(args)
except subprocess.CalledProcessError as e:
# Handle error
pass
For prototyping on-device TTS, Mimic3 from Mycroft provides a balance of quality and performance.
Orca Text-to-Speech
Picovoice Orca Text-to-Speech leverages state-of-the-art Text-to-Speech (TTS) models to provide high-quality voices, while still being small and efficient.
- Install Orca Text-to-Speech Python SDK
pip install pvorca
- Import Orca and create an Orca instance.
import pvorca
orca = pvorca.create(access_key="${ACCESS_KEY}")
Sign-up or Log in to Picovoice Console to copy your access key and replace ${ACCESS_KEY} with it.
- Synthesize your desired text with
orca.synthesize(text="${TEXT}")
For more information refer to the Orca Text-to-Speech Python SDK Documentation.
Conclusion
On-device TTS removes privacy concerns, internet requirements, and minimizes latency. With Python solutions like PyTTSx3, Coqui TTS, and Mimic3, developers have several options for synthesizing speech directly on devices based on their needs. However, each solution comes with drawbacks such as poor voice quality, large resource requirements, or lack of flexible APIs. Another alternative is Orca Text-to-Speech, which combines state-of-the-art neural TTS with efficiency, allowing to synthesize high-quality speech even on a Raspberry Pi.
Top comments (0)