Setting up local speech synthesizing using piper and piper-whistle

#ai #machinelearning #opensource #tooling

Pre-requisits:

GNU/Linux (equivalent to Ubuntu 20.04)
python >=3.8
piper-whistle - https://gitlab.com/think-biq/piper-whistle/-/releases
piper - https://github.com/rhasspy/piper/releases
aplay - https://linux.die.net/man/1/aplay

Let's take a look at how one may setup an ad-hoc, local / offline text-to-speech synthesizer with great voice quality using piper.

To setup piper on a GNU/Linux based system, I'll describe a general architecture using named pipes, which is straight forward enough to allow for system wide text-to-speech, with a little bit of manual setup, the help of piper-whistle and some minor trade-offs (it's simple, yet it won't support parallel speech processing).

To start, let's fetch the latest piper stand-alone built from its repository hosted on github (2023.11.14-2 at the time of writing this). After downloading the compressed archive, we'll create a directory structure for our setup. The root directory shall be at /opt/wind and the following sub-directories:

/opt/wind/piper
/opt/wind/channels

Decompress and copy the piper built into /opt/wind/piper.
For managing piper's voice models, we use piper-whistle.¹ A command-line utility, which makes it more convenient to download and access voices.

You can get the latest release using pip install piper-whistle or download the wheel file from its gitlab releases page. After installing whistle, let's fetch a voice to generate speech with. First step is updating the database by calling piper_whistle -vR. For English speech, I quite like the female voice called alba. Using whistle, we can get list all available english (GB) voices using piper_whistle list -l en_GB. The voice is at index 2. So to install simply call piper_whistle install en_GB 2.

Next, let's create the neccessary named pipes. The resulting structure will look like this:

/opt/wind/channels/speak (accepts json payload)
/opt/wind/channels/input (read by piper)
/opt/wind/channels/ouput (written by piper)

To create a named pipe, we can use the following command: mkfifo -m 755 /opt/wind/channels/input
Finally, we create three processes in separate shells:

tty0: tail -F /opt/wind/channels/speak | tee /opt/wind/channels/input
tty1: /opt/wind/piper/piper -m $(piper_whistle path alba@medium) --debug --json-input --output_raw < /opt/wind/channels/input > /opt/wind/channels/output
tty2: aplay --buffer-size=777 -r 22050 -f S16_LE -t raw < /opt/wind/channels/output

The process on tty0, ensures the pipe is kept open after the processing by piper or aplay. This way, we can subsequently queue TTS requests. Using the structure above, we now use piper-whistle to generate speech, for example:
piper_whistle speak "If the infinite had not desired man to be wise, he would not have bestowed upon him the faculty of knowing."

Hope this will help you with your text-to-speech needs. If you have any ideas for improvement or found and issue with piper-whistle, feel free to open a ticket on its repository, reach out on twitter or join my discord. Thanks for reading and until next time.

Yes, I'm the author of the package :) ↩

DEV Community

Setting up local speech synthesizing using piper and piper-whistle

Top comments (0)