Let's Talk Espressif ESP32-S3 Voice--Text-to-Speech (TTS)

#beginners #tutorial #discuss #devops

We all know that Espressif's ESP32 module is very famous. Today, let's talk the Chinese speech synthesis routine in Espressif's voice assistant framework ESP-Skainet.

*Compile the original routine

First, you need the clone routine
The project comes with an IDF, which is the best version to run, but you can still use your own IDF.
Enter routine
Theoretically, after the chip model is set to esp32s3, the sdkconfig.defaults.esp32s3 configuration file is automatically adjusted.
But in fact, it seems that there is no, so here is a step to ensure that the default configuration can be used.
Set the chip to ESP32S3
Enter menuconfig
Modify Audio Media Hal -> Audio Hardware board to ESP32-S3-Korvo-1
Compile and burn progra

Run the original routine
After running, you can see the following print

Simplify the original routine and analyze

The original routine is roughly divided into two functions. The first function is to read the sentence "Lexin speech synthesis", and the other function is to read the text input through the serial port.

There are often bugs in the second part, so let's simplify the routine and focus on the first function. The simplified routine is as follows:
The TTS of the audio here comes from the static library libvoice_set_xiaole. At present, only this timbre is available, and the rest of the tts related functions belong to the static library libesp_tts_chinese.

Summarize
TTS is over-encapsulated, and to a certain extent it is destined to be not difficult to use. However, according to the routines that have been run, the audio still has the problem of pronunciation. For some mature tts solutions, there is still a certain gap in the tts of Espressif. This shortcoming may cause it to fail to be applied to commercial projects. middle. If the content of speech to text is involved in the project, on the one hand, it can be solved by sending text to receive PCM audio through the API capability provided by the cloud platform. On the other hand, if the vocabulary is limited, the corresponding audio can also be stored in the file system by means of voice splicing, and the specified content can be played through mapping and pieced together into a complete sentence. For example: "Alipay Collection", "Yuan", "One", "Ten", "Hundred", "Thousand", "Ten thousand" can basically achieve the Alipay voice broadcast function by piecing together the audio.

DEV Community

Let's Talk Espressif ESP32-S3 Voice--Text-to-Speech (TTS)

Top comments (0)

Read next

Any web developers or designers in Lanzarote? I’m up for a chat and a coffee next week (PDC or Arrecife) ☕️

Types: char and boolean

Introduction to Git: A Powerful Version Control System

Upgrading to .NET 9: The Ultimate Migration Guide for Developers