DEV Community

Panlee123
Panlee123

Posted on

Let's Talk Espressif ESP32-S3 Voice--Text-to-Speech (TTS)

We all know that Espressif's ESP32 module is very famous. Today, let's talk the Chinese speech synthesis routine in Espressif's voice assistant framework ESP-Skainet.

*Compile the original routine

  1. First, you need the clone routine
    Image descriptionThe project comes with an IDF, which is the best version to run, but you can still use your own IDF.

  2. Enter routine
    Image description

  3. Theoretically, after the chip model is set to esp32s3, the sdkconfig.defaults.esp32s3 configuration file is automatically adjusted.
    But in fact, it seems that there is no, so here is a step to ensure that the default configuration can be used.
    Image description

  4. Set the chip to ESP32S3
    Image description

  5. Enter menuconfig
    Image descriptionModify Audio Media Hal -> Audio Hardware board to ESP32-S3-Korvo-1

  6. Compile and burn progra
    Image description

Run the original routine
After running, you can see the following print
Image description

Simplify the original routine and analyze

The original routine is roughly divided into two functions. The first function is to read the sentence "Lexin speech synthesis", and the other function is to read the text input through the serial port.

There are often bugs in the second part, so let's simplify the routine and focus on the first function. The simplified routine is as follows:
Image descriptionThe TTS of the audio here comes from the static library libvoice_set_xiaole. At present, only this timbre is available, and the rest of the tts related functions belong to the static library libesp_tts_chinese.

Summarize
TTS is over-encapsulated, and to a certain extent it is destined to be not difficult to use. However, according to the routines that have been run, the audio still has the problem of pronunciation. For some mature tts solutions, there is still a certain gap in the tts of Espressif. This shortcoming may cause it to fail to be applied to commercial projects. middle. If the content of speech to text is involved in the project, on the one hand, it can be solved by sending text to receive PCM audio through the API capability provided by the cloud platform. On the other hand, if the vocabulary is limited, the corresponding audio can also be stored in the file system by means of voice splicing, and the specified content can be played through mapping and pieced together into a complete sentence. For example: "Alipay Collection", "Yuan", "One", "Ten", "Hundred", "Thousand", "Ten thousand" can basically achieve the Alipay voice broadcast function by piecing together the audio.

Top comments (0)