DEV Community

Louis Austke
Louis Austke

Posted on

A one-stop comparison of several synthetic and human voices

For the impatient: Go to the Voice Comparison Page to test the voices right away.

It's no secret that text-to-speech technology has made significant advancements recently, with voices now sounding almost as natural as human voices. Major giants in the field include Google, Amazon, and Microsoft, alongside smaller companies like ElevenLabs, which boasts of providing the most hyper-realistic voice generator on the market. Actually, the spectrum of these smaller companies is so diverse that making a direct comparison and deciding for yourself which voice is better becomes difficult.

Usually, the voice example page is deeply embedded within the documentation of a text-to-speech API, requiring you to navigate between several sites and generate speech for various text snippets to form your judgment. So, to simplify the comparison process, I collected all of them on one page. I compiled a set of text snippets that imitate commonly used topics for voiceovers and created audio files where these texts are read by all the voices.

Highlighting words

I wanted the words to be karaoke-style highlighted while the text is spoken. First of all, this is visually pleasing, and it makes the text much easier to follow. So, for every audio file, I created a speech mark file that contains a list of pairs, each consisting of a word and the time when its pronunciation starts.

Microsoft, Amazon Polly, and Google provide this functionality. It is referred to differently - marks, speech marks, or subtitles. Essentially, alongside the events when audio is generated, there are other events like 'WordBoundary' that include supplementary information, such as the starting time.

I have a 'highlightWord' function that highlights the word in the textbox. When the audio player starts playing the speech file, this function is scheduled to run multiple times, each at the starting time when a specific word is pronounced.

audioPlayer.onplay = function () {
    ...
    scheduledHighlightWordIDs = []
    for (const { word, offset } of speechMarks) {
        const timeoutId = setTimeout(function () {
            highlightWord(word)
        }, offset)
        scheduledHighlightWordIDs.push(timeoutId)
    }
}
Enter fullscreen mode Exit fullscreen mode

I keep the timeout IDs so that I can unschedule the function, for example, when playing is interrupted.

Pre-fetching audio

Another problem I faced was the desynchronization of speech and highlighting when I provided a URL for the MP3 file to the Audio Player. Apparently, depending on the user's connection speed, it takes a while to download and start playing the file, but the highlighting functions are scheduled immediately. The Audio Player has an option to notify the user when it has downloaded a sufficient portion of the audio content to start playing, but this was not good enough for me. I wanted to make sure that the entire file was downloaded before getting started. So, I pre-fetched the audio content into a blob, created a URL that represents the blob object in memory, and passed it to the Audio Player.

try {
        const response = await fetch(audioUrl)
        blob = await response.blob()
} catch (error) {
    console.log(`Cannot fetch url: ${audioUrl}`)
}

const audioObjectUrl = URL.createObjectURL(blob)

const audioPlayer = new Audio(audioObjectUrl)
Enter fullscreen mode Exit fullscreen mode

Adding real human voices

Eventually, I thought, why stop here? So, I added recordings from actual voiceover artists to the comparison. Now, you can compare synthesized voices with real human voices to get a firsthand feel for the current state of text-to-speech technology.

This is the link to the page: Compare Voices.

Now, it's your turn to weigh in. Which synthetic voice do you find the most realistic? In a blind test, would you be able to tell AI-generated voices apart from those of actual voiceover artists?

Top comments (0)