DEV Community

loading...
Experts Inside

Using PowerShell and Azure Cognitive Services to convert text to speech

torggler profile image Tom Torggler ・3 min read

A few days ago I had to record some voice prompts for a customer service call queue that I was configuring in one of our Microsoft Teams enterprise voice projects. Something like "Thank you for calling X, please hold..." I figured it would be nice to have Azure's artificial-intelligence-powered speech service convert my text input to an audio file. Turns out it's easier than I thought it would be.

Azure Cognitive Speech Service

First of all we need an Azure Subscription where we can deploy our Speech Services instance. If you don't have an Azure subscription, you can sign up for a trial account using the links below. If you already have a subscription, you can easily create a free Speech Services account using the following commands from Azure Cloud Shell:

az group create -n devto-speech -l WestEurope
az cognitiveservices account create -n devto-speech -g devto-speech --kind SpeechServices --sku F0 -l WestEurope --yes

Now the account was created and we can start using it right away. To authenticate our calls from PowerShell, we need an API key, again we can use Azure Cloud Shell to retrieve the key:

az cognitiveservices account keys list -n devto-speech -g devto-speech

PowerShell and REST API

The speech service provides a very well documented API that can easily be called using PowerShell's native Invoke-RestMethod command. The code is already documented on Microsoft Docs (see link below), I wrapped it into a module and uploaded it to the PowerShell gallery.

You can install the module using the following command, it works on Windows PowerShell and PowerShell 7.

Install-Module PSSpeech

Before we can call any of the speech service's API endpoints, we have to use the API key to get a token and store it in a variable for later use. The function in the following example calls the /issueToken endpoint:

Get-SpeechToken -Key <yourapikey> | Save-SpeechToken

Now we should have a token and be able to get a list of available voices using Get-SpeechVoicesList | Format-Table. Note that the function has a parameter -Token that accepts a token as retrieved by Get-SpeechToken. If that parameter is omitted, it checks the value of the variable that's created by Save-SpeechToken.

And finally we can convert some input text to speech using one of the voices from the list:

Convert-TextToSpeech -Voice en-US-JessaNeural -Text "Hi Tom, I'm Jessa from Azure!" -Path jessa.mp3
Convert-TextToSpeech -Voice en-GB-HarryNeural -Text "Hi Tom, I'm Harry from Azure!" -Path harry.mp3

You can find a lot of information about the speech service in the links below, be sure to check out the SSML structure to see how you can customize the voices, introduce pauses to the audio file, and many other things.

You can find the code for the module in my GitHub, please let me know if you find it useful and feel free to submit a pull request with your optimizations :)

Tom

Links

Discussion (0)

Forem Open with the Forem app