A few days ago I had to record some voice prompts for a customer service call queue that I was configuring in one of our Microsoft Teams enterprise voice projects. Something like "Thank you for calling X, please hold..." I figured it would be nice to have Azure's artificial-intelligence-powered speech service convert my text input to an audio file. Turns out it's easier than I thought it would be.
First of all we need an Azure Subscription where we can deploy our Speech Services instance. If you don't have an Azure subscription, you can sign up for a trial account using the links below. If you already have a subscription, you can easily create a free Speech Services account using the following commands from Azure Cloud Shell:
az group create -n devto-speech -l WestEurope az cognitiveservices account create -n devto-speech -g devto-speech --kind SpeechServices --sku F0 -l WestEurope --yes
Now the account was created and we can start using it right away. To authenticate our calls from PowerShell, we need an API key, again we can use Azure Cloud Shell to retrieve the key:
az cognitiveservices account keys list -n devto-speech -g devto-speech
The speech service provides a very well documented API that can easily be called using PowerShell's native
Invoke-RestMethod command. The code is already documented on Microsoft Docs (see link below), I wrapped it into a module and uploaded it to the PowerShell gallery.
You can install the module using the following command, it works on Windows PowerShell and PowerShell 7.
Before we can call any of the speech service's API endpoints, we have to use the API key to get a token and store it in a variable for later use. The function in the following example calls the
Get-SpeechToken -Key <yourapikey> | Save-SpeechToken
Now we should have a token and be able to get a list of available voices using
Get-SpeechVoicesList | Format-Table. Note that the function has a parameter
-Token that accepts a token as retrieved by
Get-SpeechToken. If that parameter is omitted, it checks the value of the variable that's created by
And finally we can convert some input text to speech using one of the voices from the list:
Convert-TextToSpeech -Voice en-US-JessaNeural -Text "Hi Tom, I'm Jessa from Azure!" -Path jessa.mp3 Convert-TextToSpeech -Voice en-GB-HarryNeural -Text "Hi Tom, I'm Harry from Azure!" -Path harry.mp3
You can find a lot of information about the speech service in the links below, be sure to check out the SSML structure to see how you can customize the voices, introduce pauses to the audio file, and many other things.
You can find the code for the module in my GitHub, please let me know if you find it useful and feel free to submit a pull request with your optimizations :)