Olalekan Oladiran

Posted on Jul 13

Build a Talking Clock with Azure AI Speech: A Step-by-Step Guide to Speech Synthesis & Recognition

#softwareengineering #ai #cloud #azure

Introduction

Azure AI Speech is a service that provides speech-related functionality, including:

A speech-to-text API that enables you to implement speech recognition (converting audible spoken words into text).
A text-to-speech API that enables you to implement speech synthesis (converting text into audible speech). In this exercise, you’ll use both of these APIs to implement a speaking clock application.

Create an Azure AI Speech resource

Go to your Azure portal and click create a resource
In the top search field, search for Speech service. Select it from the list, then select Create. Provision the resource using the following settings:
- Subscription: Your Azure subscription.
- Resource group: Choose or create a resource group.
- Region:Choose any available region
- Name: Enter a unique name.
- Pricing tier: Select F0 (free), or S (standard) if F is not available.
Select Review + create, then select Create to provision the resource.
Wait for deployment to complete, and then go to the deployed resource.
View the Keys and Endpoint page in the Resource Management section. You will need the information on this page later in the exercise.

Prepare and configure the speaking clock app

Open your VS Code and enter the following commands to clone the GitHub repo for this exercise: git clone https://github.com/microsoftlearning/mslearn-ai-language mslearn-ai-language

After the repo has been cloned, navigate to the folder containing the speaking clock application code files: cd mslearn-ai-language/Labfiles/07-speech/C-sharp/speaking-clock

Enter the following command to install the libraries you’ll use:

dotnet add package Azure.Identity
dotnet add package Azure.AI.Projects --prerelease
dotnet add package Microsoft.CognitiveServices.Speech --version 1.42.0

Edit the appsettings.json by expanding speaking-clock
In the code file, replace the your_project_api_key and your_project_location placeholders with the API key and location for your project (copied from the portal page you left open).
After you’ve replaced the placeholders, within the code editor, use the CTRL+S command or Right-click > Save to save your changes

Add code to use the Azure AI Speech SDK

Open Programs.cs file and at the top of the code file, under the existing namespace references, find the comment Import namespaces. Then, under this comment, add the following language-specific code to import the namespaces you will need to use the Azure AI Speech SDK:

// Import namespaces
using Azure.Identity;
using Azure.AI.Projects;
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;

In the main function, under the comment Get config settings, note that the code loads the project key and location you defined in the configuration file.
Under the comment Configure speech service, add the following code to use the AI Services key and your project’s region to configure your connection to the Azure AI Services Speech endpoint

// Configure speech service
speechConfig = SpeechConfig.FromSubscription(projectKey, location);
Console.WriteLine("Ready to use speech service in " + speechConfig.Region);

Save your changes (CTRL+S), but leave the code editor open.

Run the app

So far, the app doesn’t do anything other than connect to your Azure AI Speech service, but it’s useful to run it and check that it works before adding speech functionality.

In the command line, enter the following language-specific command to run the speaking clock app:

dotnet run
If you are using C#, you can ignore any warnings about using the await operator in asynchronous methods - we’ll fix that later. The code should display the region of the speech service resource the application will use. A successful run indicates that the app has connected to your Azure AI Speech resource.

NOTE: I encountered this error

OLAMOBILEs-MacBook-Pro:speaking-clock olamobile$ dotnet run
/Users/olamobile/AI-pract/mslearn-ai-language/Labfiles/07-speech/C-Sharp/speaking-clock/Program.cs(6,13): error CS0234: The type or namespace name 'Identity' does not exist in the namespace 'Azure' (are you missing an assembly reference?)
/Users/olamobile/AI-pract/mslearn-ai-language/Labfiles/07-speech/C-Sharp/speaking-clock/Program.cs(7,16): error CS0234: The type or namespace name 'Projects' does not exist in the namespace 'Azure.AI' (are you missing an assembly reference?)

The build failed. Fix the build errors and run again.

To fix this, I comment the first two lines under // Import namespaces.

Add code to recognize speech

Now that you have a SpeechConfig for the speech service in your project’s Azure AI Services resource, you can use the Speech-to-text API to recognize speech and transcribe it to text.
In this procedure, the speech input is captured from an audio file.

In the Main function, note that the code uses the TranscribeCommand function to accept spoken input. Then in the TranscribeCommand function, under the comment Configure speech recognition, add the appropriate code below to create a SpeechRecognizer client that can be used to recognize and transcribe speech from an audio file:

// Configure speech recognition
string audioFile = "time.wav";
using AudioConfig audioConfig = AudioConfig.FromWavFileInput(audioFile);
using SpeechRecognizer speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);

In the TranscribeCommand function, under the comment Process speech input, add the following code to listen for spoken input, being careful not to replace the code at the end of the function that returns the command:

// Process speech input
Console.WriteLine("Listening...");
SpeechRecognitionResult speech = await speechRecognizer.RecognizeOnceAsync();
if (speech.Reason == ResultReason.RecognizedSpeech)
{
    command = speech.Text;
    Console.WriteLine(command);
}
else
{
    Console.WriteLine(speech.Reason);
    if (speech.Reason == ResultReason.Canceled)
    {
        var cancellation = CancellationDetails.FromResult(speech);
        Console.WriteLine(cancellation.Reason);
        Console.WriteLine(cancellation.ErrorDetails);
    }
}

Save your changes (CTRL+S), and then in the command line below the code editor, enter the following command to run the program: dotnet run

Synthesize speech

Your speaking clock application accepts spoken input, but it doesn’t actually speak! Let’s fix that by adding code to synthesize speech.
Once again, due to the hardware limitations of the cloud shell we’ll direct the synthesized speech output to a file.

In the Main function for your program, note that the code uses the TellTime function to tell the user the current time.
In the TellTime function, under the comment Configure speech synthesis, add the following code to create a SpeechSynthesizer client that can be used to generate spoken output:

// Configure speech synthesis
var outputFile = "output.wav";
speechConfig.SpeechSynthesisVoiceName = "en-GB-RyanNeural";
using var audioConfig = AudioConfig.FromWavFileOutput(outputFile);
using SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

In the TellTime function, under the comment Synthesize spoken output, add the following code to generate spoken output, being careful not to replace the code at the end of the function that prints the response:

// Synthesize spoken output
SpeechSynthesisResult speak = await speechSynthesizer.SpeakTextAsync(responseText);
if (speak.Reason != ResultReason.SynthesizingAudioCompleted)
{
    Console.WriteLine(speak.Reason);
}
else
{
    Console.WriteLine("Spoken output saved in " + outputFile);
}

Save your changes (CTRL+S), and then in the command line below the code editor, enter the following command to run the program:

dotnet run
Review the output from the application, which should indicate that the spoken output was saved in a file.
If you have a media player capable of playing .wav audio files, in the toolbar for the cloud shell pane, use the Upload/Download files button to download the audio file from your app folder, and then play it:

/home/user/mslearn-ai-language/Labfiles/07-speech/C-Sharp/speaking-clock/output.wav

Use Speech Synthesis Markup Language

Speech Synthesis Markup Language (SSML) enables you to customize the way your speech is synthesized using an XML-based format.

In the TellTime function, replace all of the current code under the comment Synthesize spoken output with the following code (leave the code under the comment Print the response):

// Synthesize spoken output
string responseSsml = $@"
    <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='en-US'>
        <voice name='en-GB-LibbyNeural'>
            {responseText}
            <break strength='weak'/>
            Time to end this lab!
        </voice>
    </speak>";
SpeechSynthesisResult speak = await speechSynthesizer.SpeakSsmlAsync(responseSsml);
if (speak.Reason != ResultReason.SynthesizingAudioCompleted)
{
    Console.WriteLine(speak.Reason);
}
else
{
     Console.WriteLine("Spoken output saved in " + outputFile);
}

Save your changes and return to the integrated terminal for the speaking-clock folder, and enter the following command to run the program:

dotnet run
Review the output from the application, which should indicate that the spoken output was saved in a file.
Once again, if you have a media player capable of playing .wav audio files, in the toolbar for the cloud shell pane, use the Upload/Download files button to download the audio file from your app folder, and then play it:

/home/user/mslearn-ai-language/Labfiles/07-speech/C-Sharp/speaking-clock/output.wav

What if you have a mic and speaker?

In this exercise, you used audio files for the speech input and output. Let’s see how the code can be modified to use audio hardware.

Using speech recognition with a microphone

If you have a mic, you can use the following code to capture spoken input for speech recognition:

// Configure speech recognition
using AudioConfig audioConfig = AudioConfig.FromDefaultMicrophoneInput();
using SpeechRecognizer speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);
Console.WriteLine("Speak now...");

SpeechRecognitionResult speech = await speechRecognizer.RecognizeOnceAsync();
if (speech.Reason == ResultReason.RecognizedSpeech)
{
    command = speech.Text;
    Console.WriteLine(command);
}
else
{
    Console.WriteLine(speech.Reason);
    if (speech.Reason == ResultReason.Canceled)
    {
        var cancellation = CancellationDetails.FromResult(speech);
        Console.WriteLine(cancellation.Reason);
        Console.WriteLine(cancellation.ErrorDetails);
    }
}

Using speech synthesis with a speaker

If you have a speaker, you can use the following code to synthesize speech.

var now = DateTime.Now;
string responseText = "The time is " + now.Hour.ToString() + ":" + now.Minute.ToString("D2");

// Configure speech synthesis
speechConfig.SpeechSynthesisVoiceName = "en-GB-RyanNeural";
using var audioConfig = AudioConfig.FromDefaultSpeakerOutput();
using SpeechSynthesizer speechSynthesizer = new SpeechSynthesizer(speechConfig, audioConfig);

// Synthesize spoken output
SpeechSynthesisResult speak = await speechSynthesizer.SpeakTextAsync(responseText);
if (speak.Reason != ResultReason.SynthesizingAudioCompleted)
{
    Console.WriteLine(speak.Reason);
}

You’ve just transformed text into speech and speech into understanding—turning Azure AI’s capabilities into a functional talking clock. But this is just the beginning. Imagine applying these same techniques to build voice assistants, interactive IVR systems, or even accessibility tools that give your applications a voice.

Thanks for staying till the end

DEV Community