DEV Community

Cover image for Build an Audio-to-Text Conversion Tool Using Azure AI Speech SDK with Audio Transformation in C#
David Au Yeung
David Au Yeung

Posted on • Edited on

Build an Audio-to-Text Conversion Tool Using Azure AI Speech SDK with Audio Transformation in C#

Introduction

In this tutorial, we will create a powerful application that converts audio recordings into text using Azure AI Speech SDK. The application also includes an audio transformation process, converting files from formats like .m4a to .wav to ensure compatibility with the speech recognition pipeline.

By the end of this guide, you'll have a working solution that processes an audio file, transcribes it into text, and demonstrates the potential of integrating speech recognition into your applications.

Prerequisites

Before we get started, make sure you have the following:

  1. Azure Speech SDK: Install the Microsoft.CognitiveServices.Speech NuGet package in your project.
  2. NAudio Library: Install the NAudio NuGet package for audio file conversion.
  3. Azure Speech API Key and Endpoint: Create an Azure Speech resource in the Azure portal and obtain the API key and endpoint.

  1. Visual Studio or any C# IDE: Ensure you have a working C# environment set up for development.
  2. Record your own voice:

Step 1: Setting Up the Project

1.Open your IDE and create a new Console App project.
2.Add the required NuGet packages:

Microsoft.CognitiveServices.Speech
NAudio

3.Create a new class file for audio conversion (AudioConverter.cs) and another for Azure Speech SDK integration (AzureOpenAISpeechService.cs).

Step 2: Audio Conversion (M4A to WAV)

Why Convert Audio Formats?
Azure Speech SDK works best with .wav files in 16-bit PCM format. To ensure compatibility, we’ll use the NAudio library to convert .m4a files to .wav.

using NAudio.Wave;

public static class AudioConverter
{
    public static bool ConvertToWav(string inputPath, string outputPath)
    {
        try
        {
            using var reader = new MediaFoundationReader(inputPath);
            var outFormat = new WaveFormat(16000, 16, 1); // 16kHz, mono, 16-bit PCM
            using var resampler = new MediaFoundationResampler(reader, outFormat)
            {
                ResamplerQuality = 60
            };
            WaveFileWriter.CreateWaveFile(outputPath, resampler);
            return true;
        }
        catch (Exception ex)
        {
            Console.WriteLine("NAudio conversion failed:");
            Console.WriteLine(ex.Message);
            return false;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Speech Recognition Using Azure Speech SDK

Here we define a service class that interacts with Azure Speech SDK. It uses the converted .wav file to recognize speech and return the transcribed text.

using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
using System.Text;

public class AzureOpenAISpeechService
{
    private readonly string _apiKey;
    private readonly string _endpoint;

    public AzureOpenAISpeechService()
    {
        _apiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_Speech_API_KEY");
        _endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_Speech_API_EndPoint");
    }

    // Recognizes speech from a short audio stream (single utterance).
    // Uses RecognizeOnceAsync, which is suitable for short audio clips or single sentences.
    // Returns the recognized text or an error message.
    public async Task<string> RecognizeSpeechFromAudioStream(Stream audioStream, string language = "en-US")
    {
        if (string.IsNullOrEmpty(_apiKey) || string.IsNullOrEmpty(_endpoint))
            throw new InvalidOperationException("Speech key or endpoint is not set in environment variables.");

        var speechConfig = SpeechConfig.FromEndpoint(new Uri(_endpoint), _apiKey);
        speechConfig.SpeechRecognitionLanguage = language;

        if (audioStream.CanSeek)
            audioStream.Position = 0;

        using var audioInput = AudioConfig.FromStreamInput(new BinaryAudioStreamReader(audioStream));
        using var recognizer = new SpeechRecognizer(speechConfig, audioInput);

        var result = await recognizer.RecognizeOnceAsync();

        return result.Reason switch
        {
            ResultReason.RecognizedSpeech => result.Text,
            ResultReason.NoMatch => "No speech could be recognized.",
            ResultReason.Canceled => $"Recognition canceled: {CancellationDetails.FromResult(result).Reason}",
            _ => "Unknown recognition result."
        };
    }

    // Recognizes speech from a long or continuous audio stream.
    // Uses StartContinuousRecognitionAsync and event handlers to process audio in real time.
    // Suitable for long recordings, meetings, or when the audio contains multiple sentences or speakers.
    // Collects all recognized text and returns it as a single string after recognition completes.
    public async Task<string> RecognizeSpeechFromAudioStreamForLongAudio(Stream audioStream, string language = "en-US")
    {
        if (string.IsNullOrEmpty(_apiKey) || string.IsNullOrEmpty(_endpoint))
            throw new InvalidOperationException("Speech key or endpoint is not set in environment variables.");

        var speechConfig = SpeechConfig.FromEndpoint(new Uri(_endpoint), _apiKey);
        speechConfig.SpeechRecognitionLanguage = language;

        if (audioStream.CanSeek)
            audioStream.Position = 0;

        using var audioInput = AudioConfig.FromStreamInput(new BinaryAudioStreamReader(audioStream));
        using var recognizer = new SpeechRecognizer(speechConfig, audioInput);

        var recognizedText = new StringBuilder();
        var stopRecognition = new TaskCompletionSource<bool>();

        recognizer.Recognized += (s, e) =>
        {
            if (e.Result.Reason == ResultReason.RecognizedSpeech)
            {
                recognizedText.AppendLine(e.Result.Text);
            }
            else if (e.Result.Reason == ResultReason.NoMatch)
            {
                // Optionally handle no match
            }
        };

        recognizer.Canceled += (s, e) =>
        {
            stopRecognition.TrySetResult(true);
        };

        recognizer.SessionStopped += (s, e) =>
        {
            stopRecognition.TrySetResult(true);
        };

        await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);
        await stopRecognition.Task.ConfigureAwait(false);
        await recognizer.StopContinuousRecognitionAsync().ConfigureAwait(false);

        return recognizedText.ToString().Trim();
    }

    private class BinaryAudioStreamReader : PullAudioInputStreamCallback
    {
        private readonly Stream _stream;
        public BinaryAudioStreamReader(Stream stream) => _stream = stream;
        public override int Read(byte[] dataBuffer, uint size) => _stream.Read(dataBuffer, 0, (int)size);
        public override void Close() => _stream.Dispose();
    }
}
Enter fullscreen mode Exit fullscreen mode

When using the AzureOpenAISpeechService, you have two methods available for speech recognition:
RecognizeSpeechFromAudioStream vs RecognizeSpeechFromAudioStreamForLongAudio. Choosing the right method depends on the nature and length of your audio stream.

Method 1: RecognizeSpeechFromAudioStream

  • Purpose: This method is designed for short audio clips or single sentences.
  • Usage: Use this method when you have a brief audio file, such as a single utterance or a short command.
  • Functionality: It uses the RecognizeOnceAsync function, which is optimized for processing short audio streams efficiently.
  • Output: Returns the recognized text or an error message if recognition fails.

Method 2: RecognizeSpeechFromAudioStreamForLongAudio

  • Purpose: This method is suitable for long or continuous audio recordings.
  • Usage: Use this method when dealing with longer audio files, such as meetings, lectures, or any audio that contains multiple sentences or speakers.
  • Functionality: It utilizes StartContinuousRecognitionAsync with event handlers to process audio in real-time, collecting all recognized text as the audio plays.
  • Output: Returns all recognized text as a single string after the recognition process is complete.

Step 4: Bringing It All Together in Program.cs

Here’s the complete program that combines audio conversion and speech recognition:

using System;
using System.IO;

class Program
{
    static async Task Main(string[] args)
    {
        string inputFile = @"C:\Users\User\Downloads\Recording for Demo.m4a";
        string outputFile = @"C:\Users\User\Downloads\RecordingForDemo_converted.wav";

        Console.WriteLine("Azure Speech Recognition Demo");
        Console.WriteLine($"Input file: {inputFile}");
        Console.WriteLine($"Output (converted) file: {outputFile}");
        Console.WriteLine();

        try
        {
            Console.WriteLine("Step 1: Starting audio conversion...");
            bool success = AudioConverter.ConvertToWav(inputFile, outputFile);

            if (success)
            {
                Console.WriteLine("Audio conversion successful.");
                Console.WriteLine("Step 2: Starting speech recognition...");

                using var fileStream = File.OpenRead(outputFile);

                var speechService = new AzureOpenAISpeechService();
                string recognizedText = await speechService.RecognizeSpeechFromAudioStream(fileStream, "en-US");

                Console.WriteLine("Speech recognition completed. Result:");
                Console.WriteLine(string.IsNullOrWhiteSpace(recognizedText) ? "[No text recognized]" : recognizedText);
            }
            else
            {
                Console.WriteLine("Audio conversion failed. Please check the input file.");
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine("An error occurred:");
            Console.WriteLine(ex.ToString());
        }
        finally
        {
            Console.WriteLine("Process completed. Press any key to exit.");
            Console.ReadKey();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Run the Application

  1. Build and run the application.
  2. Provide the path to your .m4a file and ensure the output path is writable.
  3. View the transcribed text in the console.

Final Output

Sample output for a .m4a file containing the phrase "Hello everyone, I hope you enjoyed this tutorial":

Challenge for You!

Combine the HTML to DOCX Conversion Tool from my previous article with this audio-to-text solution. Create an application that:

Converts audio to text using Azure AI Speech SDK.
Generates a polished DOCX document from the recognized text using the OpenXML SDK.
This integrated solution could be used to transcribe meetings, presentations, or interviews into shareable, professional documents. Share your implementation and insights in the comments!

Conclusion

Congratulations! You’ve successfully built an audio-to-text conversion tool using Azure AI Speech SDK. This project demonstrates how to integrate powerful AI services into real-world applications. The next step is to extend this functionality into complete workflows, combining speech recognition with document generation tools.

Reference

How to recognize speech

Love C#

Top comments (0)