Chatbot with Semantic Kernel - Part 4: Whisper 👂

#ai #semantickernel #python #microsoft

On the previous chapters, we built a basic Librarian agent, enhanced with some specific skills via function calling, and a tool to inspect in real time the interactions of our agent with the plugins.

On this chapter, we are going to add some audio capabilities to our Librarian agent. Once we finish, our Librarian will start its multimodality journey, we will be able to communicate with it using our voice.

Whisper

Our goal is to make the Agent capable of listening to us. We will use the microphone of the computer to get back a response from the model. The process should work as if we had written the text.

In order to accomplish it, we will use an Automatic Speech Recognition (ASR) system, in our case, Whisper from OpenAI. Although this model uses a similar architecture to a Large Languange Model, it should not be defined as an LLM, as Yann LeCun states in this message. Whisper, or any ASR system, is able to transcript an audio input in multiple languages.

Whisper on Semantic Kernel

In November 2024, Microsoft added support to audio capabilities to Semantic Kernel. The workflow we will build is as follows:

Record the user's audio using the computer's microphone.
Use Whisper to convert the audio into text.
Provide the text as the agent's input.
Show the reply generated by the agent to the user.

Let's start by recording with the user's microphone on demand. On my chatbot, I have a button to start the recording. Once pressed, the recording starts and the user must click on it again to stop it. For that reason, I have created two methods: start_recording and stop_recording.

import os
import threading
import pyaudio
import wave
from typing import ClassVar

class AudioRecorder:
    FORMAT: ClassVar[int] = pyaudio.paInt16
    CHANNELS: ClassVar[int] = 1
    RATE: ClassVar[int] = 44100
    CHUNK: ClassVar[int] = 1024

    is_recording: bool = False
    output_filepath: str

    def start_recording(self):
        """Start the recording on a new thread to avoid blocking the UI"""
        if not self.is_recording:  
            self.is_recording = True  
            self.audio_thread = threading.Thread(target=self.record_audio)  
            self.audio_thread.start() 

    def stop_recording(self):
        """Stop the recording (if started)"""
        if self.is_recording:  
            self.is_recording = False  
            if self.audio_thread is not None:  
                self.audio_thread.join()

    def record_audio(self):
        """Record the audio in a output.wav file"""
        # Create output file path
        self.output_filepath = os.path.join(os.path.dirname(__file__), "output.wav")

        self.is_recording = True

        # Open the stream of audio
        audio = pyaudio.PyAudio()
        stream = audio.open(
            format=self.FORMAT,
            channels=self.CHANNELS,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )
        frames = []

        # Read chunks while recording and append them to the list of frames
        while self.is_recording:
            data = stream.read(self.CHUNK)
            frames.append(data)

        # Stop and close the stream of audio
        stream.stop_stream()
        stream.close()

        # Store the audio as a WAV by joining all frames
        with wave.open(self.output_filepath, "wb") as wf:
            wf.setnchannels(self.CHANNELS)
            wf.setsampwidth(audio.get_sample_size(self.FORMAT))
            wf.setframerate(self.RATE)
            wf.writeframes(b"".join(frames))

        audio.terminate()

With that piece of code, we can record the user's voice as a wav file. Now, we will use Whisper to transcribe it, so we need to add the audio service AzureAudioToText to Semantic Kernel. Alternatively, you can use OpenAIAudioToText in case you want to connect directly with the OpenAI API.

self.kernel.add_service(AzureAudioToText(
    service_id='audio_service'
))

Once it is added to the kernel, it can be retrieved at any time.

self.audio_to_text_service = self.kernel.get_service(type=AzureAudioToText)

The usage of the audio service is quite straightfoward. First, we convert the audio file into an AudioContent. Then, we use the AudioContent to call the method get_text_content from the audio service:

async def transcript_audio(self, audio_file: str) -> str:
    # Conver the WAV file into AudioContent
    audio_content = AudioContent.from_audio_file(audio_file)

    # Use the audio service to trascript the AudioContent
    user_message = await self.audio_to_text_service.get_text_content(audio_content)

    # Return the message as text
    return user_message.text

The returned result from the method can be used then to be displayed on the chat interface, and to be ingested to the agent as any other user's input. You can checkout the other chapters of this series where I explain how to build the text-based chat.

async def transcript_audio_and_send_mesasge(self, audio_file: str) -> str:
    # Conver the WAV file into AudioContent
    audio_content = AudioContent.from_audio_file(audio_file)

    # Use the audio service to transcribe the AudioContent
    user_message = await self.audio_to_text_service.get_text_content(audio_content)

    # Add message to the history
    self.history.add_message(ChatMessageContent(role=AuthorRole.USER, content=user_message))

    # Invoke the agent with the updated history
    async for response in self.agent.invoke(self.history):
        # Add agent's reply to the history
        self.history.add_message(response)

        # Return the reply
        return str(response)

Summary

On this chapter, we have added the possibility to transform our audio into text using Whisper, and then ingest that text into the model to generate a response.

Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.

On the next chapter, we will add voice to our Librarian using a Text To Speech service.

DEV Community

Chatbot with Semantic Kernel - Part 4: Whisper 👂

Whisper

Whisper on Semantic Kernel

Summary

Top comments (0)

Read next

Suppressing "KeyboardInterrupt" Message on Python Script

Is AGI Here? A Deep Dive into OpenAI's o3 Model and ARC-AGI Benchmarks

Run LLMs Locally with Ollama & Semantic Kernel in .NET: A Quick Start

Tiny AI Safety Guard Matches Larger Models with 98% Accuracy, Runs on Phones