DEV Community

Cover image for Chatbot with Semantic Kernel - Part 4: Whisper 👂
David Sola
David Sola

Posted on

Chatbot with Semantic Kernel - Part 4: Whisper 👂

On the previous chapters, we built a basic Librarian agent, enhanced with some specific skills via function calling, and a tool to inspect in real time the interactions of our agent with the plugins.

On this chapter, we are going to add some audio capabilities to our Librarian agent. Once we finish, our Librarian will start its multimodality journey, we will be able to communicate with it using our voice.

Whisper

Our goal is to make the Agent capable of listening to us. We will use the microphone of the computer to get back a response from the model. The process should work as if we had written the text.

In order to accomplish it, we will use an Automatic Speech Recognition (ASR) system, in our case, Whisper from OpenAI. Although this model uses a similar architecture to a Large Languange Model, it should not be defined as an LLM, as Yann LeCun states in this message. Whisper, or any ASR system, is able to transcript an audio input in multiple languages.

Whisper on Semantic Kernel

In November 2024, Microsoft added support to audio capabilities to Semantic Kernel. The workflow we will build is as follows:

  1. Record the user's audio using the computer's microphone.
  2. Use Whisper to convert the audio into text.
  3. Provide the text as the agent's input.
  4. Show the reply generated by the agent to the user.

Audio to agent

Let's start by recording with the user's microphone on demand. On my chatbot, I have a button to start the recording. Once pressed, the recording starts and the user must click on it again to stop it. For that reason, I have created two methods: start_recording and stop_recording.

Recording button

import os
import threading
import pyaudio
import wave
from typing import ClassVar

class AudioRecorder:
    FORMAT: ClassVar[int] = pyaudio.paInt16
    CHANNELS: ClassVar[int] = 1
    RATE: ClassVar[int] = 44100
    CHUNK: ClassVar[int] = 1024

    is_recording: bool = False
    output_filepath: str

    def start_recording(self):
        """Start the recording on a new thread to avoid blocking the UI"""
        if not self.is_recording:  
            self.is_recording = True  
            self.audio_thread = threading.Thread(target=self.record_audio)  
            self.audio_thread.start() 

    def stop_recording(self):
        """Stop the recording (if started)"""
        if self.is_recording:  
            self.is_recording = False  
            if self.audio_thread is not None:  
                self.audio_thread.join()

    def record_audio(self):
        """Record the audio in a output.wav file"""
        # Create output file path
        self.output_filepath = os.path.join(os.path.dirname(__file__), "output.wav")

        self.is_recording = True

        # Open the stream of audio
        audio = pyaudio.PyAudio()
        stream = audio.open(
            format=self.FORMAT,
            channels=self.CHANNELS,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK,
        )
        frames = []

        # Read chunks while recording and append them to the list of frames
        while self.is_recording:
            data = stream.read(self.CHUNK)
            frames.append(data)

        # Stop and close the stream of audio
        stream.stop_stream()
        stream.close()

        # Store the audio as a WAV by joining all frames
        with wave.open(self.output_filepath, "wb") as wf:
            wf.setnchannels(self.CHANNELS)
            wf.setsampwidth(audio.get_sample_size(self.FORMAT))
            wf.setframerate(self.RATE)
            wf.writeframes(b"".join(frames))

        audio.terminate()
Enter fullscreen mode Exit fullscreen mode

With that piece of code, we can record the user's voice as a wav file. Now, we will use Whisper to transcribe it, so we need to add the audio service AzureAudioToText to Semantic Kernel. Alternatively, you can use OpenAIAudioToText in case you want to connect directly with the OpenAI API.

self.kernel.add_service(AzureAudioToText(
    service_id='audio_service'
))
Enter fullscreen mode Exit fullscreen mode

Once it is added to the kernel, it can be retrieved at any time.

self.audio_to_text_service = self.kernel.get_service(type=AzureAudioToText)
Enter fullscreen mode Exit fullscreen mode

The usage of the audio service is quite straightfoward. First, we convert the audio file into an AudioContent. Then, we use the AudioContent to call the method get_text_content from the audio service:

async def transcript_audio(self, audio_file: str) -> str:
    # Conver the WAV file into AudioContent
    audio_content = AudioContent.from_audio_file(audio_file)

    # Use the audio service to trascript the AudioContent
    user_message = await self.audio_to_text_service.get_text_content(audio_content)

    # Return the message as text
    return user_message.text
Enter fullscreen mode Exit fullscreen mode

The returned result from the method can be used then to be displayed on the chat interface, and to be ingested to the agent as any other user's input. You can checkout the other chapters of this series where I explain how to build the text-based chat.

async def transcript_audio_and_send_mesasge(self, audio_file: str) -> str:
    # Conver the WAV file into AudioContent
    audio_content = AudioContent.from_audio_file(audio_file)

    # Use the audio service to transcribe the AudioContent
    user_message = await self.audio_to_text_service.get_text_content(audio_content)

    # Add message to the history
    self.history.add_message(ChatMessageContent(role=AuthorRole.USER, content=user_message))

    # Invoke the agent with the updated history
    async for response in self.agent.invoke(self.history):
        # Add agent's reply to the history
        self.history.add_message(response)

        # Return the reply
        return str(response)
Enter fullscreen mode Exit fullscreen mode

Summary

On this chapter, we have added the possibility to transform our audio into text using Whisper, and then ingest that text into the model to generate a response.

Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.

On the next chapter, we will add voice to our Librarian using a Text To Speech service.

Top comments (0)