On the previous chapters, we built a basic Librarian agent, enhanced with some specific skills via function calling, and a tool to inspect in real time the interactions of our agent with the plugins.
On this chapter, we are going to add some audio capabilities to our Librarian agent. Once we finish, our Librarian will start its multimodality journey, we will be able to communicate with it using our voice.
Whisper
Our goal is to make the Agent capable of listening to us. We will use the microphone of the computer to get back a response from the model. The process should work as if we had written the text.
In order to accomplish it, we will use an Automatic Speech Recognition (ASR) system, in our case, Whisper from OpenAI. Although this model uses a similar architecture to a Large Languange Model, it should not be defined as an LLM, as Yann LeCun states in this message. Whisper, or any ASR system, is able to transcript an audio input in multiple languages.
Whisper on Semantic Kernel
In November 2024, Microsoft added support to audio capabilities to Semantic Kernel. The workflow we will build is as follows:
- Record the user's audio using the computer's microphone.
- Use Whisper to convert the audio into text.
- Provide the text as the agent's input.
- Show the reply generated by the agent to the user.
Let's start by recording with the user's microphone on demand. On my chatbot, I have a button to start the recording. Once pressed, the recording starts and the user must click on it again to stop it. For that reason, I have created two methods: start_recording
and stop_recording
.
import os
import threading
import pyaudio
import wave
from typing import ClassVar
class AudioRecorder:
FORMAT: ClassVar[int] = pyaudio.paInt16
CHANNELS: ClassVar[int] = 1
RATE: ClassVar[int] = 44100
CHUNK: ClassVar[int] = 1024
is_recording: bool = False
output_filepath: str
def start_recording(self):
"""Start the recording on a new thread to avoid blocking the UI"""
if not self.is_recording:
self.is_recording = True
self.audio_thread = threading.Thread(target=self.record_audio)
self.audio_thread.start()
def stop_recording(self):
"""Stop the recording (if started)"""
if self.is_recording:
self.is_recording = False
if self.audio_thread is not None:
self.audio_thread.join()
def record_audio(self):
"""Record the audio in a output.wav file"""
# Create output file path
self.output_filepath = os.path.join(os.path.dirname(__file__), "output.wav")
self.is_recording = True
# Open the stream of audio
audio = pyaudio.PyAudio()
stream = audio.open(
format=self.FORMAT,
channels=self.CHANNELS,
rate=self.RATE,
input=True,
frames_per_buffer=self.CHUNK,
)
frames = []
# Read chunks while recording and append them to the list of frames
while self.is_recording:
data = stream.read(self.CHUNK)
frames.append(data)
# Stop and close the stream of audio
stream.stop_stream()
stream.close()
# Store the audio as a WAV by joining all frames
with wave.open(self.output_filepath, "wb") as wf:
wf.setnchannels(self.CHANNELS)
wf.setsampwidth(audio.get_sample_size(self.FORMAT))
wf.setframerate(self.RATE)
wf.writeframes(b"".join(frames))
audio.terminate()
With that piece of code, we can record the user's voice as a wav
file. Now, we will use Whisper to transcribe it, so we need to add the audio service AzureAudioToText
to Semantic Kernel. Alternatively, you can use OpenAIAudioToText
in case you want to connect directly with the OpenAI API.
self.kernel.add_service(AzureAudioToText(
service_id='audio_service'
))
Once it is added to the kernel, it can be retrieved at any time.
self.audio_to_text_service = self.kernel.get_service(type=AzureAudioToText)
The usage of the audio service is quite straightfoward. First, we convert the audio file into an AudioContent
. Then, we use the AudioContent
to call the method get_text_content
from the audio service:
async def transcript_audio(self, audio_file: str) -> str:
# Conver the WAV file into AudioContent
audio_content = AudioContent.from_audio_file(audio_file)
# Use the audio service to trascript the AudioContent
user_message = await self.audio_to_text_service.get_text_content(audio_content)
# Return the message as text
return user_message.text
The returned result from the method can be used then to be displayed on the chat interface, and to be ingested to the agent as any other user's input. You can checkout the other chapters of this series where I explain how to build the text-based chat.
async def transcript_audio_and_send_mesasge(self, audio_file: str) -> str:
# Conver the WAV file into AudioContent
audio_content = AudioContent.from_audio_file(audio_file)
# Use the audio service to transcribe the AudioContent
user_message = await self.audio_to_text_service.get_text_content(audio_content)
# Add message to the history
self.history.add_message(ChatMessageContent(role=AuthorRole.USER, content=user_message))
# Invoke the agent with the updated history
async for response in self.agent.invoke(self.history):
# Add agent's reply to the history
self.history.add_message(response)
# Return the reply
return str(response)
Summary
On this chapter, we have added the possibility to transform our audio into text using Whisper, and then ingest that text into the model to generate a response.
Remember that all the code is already available on my GitHub repository 🐍 PyChatbot for Semantic Kernel.
On the next chapter, we will add voice to our Librarian using a Text To Speech service.
Top comments (0)