OpenAI Realtime API : Conversing via Local Microphone and Speaker

#ai #realtimeapi #openai #pyaudio

Several code samples using the Realtime API provided by OpenAI and Azure are available online. However, Python code is only available on Azure's GitHub, and it assumes the use of an audio file as input.

Therefore, I modified the code to accept real-time audio input from the local microphone using Python. The modified version is available on GitHub. Since the code is simple and concise, it should be easy to integrate into other projects.

The original code is based on low_level_sample.py, and a detailed explanation is available in this article, which you can refer to.

About the Modifications

This article explains how to modify a Python application that processes audio to accept input from the local microphone and output audio data returned by the Realtime API through the local speaker. The implementation mainly uses the pyaudio library.

The modifications consist of the following two points:

Capturing audio input from the local microphone.
Outputting the audio data returned by the Realtime API through the local speaker.

1. Implementing Audio Input from the Local Microphone

Below is the code that captures audio input from the local microphone using pyaudio and sends the data to the Realtime API in real time.

async def send_audio(client: RTLowLevelClient):
    p = pyaudio.PyAudio()
    default_input_index = p.get_default_input_device_info()['index']
    stream = p.open(
        format=STREAM_FORMAT,
        channels=INPUT_CHANNELS,
        rate=INPUT_SAMPLE_RATE,
        input=True,
        output=False,
        frames_per_buffer=INPUT_CHUNK_SIZE,
        input_device_index=default_input_index,
        start=False,
    )
    stream.start_stream()

    print("Start sending audio")
    while not client.closed:
        audio_data = stream.read(INPUT_CHUNK_SIZE, exception_on_overflow=False)
        base64_audio = base64.b64encode(audio_data).decode("utf-8")
        await client.send(InputAudioBufferAppendMessage(audio=base64_audio))

This code captures audio data from the local default microphone using pyaudio, encodes it in Base64, and sends it to the Realtime API.

Key Points:

pyaudio.PyAudio() is used to operate the audio device.
get_default_input_device_info() retrieves the default input device.
stream.read() captures real-time audio data to send to the API.

2. Implementing Audio Output from the Realtime API to Speakers

Next is the code for outputting the audio data returned by the Realtime API through the local speakers.

async def receive_messages(client: RTLowLevelClient):
    p = pyaudio.PyAudio()
    default_output_index = p.get_default_output_device_info()['index']
    stream = p.open(
        format=STREAM_FORMAT,
        channels=OUTPUT_CHANNELS,
        rate=OUTPUT_SAMPLE_RATE,
        input=False,
        output=True,
        output_device_index=default_output_index,
        start=False,
    )
    stream.start_stream()

    print("Start receiving messages")
    while True:
        ...
            case "response.audio.delta":
                print("Response Audio Delta Message")
                print(f"  Response Id: {message.response_id}")
                print(f"  Item Id: {message.item_id}")
                print(f"  Audio Data Length: {len(message.delta)}")
                audio_data = base64.b64decode(message.delta)
                print(f"  Audio Binary Data Length: {len(audio_data)}")
                audio_duration = len(audio_data) / OUTPUT_SAMPLE_RATE / OUTPUT_SAMPLE_WIDTH / OUTPUT_CHANNELS
                print(f"  Audio Duration: {audio_duration}")
                start_time = time.time()
                for i in range(0, len(audio_data), OUTPUT_CHUNK_SIZE):
                    stream.write(audio_data[i:i+OUTPUT_CHUNK_SIZE])
                time.sleep(max(0, audio_duration - (time.time() - start_time) - 0.05))

This code decodes the Base64-encoded audio data received from the Realtime API and outputs it to the speakers using pyaudio.

Key Points:

get_default_output_device_info() retrieves the default output device (speakers).
stream.write() outputs the decoded audio data to the speakers in real time.
The length of the received audio data is used to adjust the timing, minimizing audio delay.

Thank you for reading to the end. If you have any questions or feedback about the code, feel free to reach out!