DEV Community

Cover image for OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code
M Sea Bass
M Sea Bass

Posted on

OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

Introduction

The "gpt-4o-realtime-preview" has been released. In addition to text and audio input/output, it also allows custom function calling via function calling.

As of October 2, 2024, there are issues such as 403 errors, and it seems the API is not usable. This article will be updated once it becomes available.

OpenAI has provided a JavaScript code sample on its website. Additionally, Azure has also published a Python code sample on GitHub.

In this article, we will analyze Azure's sample code, "low_level_sample.py," to understand how it works.

Libraries

The required libraries are as follows:

python-dotenv  
soundfile  
numpy  
scipy  
Enter fullscreen mode Exit fullscreen mode

Code Explanation

main Function

In the main function, it first loads the dotenv file to retrieve the API key and endpoint:

load_dotenv()  
Enter fullscreen mode Exit fullscreen mode

Next, it checks the arguments. This file is executed using the command python low_level_sample.py <audio file> <azure|openai>. You can choose either OpenAI or Azure OpenAI as the API:

if len(sys.argv) < 2:  
    print("Usage: python sample.py <audio file> <azure|openai>")  
    print("If second argument is not provided, it will default to azure")  
    sys.exit(1)  
Enter fullscreen mode Exit fullscreen mode

Then, it uses asyncio to run the process asynchronously:

file_path = sys.argv[1]  
if len(sys.argv) == 3 and sys.argv[2] == "openai":  
    asyncio.run(with_openai(file_path))  
else:  
    asyncio.run(with_azure_openai(file_path))  
Enter fullscreen mode Exit fullscreen mode

Next, let's look at the with_openai function.

with_openai Function

The API key and model name are retrieved from environment variables.

Then, an instance of RTLowLevelClient is created:

async with RTLowLevelClient(key_credential=AzureKeyCredential(key), model=model) as client:  
Enter fullscreen mode Exit fullscreen mode

Next, a message is added:

await client.send(  
    SessionUpdateMessage(session=SessionUpdateParams(turn_detection=ServerVAD(type="server_vad")))  
)  
Enter fullscreen mode Exit fullscreen mode

Here, we specify "server_vad" for Voice Activity Detection (VAD). Although "server_vad" is the only option currently available, you can set options like detection threshold and allowable silence duration:

class ServerVAD(BaseModel):  
    type: Literal["server_vad"] = "server_vad"  
    threshold: Optional[Annotated[float, Field(strict=True, ge=0.0, le=1.0)]] = None  
    prefix_padding_ms: Optional[int] = None  
    silence_duration_ms: Optional[int] = None  
Enter fullscreen mode Exit fullscreen mode

The message is then converted to JSON before being sent:

async def send(self, message: UserMessageType):  
    message_json = message.model_dump_json()  
    await self.ws.send_str(message_json)  
Enter fullscreen mode Exit fullscreen mode

The model_dump_json method is defined in Pydantic.BaseModel and converts the model into a JSON string. The resulting JSON looks like this:

{  
    "event_id": null,  
    "type": "session.update",  
    "session": {  
        "model": null,  
        "modalities": null,  
        "voice": null,  
        "instructions": null,  
        "input_audio_format": null,  
        "output_audio_format": null,  
        "input_audio_transcription": null,  
        "turn_detection": {  
            "type": "server_vad",  
            "threshold": null,  
            "prefix_padding_ms": null,  
            "silence_duration_ms": null  
        },  
        "tools": null,  
        "tool_choice": null,  
        "temperature": null,  
        "max_response_output_tokens": null  
    }  
}  
Enter fullscreen mode Exit fullscreen mode

This is sent to session.update to configure the session. You can specify system instructions in the "instructions" field. For example, to set a system prompt, you can modify the code like this:

await client.send(  
    SessionUpdateMessage(  
        session=SessionUpdateParams(  
            instructions="<your system instructions>",  
            turn_detection=ServerVAD(type="server_vad")  
        )  
    )  
)  
Enter fullscreen mode Exit fullscreen mode

Next, asyncio.gather is used to run both send_audio and receive_messages functions simultaneously:

await asyncio.gather(send_audio(client, audio_file_path), receive_messages(client))  
Enter fullscreen mode Exit fullscreen mode

In the send_audio function, the audio file is read using soundfile, base64 encoded, and then sent as InputAudioBufferAppendMessage:

...  

audio_data, original_sample_rate = sf.read(audio_file_path, dtype="int16", **extra_params)  

...  

audio_bytes = audio_data.tobytes()  

for i in range(0, len(audio_bytes), bytes_per_chunk):  
    chunk = audio_bytes[i : i + bytes_per_chunk]  
    base64_audio = base64.b64encode(chunk).decode("utf-8")  
    await client.send(InputAudioBufferAppendMessage(audio=base64_audio))  
Enter fullscreen mode Exit fullscreen mode

The audio data is sent to input_audio_buffer.append.

In the receive_messages function, responses based on the processed audio data from the send_audio function are received.

The session is established at "/openai/realtime", and messages are received asynchronously:

message = await client.recv()  
Enter fullscreen mode Exit fullscreen mode

The case structure handles different message types. The message types are explained here. Below are some of the important ones:

input_audio_buffer.committed

When the server-side Voice Activity Detection (VAD) detects that the user's speech has ended, the input_audio_buffer.committed message is sent.

input_audio_buffer.speech_started

When the AI response begins, input_audio_buffer.speech_started is sent. You can retrieve the start time using message.audio_start_ms.

input_audio_buffer.speech_stopped

When the AI response finishes, input_audio_buffer.speech_stopped is sent. You can retrieve the end time using message.audio_end_ms.

By monitoring speech events, it’s possible to trigger spontaneous responses. For instance, using response.create, the AI can generate a response without waiting for further user input when a period of silence is detected.

conversation.item.created

This can be used to manage conversation history.

response.created

When a response is created, response.created is sent. For streaming processing, you can use response.text.delta and response.audio.delta.

The low_level_sample.py script does not handle audio output. To output audio, you need to retrieve the audio data and use tools like pyaudio for playback. Here's how to handle the audio data:

audio_bytes = base64.b64decode(chunk.data)
audio_data.extend(audio_bytes)
if audio_data is not null:
    print(prefix, f"Audio received with length: {len(audio_data)}")
    with open(os.path.join(out_dir, f"{item.id}.wav"), "wb") as out:
        audio_array = np.frombuffer(audio_data, dtype=np.int16)
Enter fullscreen mode Exit fullscreen mode

I hope this article helps with your development.

If you found it useful, I would appreciate a positive rating.

Thank you!

Top comments (0)