M Sea Bass

Posted on Oct 2

OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

Introduction

The "gpt-4o-realtime-preview" has been released. In addition to text and audio input/output, it also allows custom function calling via function calling.

As of October 2, 2024, there are issues such as 403 errors, and it seems the API is not usable. This article will be updated once it becomes available.

OpenAI has provided a JavaScript code sample on its website. Additionally, Azure has also published a Python code sample on GitHub.

In this article, we will analyze Azure's sample code, "low_level_sample.py," to understand how it works.

Libraries

The required libraries are as follows:

python-dotenv  
soundfile  
numpy  
scipy

Code Explanation

`main` Function

In the main function, it first loads the dotenv file to retrieve the API key and endpoint:

load_dotenv()

Next, it checks the arguments. This file is executed using the command python low_level_sample.py <audio file> <azure|openai>. You can choose either OpenAI or Azure OpenAI as the API:

if len(sys.argv) < 2:  
    print("Usage: python sample.py <audio file> <azure|openai>")  
    print("If second argument is not provided, it will default to azure")  
    sys.exit(1)

Then, it uses asyncio to run the process asynchronously:

file_path = sys.argv[1]  
if len(sys.argv) == 3 and sys.argv[2] == "openai":  
    asyncio.run(with_openai(file_path))  
else:  
    asyncio.run(with_azure_openai(file_path))

Next, let's look at the with_openai function.

`with_openai` Function

The API key and model name are retrieved from environment variables.

Then, an instance of RTLowLevelClient is created:

async with RTLowLevelClient(key_credential=AzureKeyCredential(key), model=model) as client:

Next, a message is added:

await client.send(  
    SessionUpdateMessage(session=SessionUpdateParams(turn_detection=ServerVAD(type="server_vad")))  
)

Here, we specify "server_vad" for Voice Activity Detection (VAD). Although "server_vad" is the only option currently available, you can set options like detection threshold and allowable silence duration:

class ServerVAD(BaseModel):  
    type: Literal["server_vad"] = "server_vad"  
    threshold: Optional[Annotated[float, Field(strict=True, ge=0.0, le=1.0)]] = None  
    prefix_padding_ms: Optional[int] = None  
    silence_duration_ms: Optional[int] = None

The message is then converted to JSON before being sent:

async def send(self, message: UserMessageType):  
    message_json = message.model_dump_json()  
    await self.ws.send_str(message_json)

The model_dump_json method is defined in Pydantic.BaseModel and converts the model into a JSON string. The resulting JSON looks like this:

{  
    "event_id": null,  
    "type": "session.update",  
    "session": {  
        "model": null,  
        "modalities": null,  
        "voice": null,  
        "instructions": null,  
        "input_audio_format": null,  
        "output_audio_format": null,  
        "input_audio_transcription": null,  
        "turn_detection": {  
            "type": "server_vad",  
            "threshold": null,  
            "prefix_padding_ms": null,  
            "silence_duration_ms": null  
        },  
        "tools": null,  
        "tool_choice": null,  
        "temperature": null,  
        "max_response_output_tokens": null  
    }  
}

This is sent to session.update to configure the session. You can specify system instructions in the "instructions" field. For example, to set a system prompt, you can modify the code like this:

await client.send(  
    SessionUpdateMessage(  
        session=SessionUpdateParams(  
            instructions="<your system instructions>",  
            turn_detection=ServerVAD(type="server_vad")  
        )  
    )  
)

Next, asyncio.gather is used to run both send_audio and receive_messages functions simultaneously:

await asyncio.gather(send_audio(client, audio_file_path), receive_messages(client))

In the send_audio function, the audio file is read using soundfile, base64 encoded, and then sent as InputAudioBufferAppendMessage:

...  

audio_data, original_sample_rate = sf.read(audio_file_path, dtype="int16", **extra_params)  

...  

audio_bytes = audio_data.tobytes()  

for i in range(0, len(audio_bytes), bytes_per_chunk):  
    chunk = audio_bytes[i : i + bytes_per_chunk]  
    base64_audio = base64.b64encode(chunk).decode("utf-8")  
    await client.send(InputAudioBufferAppendMessage(audio=base64_audio))

The audio data is sent to input_audio_buffer.append.

In the receive_messages function, responses based on the processed audio data from the send_audio function are received.

The session is established at "/openai/realtime", and messages are received asynchronously:

message = await client.recv()

The case structure handles different message types. The message types are explained here. Below are some of the important ones:

`input_audio_buffer.committed`

When the server-side Voice Activity Detection (VAD) detects that the user's speech has ended, the input_audio_buffer.committed message is sent.

`input_audio_buffer.speech_started`

When the AI response begins, input_audio_buffer.speech_started is sent. You can retrieve the start time using message.audio_start_ms.

`input_audio_buffer.speech_stopped`

When the AI response finishes, input_audio_buffer.speech_stopped is sent. You can retrieve the end time using message.audio_end_ms.

By monitoring speech events, it’s possible to trigger spontaneous responses. For instance, using response.create, the AI can generate a response without waiting for further user input when a period of silence is detected.

`conversation.item.created`

This can be used to manage conversation history.

`response.created`

When a response is created, response.created is sent. For streaming processing, you can use response.text.delta and response.audio.delta.

The low_level_sample.py script does not handle audio output. To output audio, you need to retrieve the audio data and use tools like pyaudio for playback. Here's how to handle the audio data:

audio_bytes = base64.b64decode(chunk.data)
audio_data.extend(audio_bytes)
if audio_data is not null:
    print(prefix, f"Audio received with length: {len(audio_data)}")
    with open(os.path.join(out_dir, f"{item.id}.wav"), "wb") as out:
        audio_array = np.frombuffer(audio_data, dtype=np.int16)

I hope this article helps with your development.

If you found it useful, I would appreciate a positive rating.

Thank you!

DEV Community

OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

Introduction

Libraries

Code Explanation

`main` Function

`with_openai` Function

`input_audio_buffer.committed`

`input_audio_buffer.speech_started`

`input_audio_buffer.speech_stopped`

`conversation.item.created`

`response.created`

Top comments (0)

Read next

New ML Compiler Uses Pattern Matching to Speed Up AI Code, Verified with Formal Proofs

Hero to Zero: How Not To Manage Staff Redundancies

#11 Next.js 15: Revolutionizing Server-Side Rendering (SSR) for Modern Applications😯🤓

Memory PCB: The Heart of Data Storage and Processing

Introduction

Libraries

Code Explanation

main Function

with_openai Function

input_audio_buffer.committed

input_audio_buffer.speech_started

input_audio_buffer.speech_stopped

conversation.item.created

response.created

Read next

New ML Compiler Uses Pattern Matching to Speed Up AI Code, Verified with Formal Proofs

Hero to Zero: How Not To Manage Staff Redundancies

#11 Next.js 15: Revolutionizing Server-Side Rendering (SSR) for Modern Applications😯🤓

Memory PCB: The Heart of Data Storage and Processing

`main` Function

`with_openai` Function

`input_audio_buffer.committed`

`input_audio_buffer.speech_started`

`input_audio_buffer.speech_stopped`

`conversation.item.created`

`response.created`