OpenAI Realtime API: Points to keep in mind when executing function calling

#ai #realtimeapi #openai #functioncalling

This time, I encountered a small issue when executing functions using the Realtime API, so I’m leaving a note here as a solution. The actual code is available in the GitHub repository below, so feel free to check it out for reference.

Voice Chat Repository

The above code is based on an Azure sample. I have provided an explanation of the Azure sample code here as well, so feel free to read it if you're interested.

Flow of Function Execution

1. Preparing the Function

First, specify the Tool to execute the function. In the following code, a function called get_your_info is defined.

TOOLS = [
    {
        "type": "function",
        "name": "get_your_info",
        "description": "Get the information of your own.(e.g. name, age, hobby, favorite food, favorite color, favorite music, favorite movie, favorite game, favorite sport, favorite team, favorite player, favorite programming language)",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The query to get the information",
                },
            },
            "required": ["query"],
        },
    }
]
TOOL_CHOICE = "auto"

2. Setting Up the Session and Preparing the Function Call

Next, set up the session and prepare to call the function. When calling the function, use the specified tool (TOOLS) and automatic selection mode (TOOL_CHOICE = "auto").

await client.send(
    SessionUpdateMessage(
        session=SessionUpdateParams(
            model=model,
            modalities=["text", "audio"],
            input_audio_format="pcm16",
            output_audio_format="pcm16",
            turn_detection=ServerVAD(type="server_vad", threshold=0.5, prefix_padding_ms=200, silence_duration_ms=200),
            input_audio_transcription=InputAudioTranscription(model="whisper-1"),
            voice=VOICE_TYPE,
            instructions=INSTRUCTIONS,
            temperature=TEMPERATURE,
            max_response_output_tokens=MAX_RESPONSE_OUTPUT_TOKENS,
            tools=TOOLS,
            tool_choice=TOOL_CHOICE,
        )
    )
)

3. Executing the Function and Retrieving Arguments

When the function is called, you can retrieve the arguments via response.function_call_arguments. Below is the processing for this step.

message = await client.recv()
...
match message.type:
    case "response.function_call_arguments.done":
        print("Response Function Call Arguments Done Message")
        print(f"  Response Id: {message.response_id}")
        print(f"  Arguments: {message.arguments}")
        try:
            arguments = json.loads(message.arguments)
            await call_tool(client, message.item_id, message.call_id, message.name, arguments)

4. Sending the Function Result

The function result is sent using conversation.item.create. However, simply sending this result will not trigger the subsequent processes, so you need to trigger response generation using response.create.

async def call_tool(client: RTLowLevelClient, previous_item_id: str, call_id: str, tool_name: str, arguments: dict):
    tool_func = TOOL_MAP[tool_name]
    tool_output = tool_func(**arguments)
    print(f"tool_output: {tool_output}")
    await client.send(
        ItemCreateMessage(
            item=FunctionCallOutputItem(
                call_id=call_id, 
                output=tool_output,
            ),
            previous_item_id=previous_item_id,
        )
    )
    await client.send(
        ResponseCreateMessage(
            response=ResponseCreateParams(
            )
        )
    )

At this point, the content of the response passed to response.create can be empty. This ensures that the response is generated successfully and that subsequent processing will be executed.

Summary

The key point here is that after passing the function result, you need to trigger response generation with response.create. If this step is skipped, subsequent processing will not take place, so it's important to be mindful of this. Although simple, I hope this serves as a reference for anyone facing similar issues.