DEV Community

Cover image for Creating a realtime voice agent using OpenAI's new gpt-realtime speech-to-speech model
Matthew Mascord
Matthew Mascord

Posted on

Creating a realtime voice agent using OpenAI's new gpt-realtime speech-to-speech model

Build a Python realtime voice agent with OpenAI’s gpt-realtime: capture microphone audio, hear spoken responses and optionally call GitHub via MCP with safe, approval-gated actions. Requires Python, an OpenAI API key and headphones.

Introduction

OpenAI recently announced general availability of their Realtime API along with their latest speech-to-speech model, gpt-realtime.

This is a groundbreaking API that allows developers to implement bespoke conversational voice agents tailored to a specific application and with access to virtually any tool.

In this post I will share the basics of connecting to this API using Python and the OpenAI SDK. Full working code is available in this accompanying GitHub repository.

You will need:

  • Python 3.10+
  • uv
  • A valid OpenAI API key - note, following this guide will incur cost - see OpenAI API pricing for details
  • A GitHub PAT (readonly permissions recommended)
  • Headphones

What's new?

The new realtime API offers some intriguing new features like support for remote MCP servers, re-usable prompts, image inputs and SIP (connecting from the public phone network). Since the beta version, there have also been some minor API changes, essentially some refinements to the event model and alignment with the responses API.

Remote MCP server support

While the previous API did support tool-calling, you would have to wire it up by hand. In other words, the API would only generate a tool call, not actually execute it. With remote MCP server support, the model only needs pointing at the server with a valid authorization and the API/model will take care of the rest. This is potentially game-changing for connections via a browser previously limited to the use of tools available in a browser context.

To mitigate the potential for the model to hallucinate and perform destructive, non-reversible actions, an approval mechanism has sensibly been introduced. This allows the developer to specify a list of tools that should always or never require approval. The approval itself is managed out of band via specific events that need to be intercepted and handled by your code.

A number of built-in connectors, OpenAI-maintained MCP wrappers for popular services like Google Workspace or Dropbox, are also available, saving the need to maintain your own MCP server or trust another third-party.

Re-usable prompts

The re-usable prompts feature allows instructions to be stored on OpenAI's servers and referenced via an identifier rather than having to include the prompt for each session. This has the potential to save on token usage, especially for long, complicated prompts with multiple reference examples. One current downside of this feature is there doesn't seem to be an API to manage the prompts themselves - everything has to be managed via the API playground UI - see discussion

Ways to connect

The API offers four main ways to connect to it:

  • from a web browser via WebRTC
  • from a web browser via WebSockets
  • from a server via WebSockets
  • from the public phone network via the Session Initiation Protocol (SIP)

The OpenAI tutorials and example code tend to focus on WebRTC in a browser. If you are a web developer, this is likely to be the easiest method to get going with as there are fewer events to handle and WebRTC clients normally handle things like echo cancellation out of the box.

If, like me, you are coming from a Python backend development background and are familiar with OpenAI's other APIs, you are likely to find using Python and WebSockets easier even though there is less documentation. There is more work to do though, like decoding/encoding PCM data and echo cancellation is not out of the box. You will need to wear headphones or implement a mechanism to prevent the model interrupting itself. A big plus, however, is that this method allows for an easier integration with any existing server-side functionality and tools you have already developed. For these reasons, it is the approach I chose.

Connecting to the new API

The new API works similarly to the beta version. There is the option to work directly with the Python WebSocket library and raw JSON objects or make use of the OpenAI Python SDK. I opted to use the SDK for improved type safety, IntelliSense, and to eliminate the need to work with a WebSocket connection directly.

You can connect as follows:

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

basic_session = RealtimeSessionCreateRequestParam(
    type="realtime",
    tracing=None,
    audio=RealtimeAudioConfigParam(
        output=RealtimeAudioConfigOutputParam(voice="marin")
    ),
    instructions="Help the user with their projects. Speak in ENGLISH only. Be extra nice!",
)


@asynccontextmanager
async def connect_to_realtime_api(session: RealtimeSessionCreateRequestParam) -> AsyncGenerator[
    AsyncRealtimeConnection]:
    async with AsyncOpenAI(api_key=OPENAI_API_KEY).realtime.connect(
            model="gpt-realtime"
    ) as connection:
        logger.info("Connected to OpenAI Realtime API")
        await connection.session.update(
            session=session
        )
        logger.info("Session updated")
        yield connection


async def main():
    async with connect_to_realtime_api(basic_session) as connection:
        async for event in connection:
            logger.info("Received event: %s", event.type)
Enter fullscreen mode Exit fullscreen mode

Run it with:

git clone https://github.com/altosi/gpt-realtime-starter
cd gpt-realtime-starter
uv sync
OPENAI_API_KEY=sk-... uv run python connect.py
Enter fullscreen mode Exit fullscreen mode

You should get something similar to:

INFO:__main__:Connected to OpenAI Realtime API
INFO:__main__:Session updated
INFO:__main__:Received event: session.created
INFO:__main__:Received event: session.updated
Enter fullscreen mode Exit fullscreen mode

Streaming microphone audio to the model

Clearly in order to get the model to respond, we need to stream some audio to it. I used sounddevice, a wrapper over PortAudio that supports NumPy arrays. I opted to use a separate input stream with the “non-blocking” callback interface and a queue to manage microphone data:

SAMPLE_RATE: int = 24000
CHANNELS: int = 1
FORMAT = np.int16

record_queue = queue.Queue()


def audio_callback(indata, frames, time, status):
    record_queue.put_nowait(indata.copy())


async def recording_task(connection: AsyncRealtimeConnection):
    while True:
        chunk = await asyncio.to_thread(record_queue.get)

        await connection.send(
            InputAudioBufferAppendEvent(
                type="input_audio_buffer.append",
                audio=b64encode(chunk.tobytes()).decode("ascii"),
            )
        )


input_stream = InputStream(
    channels=CHANNELS,
    samplerate=SAMPLE_RATE,
    dtype=FORMAT,
    callback=audio_callback,
)

Enter fullscreen mode Exit fullscreen mode

The callback is called for small chunks of incoming audio, supplied in the indata buffer. The code above just puts this chunk of data on the record_queue.

The recording_task, consumes from the record_queue, converts each data chunk to ASCII and sends it to the realtime model as an input_audio_buffer.append event.

We can set this off and start streaming with:

@asynccontextmanager
async def connect_and_start_recording(
    session: RealtimeSessionCreateRequestParam,
) -> AsyncGenerator[AsyncRealtimeConnection, None]:
    async with connect_to_realtime_api(session) as connection:
        asyncio.create_task(recording_task(connection))
        input_stream.start()

        yield connection


async def main() -> None:
    async with connect_and_start_recording(basic_session) as connection:
        async for event in connection:
            match event:
                case ResponseAudioTranscriptDoneEvent():
                    logger.info(f"Assistant: {event.transcript}")
                case _:
                    pass

Enter fullscreen mode Exit fullscreen mode

Run it with:

OPENAI_API_KEY=sk-... uv run python connect_and_record.py
Enter fullscreen mode Exit fullscreen mode

Asking the model 'What can you do?' we can see it responding in text but not audio yet:

% python connect_and_record.py
INFO:connect:Connected to OpenAI Realtime API
INFO:connect:Session updated
INFO:__main__:Assistant: Hi there! I'm here to help you with all sorts of projects. I can assist with research, brainstorming ideas, writing content, organizing tasks, and even analyzing visual information. I can also answer questions, provide explanations, and even walk you through step-by-step instructions. Just let me know what you're working on, and I'll do my best to be super helpful and friendly!
Enter fullscreen mode Exit fullscreen mode

Playing the model’s output

gpt-realtime sends audio data in chunks of PCM data encoded using base64 ASCII in a ResponseAudioDeltaEvent event.

In the code below, I'm creating a separate output stream, a queue to manage playback data and an asynchronous task that takes audio data from the queue and sends it to the output stream.


play_queue: asyncio.Queue[np.ndarray] = asyncio.Queue()

output_stream = OutputStream(
    channels=CHANNELS,
    samplerate=SAMPLE_RATE,
    dtype=FORMAT,
)


async def playback_task() -> None:
    while True:
        data = await play_queue.get()
        try:
            await asyncio.to_thread(output_stream.write, data)
        finally:
            play_queue.task_done()


@asynccontextmanager
async def connect_and_start_recording_and_playback(
    session: RealtimeSessionCreateRequestParam,
) -> AsyncGenerator[AsyncRealtimeConnection, None]:
    async with connect_and_start_recording(session) as connection:
        asyncio.create_task(playback_task())
        output_stream.start()

        yield connection
Enter fullscreen mode Exit fullscreen mode

We can wait for delta events from OpenAI, decode them and send them to the play_queue as follows:

async def handle_audio_delta(delta: str) -> None:
    data = np.frombuffer(base64.b64decode(delta), dtype=FORMAT)

    await play_queue.put(data)


async def main() -> None:
    async with connect_and_start_recording_and_playback(basic_session) as connection:
        async for event in connection:
            if "delta" not in event.type:
                logger.info("Received a %s event", event.type)

            match event:
                case ResponseAudioDeltaEvent():
                    await handle_audio_delta(event.delta)
                case _:
                    pass

Enter fullscreen mode Exit fullscreen mode

Run it with:

OPENAI_API_KEY=sk-... uv run python connect_record_and_playback.py
Enter fullscreen mode Exit fullscreen mode

Now put some headphones on, fire it up and try talking to the model! The voice configured above is the new one called 'Marin' but there are many more to experiment with.

Using a remote MCP server

We will use the Github Remote MCP server for this.

First obtain a Github PAT from developer settings within Github. I'd recommend assigning read-only permissions. We'll store this in an environment variable GITHUB_PAT.

To make our voice agent aware of this new functionality, we'll pass it the MCP server details:

GITHUB_PAT = os.environ.get("GITHUB_PAT")

tools_always_allowed = [
    "get_commit",
    "get_issue",
    "get_issue_comments",
    "get_latest_release",
    "get_pull_request",
    "get_release_by_tag",
    "get_tag",
    "list_branches",
    "list_commits",
    "list_discussion_categories",
    "list_discussions",
    "list_issue_types",
    "list_issues",
    "list_pull_requests",
    "list_releases",
    "list_tags",
    "search_code",
    "search_issues",
    "search_orgs",
    "search_pull_requests",
    "search_repositories",
    "search_users",
]

session_with_mcp = RealtimeSessionCreateRequestParam(
    type="realtime",
    tracing=None,
    audio=RealtimeAudioConfigParam(
        output=RealtimeAudioConfigOutputParam(voice="marin")
    ),
    tools=[
        Mcp(
            type="mcp",
            server_label="github_mcp",
            server_url="https://api.githubcopilot.com/mcp/",
            authorization=GITHUB_PAT,
            allowed_tools=McpAllowedToolsMcpToolFilter(
                tool_names=tools_always_allowed + tools_allowed_with_approval
            ),
            require_approval=McpRequireApprovalMcpToolApprovalFilter(
                never=McpRequireApprovalMcpToolApprovalFilterNever(
                    tool_names=tools_always_allowed
                ),
                always=McpRequireApprovalMcpToolApprovalFilterAlways(
                    tool_names=[]
                ),
            ),
        )
    ],
    instructions="Help the user with Github. Speak in ENGLISH only. Be extra nice!",
)
Enter fullscreen mode Exit fullscreen mode

The most important parameters are server_url and authorization. The allowed_tools and require_approval fields are an attempt to restrict the tool to only use read-only tools and never require approval for those tools. I'm not convinced the model adheres to this completely but restricting the PAT token to read-only permissions provides an extra layer of safety.

As an alternative to enumerating the tools, there is supposed to be the option to provide a read_only=True filter and for the API to match any tool that has a read_only annotation. This didn't appear to work at the time of writing, it returned a server error of Unknown parameter: 'session.tools[0].require_approval.never.read_only'. Another parameter that didn't work was server_description which is supposed to give the model additional guidance on when to use a specific MCP server.

To get better observability on when the model is calling these tools, I've added some additional event handling for ResponseDone and ConversationItemDone events sent by the server. At the same time I've split out event handling into two dedicated handlers, one for server events in general and another for ConversationItemDone events specifically:

async def handle_conversation_item_done(
    item: ConversationItem,
) -> list[RealtimeClientEvent]:
    match item:
        case RealtimeMcpToolCall():
            logger.info(
                "MCP tool call %s(%s) completed with output: %s",
                item.name,
                item.arguments,
                item.output,
            )
            return [ResponseCreateEvent(type="response.create")]
        case _:
            logger.info("Conversation item of type %s received", item.type)
            return []


async def handle_server_event(event: RealtimeServerEvent) -> list[RealtimeClientEvent]:
    if "delta" not in event.type:
        logger.info("Received a %s event", event.type)

    match event:
        case ResponseAudioDeltaEvent():
            await handle_audio_delta(event.delta)
        case ResponseDoneEvent():
            for item in event.response.output:
                match item:
                    case RealtimeMcpToolCall():
                        logger.info("MCP tool call %s(%s)", item.name, item.arguments)
        case ConversationItemDone():
            return await handle_conversation_item_done(event.item)
        case ResponseAudioTranscriptDoneEvent():
            logger.info("Assistant: %s", event.transcript)
        case RealtimeErrorEvent():
            logger.error("Error: %s", event.error.message)
    return []


async def main() -> None:
    try:
        async with connect_and_start_recording_and_playback(
            session_with_mcp
        ) as connection:
            async for server_event in connection:
                for client_event in await handle_server_event(server_event):
                    await connection.send(client_event)
    except KeyboardInterrupt:
        logger.info("Shutting down...")
Enter fullscreen mode Exit fullscreen mode

As you can see above, I'm also having to respond to a ConversationItemDone containing a RealtimeMcpToolCall item with a ResponseCreateEvent. This forces the model to respond following a tool call. Without it, the model would stay silent and you would have to ask it explicitly what the outcome was.

Run it with:

OPENAI_API_KEY=sk-... GITHUB_PAT=ghp_... uv run python connect_and_use_a_tool.py
Enter fullscreen mode Exit fullscreen mode

We can now run the model and ask a question such as "What is the latest release tag for the openai-python repository?". The model will respond with something like:

INFO:__main__:Assistant: The latest release tag for the "openai-python" repository is "v2.1.0". This was published on October 2, 2025, and includes features like support for real-time calls. You can check out the release on GitHub. Let me know if you'd like help with anything else!
Enter fullscreen mode Exit fullscreen mode

Try asking the agent to describe the latest changes in a release, find the latest issue, search for things in code and more.

Handling MCP tool approvals

If we want to allow the agent to do state-changing actions, I would recommend requiring it to ask for permission first to avoid non-reversible consequences. In the next example, I'll allow the model to use the create_repository tool as part of the GitHub remote MCP server, but require it to ask permission first.

If you want to proceed with this, you will have to give your Github PAT the ability to create repositories. As this permission generally comes with wider permissions and given the above-mentioned question mark around whether the API/model restricts itself to the allowed_tools list, be very careful, ensure all MCP tool calls are logged and enable this at your own risk! I would suggest doing this on a strictly time limited basis and revoking write permissions as soon as you are finished testing.

We'll now need to check specifically for conversation items of type RealtimeMcpApprovalRequest and ask the user for confirmation in a separate thread to avoid blocking the audio streams. If the user enters 'y', we'll send back a ConversationItemCreateEvent containing a RealtimeMcpApprovalResponse item with approve=True, a unique ID and a reference to the original approval request.

This is achieved by adding an additional case into the handle_conversation_item_done function:


        case RealtimeMcpApprovalRequest():
            prompt = f"Agent requests: {item.name}({item.arguments}). Approve? [y/N]: "
            try:
                approval_response = await asyncio.wait_for(
                    asyncio.to_thread(input, prompt), timeout=30
                )
            except asyncio.TimeoutError:
                approval_response = "n"  # default deny

            return [
                ConversationItemCreateEvent(
                    type="conversation.item.create",
                    item=RealtimeMcpApprovalResponse(
                        id=f"mcp_rsp_{generate_id()}",
                        type="mcp_approval_response",
                        approve=(approval_response.lower() == "y"),
                        approval_request_id=item.id,
                    ),
                )
            ]
Enter fullscreen mode Exit fullscreen mode

Run it with:

OPENAI_API_KEY=sk-... GITHUB_PAT=ghp_... uv run python connect_and_use_a_tool_with_approval.py
Enter fullscreen mode Exit fullscreen mode

Try asking the agent "Create a new private repository called test-repo."

Conclusion

You now have a working Python voice agent using gpt‑realtime: stream audio, hear replies, and call GitHub via MCP with approvals.

The examples have been kept simple for tutorial purposes - for production further work would be needed to bolster security and resilience - see the GitHub README for further details.

I hope you've enjoyed this and it has helped you get up and running with the realtime API. If you liked it, please let me know in the comments.

Next steps (which I may cover in later posts):

  • interrupting the model
  • implementing custom tool calls
  • linking up with existing AI apps using sub-agents
  • re-usable prompts
  • image inputs
  • integrating with a browser via WebRTC

 References

Top comments (0)