DEV Community

Cover image for Virtual Assistants with Amazon IVS and OpenAI's Realtime API
Todd Sharp for AWS

Posted on

Virtual Assistants with Amazon IVS and OpenAI's Realtime API

We recently explored building a virtual agent that seamlessly joins your Amazon Interactive Video Service (Amazon IVS) real-time streams as a conversational assistant with vision capabilities. While Amazon Nova Sonic S2S delivers solid performance and integrates beautifully with the AWS ecosystem, you might be curious about harnessing OpenAI's cutting-edge real-time model for your agents. This post walks you through a demo integration that makes this possible.

Before we dive into the technical details, here's a glimpse of what a conversation looks like with OpenAI's real-time model working behind the scenes.

The results speak for themselves: crisp, natural voice quality, lightning-fast response times, and remarkably accurate visual analysis that nails what it's looking at. It's genuinely impressive how well everything comes together.

❓ How Is This Different?

This integration leverages the Python aiortc library to tap into the remote participant's stream, channeling their audio directly into the gpt-realtime model. Meanwhile, the agent establishes itself as a publisher on the Amazon IVS stage. When gpt-realtime generates its response, that audio flows right back into the agent's published feed. Need vision? The agent captures the current frame from the remote participant and instantly analyzes what's happening. It's elegantly straightforward - the gpt-realtime model essentially becomes another participant that both publishes and subscribes to the stage.

gpt-realtime virtual agent arch

🤖 Try It Out!

Ready to get your hands dirty? Start by spinning up a new Amazon IVS stage and generating a remote participant token for your agent (refer to the user guide for more info if you're new to IVS).

aws ivs-realtime create-stage \
  --name "my-stage" \
  --region us-east-1 \
  --participant-token-configurations '[
    {
      "duration": 720,
      "attributes": {
        "username": "gpt-realtime-agent"
      }
    },
    {
      "duration": 720,
      "attributes": {
        "username": "local-participant"
      }
    }
  ]' \
  --query '{stageArn: stage.arn, gptParticipant: {participantId: participantTokens[0].participantId, token: participantTokens[0].token}, localParticipant: {participantId: participantTokens[1].participantId, token: participantTokens[1].token}}' \
  --no-cli-pager
Enter fullscreen mode Exit fullscreen mode

This will produce output similar to the following:

{
    "stageArn": "arn:aws:ivs:us-east-1:639934345351:stage/abcdef123456",
    "gptParticipant": {
        "participantId": "6APQqRu2XnqK",
        "token": "eyJhbGciOiJLTVMiLCJ0eXAiOiJKV1QifQ...."
    },
    "localParticipant": {
        "participantId": "m243ru5A2idL",
        "token": "eyJhbGciOiJLTVMiLCJ0eXAiOiJKV1QifQ...."
    }
}

Enter fullscreen mode Exit fullscreen mode

No client app yet? No problem - you can publish directly from the AWS Management Console. If you are using the AWS Console, make sure to grab your participant ID from the 'Stage sessions' section by clicking the active session ID.

You can also try things out with this CodePen. Paste the token from the localParticipant into the Token input box and click 'Join' to publish to your new stage.

Next, clone the sample repo and navigate to the stages-gpt-realtime directory. The README there walks you through the setup process. There's even a simple management script that you can use to launch agent instances via WebSocket messages from the frontend. When you're ready to roll, fire up the script with your credentials:

python ivs-stage-gpt-realtime.py \
  --token <GPT_PARTICIPANT_TOKEN> \
  --subscribe-to <LOCAL_PARTICIPANT_ID> \
  --openai-key <OPENAI_API_KEY>
Enter fullscreen mode Exit fullscreen mode

Your agent will hop onto the stage, subscribe to your local participant, and you're off to the races with real-time AI conversations!

Real-time virtual assistant

The script provides a simplified video track with an audio visualization which can be overridden for your use case. I personally like to create a client side visualization which tends to be a bit cleaner and more responsive (as you can see in the video above).

👏 Extending The Solution

This sample gives you the foundation to start building something amazing. Dig into the repo code and OpenAI's documentation to discover how you can supercharge the solution - add custom tools for function calling, or transform the gpt-realtime model into a smart transcription service for your stages. Drop a comment and let me know what creative solutions you'll build with Amazon IVS and gpt-realtime!

Top comments (0)