DEV Community

Cover image for What Is GPT-Realtime-2 and How to Use the GPT-Realtime-2 API
Hassann
Hassann

Posted on • Originally published at apidog.com

What Is GPT-Realtime-2 and How to Use the GPT-Realtime-2 API

OpenAI shipped GPT-Realtime-2 on November 6, 2026. It is a speech-to-speech model with GPT-5-class reasoning, a 128,000-token context window, and configurable reasoning effort so you can trade latency for answer quality. If you already use gpt-realtime, migration mostly means changing the model string and adding a few optional session/tool fields.

Try Apidog today

This guide shows what changed, how pricing works, and how to call GPT-Realtime-2 over WebSocket and SIP. It also includes a practical setup in Apidog so you can replay Realtime sessions without re-recording audio for every test.

For context on OpenAI’s broader 2026 model line, see What is GPT-5.5. For the multimodal sibling, see How to use the GPT-Image-2 API.

TL;DR

  • Model ID: gpt-realtime-2
  • Context window: 128k tokens
  • Max output: 32k tokens
  • Input modalities: text, audio, image
  • Output modalities: text, audio
  • Audio pricing: $32 / 1M input tokens, $64 / 1M output tokens
  • Cached audio input: $0.40 / 1M tokens
  • New Realtime-only voices: Cedar and Marin
  • Reasoning levels: minimal, low, medium, high, xhigh
  • Default reasoning level: low
  • WebSocket endpoint:
wss://api.openai.com/v1/realtime?model=gpt-realtime-2
Enter fullscreen mode Exit fullscreen mode
  • SIP sessions use:
wss://api.openai.com/v1/realtime?call_id={call_id}
Enter fullscreen mode Exit fullscreen mode
  • Companion models:
    • GPT-Realtime-Translate: live translation, 70 input languages, $0.034/min
    • GPT-Realtime-Whisper: streaming speech-to-text, $0.017/min
  • Use Apidog to script WebSocket sessions, capture frames, and compare event output between runs.

What is GPT-Realtime-2?

GPT-Realtime-2 is a single speech-to-speech model. You stream audio in, receive audio out, and the model handles transcription, reasoning, tool selection, and voice generation in one pass.

That means you do not need to build a separate STT → LLM → TTS pipeline. The model runs on the existing Realtime API surface and improves the previous gpt-realtime flow with stronger reasoning and larger context.

The model accepts text, audio, and images as input, then emits text and audio as output. Image input is new for this model. You can add a screenshot or photo to a live conversation, ask a question by voice, and get a spoken answer.

That enables agents such as:

  • Voice support copilots that can inspect user screenshots
  • Field-support agents that reason over photos
  • Accessibility assistants that describe what is on screen

Specs:

Attribute Value
Model ID gpt-realtime-2
Context window 128,000 tokens
Max output 32,000 tokens
Modalities in text, audio, image
Modalities out text, audio
Knowledge cutoff 2024-09-30
Reasoning levels minimal, low, medium, high, xhigh
Function calling yes
Remote MCP servers yes
Image input yes
SIP phone calling yes

What changed from gpt-realtime

Compared with gpt-realtime-1.5, GPT-Realtime-2 improves benchmark performance:

  • Big Bench Audio: 81.4% → 96.6%
  • Audio MultiChallenge: 34.7% → 48.5%

Those scores used high and xhigh reasoning. In production, the default is low to reduce latency, so you should benchmark your own workload before increasing reasoning effort.

Key behavior changes:

  • Preambles: The model can say short filler phrases like “let me check that” while it reasons.
  • Parallel tool calls with narration: The model can call multiple tools and describe progress instead of going silent.
  • Better recovery: Ambiguous or partially failed turns are handled more gracefully.
  • Domain tone control: The model can keep specialized terminology consistent and adapt delivery style during a session.

The context window also increased from 32k to 128k tokens. That matters for long-running voice sessions such as support calls, banking workflows, and tutoring sessions.

Pricing

GPT-Realtime-2 is billed per token, with separate rates for text, audio, and image input.

Token type Input Cached input Output
Text $4.00 / 1M $0.40 / 1M $24.00 / 1M
Audio $32.00 / 1M $0.40 / 1M $64.00 / 1M
Image $5.00 / 1M $0.50 / 1M n/a

Cached input reduces repeated-context cost significantly. If your agent uses a stable system prompt, policy document, or repeated instructions, keep that context cacheable.

For comparison with the rest of the OpenAI line, see GPT-5.5 pricing.

Companion model pricing:

  • GPT-Realtime-Translate: $0.034/min. Supports 70 input languages and 13 output languages, with 12.5% lower Word Error Rate than any other model tested in Hindi, Tamil, and Telugu.
  • GPT-Realtime-Whisper: $0.017/min. Streaming speech-to-text for live captions and continuous transcription.

Use:

  • GPT-Realtime-2 when you need reasoning and voice generation together.
  • GPT-Realtime-Translate for live multilingual interpretation.
  • GPT-Realtime-Whisper when you only need a transcript.

Endpoints and authentication

Available endpoints:

POST https://api.openai.com/v1/chat/completions
POST https://api.openai.com/v1/responses
WSS  wss://api.openai.com/v1/realtime?model=gpt-realtime-2
WSS  wss://api.openai.com/v1/realtime?call_id={call_id}
POST https://api.openai.com/v1/realtime/translations
POST https://api.openai.com/v1/realtime/transcription_sessions
Enter fullscreen mode Exit fullscreen mode

For voice agents, use the WebSocket endpoint:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2
Enter fullscreen mode Exit fullscreen mode

Required headers:

Authorization: Bearer $OPENAI_API_KEY
OpenAI-Beta: realtime=v1
Enter fullscreen mode Exit fullscreen mode

Set your API key:

export OPENAI_API_KEY="sk-proj-..."
Enter fullscreen mode Exit fullscreen mode

Connect over WebSocket

Install the WebSocket client:

npm install ws
Enter fullscreen mode Exit fullscreen mode

Create a minimal Node.js client:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1",
    },
  }
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "cedar",
      instructions: "You are a friendly support agent for a fintech app.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad" },
      reasoning: { effort: "low" },
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "response.audio.delta") {
    // base64 PCM16 audio chunk
    // Pipe this to a speaker, browser AudioWorklet, or media stream.
    process.stdout.write(Buffer.from(event.delta, "base64"));
  }
});
Enter fullscreen mode Exit fullscreen mode

The session is event-driven:

  1. Send a session.update event to configure the voice, audio format, VAD, tools, and reasoning effort.
  2. Send input_audio_buffer.append events while the user speaks.
  3. Receive response.audio.delta events as the model speaks.
  4. Handle tool-call events if the model requests external data.

PCM16 at 24 kHz is a safe default. G.711 mu-law and A-law are also supported, which is useful for phone-system integrations.

For Python, the openai SDK >= 2.1.0 exposes a realtime client with the same event names. To compare the Realtime API with the Responses API, see How to use the GPT-5.5 API.

Voices

GPT-Realtime-2 adds two Realtime-only voices:

  • Cedar: warm, mid-range male voice. Suitable as a default general-agent voice.
  • Marin: bright, clear female voice. Useful for translation and announcements.

The previous eight voices are still available:

alloy
ash
ballad
coral
echo
sage
shimmer
verse
Enter fullscreen mode Exit fullscreen mode

They were also retuned for the new audio stack.

To switch voices mid-session, send another session.update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    voice: "marin",
  },
}));
Enter fullscreen mode Exit fullscreen mode

Add image input to a voice turn

You can attach an image to a user turn and then ask a question about it:

ws.send(JSON.stringify({
  type: "conversation.item.create",
  item: {
    type: "message",
    role: "user",
    content: [
      {
        type: "input_image",
        image_url: "https://example.com/screenshot.png",
      },
      {
        type: "input_text",
        text: "What does this error mean?",
      },
    ],
  },
}));

ws.send(JSON.stringify({ type: "response.create" }));
Enter fullscreen mode Exit fullscreen mode

Useful implementation patterns:

  • Voice-driven QA: A tester points a camera at a broken UI and the agent dictates a bug report.
  • Field support: A technician shares a wiring-panel photo and the agent walks through diagnostics.
  • Accessibility: The agent describes a user’s current screen during a support call.

For more on OpenAI’s image stack, see How to use the GPT-Image-2 API.

Function calling and MCP

GPT-Realtime-2 supports standard function tools and remote MCP servers in the same session.

Standard function calling

The flow is similar to Chat Completions:

  1. Declare tools in the session config.
  2. The model emits response.function_call_arguments.delta.
  3. Your app executes the function.
  4. Your app sends a conversation.item.create event with function_call_output.

The important change is parallel calling. The model can trigger multiple calls at once and narrate progress while waiting for results.

Example session update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "function",
        name: "lookup_account",
        description: "Look up a customer account by ID.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
          },
          required: ["account_id"],
        },
      },
      {
        type: "function",
        name: "list_transactions",
        description: "List recent transactions for an account.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
            limit: { type: "number" },
          },
          required: ["account_id"],
        },
      },
    ],
  },
}));
Enter fullscreen mode Exit fullscreen mode

Remote MCP servers

Remote MCP support lets the Realtime API call tools from an MCP server directly. Configure the MCP URL and allowed tools in the session:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "mcp",
        server_url: "https://mcp.example.com/sse",
        allowed_tools: [
          "lookup_account",
          "list_transactions",
        ],
      },
    ],
  },
}));
Enter fullscreen mode Exit fullscreen mode

This is useful when your voice agent needs access to a larger tool catalog without manually routing every function call through your WebSocket loop.

If you are testing MCP servers before wiring them into a voice agent, see MCP server testing in Apidog.

SIP phone calling

GPT-Realtime-2 can handle real phone calls through SIP.

At a high level:

  1. Point your SIP trunk at OpenAI’s SIP gateway.
  2. An inbound call opens a Realtime WebSocket session.
  3. Your app connects using the call ID:
wss://api.openai.com/v1/realtime?call_id={call_id}
Enter fullscreen mode Exit fullscreen mode

The model accepts G.711 mu-law and A-law directly, so your bridge does not need to transcode audio before sending it to the Realtime API.

This makes GPT-Realtime-2 suitable for call-center-style agents where most turns involve listening, calling tools, and responding by voice.

Configure reasoning effort

Reasoning effort controls the latency/quality tradeoff.

Level Use case Approx. latency cost
minimal Single-turn yes/no answers none
low Default; everyday support and chat small
medium Disambiguation, complex tool dispatch moderate
high Multi-step reasoning, code review by voice high
xhigh Benchmarks, hard analytical questions highest

Default to low:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    reasoning: {
      effort: "low",
    },
  },
}));
Enter fullscreen mode Exit fullscreen mode

Move to medium, high, or xhigh only when you can measure a quality gap. The latency cost is noticeable in live calls.

Test the Realtime API in Apidog

WebSocket APIs are difficult to debug from the terminal because every connection has state. Apidog gives you a repeatable way to test the same Realtime session.

A practical test workflow:

  1. Create a new WebSocket request.
  2. Use this URL:
wss://api.openai.com/v1/realtime?model=gpt-realtime-2
Enter fullscreen mode Exit fullscreen mode
  1. Add headers:
Authorization: Bearer {{OPENAI_API_KEY}}
OpenAI-Beta: realtime=v1
Enter fullscreen mode Exit fullscreen mode
  1. Save a session.update message.
  2. Add scripted messages such as:
    • input_audio_buffer.append
    • input_audio_buffer.commit
    • response.create
  3. Replay the script against one connection.
  4. Capture all server events.
  5. Diff runs when changing voice, reasoning effort, or tool configuration.

Download Apidog, create a WebSocket request, and store your bearer token under Auth or an environment variable.

For comparison with another fast multimodal model, see How to use the Gemini 3 Flash Preview API.

FAQ

What model ID should I use?

Use:

gpt-realtime-2
Enter fullscreen mode Exit fullscreen mode

The earlier model is still available as:

gpt-realtime
Enter fullscreen mode Exit fullscreen mode

The lite version is:

gpt-realtime-2-mini
Enter fullscreen mode Exit fullscreen mode

Can I stream input audio while output audio is still playing?

Yes. The Realtime API uses server-side voice activity detection by default, so the model can stop speaking when the user starts. You can also disable VAD and manage turn boundaries from the client.

Does the 128k context include audio tokens?

Yes. Audio is tokenized. One second of audio is roughly 50 tokens depending on format. Long calls can consume context faster than long text chats, so inspect usage before assuming the full 128k window is enough.

Is fine-tuning supported?

Not yet. Per the model card, GPT-Realtime-2 does not yet support fine-tuning, predicted outputs, or text streaming on Chat Completions. The Realtime endpoint streams audio inherently.

How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?

GPT-Realtime-2 performs end-to-end speech reasoning. A voice-aware model can respond to tone, hesitation, and emphasis. A text model with TTS cannot use those audio cues in the same way.

For pure text reasoning, see How to use the GPT-5.5 API.

What rate limits apply?

Tier 1 starts at 40,000 tokens per minute and scales to 15M TPM at Tier 5. Rate limits are per model, so existing GPT-5 quota does not carry over.

Wrapping up

GPT-Realtime-2 gives you a single API surface for voice input, reasoning, tool use, image input, and spoken output. The main implementation path is:

  1. Start with the WebSocket endpoint.
  2. Configure session.update.
  3. Use low reasoning by default.
  4. Add tools only after the basic audio loop works.
  5. Test repeated sessions in Apidog.
  6. Increase reasoning effort only when measured quality requires it.

The combination of 128k context, GPT-5-class reasoning, image input, MCP, and SIP support makes it practical to build voice agents that can answer calls, inspect screenshots, dispatch tools, and recover from failed turns without leaving the Realtime session.

Top comments (0)