Hassann

Posted on May 8 • Originally published at apidog.com

What Is GPT-Realtime-2 and How to Use the GPT-Realtime-2 API

OpenAI shipped GPT-Realtime-2 on November 6, 2026. It is a speech-to-speech model with GPT-5-class reasoning, a 128,000-token context window, and configurable reasoning effort so you can trade latency for answer quality. If you already use gpt-realtime, migration mostly means changing the model string and adding a few optional session/tool fields.

Try Apidog today

This guide shows what changed, how pricing works, and how to call GPT-Realtime-2 over WebSocket and SIP. It also includes a practical setup in Apidog so you can replay Realtime sessions without re-recording audio for every test.

For context on OpenAI’s broader 2026 model line, see What is GPT-5.5. For the multimodal sibling, see How to use the GPT-Image-2 API.

TL;DR

Model ID: gpt-realtime-2
Context window: 128k tokens
Max output: 32k tokens
Input modalities: text, audio, image
Output modalities: text, audio
Audio pricing: $32 / 1M input tokens, $64 / 1M output tokens
Cached audio input: $0.40 / 1M tokens
New Realtime-only voices: Cedar and Marin
Reasoning levels: minimal, low, medium, high, xhigh
Default reasoning level: low
WebSocket endpoint:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

SIP sessions use:

wss://api.openai.com/v1/realtime?call_id={call_id}

Companion models:
- GPT-Realtime-Translate: live translation, 70 input languages, $0.034/min
- GPT-Realtime-Whisper: streaming speech-to-text, $0.017/min
Use Apidog to script WebSocket sessions, capture frames, and compare event output between runs.

What is GPT-Realtime-2?

GPT-Realtime-2 is a single speech-to-speech model. You stream audio in, receive audio out, and the model handles transcription, reasoning, tool selection, and voice generation in one pass.

That means you do not need to build a separate STT → LLM → TTS pipeline. The model runs on the existing Realtime API surface and improves the previous gpt-realtime flow with stronger reasoning and larger context.

The model accepts text, audio, and images as input, then emits text and audio as output. Image input is new for this model. You can add a screenshot or photo to a live conversation, ask a question by voice, and get a spoken answer.

That enables agents such as:

Voice support copilots that can inspect user screenshots
Field-support agents that reason over photos
Accessibility assistants that describe what is on screen

Specs:

Attribute	Value
Model ID	`gpt-realtime-2`
Context window	128,000 tokens
Max output	32,000 tokens
Modalities in	text, audio, image
Modalities out	text, audio
Knowledge cutoff	2024-09-30
Reasoning levels	`minimal`, `low`, `medium`, `high`, `xhigh`
Function calling	yes
Remote MCP servers	yes
Image input	yes
SIP phone calling	yes

What changed from `gpt-realtime`

Compared with gpt-realtime-1.5, GPT-Realtime-2 improves benchmark performance:

Big Bench Audio: 81.4% → 96.6%
Audio MultiChallenge: 34.7% → 48.5%

Those scores used high and xhigh reasoning. In production, the default is low to reduce latency, so you should benchmark your own workload before increasing reasoning effort.

Key behavior changes:

Preambles: The model can say short filler phrases like “let me check that” while it reasons.
Parallel tool calls with narration: The model can call multiple tools and describe progress instead of going silent.
Better recovery: Ambiguous or partially failed turns are handled more gracefully.
Domain tone control: The model can keep specialized terminology consistent and adapt delivery style during a session.

The context window also increased from 32k to 128k tokens. That matters for long-running voice sessions such as support calls, banking workflows, and tutoring sessions.

Pricing

GPT-Realtime-2 is billed per token, with separate rates for text, audio, and image input.

Token type	Input	Cached input	Output
Text	$4.00 / 1M	$0.40 / 1M	$24.00 / 1M
Audio	$32.00 / 1M	$0.40 / 1M	$64.00 / 1M
Image	$5.00 / 1M	$0.50 / 1M	n/a

Cached input reduces repeated-context cost significantly. If your agent uses a stable system prompt, policy document, or repeated instructions, keep that context cacheable.

For comparison with the rest of the OpenAI line, see GPT-5.5 pricing.

Companion model pricing:

GPT-Realtime-Translate: $0.034/min. Supports 70 input languages and 13 output languages, with 12.5% lower Word Error Rate than any other model tested in Hindi, Tamil, and Telugu.
GPT-Realtime-Whisper: $0.017/min. Streaming speech-to-text for live captions and continuous transcription.

Use:

GPT-Realtime-2 when you need reasoning and voice generation together.
GPT-Realtime-Translate for live multilingual interpretation.
GPT-Realtime-Whisper when you only need a transcript.

Endpoints and authentication

Available endpoints:

POST https://api.openai.com/v1/chat/completions
POST https://api.openai.com/v1/responses
WSS  wss://api.openai.com/v1/realtime?model=gpt-realtime-2
WSS  wss://api.openai.com/v1/realtime?call_id={call_id}
POST https://api.openai.com/v1/realtime/translations
POST https://api.openai.com/v1/realtime/transcription_sessions

For voice agents, use the WebSocket endpoint:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

Required headers:

Authorization: Bearer $OPENAI_API_KEY
OpenAI-Beta: realtime=v1

Set your API key:

export OPENAI_API_KEY="sk-proj-..."

Connect over WebSocket

Install the WebSocket client:

npm install ws

Create a minimal Node.js client:

import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
  {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1",
    },
  }
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "cedar",
      instructions: "You are a friendly support agent for a fintech app.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad" },
      reasoning: { effort: "low" },
    },
  }));
});

ws.on("message", (raw) => {
  const event = JSON.parse(raw.toString());

  if (event.type === "response.audio.delta") {
    // base64 PCM16 audio chunk
    // Pipe this to a speaker, browser AudioWorklet, or media stream.
    process.stdout.write(Buffer.from(event.delta, "base64"));
  }
});

The session is event-driven:

Send a session.update event to configure the voice, audio format, VAD, tools, and reasoning effort.
Send input_audio_buffer.append events while the user speaks.
Receive response.audio.delta events as the model speaks.
Handle tool-call events if the model requests external data.

PCM16 at 24 kHz is a safe default. G.711 mu-law and A-law are also supported, which is useful for phone-system integrations.

For Python, the openai SDK >= 2.1.0 exposes a realtime client with the same event names. To compare the Realtime API with the Responses API, see How to use the GPT-5.5 API.

Voices

GPT-Realtime-2 adds two Realtime-only voices:

Cedar: warm, mid-range male voice. Suitable as a default general-agent voice.
Marin: bright, clear female voice. Useful for translation and announcements.

The previous eight voices are still available:

alloy
ash
ballad
coral
echo
sage
shimmer
verse

They were also retuned for the new audio stack.

To switch voices mid-session, send another session.update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    voice: "marin",
  },
}));

Add image input to a voice turn

You can attach an image to a user turn and then ask a question about it:

ws.send(JSON.stringify({
  type: "conversation.item.create",
  item: {
    type: "message",
    role: "user",
    content: [
      {
        type: "input_image",
        image_url: "https://example.com/screenshot.png",
      },
      {
        type: "input_text",
        text: "What does this error mean?",
      },
    ],
  },
}));

ws.send(JSON.stringify({ type: "response.create" }));

Useful implementation patterns:

Voice-driven QA: A tester points a camera at a broken UI and the agent dictates a bug report.
Field support: A technician shares a wiring-panel photo and the agent walks through diagnostics.
Accessibility: The agent describes a user’s current screen during a support call.

For more on OpenAI’s image stack, see How to use the GPT-Image-2 API.

Function calling and MCP

GPT-Realtime-2 supports standard function tools and remote MCP servers in the same session.

Standard function calling

The flow is similar to Chat Completions:

Declare tools in the session config.
The model emits response.function_call_arguments.delta.
Your app executes the function.
Your app sends a conversation.item.create event with function_call_output.

The important change is parallel calling. The model can trigger multiple calls at once and narrate progress while waiting for results.

Example session update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "function",
        name: "lookup_account",
        description: "Look up a customer account by ID.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
          },
          required: ["account_id"],
        },
      },
      {
        type: "function",
        name: "list_transactions",
        description: "List recent transactions for an account.",
        parameters: {
          type: "object",
          properties: {
            account_id: { type: "string" },
            limit: { type: "number" },
          },
          required: ["account_id"],
        },
      },
    ],
  },
}));

Remote MCP servers

Remote MCP support lets the Realtime API call tools from an MCP server directly. Configure the MCP URL and allowed tools in the session:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    tools: [
      {
        type: "mcp",
        server_url: "https://mcp.example.com/sse",
        allowed_tools: [
          "lookup_account",
          "list_transactions",
        ],
      },
    ],
  },
}));

This is useful when your voice agent needs access to a larger tool catalog without manually routing every function call through your WebSocket loop.

If you are testing MCP servers before wiring them into a voice agent, see MCP server testing in Apidog.

SIP phone calling

GPT-Realtime-2 can handle real phone calls through SIP.

At a high level:

Point your SIP trunk at OpenAI’s SIP gateway.
An inbound call opens a Realtime WebSocket session.
Your app connects using the call ID:

wss://api.openai.com/v1/realtime?call_id={call_id}

The model accepts G.711 mu-law and A-law directly, so your bridge does not need to transcode audio before sending it to the Realtime API.

This makes GPT-Realtime-2 suitable for call-center-style agents where most turns involve listening, calling tools, and responding by voice.

Configure reasoning effort

Reasoning effort controls the latency/quality tradeoff.

Level	Use case	Approx. latency cost
`minimal`	Single-turn yes/no answers	none
`low`	Default; everyday support and chat	small
`medium`	Disambiguation, complex tool dispatch	moderate
`high`	Multi-step reasoning, code review by voice	high
`xhigh`	Benchmarks, hard analytical questions	highest

Default to low:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    reasoning: {
      effort: "low",
    },
  },
}));

Move to medium, high, or xhigh only when you can measure a quality gap. The latency cost is noticeable in live calls.

Test the Realtime API in Apidog

WebSocket APIs are difficult to debug from the terminal because every connection has state. Apidog gives you a repeatable way to test the same Realtime session.

A practical test workflow:

Create a new WebSocket request.
Use this URL:

wss://api.openai.com/v1/realtime?model=gpt-realtime-2

Add headers:

Authorization: Bearer {{OPENAI_API_KEY}}
OpenAI-Beta: realtime=v1

Save a session.update message.
Add scripted messages such as:
- input_audio_buffer.append
- input_audio_buffer.commit
- response.create
Replay the script against one connection.
Capture all server events.
Diff runs when changing voice, reasoning effort, or tool configuration.

Download Apidog, create a WebSocket request, and store your bearer token under Auth or an environment variable.

For comparison with another fast multimodal model, see How to use the Gemini 3 Flash Preview API.

FAQ

What model ID should I use?

Use:

gpt-realtime-2

The earlier model is still available as:

gpt-realtime

The lite version is:

gpt-realtime-2-mini

Can I stream input audio while output audio is still playing?

Yes. The Realtime API uses server-side voice activity detection by default, so the model can stop speaking when the user starts. You can also disable VAD and manage turn boundaries from the client.

Does the 128k context include audio tokens?

Yes. Audio is tokenized. One second of audio is roughly 50 tokens depending on format. Long calls can consume context faster than long text chats, so inspect usage before assuming the full 128k window is enough.

Is fine-tuning supported?

Not yet. Per the model card, GPT-Realtime-2 does not yet support fine-tuning, predicted outputs, or text streaming on Chat Completions. The Realtime endpoint streams audio inherently.

How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?

GPT-Realtime-2 performs end-to-end speech reasoning. A voice-aware model can respond to tone, hesitation, and emphasis. A text model with TTS cannot use those audio cues in the same way.

For pure text reasoning, see How to use the GPT-5.5 API.

What rate limits apply?

Tier 1 starts at 40,000 tokens per minute and scales to 15M TPM at Tier 5. Rate limits are per model, so existing GPT-5 quota does not carry over.

Wrapping up

GPT-Realtime-2 gives you a single API surface for voice input, reasoning, tool use, image input, and spoken output. The main implementation path is:

Start with the WebSocket endpoint.
Configure session.update.
Use low reasoning by default.
Add tools only after the basic audio loop works.
Test repeated sessions in Apidog.
Increase reasoning effort only when measured quality requires it.

The combination of 128k context, GPT-5-class reasoning, image input, MCP, and SIP support makes it practical to build voice agents that can answer calls, inspect screenshots, dispatch tools, and recover from failed turns without leaving the Realtime session.

DEV Community

What Is GPT-Realtime-2 and How to Use the GPT-Realtime-2 API

TL;DR

What is GPT-Realtime-2?

What changed from `gpt-realtime`

Pricing

Endpoints and authentication

Connect over WebSocket

Voices

Add image input to a voice turn

Function calling and MCP

Standard function calling

Remote MCP servers

SIP phone calling

Configure reasoning effort

Test the Realtime API in Apidog

FAQ

What model ID should I use?

Can I stream input audio while output audio is still playing?

Does the 128k context include audio tokens?

Is fine-tuning supported?

How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?

What rate limits apply?

Wrapping up

Top comments (0)

TL;DR

What is GPT-Realtime-2?

What changed from gpt-realtime

Pricing

Endpoints and authentication

Connect over WebSocket

Voices

Add image input to a voice turn

Function calling and MCP

Standard function calling

Remote MCP servers

SIP phone calling

Configure reasoning effort

Test the Realtime API in Apidog

FAQ

What model ID should I use?

Can I stream input audio while output audio is still playing?

Does the 128k context include audio tokens?

Is fine-tuning supported?

How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?

What rate limits apply?

Wrapping up

What changed from `gpt-realtime`