OpenAI shipped GPT-Realtime-2 on November 6, 2026. It is a speech-to-speech model with GPT-5-class reasoning, a 128,000-token context window, and configurable reasoning effort so you can trade latency for answer quality. If you already use gpt-realtime, migration mostly means changing the model string and adding a few optional session/tool fields.
This guide shows what changed, how pricing works, and how to call GPT-Realtime-2 over WebSocket and SIP. It also includes a practical setup in Apidog so you can replay Realtime sessions without re-recording audio for every test.
For context on OpenAI’s broader 2026 model line, see What is GPT-5.5. For the multimodal sibling, see How to use the GPT-Image-2 API.
TL;DR
- Model ID:
gpt-realtime-2 - Context window: 128k tokens
- Max output: 32k tokens
- Input modalities: text, audio, image
- Output modalities: text, audio
- Audio pricing: $32 / 1M input tokens, $64 / 1M output tokens
- Cached audio input: $0.40 / 1M tokens
- New Realtime-only voices: Cedar and Marin
- Reasoning levels:
minimal,low,medium,high,xhigh - Default reasoning level:
low - WebSocket endpoint:
wss://api.openai.com/v1/realtime?model=gpt-realtime-2
- SIP sessions use:
wss://api.openai.com/v1/realtime?call_id={call_id}
- Companion models:
- GPT-Realtime-Translate: live translation, 70 input languages, $0.034/min
- GPT-Realtime-Whisper: streaming speech-to-text, $0.017/min
- Use Apidog to script WebSocket sessions, capture frames, and compare event output between runs.
What is GPT-Realtime-2?
GPT-Realtime-2 is a single speech-to-speech model. You stream audio in, receive audio out, and the model handles transcription, reasoning, tool selection, and voice generation in one pass.
That means you do not need to build a separate STT → LLM → TTS pipeline. The model runs on the existing Realtime API surface and improves the previous gpt-realtime flow with stronger reasoning and larger context.
The model accepts text, audio, and images as input, then emits text and audio as output. Image input is new for this model. You can add a screenshot or photo to a live conversation, ask a question by voice, and get a spoken answer.
That enables agents such as:
- Voice support copilots that can inspect user screenshots
- Field-support agents that reason over photos
- Accessibility assistants that describe what is on screen
Specs:
| Attribute | Value |
|---|---|
| Model ID | gpt-realtime-2 |
| Context window | 128,000 tokens |
| Max output | 32,000 tokens |
| Modalities in | text, audio, image |
| Modalities out | text, audio |
| Knowledge cutoff | 2024-09-30 |
| Reasoning levels |
minimal, low, medium, high, xhigh
|
| Function calling | yes |
| Remote MCP servers | yes |
| Image input | yes |
| SIP phone calling | yes |
What changed from gpt-realtime
Compared with gpt-realtime-1.5, GPT-Realtime-2 improves benchmark performance:
- Big Bench Audio: 81.4% → 96.6%
- Audio MultiChallenge: 34.7% → 48.5%
Those scores used high and xhigh reasoning. In production, the default is low to reduce latency, so you should benchmark your own workload before increasing reasoning effort.
Key behavior changes:
- Preambles: The model can say short filler phrases like “let me check that” while it reasons.
- Parallel tool calls with narration: The model can call multiple tools and describe progress instead of going silent.
- Better recovery: Ambiguous or partially failed turns are handled more gracefully.
- Domain tone control: The model can keep specialized terminology consistent and adapt delivery style during a session.
The context window also increased from 32k to 128k tokens. That matters for long-running voice sessions such as support calls, banking workflows, and tutoring sessions.
Pricing
GPT-Realtime-2 is billed per token, with separate rates for text, audio, and image input.
| Token type | Input | Cached input | Output |
|---|---|---|---|
| Text | $4.00 / 1M | $0.40 / 1M | $24.00 / 1M |
| Audio | $32.00 / 1M | $0.40 / 1M | $64.00 / 1M |
| Image | $5.00 / 1M | $0.50 / 1M | n/a |
Cached input reduces repeated-context cost significantly. If your agent uses a stable system prompt, policy document, or repeated instructions, keep that context cacheable.
For comparison with the rest of the OpenAI line, see GPT-5.5 pricing.
Companion model pricing:
- GPT-Realtime-Translate: $0.034/min. Supports 70 input languages and 13 output languages, with 12.5% lower Word Error Rate than any other model tested in Hindi, Tamil, and Telugu.
- GPT-Realtime-Whisper: $0.017/min. Streaming speech-to-text for live captions and continuous transcription.
Use:
- GPT-Realtime-2 when you need reasoning and voice generation together.
- GPT-Realtime-Translate for live multilingual interpretation.
- GPT-Realtime-Whisper when you only need a transcript.
Endpoints and authentication
Available endpoints:
POST https://api.openai.com/v1/chat/completions
POST https://api.openai.com/v1/responses
WSS wss://api.openai.com/v1/realtime?model=gpt-realtime-2
WSS wss://api.openai.com/v1/realtime?call_id={call_id}
POST https://api.openai.com/v1/realtime/translations
POST https://api.openai.com/v1/realtime/transcription_sessions
For voice agents, use the WebSocket endpoint:
wss://api.openai.com/v1/realtime?model=gpt-realtime-2
Required headers:
Authorization: Bearer $OPENAI_API_KEY
OpenAI-Beta: realtime=v1
Set your API key:
export OPENAI_API_KEY="sk-proj-..."
Connect over WebSocket
Install the WebSocket client:
npm install ws
Create a minimal Node.js client:
import WebSocket from "ws";
const ws = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-realtime-2",
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
}
);
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "cedar",
instructions: "You are a friendly support agent for a fintech app.",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
turn_detection: { type: "server_vad" },
reasoning: { effort: "low" },
},
}));
});
ws.on("message", (raw) => {
const event = JSON.parse(raw.toString());
if (event.type === "response.audio.delta") {
// base64 PCM16 audio chunk
// Pipe this to a speaker, browser AudioWorklet, or media stream.
process.stdout.write(Buffer.from(event.delta, "base64"));
}
});
The session is event-driven:
- Send a
session.updateevent to configure the voice, audio format, VAD, tools, and reasoning effort. - Send
input_audio_buffer.appendevents while the user speaks. - Receive
response.audio.deltaevents as the model speaks. - Handle tool-call events if the model requests external data.
PCM16 at 24 kHz is a safe default. G.711 mu-law and A-law are also supported, which is useful for phone-system integrations.
For Python, the openai SDK >= 2.1.0 exposes a realtime client with the same event names. To compare the Realtime API with the Responses API, see How to use the GPT-5.5 API.
Voices
GPT-Realtime-2 adds two Realtime-only voices:
- Cedar: warm, mid-range male voice. Suitable as a default general-agent voice.
- Marin: bright, clear female voice. Useful for translation and announcements.
The previous eight voices are still available:
alloy
ash
ballad
coral
echo
sage
shimmer
verse
They were also retuned for the new audio stack.
To switch voices mid-session, send another session.update:
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "marin",
},
}));
Add image input to a voice turn
You can attach an image to a user turn and then ask a question about it:
ws.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [
{
type: "input_image",
image_url: "https://example.com/screenshot.png",
},
{
type: "input_text",
text: "What does this error mean?",
},
],
},
}));
ws.send(JSON.stringify({ type: "response.create" }));
Useful implementation patterns:
- Voice-driven QA: A tester points a camera at a broken UI and the agent dictates a bug report.
- Field support: A technician shares a wiring-panel photo and the agent walks through diagnostics.
- Accessibility: The agent describes a user’s current screen during a support call.
For more on OpenAI’s image stack, see How to use the GPT-Image-2 API.
Function calling and MCP
GPT-Realtime-2 supports standard function tools and remote MCP servers in the same session.
Standard function calling
The flow is similar to Chat Completions:
- Declare tools in the session config.
- The model emits
response.function_call_arguments.delta. - Your app executes the function.
- Your app sends a
conversation.item.createevent withfunction_call_output.
The important change is parallel calling. The model can trigger multiple calls at once and narrate progress while waiting for results.
Example session update:
ws.send(JSON.stringify({
type: "session.update",
session: {
tools: [
{
type: "function",
name: "lookup_account",
description: "Look up a customer account by ID.",
parameters: {
type: "object",
properties: {
account_id: { type: "string" },
},
required: ["account_id"],
},
},
{
type: "function",
name: "list_transactions",
description: "List recent transactions for an account.",
parameters: {
type: "object",
properties: {
account_id: { type: "string" },
limit: { type: "number" },
},
required: ["account_id"],
},
},
],
},
}));
Remote MCP servers
Remote MCP support lets the Realtime API call tools from an MCP server directly. Configure the MCP URL and allowed tools in the session:
ws.send(JSON.stringify({
type: "session.update",
session: {
tools: [
{
type: "mcp",
server_url: "https://mcp.example.com/sse",
allowed_tools: [
"lookup_account",
"list_transactions",
],
},
],
},
}));
This is useful when your voice agent needs access to a larger tool catalog without manually routing every function call through your WebSocket loop.
If you are testing MCP servers before wiring them into a voice agent, see MCP server testing in Apidog.
SIP phone calling
GPT-Realtime-2 can handle real phone calls through SIP.
At a high level:
- Point your SIP trunk at OpenAI’s SIP gateway.
- An inbound call opens a Realtime WebSocket session.
- Your app connects using the call ID:
wss://api.openai.com/v1/realtime?call_id={call_id}
The model accepts G.711 mu-law and A-law directly, so your bridge does not need to transcode audio before sending it to the Realtime API.
This makes GPT-Realtime-2 suitable for call-center-style agents where most turns involve listening, calling tools, and responding by voice.
Configure reasoning effort
Reasoning effort controls the latency/quality tradeoff.
| Level | Use case | Approx. latency cost |
|---|---|---|
minimal |
Single-turn yes/no answers | none |
low |
Default; everyday support and chat | small |
medium |
Disambiguation, complex tool dispatch | moderate |
high |
Multi-step reasoning, code review by voice | high |
xhigh |
Benchmarks, hard analytical questions | highest |
Default to low:
ws.send(JSON.stringify({
type: "session.update",
session: {
reasoning: {
effort: "low",
},
},
}));
Move to medium, high, or xhigh only when you can measure a quality gap. The latency cost is noticeable in live calls.
Test the Realtime API in Apidog
WebSocket APIs are difficult to debug from the terminal because every connection has state. Apidog gives you a repeatable way to test the same Realtime session.
A practical test workflow:
- Create a new WebSocket request.
- Use this URL:
wss://api.openai.com/v1/realtime?model=gpt-realtime-2
- Add headers:
Authorization: Bearer {{OPENAI_API_KEY}}
OpenAI-Beta: realtime=v1
- Save a
session.updatemessage. - Add scripted messages such as:
input_audio_buffer.appendinput_audio_buffer.commitresponse.create
- Replay the script against one connection.
- Capture all server events.
- Diff runs when changing voice, reasoning effort, or tool configuration.
Download Apidog, create a WebSocket request, and store your bearer token under Auth or an environment variable.
For comparison with another fast multimodal model, see How to use the Gemini 3 Flash Preview API.
FAQ
What model ID should I use?
Use:
gpt-realtime-2
The earlier model is still available as:
gpt-realtime
The lite version is:
gpt-realtime-2-mini
Can I stream input audio while output audio is still playing?
Yes. The Realtime API uses server-side voice activity detection by default, so the model can stop speaking when the user starts. You can also disable VAD and manage turn boundaries from the client.
Does the 128k context include audio tokens?
Yes. Audio is tokenized. One second of audio is roughly 50 tokens depending on format. Long calls can consume context faster than long text chats, so inspect usage before assuming the full 128k window is enough.
Is fine-tuning supported?
Not yet. Per the model card, GPT-Realtime-2 does not yet support fine-tuning, predicted outputs, or text streaming on Chat Completions. The Realtime endpoint streams audio inherently.
How does GPT-Realtime-2 compare to GPT-5.5 plus TTS?
GPT-Realtime-2 performs end-to-end speech reasoning. A voice-aware model can respond to tone, hesitation, and emphasis. A text model with TTS cannot use those audio cues in the same way.
For pure text reasoning, see How to use the GPT-5.5 API.
What rate limits apply?
Tier 1 starts at 40,000 tokens per minute and scales to 15M TPM at Tier 5. Rate limits are per model, so existing GPT-5 quota does not carry over.
Wrapping up
GPT-Realtime-2 gives you a single API surface for voice input, reasoning, tool use, image input, and spoken output. The main implementation path is:
- Start with the WebSocket endpoint.
- Configure
session.update. - Use
lowreasoning by default. - Add tools only after the basic audio loop works.
- Test repeated sessions in Apidog.
- Increase reasoning effort only when measured quality requires it.
The combination of 128k context, GPT-5-class reasoning, image input, MCP, and SIP support makes it practical to build voice agents that can answer calls, inspect screenshots, dispatch tools, and recover from failed turns without leaving the Realtime session.



Top comments (0)