OpenAI swapped ChatGPT’s default model on May 5, 2026. GPT-5.5 Instant replaced GPT-5.3 Instant, reduced hallucinated claims on high-stakes prompts by 52.5%, and kept the low-latency behavior developers expect from the Instant tier. If you build against the API, use the gpt-5.5 model name, a 1M-token context window, and published per-million-token pricing.
This guide shows how to access GPT-5.5 Instant in ChatGPT, how the router switches between Instant and Thinking, and how to call the model from code with reproducible API tests.
TL;DR
GPT-5.5 Instant is OpenAI’s new ChatGPT default and the fast tier of the GPT-5.5 family.
- Free users: 10 messages every 5 hours
- Plus users: 160 messages every 3 hours
- Pro and Business users: unlimited use, subject to abuse guardrails
- API model name:
gpt-5.5 - Recommended endpoint: Responses API
- Instant-like API setting:
reasoning.effort: "minimal" - Context window: 1M tokens
- Max output: 128,000 tokens
- Standard pricing: $5 per 1M input tokens, $30 per 1M output tokens
What GPT-5.5 Instant is
GPT-5.5 Instant is the latency-optimized variant of GPT-5.5. In ChatGPT, OpenAI exposes three GPT-5.5 modes:
| Mode | Best for |
|---|---|
| GPT-5.5 Instant | Fast responses, default chat, low-latency UX |
| GPT-5.5 Thinking | Deeper reasoning and harder multi-step tasks |
| GPT-5.5 Pro | Extra compute for paid tiers and highest-accuracy workloads |
Instant exists because OpenAI uses a router. When a prompt looks simple, ChatGPT stays on Instant. When the prompt requires more reasoning, the router may switch to GPT-5.5 Thinking automatically.
Paid users can also pin Instant manually from the model picker when predictable latency matters.
GPT-5.5 Instant and GPT-5.5 Thinking share the same underlying model family. The difference is the reasoning budget, not the knowledge base.
Both support:
- 1M-token context
- Up to 128,000 output tokens
- Code generation and debugging
- Live web search through the search tool
- File handling for PDFs, images, and spreadsheets
- Memory in supported Plus and Pro web sessions
For the broader release details, see the GPT-5.5 overview.
How to access GPT-5.5 Instant in ChatGPT
Open chatgpt.com or the mobile app and send a message. GPT-5.5 Instant is now the default model across account tiers.
The main difference is the message cap.
| Plan | GPT-5.5 Instant cap | After the cap |
|---|---|---|
| Free | 10 messages every 5 hours | Falls back to GPT-5.5 mini |
| Plus | 160 messages every 3 hours | Falls back to GPT-5.5 mini |
| Pro | Unlimited, subject to abuse guardrails | Stays on GPT-5.5 |
| Business | Unlimited, subject to abuse guardrails | Stays on GPT-5.5 |
| Enterprise | Unlimited, subject to abuse guardrails | Stays on GPT-5.5 |
Plus, Pro, and Business users can pin GPT-5.5 Instant or GPT-5.5 Thinking from the model picker in the chat header. Pinning applies to the current chat, not your whole account.
When the router switches to Thinking
If you do not pin a model, ChatGPT decides whether to use Instant or Thinking. OpenAI has not published the full routing rules, but Thinking commonly appears when a prompt:
- Requires a multi-step plan
- Has ambiguous constraints
- Involves high-stakes domains such as medicine, law, or finance
- Needs synthesis across a long context
- Uses tools or agent-like execution
For normal chat, Instant is usually the right default. For guaranteed reasoning depth, pin Thinking manually or set a higher reasoning effort in the API.
How to call GPT-5.5 Instant through the API
In the API, GPT-5.5 Instant does not have a separate model ID. Use:
gpt-5.5
Then control the reasoning behavior with reasoning.effort.
Supported values:
minimal
low
medium
high
For Instant-like behavior, use:
{
"reasoning": {
"effort": "minimal"
}
}
GPT-5.5 is available through two endpoints:
| Endpoint | Use case |
|---|---|
Responses API, /v1/responses
|
Recommended for new apps, tools, structured output, and streaming |
Chat Completions API, /v1/chat/completions
|
Legacy compatibility |
Pricing
| Tier | Input | Output |
|---|---|---|
| Standard | $5.00 / 1M tokens | $30.00 / 1M tokens |
| Batch | $2.50 / 1M tokens | $15.00 / 1M tokens |
| Flex | $2.50 / 1M tokens | $15.00 / 1M tokens |
| Priority | $12.50 / 1M tokens | $75.00 / 1M tokens |
Important: prompts above 272K input tokens are billed at 2x input and 1.5x output for the rest of the session on every tier except Priority.
For more examples, see the GPT-5.5 pricing breakdown.
Minimal Python request
Create an API key from the OpenAI platform, then install the SDK.
pip install --upgrade openai
export OPENAI_API_KEY="sk-..."
Call the Responses API:
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "minimal"},
input=[
{
"role": "user",
"content": "Summarize this changelog entry in 3 bullet points: ..."
}
],
max_output_tokens=400,
)
print(response.output_text)
Use reasoning={"effort": "minimal"} for the closest API equivalent to GPT-5.5 Instant in ChatGPT.
Increase the effort when needed:
reasoning={"effort": "medium"}
or:
reasoning={"effort": "high"}
Minimal Node.js request
import OpenAI from "openai";
const client = new OpenAI();
const response = await client.responses.create({
model: "gpt-5.5",
reasoning: { effort: "minimal" },
input: [
{
role: "user",
content: "Translate this product description into Spanish, keeping HTML intact: ..."
}
],
max_output_tokens: 600,
});
console.log(response.output_text);
Stream GPT-5.5 Instant responses
Streaming gives users faster perceived latency because the UI can render tokens as they arrive.
from openai import OpenAI
client = OpenAI()
stream = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "minimal"},
input=[
{
"role": "user",
"content": "Draft a release note for v2.7..."
}
],
stream=True,
)
for event in stream:
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
If you are migrating from Chat Completions, note that the response shape is different. The output_text helper flattens the structured response blocks into a plain string.
For free-tier API usage and quota details, see the GPT-5.5 free access guide.
Test GPT-5.5 Instant requests with Apidog before shipping
A notebook is enough for quick experiments. Production work needs repeatable request templates, environment secrets, response assertions, and CI-ready tests.
Apidog gives you that workflow without writing throwaway scripts.
Step 1: Import the OpenAI OpenAPI spec
Apidog supports OpenAPI 3.x. Import the Responses API spec so your endpoints, parameters, and response schemas are available in one workspace.
Step 2: Store your API key as an environment secret
Add your OpenAI key to an Apidog environment, for example:
OPENAI_API_KEY=sk-...
Then reference it in the Authorization header:
Authorization: Bearer {{OPENAI_API_KEY}}
This keeps staging and production credentials separate.
Step 3: Save a GPT-5.5 Instant request template
Create a request body like this:
{
"model": "gpt-5.5",
"reasoning": {
"effort": "minimal"
},
"input": [
{
"role": "user",
"content": "Summarize this changelog entry in 3 bullet points: ..."
}
],
"max_output_tokens": 400
}
Save it as a reusable request so teammates can replay the exact same prompt and settings.
Step 4: Compare Instant and Thinking behavior
Duplicate the request and change only the reasoning effort:
{
"reasoning": {
"effort": "high"
}
}
Run both requests and compare:
- Latency
- Token usage
- Response body
- Accuracy on your expected output
- Cost impact
Step 5: Add assertions
Turn the request into a test scenario. Assert on fields that matter to your app, such as response status, schema, or required text.
Example checks:
pm.test("status is 200", function () {
pm.response.to.have.status(200);
});
pm.test("response contains output text", function () {
const json = pm.response.json();
pm.expect(json.output).to.exist;
});
Step 6: Run the scenario in CI
Use Apidog scenarios to catch regressions when:
- You edit a system prompt
- OpenAI ships a model update
- You change request parameters
- You move from staging to production
For a deeper testing workflow, see API testing for QA engineers. You can install Apidog from Download Apidog.
Advanced implementation tips
Pin reasoning effort per route
Do not use high reasoning everywhere. Route by task complexity.
Example:
const reasoningByRoute = {
support_triage: "minimal",
docs_qa: "low",
security_review: "medium",
incident_analysis: "high",
};
Then pass the selected value into the request:
const response = await client.responses.create({
model: "gpt-5.5",
reasoning: { effort: reasoningByRoute.support_triage },
input: [{ role: "user", content: ticketText }],
max_output_tokens: 500,
});
Cap output tokens
GPT-5.5 can generate up to 128,000 output tokens. Always set max_output_tokens.
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "minimal"},
input=[{"role": "user", "content": "Write a concise support reply: ..."}],
max_output_tokens=300,
)
Watch the 272K-token billing cliff
If your prompt crosses 272K input tokens, the rest of the session can cost more. For long-document workflows:
- Chunk documents
- Use retrieval instead of stuffing full documents into one prompt
- Stream partial results
- Keep session boundaries clear
Use Batch for offline jobs
Batch is cheaper and fits workloads without strict latency needs:
- Bulk support ticket classification
- Weekly report summaries
- Backfills
- Large-scale content transformation
Use Priority only when latency matters
Priority costs more. Reserve it for user-facing paths where response time affects the product experience.
Stream to your frontend
For web apps, stream tokens to the browser through WebSockets or Server-Sent Events.
Example SSE shape:
app.get("/chat", async (req, res) => {
res.setHeader("Content-Type", "text/event-stream");
const stream = await client.responses.create({
model: "gpt-5.5",
reasoning: { effort: "minimal" },
input: [{ role: "user", content: req.query.prompt }],
stream: true,
});
for await (const event of stream) {
if (event.type === "response.output_text.delta") {
res.write(`data: ${JSON.stringify({ delta: event.delta })}\n\n`);
}
}
res.end();
});
Common mistakes to avoid
Using
gpt-5.5-profor low-stakes prompts
Pro costs significantly more. Use it only when the quality gain justifies the bill.Leaving the system prompt empty
A short system prompt improves consistency and usually reduces wasted tokens.Not setting
reasoning.effortexplicitly
Pin it so traces are reproducible.Hardcoding API keys
Use environment variables, a secret manager, or Apidog environments.Forgetting
max_output_tokens
Always cap output to control cost.
Alternatives and comparison
GPT-5.5 Instant is one option among fast frontier models.
| Model | Input | Output | Context | Notable strength |
|---|---|---|---|---|
| GPT-5.5 Instant | $5.00 / 1M | $30.00 / 1M | 1M | ChatGPT default, low hallucination, broad tool use |
| GPT-5.5 Pro | $30.00 / 1M | $180.00 / 1M | 1M | Highest accuracy in the OpenAI lineup |
| Gemini 3 Flash Preview | varies | varies | 1M | Fast multimodal, Google ecosystem fit |
| DeepSeek V4 | low | low | 128K | Low-cost open-weights frontier model |
Use GPT-5.5 Instant when you need ChatGPT-grade reliability, tool use, and low latency. Consider alternatives when your infrastructure, cost model, or multimodal requirements point elsewhere.
Real-world implementation patterns
Customer support triage
Use minimal effort for fast classification.
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "minimal"},
input=[
{
"role": "system",
"content": "Classify support tickets by intent. Return only JSON."
},
{
"role": "user",
"content": "I was charged twice for my subscription."
}
],
max_output_tokens=200,
)
Expected output shape:
{
"intent": "billing_issue",
"priority": "high",
"needs_human": true
}
Documentation Q&A
Use retrieval-augmented context and keep the reasoning effort low unless the question requires synthesis.
const response = await client.responses.create({
model: "gpt-5.5",
reasoning: { effort: "low" },
input: [
{
role: "system",
content: "Answer using only the provided documentation context."
},
{
role: "user",
content: `Context:\n${retrievedDocs}\n\nQuestion: ${question}`
}
],
max_output_tokens: 700,
});
Code review assistant
Use low for common review comments and medium for security-sensitive code.
response = client.responses.create(
model="gpt-5.5",
reasoning={"effort": "medium"},
input=[
{
"role": "system",
"content": "Review the code for bugs, security issues, and API misuse."
},
{
"role": "user",
"content": diff_text
}
],
max_output_tokens=1200,
)
Pair this with the Apidog VS Code extension when you want inline API tests for suggested changes.
Video walkthrough
Conclusion
GPT-5.5 Instant is the default path for using GPT-5.5 with low latency. In ChatGPT, it is already enabled. In the API, call gpt-5.5 and set reasoning.effort to "minimal".
Key implementation points:
- Use
gpt-5.5for API calls. - Set
reasoning.effortexplicitly. - Use
minimalfor Instant-like latency. - Set
max_output_tokensto control cost. - Watch the 272K-token billing threshold.
- Stream responses for better UX.
- Test prompts and request bodies before deployment.
If you are building with the API, install Apidog, save a reusable gpt-5.5 request template, and run it across environments before shipping.
More references:
FAQ
Is GPT-5.5 Instant free?
Yes, with caps. Free ChatGPT accounts get 10 messages every 5 hours. Plus accounts get 160 messages every 3 hours. Pro and Business accounts get unlimited use, subject to abuse guardrails.
What is the API model name for GPT-5.5 Instant?
Use gpt-5.5. There is no separate gpt-5.5-instant model ID. Set reasoning.effort: "minimal" for Instant-like behavior.
See the GPT-5.5 API guide.
How is GPT-5.5 Instant different from GPT-5.5 Thinking?
They use the same underlying GPT-5.5 family, but with different reasoning budgets. Instant is optimized for fast responses. Thinking spends more compute on harder multi-step tasks. Pro adds more compute on top of Thinking.
Does GPT-5.5 Instant support tool use?
Yes. Through the Responses API, you can use the tools parameter for supported tool workflows, including search, code execution, and file-based operations.
What is the context window?
GPT-5.5 supports a 1M-token context window and up to 128,000 output tokens per response.
Watch the 272K input-token threshold. Past that point, standard, batch, and flex sessions are billed at 2x input and 1.5x output.
Can I pin GPT-5.5 Instant in ChatGPT?
Yes, on Plus, Pro, and Business plans. Open the model picker in the chat header and select GPT-5.5 Instant. The selection applies to the current chat.
How do I test GPT-5.5 Instant before deploying?
Save the request in Apidog, store the API key as an environment secret, add response assertions, and run the scenario in CI.
What happens when GPT-5.5 Instant routes me to Thinking?
ChatGPT’s router may switch to Thinking when the prompt looks complex. You may see a longer wait for the first token. In the API, pin the behavior yourself by setting reasoning.effort.




Top comments (0)