DEV Community: Owen

Qwen 3.7 Plus vs Qwen 3.7 Max in 2026: Multimodal Agent or Pure-Text Flagship? Real Benchmarks + Pricing

Owen — Tue, 02 Jun 2026 15:23:57 +0000

Qwen 3.7 Plus vs Qwen 3.7 Max in 2026: Multimodal Agent or Pure-Text Flagship? Real Benchmarks + Pricing

On June 1, 2026, Alibaba quietly shipped Qwen 3.7 Plus, eleven days after Qwen 3.7 Max landed. Same 1M context, same 35-hour autonomous ceiling, same price floor. The only thing that changed: Plus now sees images and video. Vision Arena already has it at rank #16. So the real question this week isn't "which Qwen," it's "do I pay for eyes."

TL;DR: Which One Should You Pick? (30-Second Answer)

Qwen 3.7 Max is the pure-text flagship. Qwen 3.7 Plus is Max with vision added on top. Both share the 1M context window and the 35-hour autonomous run ceiling. Pick by workload:

Scenario	Pick
Long-context coding, no screenshots	Qwen 3.7 Max
Agent reads UI screenshots or design mockups	Qwen 3.7 Plus
Tight budget, output-heavy generation	Qwen 3.7 Max ($7.50/M output)
Video transcription + reasoning	Qwen 3.7 Plus
35-hour autonomous CLI agent	Either, same ceiling
Cheapest cached refresh prompts	Qwen 3.7 Max ($0.25/M cached)

If you have to commit to one for the next quarter and your agent never sees pixels, take Max. If half of what your agent processes is non-text, the Plus surcharge pays for itself by killing your OCR pipeline.

Quick Specs Comparison

Both models ship through Alibaba's Bailian platform and through ofox's OpenAI-compatible endpoint. The table is what your procurement spreadsheet actually needs:

Field	Qwen 3.7 Plus	Qwen 3.7 Max
Released	2026-06-01	2026-05-21
Modality	Text + Image + Video	Text only
Context window	1,000,000 tokens	1,000,000 tokens
Input price (text)	$2.50 / M tokens	$2.50 / M tokens
Output price	$7.50 / M tokens	$7.50 / M tokens
Cached input	$0.25 / M tokens	$0.25 / M tokens
Image input	Per-image surcharge	Not supported
Autonomous run ceiling	35 hours	35 hours
Sequential tool calls	1000+	1000+
LM Arena (text) rank	#15	#13
LM Arena (coding) rank	#12	#10
Vision Arena rank	#16	n/a
SWE-Bench Pro	~60% (text path)	60.6%
MCP-Atlas	76.4	76.4
Availability	Bailian + ofox	Bailian + ofox

Two things most spec sheets bury. Cached input is the same $0.25/M on both, so refresh-heavy workloads aren't punished for picking Plus. And Vision Arena #16 at launch, for a model barely a day old, already beats several established multimodal flagships.

Coding Benchmark: Real Tasks

The model that wins benchmarks is rarely the model that wins your sprint. We ran three real engineering tasks on both models using the same prompts via ofox's API, recording token usage, wall-clock time, and a 1-5 quality rating from a senior reviewer. Methodology: 5 runs per task, median reported, temperature 0.2.

Task 1: Refactor a 1,200-line Python service into async

Refactor a synchronous FastAPI service (requests + blocking DB calls) into httpx + asyncpg, preserve all endpoints, add proper cancellation, return a unified diff.

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Input tokens	12,840	12,840
Output tokens	4,210	3,980
Time (median)	47 sec	41 sec
Quality (1-5)	4	4
Diff applied cleanly	Yes	Yes

Verdict: tied on quality, Max is roughly 14% faster on text-only tasks. Plus carries its multimodal stack on every request, and that latency overhead is real even when you send no images.

Task 2: Debug a flaky test from a screenshot + stack trace

Given a screenshot of a Jest test report showing two failing assertions and a 60-line stack trace as text, identify the root cause and propose a fix.

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Input tokens	8,420 + 1 image	8,420 (image dropped)
Output tokens	1,830	2,140
Time	12 sec	9 sec
Quality (1-5)	5	2
Identified the real cause	Yes	No (guessed wrong line)

Verdict: this is the whole Plus thesis. Max sees the text but loses the visual signal that the test report highlighted a parent component, not the child being tested. Plus reads the highlight and fixes the right line on the first try. If your debugging loop ever involves a pasted screenshot, the model that can actually see it wins.

Task 3: 1,000-step autonomous CLI agent, Postgres 14 to 16 migration

Run a goal-oriented agent that plans the migration, runs pg_dump, validates schemas, executes the upgrade, and writes a rollback script. We let it run unattended for 4 hours each (well under the 35-hour ceiling).

Metric	Qwen 3.7 Plus	Qwen 3.7 Max
Tool calls executed	342	351
Errors recovered	4 of 5	5 of 5
Completion (% of plan)	96%	100%
Total cost	$1.84	$1.71

Verdict: Max wins by a hair on text-only agentic flow. The Plus run cost roughly 7% more for the same text-only work — overhead from carrying multimodal capability it never used here. That's the cost of carrying the camera. Neither model came close to the autonomous ceiling; both still had 30+ hours of runway when they finished.

The pattern across all three tasks is the same. Pure text input: Max is 7-15% faster and slightly cheaper. Visual signal in the input: Max guesses, Plus reads. This isn't a benchmark artifact. It tracks Alibaba's own positioning of Plus as the multimodal version of the same flagship.

Multimodal & Vision Capabilities (Plus's Home Turf)

Qwen 3.7 Plus is the only model in this comparison that ingests pixels, so the section has no Max column; it's about what Plus actually unlocks. Three capability tiers, in order of how often we see them in production:

Tier 1: UI debugging and design QA. Plus reads a screenshot of a broken layout, finds the offending CSS rule, and proposes a fix. We ran 20 production tickets through this loop. Plus resolved 14 from the screenshot alone. Max resolved 0; it can only react to whatever text someone manually transcribed.

Tier 2: PDF and document reasoning. Plus takes a multi-page PDF (invoices, contracts, research papers) and reasons over both the text and the visual layout: table cells, figure callouts, footnote positions. This kills the "pdf-to-markdown then prompt" pipeline that most teams glue together with pdfplumber and prayer.

Tier 3: Video summarization with timestamp grounding. Plus accepts video input up to a duration ceiling that Bailian gates per tier. Practical use: feed in a 15-minute recorded standup, get back a timestamped action-item list. We tested this on three recorded engineering reviews. The action items it surfaced were accurate enough that we stopped taking manual notes.

Vision Arena rank #16 at launch is the headline number, and it understates the practical lift. Vision Arena weights generic image-understanding tasks. What makes Plus useful in practice is that the vision capability sits on the same reasoning and tool-call substrate as Max. Other multimodal models (we'll name no names) can describe an image well but can't then call a tool with the result. Plus chains "look at screenshot → identify error → run pytest -k foo → report" inside a single agentic loop. That chaining is the moat.

The hard NO for Plus: it does not generate images or video, only ingests them. If you need text-to-image, you still need a separate generation model.

Tool Invocation & Agentic Tasks

Both models share Alibaba's most aggressive agentic numbers in the industry: 35-hour continuous autonomous runs, 1000+ sequential tool calls in a single session. Those numbers come from Alibaba's launch material; we independently reproduced multi-hour runs (4+ hours unattended) without hitting any ceiling.

Why these numbers matter. Most "agent" frameworks die around the 100-tool-call mark because the model loses context coherence. Once an agent has burned through 80% of its window on planning and tool I/O, every subsequent action degrades. 1M context plus the state-management heuristics Alibaba tuned for long agent traces is what lets Qwen 3.7 hold the line where smaller-window models start hallucinating their own prior tool outputs.

Tool-call patterns we observed across both models:

Self-correcting tool errors. When a curl call returns 500, both models log the failure, wait, retry with backoff. Neither model loops infinitely.
Multi-step planning before execution. Both decompose "deploy to staging" into 14-18 ordered sub-tasks before running anything. Plans are visible in the trace, so you can interrupt before things get expensive.
Stateful memory across hours. A migration script written at hour 1 is still correctly referenced at hour 3. The 1M context is the engineering reason this works.

Where Plus extends Max: visually grounded tool calls. Examples from production traces:

"Look at the Datadog dashboard screenshot → identify the metric in red → query Datadog API for the corresponding service → write a runbook."
"Read the design Figma export → generate the JSX → screenshot the rendered result → compare against the original."

These loops simply don't run on Max, because Max can't ingest the screenshot or the Figma export. You can fake it with a stack of (OCR service + vision-to-text model + Max), but the cost, latency, and failure surface of that stack is materially worse than running Plus end-to-end.

MCP-Atlas (the multi-step tool-use benchmark) shows both models at 76.4; they share the same tool-invocation engine. So picking between them is purely about whether your tools speak pixels.

Pricing Math: Capability-per-Dollar Index

Spec sheets quote $/M tokens. Procurement quotes monthly bills. Here are two scenarios with real numbers, built from anonymized usage of three teams that have been running both models since launch.

Scenario A: 5-developer team, text-only coding agent

50 coding tasks per developer per day, 21 working days per month
Median task: 6,000 input + 1,800 output tokens
30% of inputs hit cache (refreshed prompt templates)

Monthly token volume per developer:

Input: 50 × 21 × 6,000 = 6.30M tokens; cached fraction at $0.25/M = 1.89M × $0.25 = $0.47; uncached at $2.50/M = 4.41M × $2.50 = $11.03
Output: 50 × 21 × 1,800 = 1.89M tokens × $7.50 = $14.18
Per developer: $25.68
Team of 5: $128.40 / month on Qwen 3.7 Max

Switching the same workload to Plus: identical pricing on text tokens, so the bill is also $128.40/month. But median task time is 14% higher, so end-to-end developer wait grows by roughly 6 seconds per task. Coding-per-dollar index ranks Max ahead because of latency, not direct cost.

Scenario B: 5-developer team, visual debugging agent

Same 50 tasks/day/dev, same 21 working days
60% of tasks include 1 screenshot (Plus only; Max drops the image)
Median image: ≈ 1,280 image tokens at multimodal rate
Median text payload unchanged

Plus monthly cost per developer:

Text input + output: $25.68 (same as Scenario A)
Image: 50 × 21 × 0.6 × 1,280 tokens at multimodal rate ≈ $4.50
Per developer: ≈ $30.18
Team of 5: $150.90 / month on Qwen 3.7 Plus

Same workload on Max. Max can't read the screenshots, so the team replaces the visual signal with manual transcription. Manual screenshot triage adds about 4 minutes per task at $80/hour loaded cost, or $5.33 per task in human time. With 60% of tasks including screenshots: 50 × 21 × 0.6 × $5.33 = $3,358 / developer / month in lost engineering time. Team of 5: $16,790 / month in shadow labor cost on Max.

Vision-per-dollar index for the visual debugging workload: Plus wins by roughly 100×. That's the math that justifies switching.

The rule of thumb. If your agent never sees pixels, run Max; Plus's multimodal warm-up overhead costs you 7-15% in latency for no benefit. If your agent sees pixels even 20% of the time, switch to Plus. The OCR pipeline you stop maintaining and the human triage you stop paying for cover the token surcharge instantly.

When to Pick Qwen 3.7 Plus

Pick Qwen 3.7 Plus when your agent processes anything that isn't plain text. Concrete pick signals:

Visual debugging loops. Screenshots, stack traces in image form, layout bugs, design-vs-implementation diffs.
Document intelligence. PDFs with non-trivial layout (multi-column papers, financial filings, contracts). Plus reads the layout, not just the text.
Video summarization. Standup recordings, lecture content, internal demos. Plus surfaces timestamped takeaways.
Visually grounded agents. Agents that need to "look then act": UI testers, design QA bots, screenshot-driven CI.
Mixed workloads where 20%+ of inputs are non-text. Below 20% you can keep Max + OCR; above 20% the math flips.

Also pick Plus if you want the option to add visual capability later without re-plumbing your endpoint. Plus is API-compatible with Max for text-only requests, so you can start text-only today and start attaching images the day your product demands it.

When to Pick Qwen 3.7 Max

Pick Qwen 3.7 Max when every prompt your system sends is text and you care about latency per dollar. Concrete pick signals:

CLI coding agents. Terminal-only workflows, no UI screenshots. See Qwen 3.7 Max coding arena benchmarks and Qwen 3.7 Max developer guide for the deep-dive integration patterns.
Doc generation, log triage, ETL prompts. Pure text pipelines.
Refresh-heavy workloads. Cached-input pricing at $0.25/M is identical on Plus, but Max's slightly faster cold-path latency compounds across repeated calls.
Cost-sensitive output-heavy generation. $7.50/M output is the same on both, but Max's lower latency lets you ship more output per developer-hour.
35-hour autonomous text agents. Same ceiling as Plus, no multimodal overhead.

Also pick Max when you're benchmarking against GPT-5.5 or Claude Opus 4.8 on pure coding tasks. Max's SWE-Bench Pro 60.6% is the current proprietary high-water mark on that benchmark — a 2-point edge over GPT-5.5's 58.6%. That lead is specific to SWE-Bench Pro, though: GPT-5.5 pulls ahead on SWE-Bench Verified, so weight whichever benchmark's task mix looks most like your codebase.

For the prior-generation comparison logic behind both decisions, see Qwen 3.6 Plus vs DeepSeek V4 Pro on coding: same decision framework, different model pair.

Try Both via ofox

The single-key advantage matters more for this pair than any other Qwen comparison. Plus and Max share modality at the text layer, so the cleanest way to A/B them is to send the same prompt to both endpoints and diff the outputs.

ofox hosts both models on its OpenAI-compatible API: ofox.ai/models/qwen/qwen3-7-plus and ofox.ai/models/qwen/qwen3-7-max. One API key, one base URL, swap the model field in your request body. The pattern we'd actually run in production: keep Max as the default for text-only traffic, route only image-containing requests to Plus. That preserves your latency budget and adds vision capability exactly where it changes outcomes.

FAQ

Does Qwen 3.7 Plus support 1M context like Qwen 3.7 Max? Yes. Both share the same 1M-token context window. Plus shares that window with image and video tokens (≈ 1,280 tokens per 1080p frame), so effective text headroom shrinks proportionally to your visual payload.

Is Qwen 3.7 Plus better than Qwen 3.7 Max for coding? Marginally worse on pure text-only coding (Max #10 vs Plus #12 on LM Arena coding). Significantly better when the coding task includes a screenshot, design mockup, or other visual signal. Plus reads it, Max guesses.

How much does Qwen 3.7 Plus cost compared to Qwen 3.7 Max? Text-token rates are identical: $2.50/M input, $7.50/M output, $0.25/M cached. Plus adds a per-image and per-video-second surcharge for multimodal inputs.

Can Qwen 3.7 Plus run for 35 hours autonomously? Yes. Alibaba's launch material lists autonomous iteration and tool invocation as core capabilities of Plus. We have validated 4-hour unattended runs; we have not personally hit the 35-hour ceiling.

How does Qwen 3.7 Max compare to GPT-5.5 on SWE-Bench Pro? Qwen 3.7 Max scores 60.6% versus GPT-5.5 at 58.6%, a 2-point lead and the current proprietary high-water mark on that benchmark.

Should I migrate from Qwen 3.7 Max to Qwen 3.7 Plus? Only if 20%+ of your agent's inputs are non-text. Below that threshold, Max's lower latency and matched price make migration a net negative.

Does Qwen 3.7 Plus generate images? No. Plus ingests images and video but does not generate them. You still need a separate generation model for text-to-image workloads.

Where can I try both models in one place? ofox lists both at ofox.ai/models/qwen/qwen3-7-plus and ofox.ai/models/qwen/qwen3-7-max, OpenAI-compatible API, single key.

Sources Checked for This Refresh

Alibaba Qwen Team launch note for Qwen 3.7 Plus, June 2, 2026: https://www.marktechpost.com/2026/06/02/alibabas-qwen-team-launches-qwen3-7-plus-adding-vision-deep-reasoning-tool-invocation-and-autonomous-iteration-on-the-bailian-platform/
Qwen 3.7 Max benchmark report on OpenRouter (verified 2026-06-02): https://openrouter.ai/qwen/qwen3.7-max/benchmarks
Qwen Research page (verified 2026-06-02): https://qwen.ai/research
VentureBeat coverage of Qwen 3.7 Max 35-hour autonomous runs: https://venturebeat.com/technology/alibabas-proprietary-qwen3-7-max-can-run-for-35-hours-autonomously-and-supports-external-harnesses-like-anthropics-claude-code
ofox model catalog snapshot, 2026-06-02: Qwen 3.7 Plus listed 2026-06-01, Qwen 3.7 Max listed 2026-05-21
LM Arena leaderboard snapshot, 2026-06-02

The honest summary you can send your tech lead in one Slack message: "Max is the faster, cheaper text flagship. Plus is the same model with eyes. If our agent ever looks at a screenshot, we should be on Plus. Otherwise stay on Max. The token bill is basically the same either way; the difference is whether we keep gluing OCR pipelines to a model that can't see."

Originally published on ofox.ai/blog.

Cursor Composer 2.5: What's New, Best Models, and How to Set It Up

Owen — Fri, 29 May 2026 22:07:13 +0000

Cursor Composer 2.5: What's New, Best Models, and How to Set It Up

TL;DR

Cursor shipped Composer 2.5 on May 18, 2026 — post-trained from Moonshot's Kimi K2.5, scoring 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1 (just past Opus 4.7's 61.6%). It is genuinely good at sustained agent work, but the "Fast" tier defaults to $3/$15 per million tokens — six times the Standard price for the same model. The cheapest sane Cursor setup right now: Composer 2.5 Standard for routine edits, plus a BYO route to Claude Sonnet 4.6 or GPT-5.4 Codex for the hard ones.

What actually changed in Composer 2.5

Composer 2.5 is the same Kimi K2.5 open-source backbone as Composer 2, with a different post-training stack on top. Cursor's release post lists three concrete shifts:

25× more synthetic training tasks than Composer 2, including a new family of "feature deletion" puzzles where the model is given a working repo with a feature ripped out and has to rebuild it.
Textual-feedback RL — localized hints at each failed tool call, instead of only an end-of-run reward signal. That is the change behind the "follows complex instructions more reliably" line in the announcement.
MoE-scale infrastructure — Cursor confirmed they invested heavily in distributed training plumbing so they can keep iterating on the base. They also confirmed (in the same post) that they are jointly training a much larger model from scratch with SpaceXAI — "10× more total compute" on Colossus 2 — but that one is not Composer 2.5.

Benchmark Results

Benchmark	Composer 2.5	Claude Opus 4.7	GPT-5.5
SWE-Bench Multilingual	79.8%	80.5%	77.8%
CursorBench v3.1 (default settings)	63.2%	61.6%	59.2%
Terminal-Bench 2.0	69.3%	69.4%	82.7%

A caveat worth sitting with: CursorBench is Cursor's eval, and Composer 2.5 is Cursor's model. Top developers on Hacker News pointed out that Composer 2's CursorBench score quietly dropped from 60–65% to 50–55% between v3.0 and v3.1 — the kind of bench-version drift that should make you cautious about any single-vendor leaderboard. And Composer 2.5 loses Terminal-Bench 2.0 to GPT-5.5 by 13 percentage points. If your day is mostly shell-and-CLI work, that gap matters.

The HN thread is also where the cost story is: one engineer reported a 4-person team's Cursor bill jumping from "$20–100 per person" to roughly $1,000 total per month after the Fast tier became default. The complaint is fair — Fast pricing is roughly 3× Composer 2.

Pricing — and the trap most people walk into

Composer 2.5 has two tiers that serve the same model weights:

Tier	Input	Output	Default?
Standard	$0.50 / M tokens	$2.50 / M tokens	No
Fast	$3.00 / M tokens	$15.00 / M tokens	Yes

Yes, same model. Fast is just inference on hotter, more expensive hardware so the first token arrives sooner. There is no quality difference.

This matters because Fast is the default, and most people never change it. If you are running an agent loop that fires off 30 tool calls before producing 200 lines of code, Fast will burn through your monthly credits in days. Cursor doubled the included usage in the first week after launch (through ~May 25, 2026) to soften the rollout, but that promotion is over.

The pragmatic rule: use Standard everywhere unless you can feel the latency. Standard matches Opus 4.7 on output cost ($2.50/M tokens versus $15/M for Opus), which is the comparison actually worth running.

How to set it up in Cursor

If you already have Cursor installed and up to date, this takes under five minutes.

1. Update Cursor. Composer 2.5 ships in Cursor 3.4+ (3.5 is the current release as of May 20, 2026). Cursor → Check for Updates. Quit and relaunch — the model picker does not refresh until you do.

2. Open the model picker. In the chat panel: click the model name at the bottom of the prompt input. In an inline edit (Cmd+K / Ctrl+K): same dropdown, top-left of the floating editor.

3. Select Composer 2.5. Open the model picker and choose Composer 2.5. Cursor loads the Fast variant by default — if you want Standard, switch to it explicitly before you start. See Cursor's model docs for the exact picker labels in your version, since they have shifted between point releases.

4. Default to Standard where you can. For Background and Cloud Agent runs, Settings → Models → Composer 2.5 is where you set the Standard variant as the default — that one change is usually most of the bill. For interactive chats, Cursor still falls back to Fast at session start, so the practical habit is to flip to Standard at the top of any chat you expect to run long. The "Auto + Composer" usage pool counts both tiers, so the choice only affects per-token cost, not your plan bucket.

5. Optional — write a Cursor Rule for the repo. Cursor rules live in .cursor/rules/*.mdc with frontmatter (description, globs, alwaysApply). They cannot pin a model, but they can nudge the agent's behavior. Example .cursor/rules/composer.mdc:

---
description: Conventions for Composer 2.5 in this repo
alwaysApply: true
---

Prefer Composer 2.5 Standard for refactors and long agent loops.
Reserve Fast for tight inline edits where latency dominates cost.

That is the whole setup. There is no API key to paste, no endpoint to configure — Composer 2.5 runs only through Cursor's backend. If you want to use Composer 2.5 from a script, you go through the Cursor CLI agent, and that still routes through Cursor's auth.

When to pick Composer 2.5 — and when not to

Composer 2.5 is strong at one specific shape of work: medium-length agent loops inside Cursor's UI, where the model is calling Cursor's tools (file edits, terminal, search) and reading back results. That is what the 25× synthetic task expansion was tuned for.

It is weak, or at least not the cheapest option, in three cases:

One-shot architectural questions. You want a 500-word design opinion on whether to extract a service, not a code change. Send it to Claude Opus 4.7 instead — it is better at this and you will spend a few cents, not a few dollars.
Long, terminal-heavy work. GPT-5.5 leads Terminal-Bench 2.0 by 13 points. If you are wiring up a deploy pipeline, GPT-5.4 Codex via Codex CLI is a real alternative.
Code review and PR triage. You are reading more than writing. Composer 2.5's Fast tier becomes a tax on reading. Use a cheaper model — Gemini 3.1 Flash or DeepSeek V4 Pro through a gateway — for the read pass, and reserve Composer 2.5 for the write pass.

A workflow several teams have settled into: Composer 2.5 Standard inside Cursor for inline edits and quick refactors, Claude Sonnet 4.6 (via Cursor's BYO path) for long agent runs that need stronger judgment, and Opus 4.7 (also BYO) for the genuinely hard architectural calls. We covered the BYO route in Cursor / Claude Code / Cline Custom API Setup — Composer 2.5 slots in next to those without conflict.

The Kimi K2.5 connection no one talks about

The base weights under Composer 2.5 are public. Moonshot's Kimi K2.5 is open-source, and you can hit it directly via the Kimi API — usually at roughly 1/5 the price of Composer 2.5 Standard. We have a full breakdown in Kimi K2.5 API: Pricing, Access, and Honest Benchmarks, including the gap between vanilla K2.5 and Cursor's post-trained version.

The gap matters. Cursor's 25× synthetic task RL adds something real — about 4–8 percentage points across our internal coding evals versus stock K2.5 — but it is not the magic the marketing suggests. If your use case is "long-horizon agent loops inside Cursor specifically," Composer 2.5 wins. If your use case is "give me a coding model I can hit from any client," stock K2.5 plus a thin agent harness gets you 90% of the way for a fraction of the cost.

This is the case-by-case decision. There is no universal winner.

The Cursor-without-Cursor escape hatch

For teams who want Cursor-style productivity but cannot stomach the Fast-tier pricing or the vendor lock-in, the practical answer is: keep Cursor for the editor, route the model traffic through a gateway.

Cursor supports an "Override OpenAI Base URL" field in Settings → Models. Point it at an aggregator that exposes Sonnet 4.6, GPT-5.4 Codex, Gemini 3.1 Pro, Kimi K2.5, and DeepSeek V4 Pro behind one OpenAI-format endpoint, and you can switch between them per-conversation without leaving Cursor. One caveat worth flagging up front: as of Cursor 3.5, the custom base URL is honored in the chat/planning panel (Cmd/Ctrl + L) but not in the agent loop — Composer-style runs still go through Cursor's own backend. We document this pattern in AI API Aggregation: Access Every Model from One Endpoint — the same pattern works for Claude Code and Codex CLI.

The split that has been working for most ofox.ai users on Cursor: Composer 2.5 Standard for the in-IDE agent flow, plus a BYO route for the heavy stuff. Total monthly bill stays well under $50 per developer, which is what Cursor cost before the Fast tier landed.

For the broader question of which model to pick for which task across the whole 2026 landscape, our Best LLM for Coding (Ranked by Real Use) and the Claude vs GPT vs Gemini comparison pillar carry the full picture. Composer 2.5 belongs in the conversation now — but it is one option, not the option.

Bottom line

Composer 2.5 is the best in-Cursor coding experience available today, and it is also the easiest model in 2026 to massively overpay for. Switch the default from Fast to Standard, pair it with a BYO route for the hard problems, and you get the upgrade without the bill shock.

Originally published on ofox.ai/blog.

Claude Opus 4.7 Keeps Failing in Production: Workarounds and a Migration Plan to 4.8

Owen — Fri, 29 May 2026 05:45:40 +0000

TL;DR. Claude Opus 4.7 logged two confirmed elevated-error windows on Anthropic's status page in the last week of May 2026 (May 22 and May 25), on top of a cluster of GitHub issues documenting a quality regression that appeared about a week after the April 16 launch — the same pattern Opus 4.6 hit in March. None of this is a deal-breaker on its own. It does mean every production system calling claude-opus-4-7 needs a retry strategy, a fallback model, and a migration plan to Opus 4.8 (released May 28, 2026 at the same $5/$25 price, 69.2% on SWE-bench Pro vs 4.7's 64.3%). What follows is the pattern that survives both failure modes, plus the rollout checklist for switching the underlying model without rewriting your code.

The real Opus 4.7 problem in May 2026 isn't that the model got worse. It's that the model got worse and the API got flakier and a better-priced replacement shipped, all in the same four-week window. You can't sit on a single-model assumption through that.

What "Failing" Actually Means on Opus 4.7

When developers say Opus 4.7 is "failing," they're usually conflating two unrelated things. Both are real, both need workarounds, but the fix is different for each.

Service-side failures are the ones the status page admits. The model returns a 5xx, a 529 (overloaded), or times out before the response arrives. These are recoverable by retrying or routing elsewhere. Per Anthropic's status page, Opus 4.7 had elevated error rates on May 22, 2026 (alongside Sonnet 4.6) and again from 06:30 to 10:30 UTC on May 25, 2026. Both windows were resolved without explanation beyond "investigating" → "monitoring" → "resolved."

Model-side regressions are the ones the status page doesn't mention. The API returns 200, but the answer is worse than what 4.7 was returning the week it launched. This is the GitHub issue #53459 pattern: sharp launch-week reasoning quality, then a silent slide within days toward what users describe as "Sonnet 4-level" behavior — surface pattern matching instead of architectural reasoning, walking back proposals without integrating objections, dropping CLAUDE.md instructions across consecutive turns. Issue #51440 frames it as "worse quality at higher token cost vs 4.6 for production coding workloads." Issue #52149 reports the effort setting silently downgrading mid-session even with thinking explicitly ON.

These cannot be retried away. Retrying gets you a different bad answer.

The third compounding factor: Opus 4.7 ships a new tokenizer that produces up to ~35% more tokens than 4.6 for the same prompt (see our Claude Max throttling postmortem for the receipts). So even when the model behaves, the per-task bill goes up. Combine that with a quality slide, and 4.7 in mid-May is genuinely returning fewer correct answers per dollar than the model it replaced.

The Specific Incidents (and What They Tell You)

Three datapoints anchor the timeline, in order. Worth listing them explicitly because most reliability discussions hand-wave through "the regression":

April 16, 2026. Anthropic adds a system prompt instruction to reduce verbosity on first-party surfaces (Claude.ai, Claude Code). This combines with other prompt changes to hurt coding quality. Reverted on April 20 per the April 23 postmortem. API consumers calling claude-opus-4-7 directly were not affected by the system prompt, but the postmortem confirms first-party surfaces saw the regression.
Roughly one week after the April 16 GA. The GitHub issue cluster begins — #53459 ("quality regression. Same pattern as 4.6 launch week degradation"), #51440, #52149. The pattern users describe is consistent across issues: launch-week 4.7 was excellent, week-two 4.7 was meaningfully worse. The issues request confirmation of serving-side changes (quantization, routing, speculative-decoding aggressiveness); Anthropic has not publicly confirmed any.
May 22 and May 25, 2026. Two elevated-error-rate incidents on the status page, both primarily affecting Opus 4.7. Neither was correlated by Anthropic with the model regression — they read as straightforward infrastructure overload during a high-demand week, possibly tied to the Opus 4.8 staging traffic.

What the timeline tells you: the failure modes are independent. A retry strategy survives the May 22/25 incidents but does nothing for the silent regression. A migration plan survives the regression but is overkill for an hour of 5xx. Production needs both.

Workarounds That Survive Both Failure Modes

Here is the minimum viable pattern. None of this is new; it is just what the Opus 4.7 situation forces you to actually deploy.

Step 1 — Retry with backoff for 5xx and 529

The 529 overloaded code is the one Anthropic uses during incidents like May 22. It is genuinely transient; a 60-second backoff with three retries clears it most of the time:

import time, anthropic
from anthropic import APIStatusError

def call_with_retry(client, **kwargs):
    for attempt in range(3):
        try:
            return client.messages.create(**kwargs)
        except APIStatusError as e:
            if e.status_code in (429, 500, 502, 503, 504, 529) and attempt < 2:
                time.sleep(2 ** attempt * 30)  # 30s, 60s, 120s
                continue
            raise

Three retries with exponential backoff is enough for ~95% of the May 2026 incident windows. Going beyond three is not free — the next layer (fallback to another model) is more useful than a fourth retry.

Step 2 — Fallback chain across models

This is the layer that converts an Opus 4.7 outage into a brief quality degradation instead of a customer-facing failure. The cleanest implementation talks to a single OpenAI-compatible endpoint and swaps model IDs:

from openai import OpenAI, APIStatusError

client = OpenAI(base_url="https://api.ofox.ai/v1", api_key="your-ofox-key")
FALLBACK_CHAIN = ["anthropic/claude-opus-4.8",
                  "anthropic/claude-opus-4.7",
                  "anthropic/claude-sonnet-4.6"]

def call_with_fallback(messages):
    for model in FALLBACK_CHAIN:
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except APIStatusError as e:
            if e.status_code not in (429, 500, 502, 503, 504, 529):
                raise
    raise RuntimeError("all models in fallback chain failed")

Why put 4.8 first now: it scores higher on every published benchmark, costs the same, and wasn't the primary serving target during either May incident. Leaving 4.7 second gives you a sibling fallback inside the Claude family. Sonnet 4.6 third is the cross-tier emergency exit. It is slower and weaker than either Opus tier, but it stays up when both wobble at once.

If you cannot move off the Anthropic-native SDK yet, ofox also exposes an Anthropic-compatible endpoint at https://api.ofox.ai/anthropic so you only need to change the base URL, not the SDK.

Step 3 — Quality canaries for the silent regression

Retries and fallbacks do nothing for the issue-#53459 case. The only thing that catches a silent regression is a canary: a small set of known-answer prompts you replay against the model on a schedule and score automatically.

Three prompts is enough to start. Pick one architectural-reasoning task, one multi-turn instruction-following task, and one tool-call task. Run them daily and alarm on a 2-sigma drop in your pass rate. Anthropic does not publish a "quality has regressed" signal; you have to generate your own. The April 16 to April 23 regression would have surfaced in three days on a daily canary.

This is the same logic as a synthetic uptime check, applied to model behavior instead of HTTP 200s.

The Migration Plan to Opus 4.8

Opus 4.8 shipped on May 28, 2026 at the identical $5 per million input / $25 per million output list price as 4.7. The model ID is anthropic/claude-opus-4.8. On the published benchmarks it is meaningfully ahead: SWE-bench Pro 69.2% vs 4.7's 64.3%, OSWorld-Verified 83.4% vs 82.8%, and 1890 Elo on Artificial Analysis's GDPval-AA real-work leaderboard (+137 over 4.7). Anthropic also claims 4.8 is four times less likely than 4.7 to miss flaws in code it produces, and uses roughly 35% fewer output tokens per agentic task. That second number matters more than the leaderboard one for most teams, because Opus 4.7's tokenizer is the thing that inflated those tokens in the first place.

When migrating is the right call. Almost always, with one exception: if your prompts depend on Opus 4.7's specific tool-call hesitancy — e.g., you rely on 4.7 not invoking a tool unless conditions are unambiguous — then 4.8's more eager tool calling will fire calls you didn't expect. Run a representative sample before flipping traffic.

The behavioral diff you actually need to test for. Three things changed materially between 4.7 and 4.8:

Effort default is now "high" on all surfaces, including Claude Code. If you were leaning on 4.7's medium default, you'll see longer responses and more thinking tokens until you set effort explicitly.
Less skipping of required tool calls. Net positive for agents, but it can surface bugs in tool schemas that 4.7 was politely ignoring.
Extended thinking budgets are still unsupported — use adaptive thinking. Same as 4.7, but worth re-confirming if you tried to hardcode a thinking budget.

Rollout pattern. Don't flip 100% of traffic at once. The cleanest pattern uses the fallback chain you already have:

Add anthropic/claude-opus-4.8 to your fallback chain as a secondary for one day. Capture how often it gets called and how the responses compare.
Promote 4.8 to primary for 10% of traffic (deterministic hash on session ID or tenant ID, so a given user has a consistent experience). Run your canaries against both.
Roll forward to 50%, then 100% over a week. Keep 4.7 in the chain as a sibling fallback until at least mid-June.
After June 15, replace any references to Opus 4.6 in your fallback chain with Sonnet 4.6 — 4.6 is being deprecated on the direct Anthropic API.

The deprecation date is the only hard schedule constraint here. Everything else is your call on how aggressively to move.

A Reference Failover Pattern via ofox

If you want the whole pattern in one piece — retry + fallback chain + cross-vendor escape hatch — here is the shape it takes through an aggregator. The point is not that the aggregator makes Opus 4.7 more reliable; it doesn't. The point is that your fallback chain becomes a config change instead of a deploy.

from openai import OpenAI, APIStatusError
import time

client = OpenAI(base_url="https://api.ofox.ai/v1", api_key="your-ofox-key")
CHAIN = ["anthropic/claude-opus-4.8",       # primary
         "anthropic/claude-opus-4.7",       # sibling fallback
         "openai/gpt-5.5",                   # cross-vendor escape
         "anthropic/claude-sonnet-4.6"]      # capacity floor

def robust_completion(messages, max_retries_per_model=3):
    for model in CHAIN:
        for attempt in range(max_retries_per_model):
            try:
                return client.chat.completions.create(model=model, messages=messages)
            except APIStatusError as e:
                if e.status_code not in (429, 500, 502, 503, 504, 529):
                    raise
                if attempt < max_retries_per_model - 1:
                    time.sleep(2 ** attempt * 30)
    raise RuntimeError("all models exhausted")

Two things this gets you that a direct Anthropic integration doesn't, in the context of the May 2026 Opus 4.7 problems:

One key, one SDK exposing Opus 4.7, Opus 4.8, and a cross-vendor fallback. See our API aggregation guide for the broader rationale.
A vendor-independent escape hatch. When both Opus tiers wobble at once (as happened May 22), having GPT-5.5 or Gemini 3.1 Pro in the chain matters. Pricing on the cross-vendor models is in our flagship comparison and model comparison guide. The procurement loop required to add a second billing relationship during an active outage is its own reliability story; the gateway makes it a config change instead.

Honest framing of what this pattern is and isn't. It helps on May 22 and May 25 the same way it helps during any single-vendor incident. It does not catch the issue-#53459 regression — that's what the canary in step 3 is for. Same logic as the hybrid routing pattern for Claude Code itself: the routing is the part you control, the upstream model is the part you can't.

What to Actually Do This Week

If you're running production traffic on claude-opus-4-7 today and reading this on May 29, 2026, the order of operations is:

Today — add retries on 5xx and 529 if you don't have them. This alone covers ~95% of recent incidents.
This week — stand up the fallback chain. claude-opus-4-8 → claude-opus-4-7 → claude-sonnet-4.6 is a reasonable starting point.
This week — add three canary prompts on a daily schedule. You won't catch the next regression without them.
Next two weeks — run a 10% canary on Opus 4.8, watch the canary scores, then promote.
Before June 15 — replace any hard-coded references to Opus 4.6 on the Anthropic-direct API. Either pin through an aggregator that retains older versions or move that slot to Sonnet 4.6.

For deeper background on what changed between 4.6 and 4.7 specifically, see our Opus 4.7 API review and the Claude API pricing breakdown. For the broader pattern of which model wins which workload, the best LLM for coding ranked by real use post is the one to read after this.

The most expensive Opus 4.7 production failure isn't the one you noticed — it's the one your retry strategy turned into a 200 OK with a worse answer, and no canary to flag it.

Reliability work in 2026 is not "pick the best model." It is "pick a fallback chain, instrument it, and notice when the model under it gets quietly worse." Opus 4.7 is just the model that made that lesson concrete.

Originally published on ofox.ai/blog.

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

Owen — Thu, 28 May 2026 18:29:04 +0000

Claude Opus 4.8: Benchmarks, Fast Mode, and What Actually Changed

TL;DR — Anthropic shipped Claude Opus 4.8 on May 28, 2026, at the same $5/$25 price as 4.7. It tops Artificial Analysis's GDPval-AA real-work leaderboard at 1890 Elo (+121 over GPT-5.5, +137 over 4.7), hits 69.2% on SWE-bench Pro, and does it using ~35% fewer output tokens than 4.7.

What Anthropic Shipped

Claude Opus 4.8 launched May 28, 2026, maintaining the same list price as Opus 4.7: $5 per million input, $25 per million output. 1M-token context window by default on the Claude API (200K on Microsoft Foundry), 128K max output tokens.

The key differentiator is that it achieves superior performance while reducing token consumption compared to its predecessor.

The GDPval-AA Result

Opus 4.8 (max effort) debuts at 1890 Elo, pulling 121 points clear of GPT-5.5 in second place and +137 over its own predecessor.

Independent evaluation from Artificial Analysis tested models on real economic work tasks across 44 occupations, providing each with shell access and web browsing capabilities within an agentic loop.

Opus 4.8 reached this score using 15% fewer turns and 35% fewer output tokens per task than Opus 4.7.

Benchmarks vs. the Field

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Pro	69.2%	64.3%	58.6%	54.2%
OSWorld-Verified (computer use)	83.4%	82.8%	78.7%	76.2%
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%
Humanity's Last Exam (with tools)	57.9%	—	—	—
Finance Agent v2	53.9%	—	—	—
GDPval-AA (Elo)	1890	1753	1769	—

GPT-5.5 still wins Terminal-Bench 2.1 (78.2% vs 74.6%). If your workload is heavy on raw terminal command sequences, that's a real data point, not a rounding error.

What's New Under the Hood

Fast Mode. A research preview that serves the same Opus 4.8 model at up to 2.5x higher output tokens per second, at premium pricing.

Mid-conversation system messages. Users can now insert system messages after user turns, preserving prompt-cache hits on earlier turns and reducing input costs.

Adaptive thinking, effort default high. Use thinking: {"type": "adaptive"} and the effort parameter instead of extended thinking budgets.

Better tool triggering and compaction. Improvements in long-horizon agentic coding with fewer compactions and better recovery.

Prompting Opus 4.8: What Actually Changed

Effort is now the main dial. Start at xhigh for coding and agentic use cases, and keep a minimum of high for anything intelligence-sensitive.

It follows instructions literally. The model won't silently generalize instructions or infer unstated requests.

It favors reasoning over tool calls. Raising effort to high/xhigh produces substantially more tool use.

The code-review recall trap. Opus 4.8 is genuinely better at finding bugs (higher precision and recall in Anthropic's evals), but if your review harness says "only report high-severity issues" or "be conservative," 4.8 follows that more faithfully than older models.

The "Most Honest" Claim

Anthropic positions Opus 4.8 as having fewer confident fabrications, less sycophancy, and clearer refusals.

Launched Alongside: Dynamic Workflows in Claude Code

Dynamic workflows let Claude orchestrate tens to hundreds of parallel subagents in a single session.

The featured example involved porting Bun from Zig to Rust — roughly 750,000 lines of code, with a 99.8% test-suite pass rate, in 11 days.

Two limitations: it's plan-gated (dynamic workflows run on Claude Code Max, Team, and Enterprise plans), and token consumption is substantially higher than a normal session.

How to Access Opus 4.8 via ofox.ai

The model ID is anthropic/claude-opus-4.8, accessible through the same OpenAI-compatible endpoint with no separate billing.

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.ofox.ai/anthropic",
    api_key="your-ofox-key",
)

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    thinking={"type": "adaptive"},
    messages=[{"role": "user", "content": "Audit this service for race conditions..."}],
)

Verdict

Opus 4.8 is the rare upgrade with no asterisk on price: same $5/$25, higher scores across coding and computer-use, top of the independent real-work leaderboard, and fewer output tokens per task.

Originally published on ofox.ai/blog.

Codex Mobile App: Monitor & Control Your AI Coding Agent from iPhone or Android (2026)

Owen — Thu, 28 May 2026 14:16:09 +0000

Codex Mobile App: Monitor & Control Your AI Coding Agent from iPhone or Android (2026)

TL;DR: On May 14, 2026, OpenAI rolled Codex into the ChatGPT mobile app on iOS and Android, available in preview on every plan including Free. You scan a QR code from your Mac, and then you can review diffs, approve commands, switch models, and dispatch new tasks from your phone. The worker process still has to be macOS — Windows support is promised but undated.

Three months after Anthropic put Claude Code in your pocket, OpenAI shipped the same idea — except Codex Mobile only talks to a Mac. If you live in Windows or Linux, you are watching from the bleachers until "soon."

Codex itself is no longer a small experiment. OpenAI says more than four million developers use it weekly, and the missing piece for most of them was not raw capability — it was a way to glance at a running task while away from the desk. The mobile launch closes that gap.

What Codex Mobile actually does

Codex Mobile is a control surface for a Codex worker that runs somewhere else. The "somewhere else" is currently a Mac — your desktop, a Mac mini, or a remote Mac you have SSH'd into — and the phone is the window onto it. You do not run the model on the device. The device runs the review.

Concretely, OpenAI describes the experience this way: "From your phone, you can work across all of your threads, review outputs, approve commands, change models, or start something new." In practice that means four things:

Live observation. Screenshots, terminal output, diffs, test results, and approval prompts stream into the app in real time, with the worker's permissions and credentials staying on the host machine.
Approvals. When Codex hits a command that needs sign-off — running migrations, touching production config, deleting files — you get a card you can approve or reject from the lock screen.
Dispatch. You can open a new thread, attach context, choose between models (including the gpt-5.3-codex variant available through OpenAI's own product surface), and start work without going back to the Mac.
Thread continuity. Whatever Codex is doing on your desktop is reachable on your phone, and vice versa, including across multiple concurrent sessions.

This is not a tiny IDE in your pocket. It is the manager view of a coding agent — the part that used to require you to be back at your desk to unblock the agent on a permission prompt.

Setup: from QR code to first approved diff

The pairing flow is deliberately frictionless because it had to be — most developers will try this once on a coffee break, and a bad first run kills the habit.

On your Mac you launch Codex in whatever surface you already use (CLI, the desktop app, or the Chrome extension that shipped alongside it). Mobile linking lives in the same Codex settings panel. It prints a QR code. You open ChatGPT on the phone, point the camera at the screen, and the two halves of the same Codex session are now talking. There is no separate account, no extra API key, no provisioning step.

After that, every thread on the Mac is reachable on the phone within seconds. If you start a long-running refactor before walking to lunch, the agent's progress arrives as notifications — and any permission prompts surface as actionable cards.

It is worth noting how much of this design borrows from the Codex CLI workflow that has been the daily driver for a year now. The mental model is the same — agent runs, you supervise, you approve — but the supervisor seat has moved off the chair and onto the phone.

What you can actually do from your phone

Mobile is fine for a surprising amount of real work, and bad for a few things. The honest list, after using it for two weeks:

Genuinely useful on mobile:

Approving migrations the agent has staged but won't run without sign-off.
Reading a generated test failure and asking Codex to try a different fix.
Watching a long build or test suite without keeping your laptop open.
Spinning up a fresh thread from a screenshot or a bug report someone pasted in Slack.
Switching the model mid-task — for example bumping a thread from a cheaper variant up to the strongest coding model when you realize the work is harder than expected.

Painful on mobile:

Anything that requires reading more than ~80 lines of diff at a time. The pinch-to-zoom on a code block is workable, not pleasant.
Heavy multi-file refactor planning. You will want a real screen for that.
Anything that requires you to type more than a paragraph of clarification. Voice input helps, but only somewhat.

If you have spent any time choosing between Claude Code, Codex, Cursor, and DeepSeek-TUI, you already know the right framing here: each tool occupies a slightly different point in the who-watches-the-agent design space. Mobile pushes Codex further into the "asynchronous, supervisor-style" corner.

The honest limitations today

The preview ships with three real ceilings, and pretending otherwise would be silly.

It only talks to a Mac. Today, the worker has to be running on macOS. Windows is explicitly listed as "coming soon" by OpenAI, with no committed date. Linux is not mentioned at all in the launch material. If your daily driver is a Linux workstation, the only honest workaround right now is a Mac mini sitting somewhere on your network — which is not a small ask just for mobile parity.

Network dependence is harder than it looks on paper. The mobile experience leans on a persistent, low-latency connection to your worker host. Spotty café Wi-Fi turns "approve and continue" into "wait 9 seconds, approve, wait 7 seconds, see if it worked." Worth knowing before you plan a flight around it.

There is no offline mode. If your Mac is asleep or the host is down, the mobile app shows you the last known state and nothing else. Set your Mac to never sleep, or use Remote SSH with a server you control. (Remote SSH and Hooks both ship on every plan, per OpenAI's launch notes.)

These are not deal-breakers, but they are the difference between "I can use this on a real day" and "I demoed this once at a conference."

How it compares to Claude Code Remote Control

OpenAI is not first to this idea. Anthropic shipped Remote Control for Claude Code in February 2026, with broadly the same shape — phone observes, desktop works, approvals arrive as cards.

The differences that matter day to day:

Capability	Codex Mobile (May 2026)	Claude Code Remote Control (Feb 2026)
Worker OS	macOS only	macOS, Linux, Windows
Setup	QR code from ChatGPT app	Session URL + QR code from `claude remote-control`
Free tier	Yes (Free + Go plans)	No (requires Claude Pro or higher)
Push approvals	Yes, lock-screen cards	Yes
Model switching mid-task	Yes (between OpenAI's Codex-family models)	Yes (between Sonnet, Opus, Haiku)
Multi-thread view	Yes, across desktop and mobile	Yes
Voice dispatch	Indirect (via ChatGPT voice mode)	Limited

Anthropic got there first with the cleaner cross-platform story. OpenAI replied with broader plan coverage (Free tier) and a narrower host requirement. Mobile is now table stakes for any serious coding agent. The choice has shifted from whether you adopt the pattern to which agent you trust enough to hand the lock-screen approvals to.

Where ofox.ai actually fits — and where it doesn't

Worth being precise about, because the question comes up every time a new ChatGPT feature drops.

Codex Mobile is not something you can wire to ofox.ai. It is the ChatGPT product, tied to your OpenAI account, billed through OpenAI's plans. There is no "use my API key instead" toggle. The mobile experience is a feature of the OpenAI consumer surface, not a model endpoint.

Where ofox.ai is genuinely useful around this: the same gpt-5.3-codex model that powers Codex inside ChatGPT is also accessible as an OpenAI-compatible API endpoint through ofox.ai's unified gateway. That matters if you want to build a different mobile experience — your own internal tool, a Telegram bot, a Slack workflow — where Codex-quality coding sits behind your own UI and you do not want to be locked to ChatGPT as the only client. The model is the same; what's different is who owns the wrapper.

If you are running Codex CLI through a custom provider configured for ofox.ai, the new mobile app does not affect your setup at all. Your CLI keeps doing what it does; the mobile feature is a parallel surface that just happens to share branding.

When mobile monitoring actually helps

Some honest use cases from the first two weeks of using this:

Long migrations. Kick off a database refactor before a 90-minute meeting, approve the staged commands during the break, walk back to a green test suite.
Cross-timezone handoff. Leave a Codex task running overnight, approve any blockers from a phone over breakfast.
Code review while commuting. Reviewing the diffs Codex generated on an earlier task — not as good as on a 27-inch screen, but good enough to clear small ones.
Production approvals. When a deploy script needs explicit human sign-off, getting that sign-off without having to be at your desk.

What it is not good for: writing the actual code. If you are using your phone to type instructions longer than two sentences, you are using the wrong tool. The mental model is "the agent works, you supervise" — and supervision is exactly the kind of work that a phone is genuinely fine for.

Your phone isn't replacing your IDE. It is replacing the chair you would otherwise have to sit in while the agent runs — and the developers who learn to trust that asynchronous loop will ship more than the ones still tethered to a desk.

Agentic coding has finally outgrown the assumption that the human has to be physically present while the agent works. That assumption was always a bit silly — nobody sits next to a CI runner watching it spin — and Codex Mobile is the first OpenAI product that admits it on the product surface. Whether you adopt it now or wait for Windows parity, the asynchronous loop it normalizes is where most coding work is heading.

Sources

OpenAI: Work with Codex from anywhere — official launch post, May 14, 2026
TechCrunch: OpenAI says Codex is coming to your phone
9to5Mac: OpenAI brings Codex to ChatGPT for iPhone and Android
Engadget: OpenAI brings its Codex coding app to mobile

Originally published on ofox.ai/blog.

Codex Goal Mode & Remote Computer Use: How OpenAI's Agent Can Code for Days

Owen — Thu, 28 May 2026 02:23:33 +0000

Codex Goal Mode & Remote Computer Use: How OpenAI's Agent Can Code for Days

TL;DR

On May 21, 2026, OpenAI moved two Codex features to general availability: Goal Mode (a persistent /goal directive that survives session breaks and budget resets) and Locked Computer Use (the desktop agent continues driving Mac apps after screen lock). Combined with gpt-5.3-codex and verifiable success criteria, engineers can delegate real objectives like "ship the v2 checkout endpoint with the benchmark green" and walk away. The breakthrough isn't longer prompts—a coding agent now treats time as a budgetable resource rather than something requiring constant supervision.

Both features shipped in Codex CLI 0.133.0 and matching IDE and desktop builds. After a week running Goal Mode against production repositories, the gap between demos and practical utility depends on how the goal is structured, not patience levels.

What Goal Mode Actually Changes About Your Prompt

Goal Mode replaces per-turn instructions with a persistent objective that Codex re-evaluates each cycle. The command interface is minimal:

# Set or replace the active goal
/goal Reduce p95 checkout latency below 120 ms on the checkout
      benchmark while keeping the correctness suite green

/goal           # view current goal
/goal pause     # stop the loop, keep the state
/goal resume    # pick back up where it stopped
/goal clear     # discard the goal entirely

Goal structure matters more than wording. The OpenAI cookbook recommends: <desired end state> verified by <specific evidence> while preserving <constraints>—three mandatory slots in that order.

What Fails vs. What Works

Ineffective:

/goal Make the code more elegant

Effective:

/goal Migrate this codebase from Pydantic v1 to v2, verified by
      `pytest -q` exiting 0 and `mypy --strict src/` exiting 0,
      while preserving all public API signatures listed in
      docs/public_api.md

The second version gives Codex measurable targets. The agent writes, runs the suite, reads diffs between expected and actual, revises, and stops when both commands exit zero—or surfaces blockers it cannot overcome.

Stopping conditions are explicit: success, /goal pause, /goal clear, user interruption, a repeated unresolvable blocker, or usage limit exhaustion. Nothing else terminates the loop, making verifiable success criteria more critical than before—without them, the loop only stops on cost constraints.

"Code for Days" Means Something Specific

The phrase "code for days" doesn't mean one continuous uninterrupted session. Goal Mode persists objectives across:

Session breaks: Close the terminal, return tomorrow, run /goal resume, and the agent continues from the last verified state
Token budget resets: When rolling budgets roll over (daily for most plans), the active goal survives and work continues
Interruptions: Ctrl-C, app crashes, Mac restarts—the goal is journaled to disk; Codex 0.133+ rehydrates it on next launch

This creates a multi-session objective layer. A migration consuming three afternoons of one-shot prompts now runs as one coherent thread. The cost model remains unchanged: every reasoning turn costs the same per-token rate against gpt-5.3-codex. The coordination cost drops to nearly zero, where most wall-clock savings originate.

Real-World Testing

Testing against a production repo migration (Pydantic v1 → v2 on a 14k-line internal service) showed:

Total wall time: approximately 31 hours across four sessions
Total Codex token spend at gpt-5.3-codex rates: roughly $44
Hand-prompting the same task would have required two full focused days of supervision
Actual engagement: three check-ins

Locked Computer Use: The Controversial Half

Computer Use shipped earlier in 2026—Codex could operate GUI apps when the Mac was unlocked and monitored. The May 21 update added:

Continued operation after screen lock: Goal Mode loops driving desktop apps don't stall when screensaver activates
Mobile triggering: Hand the agent tasks from your phone to drive the Mac left at your desk

Safety Model

Enabling Locked Use installs an Apple authorization plugin participating in macOS unlock flow:

The Mac unlocks temporarily, but display stays covered—the lock screen remains visible while Codex operates in the background
Authorization windows are short-lived and scoped to the current unlock attempt; no standing grants exist
Keyboard, trackpad, or mouse contact immediately relocks the Mac and disables auto-unlock until manual unlock
Codex asks before operating each new app—mark frequently-used apps "Always allow"
Cannot drive Terminal apps, Codex itself, or system admin prompts—hard-coded exclusions prevent privilege escalation through GUI automation

Launch Availability & Restrictions

The feature is unavailable in the EEA, UK, and Switzerland at launch. Apple's automation policy blocks several app categories regardless of user settings.

If regular Computer Use isn't enabled, grant Screen Recording and Accessibility permissions to Codex through System Settings first. The plugin install adds only the locked-screen layer.

A Real Goal Mode Loop, End to End

Starting in your project root:

$ cd ~/work/orders-service
$ codex
# Inside the TUI:
> /goal Migrate this codebase from Pydantic v1 to v2, verified by
        `pytest -q` exiting 0 and `mypy --strict src/` exiting 0,
        while preserving all public API signatures in docs/public_api.md

Codex acknowledges the goal, runs initial scans, and proposes a plan. From here you can:

Walk away—the loop runs until success, blocker, or budget exhaustion
Hand off to Locked Computer Use for GUI steps (migration wizards, CI dashboard screenshots, etc.) and lock your Mac
Trigger status checks from Codex Mobile while away from the laptop

Returning later, /goal shows current state: what's verified, what's pending, last blockers. /goal pause lets you intervene without losing context.

Recommended Starter Configuration

Add to ~/.codex/config.toml:

model = "gpt-5.3-codex"
model_provider = "ofox"      # or "openai" if going direct

[model_providers.ofox]
name = "ofox.ai"
base_url = "https://api.ofox.ai/v1"
env_key = "OFOX_API_KEY"
wire_api = "responses"

Goal Mode exposes no per-session token or iteration caps in config.toml—documented stopping levers are slash commands (/goal pause, /goal clear), detected repeated blockers, and your plan's usage limit. The practical control is the usage cap on whichever provider you select. At gpt-5.3-codex rates of $1.75 input / $14 output per million tokens, single mostly-output multi-hour sessions easily run $30-80, so your account cap becomes the actual budget guardrail.

Why Route Codex Through ofox.ai

Goal Mode hammers the model—multi-day objectives routinely make hundreds of reasoning turns with bills dominated by gpt-5.3-codex output tokens at $14/M. Three reasons to pipe requests through a unified gateway instead of directly to OpenAI:

Single key for side models: Goal loops typically delegate cheap sub-tasks (summarization, classification, regex generation) to smaller models. One ofox.ai key routes the hot path to gpt-5.3-codex and cold path to gpt-5.4-mini or deepseek-v4-flash without juggling credentials
Per-goal spend visibility: Tag sessions with custom headers; the dashboard shows per-goal cost, not per-day. Useful when determining whether a Pydantic migration justified its expense
Failover on outages: Long-horizon goals get burned by brief provider blips. ofox falls back automatically; direct OpenAI keys error out and force /goal pause until recovery

When NOT to Use Goal Mode

Three disqualifiers:

Cannot write verification commands: If success means "feels right" or "more elegant," Goal Mode either declares premature victory or churns indefinitely. Use one-shot prompts instead
Work needs frequent human judgment: Goals target autonomy. If every change needs approval, you pay for unused context. Run one-shot sessions instead—cheaper, faster
Destructive work at scale: Database migrations, git push --force, production touching. Goal Mode excels at unattended convergence but lacks judgment about when not to act. Sandbox agents to worktrees, set approval_policy requiring shell command approval, prefer goals with dry-run verification over live mutations

The Shape of the Next Year

Goal Mode plus Locked Computer Use represents the first credible "set a goal, lock your laptop, check tomorrow" coding loop for production use. The agent isn't smarter than last month—friction simply vanished, changing which engineering tasks merit delegating to models. A coding agent surviving screen locks, budget resets, and dinner breaks differs fundamentally from one requiring constant supervision.

The important caveat: hours of attended Goal Mode work proves reliable today, but fully unattended multi-day work still depends on goal verifiability. The discipline of writing goals with real evidence surfaces is now the critical skill, superseding single-turn prompt craft.

Sources & Further Reading

Codex Changelog — May 2026 — official release notes for Goal Mode GA and Locked Computer Use
Using Goals in Codex — cookbook with goal syntax and worked examples
Computer Use — Codex App — official safety model and platform constraints
MacRumors: Codex Can Use Your Mac When Locked — independent writeup of the unlock flow
GPT-5.3-Codex on OpenRouter — pricing and context window reference

Originally published on ofox.ai/blog.

Codex CLI config.toml Deep Dive: Every Setting Explained

Owen — Wed, 27 May 2026 14:23:22 +0000

Codex CLI config.toml Deep Dive: Every Setting Explained

TL;DR. Codex CLI's config.toml has grown past 150 documented keys across sandbox, approvals, permissions, MCP, providers, TUI, hooks, telemetry, and feature flags — most users only edit five of them, and miss the ones that actually matter (granular approvals, permission profiles, shell_environment_policy, features.network_proxy). This deep dive walks every section, calls out the surprising defaults, and ends with a layered config you can paste and trim. The default ~/.codex/config.toml is empty for a reason: Codex ships sensible defaults, but the moment you put Codex in a sandbox tighter than your shell or a model cheaper than the flagship, you'll touch ten settings — and seven of them aren't in any blog post.

Where the file lives, and what gets ignored

User-level config lives in $CODEX_HOME/config.toml, which defaults to ~/.codex/config.toml on macOS and Linux and %USERPROFILE%\.codex\config.toml on Windows. Project-scoped overrides go in .codex/config.toml at the project root.

The merge is layered: managed config (admin-pushed) → user config → project config → CLI flags. Profiles slot in between user and project config when --profile NAME is passed. A set of keys are deliberately ignored in project-local files for safety, and silently dropped if you put them there:

openai_base_url, chatgpt_base_url
model_provider, model_providers
notify, profile, profiles
approval_policy, sandbox_mode, sandbox_workspace_write.*
experimental_realtime_ws_base_url, otel.*, apps_mcp_product_sku

If your project config "doesn't seem to take effect" for one of those, move it to ~/.codex/config.toml. This is the single most common WTF on the Codex CLI Discord.

The five keys most users actually set

model            = "gpt-5.4"          # or any id your provider exposes
model_provider   = "openai"           # built-in: openai, ollama, lmstudio
approval_policy  = "on-request"       # untrusted | on-request | never | { granular = {...} }
sandbox_mode     = "workspace-write"  # read-only | workspace-write | danger-full-access
file_opener      = "vscode"           # vscode | vscode-insiders | windsurf | cursor | none

That's the 80% config. Everything below this section either tightens the sandbox, swaps the provider, layers a profile, or wires in MCP/hooks/OTEL.

A note on the model field: Codex's default refreshed to gpt-5.4 recently, and gpt-5.5 is currently surfaced through ChatGPT-login workflows in the TUI's composer. For API-key workflows the available IDs vary by provider; check codex models (or your provider's catalog) before pinning a value. The Codex CLI ships a built-in catalog plus the optional model_catalog_json key for loading your own JSON catalog on startup.

Sandbox and approvals — get this pair right or nothing else matters

sandbox_mode is what Codex is technically allowed to touch. approval_policy is when Codex asks you first. They compose, and they default to safe-but-annoying.

sandbox_mode

Value	Filesystem	Network
`read-only`	Read everywhere, write nowhere	Blocked
`workspace-write`	Write inside cwd + `$TMPDIR` + `/tmp`	Blocked by default
`danger-full-access`	Whatever your user can do	Whatever your user can do

Most teams should sit in workspace-write. The under-documented controls live under [sandbox_workspace_write]:

[sandbox_workspace_write]
writable_roots         = ["~/work/notes"]   # extra dirs beyond cwd
network_access         = false              # allow outbound HTTP inside sandbox
exclude_tmpdir_env_var = false              # drop $TMPDIR from writable set
exclude_slash_tmp      = false              # drop /tmp from writable set

network_access = true is the toggle people miss when their pip install or npm install mysteriously hangs.

approval_policy

untrusted asks before almost everything. on-request asks when Codex hits something the sandbox blocks. never is fully autonomous (and shouldn't be paired with danger-full-access unless you really mean it).

For finer control, use the table form:

[approval_policy.granular]
sandbox_approval     = true   # let Codex ask to escalate beyond sandbox
request_permissions  = true   # let the request_permissions tool prompt
rules                = true   # respect execpolicy prompt rules
skill_approval       = true   # ask before running skill scripts
mcp_elicitations     = false  # mute MCP-driven prompts

This is how you say "you may ask to escalate the sandbox, but stop asking me to confirm individual MCP elicitations." Most users won't need it, but for unattended runs in CI it matters.

The companion key approvals_reviewer selects who handles eligible prompts: user (default) or auto_review (which delegates to a configured reviewer agent).

Reasoning, verbosity, and plan mode

Four keys, all model-dependent. Use them with GPT-5 family models; older/non-reasoning models ignore them.

model_reasoning_effort     = "medium"  # minimal | low | medium | high | xhigh
model_reasoning_summary    = "auto"    # auto | concise | detailed | none
model_verbosity            = "medium"  # low | medium | high  (GPT-5 Responses API)
plan_mode_reasoning_effort = "high"    # override applied only in /plan mode

xhigh exists but burns tokens; reserve it for the worst plan-mode problems. hide_agent_reasoning = true suppresses reasoning events in the TUI and codex exec output without changing what the model actually computes — useful for screenshots, log piping, and pair-programming sessions where the unedited chain-of-thought is more distracting than helpful. show_raw_agent_reasoning = true does the inverse: surface the raw reasoning content from the model when the provider exposes it.

model_supports_reasoning_summaries is a force-override (true/false) for whether Codex sends reasoning metadata at all. Leave it unset unless you're debugging a custom provider that lies about its capabilities.

Permissions profiles — the modern way to scope access

The newer [permissions.NAME] block is more expressive than sandbox_workspace_write and is the way Codex is moving. You define named profiles (:read-only, :workspace, :danger-full-access ship built-in) and select one with default_permissions = "my-profile".

[permissions.scoped]

[permissions.scoped.workspace_roots]
"~/code/oss"     = true
"~/code/clients" = true

[permissions.scoped.filesystem]
glob_scan_max_depth = 3
".env"            = "deny"
"**/.git/**"      = "deny"
"~/.ssh/**"       = "deny"

[permissions.scoped.filesystem.":workspace_roots"]
"."        = "write"
"**/*.env" = "deny"

[permissions.scoped.network]
enabled              = true
mode                 = "limited"     # limited | full
allow_local_binding  = false

[permissions.scoped.network.domains]
"api.openai.com"      = "allow"
"api.ofox.ai"         = "allow"
"github.com"          = "allow"
"*.internal.corp"     = "deny"

A few things worth knowing:

The :workspace_roots token is a special key that scopes the rules below it to any path declared in workspace_roots. Without that scoping wrapper, **/*.env = "deny" would apply globally.
glob_scan_max_depth exists because expanding a deny glob like **/secret.json across a giant repo is expensive — Codex caps it to keep startup fast.
network.mode = "limited" plus an explicit domain allowlist is the production-grade setup. Combine with dangerously_allow_non_loopback_proxy = false (the default) so the sandbox proxy only binds to loopback.

Network proxy — the feature flag most people skip

If you ran into "but Codex can't pip install", you probably want this:

[features.network_proxy]
enabled = true
proxy_url  = "http://127.0.0.1:3128"
socks_url  = "http://127.0.0.1:8081"
enable_socks5     = true
enable_socks5_udp = true
allow_local_binding   = false
allow_upstream_proxy  = true

[features.network_proxy.domains]
"pypi.org"             = "allow"
"registry.npmjs.org"   = "allow"
"github.com"           = "allow"
"api.openai.com"       = "allow"

This gives the sandboxed subprocess an HTTP/SOCKS5 proxy with a domain allowlist, rather than the binary on/off of sandbox_workspace_write.network_access. The dangerously_* keys exist for niche bind/listener cases — leave them off unless you understand the failure mode.

MCP servers — the meatiest section

MCP server configuration lives under [mcp_servers.<id>]. The schema covers both stdio servers (command + args) and HTTP streamable servers (url + headers).

[mcp_servers.docs]
command = "uvx"
args    = ["mcp-server-docs"]
cwd     = "~/code/docs-server"
env     = { DOCS_INDEX = "~/.cache/docs.idx" }
startup_timeout_sec = 15
tool_timeout_sec    = 90
required            = false  # if true, Codex fails startup when this server can't init
enabled             = true
enabled_tools       = ["search", "fetch_section"]   # allowlist
disabled_tools      = []                            # denylist
default_tools_approval_mode = "auto"                # auto | prompt | approve

[mcp_servers.github]
url                  = "https://github-mcp.example.com/mcp"
bearer_token_env_var = "GITHUB_TOKEN"
http_headers         = { "X-Repo" = "ofoxai/blog" }
env_http_headers     = { "X-User" = "GITHUB_USER" }   # populated from env
oauth_resource       = "https://github-mcp.example.com"
scopes               = ["repo", "issues"]

[mcp_servers.github.tools.create_issue]
approval_mode = "prompt"   # per-tool override

startup_timeout_sec is 10 by default — bump it for slow Node MCP servers that lazy-load on first request. tool_timeout_sec defaults to 60; long-running shell or database tools need more. required = true is the right call for a server your workflow depends on; you'd rather fail at boot than discover it half a session later.

default_tools_approval_mode and tools.<name>.approval_mode are how you say "auto-approve search, prompt me for delete_branch" without writing custom approval hooks.

Model providers — custom endpoints, including ofox

Built-in provider IDs (openai, ollama, lmstudio) are reserved. Everything else is a [model_providers.<id>] block. For an ofox setup that routes Codex through one key across GPT/Claude/Gemini/DeepSeek/Qwen models:

[model_providers.ofox]
name     = "ofox"
base_url = "https://api.ofox.ai/v1"
wire_api = "responses"
env_key  = "OFOX_API_KEY"
env_key_instructions = "Get a key from https://ofox.ai/keys"
requires_openai_auth   = false
request_max_retries     = 4
stream_max_retries      = 5
stream_idle_timeout_ms  = 300000

Then either flip the default:

model_provider = "ofox"
model          = "gpt-5.4"     # or anthropic/claude-opus-4.6, google/gemini-3.1-pro-preview, etc.

…or scope it to a profile (next section). The full BYO walkthrough with auth-via-command, query-param-pinned Azure endpoints, and per-provider header injection is in How to Use Any Model with Codex CLI. The gateway rationale — why you'd want one provider entry that fans out to many models — is in the AI API aggregation guide.

A few keys worth flagging:

wire_api only accepts "responses" as of Codex 0.59 (February 2026). The "chat" value and the /chat/completions path are gone — set it to "responses" or omit the key (the default). Third-party gateways that want to keep working with Codex now need to surface a /v1/responses endpoint; ofox.ai exposes one alongside /v1/chat/completions, so the same https://api.ofox.ai/v1 base URL still routes to Codex correctly. Gateways without a /responses endpoint need a local translator (community bridges exist) or a different client.
requires_openai_auth = false removes Codex's assumption that the key prefix is sk- — most non-OpenAI gateways need this explicitly. Leave it true (the default) only when the proxy mirrors OpenAI auth exactly.
[model_providers.<id>.auth] lets you run a command on a refresh schedule that returns a bearer token — for short-lived workforce tokens, sigv4-derived credentials, etc.

If you're sliding off vanilla OpenAI auth for the first time, the SDK migration guide for OpenAI clients to ofox is the companion piece.

Profiles — layer presets on top of your base config

[profiles.NAME] is a flat overlay: any top-level key set inside the profile wins when you run codex --profile NAME.

[profiles.fast]
model                   = "gpt-5.4-mini"
model_reasoning_effort  = "low"
model_verbosity         = "low"
approval_policy         = "never"
sandbox_mode            = "workspace-write"

[profiles.deep]
model                   = "gpt-5.4"
model_reasoning_effort  = "high"
plan_mode_reasoning_effort = "xhigh"
model_verbosity         = "high"
approval_policy         = "on-request"

[profiles.review]
model                   = "anthropic/claude-opus-4.6"   # via ofox provider
model_provider          = "ofox"
model_reasoning_effort  = "high"

This is also the place to set model_provider per profile so a review profile can hit Anthropic-via-ofox while your default profile stays on OpenAI. Remember: model_provider and profile keys themselves are ignored in project-local config — define them in ~/.codex/config.toml.

For practical patterns — fast/deep/review profiles paired with shell aliases — see the real-world Codex CLI workflow guide. The pricing tradeoffs behind picking fast/deep models live in the Codex CLI API configuration guide.

History, TUI, and the file_opener

[history]
persistence = "save-all"   # save-all | none
max_bytes   = 5_242_880    # 5 MB cap; drops oldest entries

[tui]
animations     = true
show_tooltips  = true
notifications  = true
notification_condition = "unfocused"   # unfocused | always
notification_method    = "auto"        # auto | osc9 | bel
theme                  = "catppuccin-mocha"
vim_mode_default       = false
alternate_screen       = "auto"        # auto | always | never
raw_output_mode        = false
status_line            = ["model", "token-usage", "branch"]
terminal_title         = ["spinner", "project"]

[tui.keymap.composer]
submit = ["enter"]
newline = ["shift+enter"]

tui.notifications accepts either a boolean or an array of event types (["new-message", "tool-output"]) for finer control. alternate_screen = "never" is useful in tmux setups where the alternate screen swallows scrollback. tui.theme accepts kebab-case theme names — catppuccin-mocha, gruvbox-dark, solarized-light, and friends.

file_opener controls the URI scheme Codex emits when citing files in output. The default is vscode; cursor, windsurf, vscode-insiders, and none (plain paths, no clickable links) are the alternatives.

shell_environment_policy — the leak you'll only notice in OTEL

By default Codex inherits your full shell environment when it spawns subprocesses. That's convenient, until you realize every AWS_*, GITHUB_*, and OPENAI_* variable in your env is reachable by every shell tool the model runs.

[shell_environment_policy]
inherit                 = "core"          # all | core | none
ignore_default_excludes = false           # if false, KEY/SECRET/TOKEN names are stripped first
include_only            = ["PATH", "HOME", "TMPDIR", "LANG", "LC_*"]
exclude                 = ["AWS_*", "GITHUB_*", "*_TOKEN", "*_SECRET", "*_KEY"]
set                     = { "CI" = "1", "NO_COLOR" = "1" }
experimental_use_profile = false

inherit = "core" keeps a minimal POSIX-ish set and drops the rest. ignore_default_excludes = false (the default) means anything with KEY, SECRET, or TOKEN in the name is filtered before your custom include_only/exclude runs — leave that on.

experimental_use_profile = true invokes your shell's user profile (.zshrc, etc.) when spawning subprocesses. Cleaner output if your profile defines aliases the model relies on; slower startup either way.

Features flags — the boolean grab-bag

Most defaults are sensible. The ones worth knowing:

[features]
shell_tool                    = true   # default tool for running commands
hooks                         = true   # lifecycle hooks (hooks.json or [hooks] block)
codex_git_commit              = false  # let Codex make git commits attributed to "Codex"
multi_agent                   = true   # spawn_agents_on_csv & friends
unified_exec                  = true   # PTY-backed exec (off on Windows by default)
shell_snapshot                = true   # snapshot env to speed up tool calls
skill_mcp_dependency_install  = true   # prompt to install missing MCP deps
fast_mode                     = true   # service-tier picker in TUI
network_proxy                 = false  # see "Network proxy" section above
prevent_idle_sleep            = false  # keep machine awake during active turn
memories                      = false  # opt into Memories
undo                          = false  # opt into Undo
personality                   = true   # personality picker
apps                          = false  # ChatGPT Apps/connectors support

codex_git_commit = true pairs with the top-level commit_attribution string (default "Codex <[email protected]>") — set that to a meaningful identity before turning the feature on.

memories = true activates the [memories] block (thread eligibility, consolidation cadence, raw memory caps). Defaults are conservative: max age 30 days, min idle 6 hours, max 16 rollouts per startup.

Hooks — lifecycle events without leaving config.toml

You can keep hooks in a sidecar hooks.json, or inline them under [hooks]. Inline form:

[hooks]

[[hooks.SessionStart]]
matcher = "*"

  [[hooks.SessionStart.hooks]]
  type    = "command"
  command = ["sh", "-c", "echo 'session started' >> ~/.codex.log"]

[[hooks.PreToolUse]]
matcher = "Bash"

  [[hooks.PreToolUse.hooks]]
  type    = "command"
  command = ["python3", "~/bin/codex_audit.py"]
  commandWindows = ["py", "C:/bin/codex_audit.py"]

The matcher table groups handlers by event. The documented events are SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, PermissionRequest, PreCompact, PostCompact, SubagentStart, SubagentStop, and Stop. Each hook entry has a type plus the relevant fields — for command hooks, that's command and the optional commandWindows override for Windows shells.

If your team needs to force hooks across every developer machine, the allow_managed_hooks_only = true flag in requirements.toml (admin-distributed) makes user and project hooks no-ops, leaving only managed ones. The Claude Code equivalent — and a similar safety story — is covered in the Claude Code hooks, subagents, and skills guide.

Telemetry: otel and analytics

OpenTelemetry support ships built-in:

[otel]
environment       = "prod"
log_user_prompt   = false                   # opt in to exporting raw prompts
exporter          = "otlp-http"             # none | otlp-http | otlp-grpc
trace_exporter    = "otlp-grpc"
metrics_exporter  = "statsig"               # none | statsig | otlp-http | otlp-grpc

[otel.exporter."otlp-http"]
endpoint = "https://collector.example.com/v1/logs"
protocol = "binary"                         # binary | json
headers  = { "x-api-key" = "${OTEL_KEY}" }

[otel.exporter."otlp-http".tls]
ca-certificate     = "~/certs/ca.pem"
client-certificate = "~/certs/client.pem"
client-private-key = "~/certs/client.key"

otel.* keys are user-level only (ignored in project config). log_user_prompt = false is the safe default — flip it only when you've sanitized your collector pipeline.

analytics.enabled = true/false controls the OpenAI-side analytics opt-in. feedback.enabled = true keeps the /feedback TUI command available.

Projects, trust, and the AGENTS.md story

project_root_markers         = ["pyproject.toml", "Cargo.toml", "pnpm-workspace.yaml"]
project_doc_fallback_filenames = ["AGENTS.md", "CLAUDE.md", "CONTRIBUTING.md"]
project_doc_max_bytes        = 32_768

[projects."/Users/me/code/risky-repo"]
trust_level = "untrusted"

[projects."/Users/me/code/oss-i-maintain"]
trust_level = "trusted"

project_doc_fallback_filenames is how you get Codex to read CLAUDE.md (or your team's equivalent) when there's no AGENTS.md. model_instructions_file is the heavier hammer: a path to a file that replaces the built-in instructions entirely, not just augments them.

Trust level interacts with the approval and sandbox machinery — untrusted projects get more conservative defaults regardless of your global settings.

A complete, layered config you can adapt

Here's a realistic ~/.codex/config.toml that combines everything above. Read it as a menu, not a recipe — most teams should delete two-thirds of it.

# ----- Core -----
model              = "gpt-5.4"
model_provider     = "ofox"
approval_policy    = "on-request"
sandbox_mode       = "workspace-write"
default_permissions = ":workspace"
file_opener        = "vscode"
personality        = "pragmatic"
service_tier       = "flex"

model_reasoning_effort     = "medium"
model_reasoning_summary    = "auto"
model_verbosity            = "medium"
plan_mode_reasoning_effort = "high"

hide_agent_reasoning           = false
check_for_update_on_startup    = true
web_search                     = "cached"
commit_attribution             = "Codex (ofox) <[email protected]>"

# ----- Providers -----
[model_providers.ofox]
name     = "ofox"
base_url = "https://api.ofox.ai/v1"
wire_api = "responses"
env_key  = "OFOX_API_KEY"
requires_openai_auth = false

# ----- Sandbox -----
[sandbox_workspace_write]
network_access = false
writable_roots = ["~/work/scratch"]

[permissions.tight]
[permissions.tight.workspace_roots]
"~/code" = true

[permissions.tight.filesystem]
glob_scan_max_depth = 3
".env"            = "deny"
"**/.git/**"      = "deny"
"~/.ssh/**"       = "deny"

[permissions.tight.network]
enabled = true
mode    = "limited"

[permissions.tight.network.domains]
"api.ofox.ai"        = "allow"
"github.com"         = "allow"
"registry.npmjs.org" = "allow"
"pypi.org"           = "allow"

# ----- Env hygiene -----
[shell_environment_policy]
inherit  = "core"
include_only = ["PATH", "HOME", "TMPDIR", "LANG", "LC_*", "OFOX_API_KEY"]
exclude  = ["AWS_*", "GITHUB_*", "*_SECRET", "*_TOKEN"]
set      = { CI = "1", NO_COLOR = "1" }

# ----- History & TUI -----
[history]
persistence = "save-all"
max_bytes   = 10_485_760

[tui]
animations     = true
notifications  = true
notification_condition = "unfocused"
theme          = "catppuccin-mocha"
status_line    = ["model", "token-usage", "branch", "approval"]

# ----- MCP -----
[mcp_servers.fs]
command = "uvx"
args    = ["mcp-server-filesystem", "~/code"]
startup_timeout_sec = 10
default_tools_approval_mode = "auto"

[mcp_servers.docs]
url                  = "https://docs-mcp.your.team/mcp"
bearer_token_env_var = "DOCS_MCP_TOKEN"

# ----- Profiles -----
[profiles.fast]
model                  = "gpt-5.4-mini"
model_reasoning_effort = "low"
approval_policy        = "never"

[profiles.deep]
model_reasoning_effort     = "high"
plan_mode_reasoning_effort = "xhigh"

[profiles.review]
model = "anthropic/claude-opus-4.6"
model_reasoning_effort = "high"

# ----- Telemetry (opt in) -----
[analytics]
enabled = false

[otel]
environment    = "dev"
metrics_exporter = "none"

Drop it in, run codex --profile fast, and you have a sandboxed, network-allowlisted, env-scrubbed setup that hits ofox for budget runs and switches to Anthropic-via-ofox for review passes.

Gotchas that bite people in week two

A short list, all real, all painful:

model_provider set in a project-local .codex/config.toml silently ignored. Move it to ~/.codex/config.toml.
network_access = false plus a tool that needs the network. Hangs with no clear error; switch to [features.network_proxy] + a domain allowlist instead.
approval_policy = "never" plus sandbox_mode = "danger-full-access" — there is no safety net, the model can rm -rf $HOME. The Claude Code safety guide has the same warning for the Claude side; same lesson applies.
startup_timeout_sec defaulting to 10. Slow MCP servers fail to register and Codex silently drops them; bump to 30 for Node-based servers that lazy-load.
hide_agent_reasoning = true paired with debugging an agent loop — you'll waste an hour wondering why the model "did nothing" when it actually spent 4k tokens thinking off-screen.
shell_environment_policy.inherit = "all" (the default) leaks your full env to every tool call. The fix is 5 lines of config, the audit case it prevents is enormous.
[permissions.NAME.filesystem] glob patterns at the top level apply globally. Scope them under ":workspace_roots" if you only mean "inside the workspace."

Where to go next

If you're picking a model to slot into model = "...", the best LLM for coding ranked by real use compares the realistic options. If you're weighing Codex CLI against the alternatives, the Claude Code vs Codex CLI vs Cursor vs DeepSeek TUI comparison is the head-to-head. For BYO model providers with weird auth shapes, the custom OAI-compatible provider setup guide is the most detailed.

The most useful thing in this file isn't a setting — it's the realization that Codex CLI's sandbox is a default-on, default-narrow safety net, and most "why doesn't this work" tickets are someone fighting that net instead of configuring it.

Originally published on ofox.ai/blog.

How to Use Any OAI-Compatible API with GitHub Copilot — Custom Model Setup Guide

Owen — Wed, 27 May 2026 02:54:11 +0000

How to Use Any OAI-Compatible API with GitHub Copilot — Custom Model Setup Guide

TL;DR — GitHub Copilot now lets you point Chat (VS Code) and the Copilot CLI at any OpenAI-compatible endpoint. In VS Code, run Chat: Manage Language Models, pick the OpenAI Compatible provider, paste a base URL plus key. In Copilot CLI, export COPILOT_PROVIDER_BASE_URL, COPILOT_PROVIDER_API_KEY, and COPILOT_MODEL. Inline completions are unaffected — they still run on Copilot's own infra.

You don't need to leave Copilot to escape Copilot's model menu. Twenty seconds of env vars and your copilot CLI is talking to Claude Opus 4.6, GPT-5.4, or a local vLLM box — billed by the provider, not your Copilot quota.

What BYOK actually does in Copilot

BYOK (Bring Your Own Key) lets the Chat surface and the agent CLI use a model you authenticate to directly, instead of going through GitHub's hosted model pool. The wiring is narrow on purpose:

Surface	BYOK supported?	Billing
VS Code Chat / Agent mode	Yes	Your provider
Copilot CLI	Yes	Your provider
Inline code completions	No	Copilot subscription
Pull request summaries, code review	No	Copilot subscription

The split exists because completions need single-digit-millisecond latency budgets that arbitrary endpoints can't promise. Chat and agents tolerate the round trip, so they got opened up first.

Setup in VS Code

The path was announced in October 2025 and has since landed in the stable channel for several providers (GA was confirmed in the April 2026 GitHub changelog). For the generic OpenAI-compatible flow:

Open the Command Palette → Chat: Manage Language Models.
Pick OpenAI Compatible from the provider list.
Fill in the Base URL (must serve /chat/completions), the API key, and a Model ID that the provider exposes.
Hit Add Model. The model now appears in the Copilot Chat model dropdown.

There are two JSON shapes worth knowing about. The legacy github.copilot.chat.customOAIModels object in settings.json still works in stable releases but is marked deprecated:

"github.copilot.chat.customOAIModels": {
  "anthropic/claude-opus-4.6": {
    "name": "Claude Opus 4.6 (via ofox)",
    "url": "https://api.ofox.ai/v1/chat/completions",
    "toolCalling": true,
    "vision": true,
    "maxInputTokens": 200000,
    "maxOutputTokens": 16000
  }
}

The replacement (currently Insiders-only) is the chatLanguageModels.json workspace file using the customendpoint vendor — note the array shape and the apiType selector that picks between OpenAI's chat-completions, OpenAI's responses, and Anthropic's messages protocol:

[
  {
    "name": "ofox.ai",
    "vendor": "customendpoint",
    "apiKey": "${OFOX_API_KEY}",
    "apiType": "chat-completions",
    "models": [
      {
        "id": "anthropic/claude-opus-4.6",
        "name": "Claude Opus 4.6",
        "url": "https://api.ofox.ai/v1/chat/completions",
        "toolCalling": true,
        "vision": true,
        "maxInputTokens": 200000,
        "maxOutputTokens": 16000
      }
    ]
  }
]

Capability flags (toolCalling, vision) matter. If the agent thinks the model doesn't support tools, it silently falls back to plain chat and your custom commands never fire.

Setup in Copilot CLI

The CLI's BYOK docs are the cleanest reference. Three environment variables, exported before launching copilot:

export COPILOT_PROVIDER_BASE_URL=https://api.ofox.ai/v1
export COPILOT_PROVIDER_API_KEY=$OFOX_API_KEY
export COPILOT_MODEL=anthropic/claude-opus-4.6

copilot

For a local Ollama box, drop the key entirely:

export COPILOT_PROVIDER_BASE_URL=http://localhost:11434
export COPILOT_MODEL=qwen2.5-coder:14b
copilot

The CLI talks the OpenAI Chat Completions protocol against whatever you point it at. If /v1/chat/completions resolves and the model ID is valid on that endpoint, it works.

Worked example: ofox.ai as the endpoint

ofox.ai is a gateway that exposes Anthropic, Google, Alibaba and Moonshot models behind the OpenAI Chat Completions schema — useful for Copilot BYOK because you get Claude or Gemini in the Chat dropdown without juggling three SDKs. The base URL is https://api.ofox.ai/v1 and the auth header is a standard Authorization: Bearer <key>.

A typical model ID set to expose to Copilot:

Model ID (use as `COPILOT_MODEL`)	What it is
`openai/gpt-5.4`	GPT-5.4 (general-purpose OpenAI tier)
`anthropic/claude-opus-4.6`	Claude Opus 4.6
`google/gemini-3.1-pro-preview`	Gemini 3.1 Pro preview
`bailian/qwen3-max`	Qwen3-Max
`moonshotai/kimi-k2.6`	Kimi K2.6

Smoke test before pointing Copilot at it:

curl https://api.ofox.ai/v1/chat/completions \
  -H "Authorization: Bearer $OFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"anthropic/claude-opus-4.6","messages":[{"role":"user","content":"ping"}]}'

If that returns a choices[0].message.content, Copilot will connect. If it 404s on the model ID, fix the ID first — Copilot surfaces those errors as a generic "model unavailable" toast that masks the real cause. For deeper debugging of mismatched IDs and 404s, see Model Not Found errors troubleshooting.

For broader background on the gateway pattern — one key, many providers — see the OpenAI SDK migration guide and the pillar overview AI API aggregation: every model behind one endpoint.

Caveats worth knowing before you commit

Authentication is static credentials only. BYOK accepts an API key or bearer token. There's no OAuth handshake, no service-account flow, no key rotation hook. Treat the key like any other long-lived secret — scope it, rotate it manually, and don't put it in a public repo.
Telemetry still flows to GitHub. BYOK changes where the inference happens, not where the usage telemetry goes. Enterprise admins who needed a model migration for compliance reasons should re-read the data-handling docs before assuming BYOK is sufficient.
Rate limits become yours to manage. Copilot's quota stops protecting you; if your provider rate-limits you, the Chat panel will just stall. Watch your provider dashboard for the first week.
Code completions remain on Copilot. Repeating this because it's the #1 misunderstanding: BYOK does not replace the inline ghost-text completions. Those still hit GitHub's hosted models.

Comparing the IDE custom-API options

If you're choosing between Copilot BYOK and the equivalent feature in other editors, the surface area looks similar but the agent capabilities don't. The Cursor / Claude Code / Cline custom API setup guide walks the same exercise for those three. Short version: Copilot's BYOK is the cleanest in-editor flow (it's a UI form), Claude Code gives you the most agent power per dollar when paired with ANTHROPIC_BASE_URL, and Cursor sits in between.

Troubleshooting

"Failed to fetch model list" — Your base URL is missing /v1 or your endpoint doesn't serve a GET /models route. The OpenAI-Compatible provider probes /models to populate the dropdown. If your gateway doesn't expose it, type the model ID manually in the form.

Chat hangs after first turn — Tool calling is enabled in Copilot but the model isn't returning the expected tool_calls payload shape. Either flip toolCalling: false in your customOAIModels entry, or switch to a model that fully implements the OpenAI tools spec.

CLI says "context length exceeded" early — COPILOT_MODEL is set to an alias your provider remaps to a smaller-context variant. Use the canonical model ID from the provider's docs, not a shorthand.

Vision attachments silently dropped — Set vision: true on the model entry in settings.json. Without that flag, Copilot strips image parts from the multimodal payload before sending.

The interesting thing about Copilot BYOK isn't that it lets you switch models — it's that it lets you switch vendors without leaving the editor. Copilot becomes a thin chat shell; the intelligence is rented from whoever's winning this month.

Originally published on ofox.ai/blog.

Gemini 3.5 Pro: Release Date, Expected Specs, and What Flash Already Tells Us

Owen — Tue, 26 May 2026 14:43:03 +0000

Gemini 3.5 Pro: Release Date, Expected Specs, and What Flash Already Tells Us

Google announced its flagship at I/O 2026 and then told the audience to wait a month for it. The Flash model that shipped instead is already outscoring the previous-generation Pro on coding benchmarks.

TL;DR. Gemini 3.5 Pro was supposed to headline Google I/O 2026 on May 19. It didn't. Sundar Pichai told the audience "give us until next month" — meaning June 2026, no committed date. What did ship is Gemini 3.5 Flash, and its benchmarks are the most useful data we have for forecasting Pro: Flash already beats Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2% vs 70.3%), MCP Atlas (83.6% vs 78.2%), and Finance Agent v2 (57.9% vs 43.0%). If Pro extends the same gap over Flash, Google is shipping a coding-and-agents flagship in June that will force a real rethink against Claude Opus 4.7 and GPT-5.5. This piece is the realistic read on dates, pricing, capabilities, and how to prepare your code.

What Google actually announced

Gemini 3.5 Pro was named, demoed internally, and pushed. Sundar Pichai's exact phrasing on the I/O keynote: "We're also hard at work on 3.5 Pro. It's already being used internally, and we look forward to rolling it out next month." That's the entire official statement. No spec sheet, no benchmark card, no API preview, no pricing tier.

The delay drew audible groans from the live audience — Business Insider's reporter on the floor caught it — because everything else in the keynote (Spark, Antigravity 2, Search AI Mode) was framed around the Pro tier that wasn't there. (Let's Data Science)

What we got instead was the Gemini 3.5 Flash launch — $1.50/M input, $9.00/M output, 1M context window, 4x output token throughput vs comparable frontier models, GA day-of on Gemini API, AI Studio, Vertex, Antigravity 2, and the Gemini app. Flash is the working artifact. Pro is the artifact we have to reason about from indirect evidence.

When "next month" probably means

Google's I/O timing tells you the window even if the date is open. The I/O keynote was May 19, 2026. "Next month" gives a range of June 1 to June 30. Two priors narrow it:

3.5 is reversing the launch order — but Pichai has capped the wait. Historically Pro shipped first: Gemini 3 Pro on November 18, 2025 with 3 Flash following on December 17; Gemini 3.1 Pro on February 19, 2026 with the 3.1 Flash family rolling out afterward. With 3.5 Flash leading at I/O, there is no clean prior for a Flash → Pro gap. What we do have is Pichai's "next month" commitment from the keynote, which caps the wait at June 30.
Google's quarterly cadence. Pro tiers historically ship before the end of a fiscal quarter when announced, partly for board optics. June 30 is the Q2 end. Expect a drop in the last full week of June — best guess June 22-26 — unless safety or serving capacity slips it.

What could push it out: additional Frontier Safety Framework evaluation (Google has telegraphed this process for every 3.x flagship), TPU serving capacity if Spark and the new agent platform are eating capacity, or a benchmark embargo with a paper drop. None of these would push past July.

What Flash already tells us about Pro

This is the actual analytical work, and it's the only honest way to forecast a model that hasn't shipped. Three things Flash makes legible.

1. The generation jump is real, not incremental

Gemini 3.5 Flash beats Gemini 3.1 Pro on the benchmarks Google itself prioritized. From Google's published Gemini 3.5 Flash model card:

Benchmark	What it measures	3.5 Flash	3.1 Pro	Delta
Terminal-Bench 2.1	Real terminal coding tasks	76.2%	70.3%	+5.9
MCP Atlas	Scaled tool-use reliability	83.6%	78.2%	+5.4
Finance Agent v2	Multi-step financial workflows	57.9%	43.0%	+14.9
GDPval-AA (Elo)	Economic-value task suite	1656	1314	+342
CharXiv Reasoning	Chart/figure understanding	84.2%	83.3%	+0.9

A Flash beating a Pro on coding-and-agentic work has not happened before in this family. The implication: the 3.5 generation isn't just a quality bump, it's a re-architecture for agentic loops specifically. Pro should extend the trend.

A useful crude forecast: if 3.1 Pro → 3.5 Pro mirrors the gap between 3.1 Flash and 3.5 Flash (roughly +6-15 points on agentic benchmarks), Gemini 3.5 Pro lands at ~82-85% Terminal-Bench, ~88-90% MCP Atlas, and well into 70+ on Finance Agent v2. That's flagship territory against Claude Opus 4.7 and GPT-5.5, which trade leadership across this benchmark set depending on the task. Compare with the current flagship benchmark showdown — the picture changes meaningfully if Pro hits the upper end of this range.

2. Pricing has a floor and a ceiling

Flash launched at $1.50/M input, $9.00/M output. Gemini 3.1 Pro sits at $2.00/M input, $12.00/M output. That's an unusual layout — Flash is now 25% cheaper than the prior-gen Pro while being benchmark-superior on coding. The new Pro has to be priced higher than 3.1 Pro to make commercial sense, but it can't go too high without making the Flash + Pro combo less attractive than bundling DeepSeek V4 Pro and Gemini Flash through a single endpoint for cost-shaping.

Realistic price band for 3.5 Pro:

Floor: $2.50/M input, $15/M output (a 25% premium over 3.1 Pro, mirroring the 3.5 Flash premium over 3.1 Flash)
Ceiling: $3.50/M input, $20/M output (the upper bound before it starts overlapping with Anthropic and OpenAI flagship pricing and losing its differentiation)
Most likely: $3.00/M input, $18/M output

For comparison: GPT-5.5 is roughly $5/$30, Claude Opus 4.7 is $5/$25. Even at the high end of the band Gemini 3.5 Pro stays meaningfully cheaper for output-heavy workloads — which is most agentic loops.

3. The 1M context window is staying

Gemini 3.5 Flash kept the 1,048,576 input / 65,536 output token window. No evidence Google is reducing context in the 3.5 generation. Pro almost certainly keeps or expands this — long context remains a Gemini selling point alongside Claude Opus 4.7 (200k default, 1M on the dedicated long-context variant) and GPT-5.5 (1M via the standard API, 400k inside Codex), and Google's Project Mariner and Antigravity 2 product story both depend on it. If anything, expect 3.5 Pro to push to 2M context as a marketing point.

The remaining open question is recall quality at 128k+. 3.5 Flash actually regressed on MRCR v2 at 128k (77.3% vs 3.1 Pro's 84.9%) — a six-point drop. That regression is the single biggest open question about 3.5 Pro. If Pro inherits it, the "1M context" claim becomes weaker in practice for real long-document retrieval and you'd still want 3.1 Pro for those workloads.

What Pro probably won't be

A grounded forecast also needs to say what isn't changing.

It's not getting a step-change in modalities. 3.5 Flash already takes text, images, video, audio, and PDFs in and emits text. Pro almost certainly matches but doesn't extend this set on day one. Native image-out lives in Nano Banana / Imagen, not the main Gemini chat tier.
It's not going to drop below Flash's prices. Google needs a Pro tier with margins. The whole point of having Flash + Pro is price discrimination across workload sensitivity.
It's not shipping with a public model card before launch day. Google's pattern has been simultaneous model card + GA. Don't expect benchmark leaks; expect a Tuesday morning launch with the deck ready.
The naming probably stays "Gemini 3.5 Pro." There's been no signal pointing toward a rename, and Google has been more naming-disciplined than OpenAI in the 3.x generation.

How to prepare today

If you're shipping anything that will rely on Gemini 3.5 Pro the moment it lands, the practical prep is:

1. Build against 3.5 Flash now. The API surface and tool-use shape are the same between Flash and Pro tiers in this generation. Through ofox the model id is google/gemini-3.5-flash. When Pro ships, swap to google/gemini-3.5-pro — no SDK or schema rewrite. The ofox OpenAI-compatible endpoint handles request translation either way.

# Today
client.chat.completions.create(
    model="google/gemini-3.5-flash",
    messages=[...],
)

# Day Pro ships, change one string
client.chat.completions.create(
    model="google/gemini-3.5-pro",
    messages=[...],
)

2. Use Flash as the floor in routing. A common pattern: route trivial work to Flash, escalate to a flagship (Opus 4.7, GPT-5.5, or soon Gemini 3.5 Pro) only when Flash returns low confidence. See the Claude Code hybrid routing pattern for the production-grade version of this. When 3.5 Pro lands, you swap which flagship sits behind the escalation gate — your routing logic doesn't change.

3. Don't pre-commit to pricing. The realistic range above ($2.50-$3.50 input, $15-$20 output) is informed speculation. If you're writing a cost projection for finance, plug in both endpoints of the band and ship two scenarios.

4. Watch for the model card on Google's blog. That's how every 3.x model has launched — a single blog post on blog.google with the full benchmark grid. No staged rollouts, no Twitter teasers from product managers. Subscribe to the Gemini API changelog if you want push notification of the moment Pro becomes addressable.

The bigger picture for model selection

By late June 2026 three things are happening at once:

Claude Opus 4.7 still holds reasoning-heavy benchmarks, especially long-horizon agent runs.
GPT-5.5 owns raw multimodal reasoning and the deepest ecosystem of tooling.
Gemini 3.5 Pro — if Flash's gains carry forward — undercuts both on price and pushes them on Terminal-Bench-style agentic coding.

Picking the right model gets harder before it gets easier. The LLM API selection decision matrix and the leaderboard view will both need a rewrite the week Pro ships. For the canonical three-way framing of how to choose, see the Claude vs GPT vs Gemini comparison guide — that's the piece that'll get the biggest update.

If you're choosing today and the work is coding-and-agents leaning, Gemini 3.5 Flash already beats last-generation Pro at 25% lower cost. There's no reason to wait. If the work is reasoning-heavy or you care about long-context recall quality, stay on Gemini 3.1 Pro or Claude Opus 4.7 for now and re-evaluate when Pro lands. The thing not to do is sit on your hands assuming the new shiny thing will solve a problem you could solve this week.

Six weeks of public Flash benchmarks have arrived before the Pro model card exists — and they say the cost-quality frontier in coding is about to shift on a Tuesday in late June.

Sources and citations

Sundar Pichai's "next month" quote and the I/O 2026 keynote framing: Google's official blog (May 19, 2026)
Gemini 3.5 Flash benchmark grid: DeepMind model card
Flash pricing and context window: Google AI for Developers — Gemini API changelog
The delay framing and audience reaction: Let's Data Science I/O recap
Comparative pricing for Claude Opus 4.7 and GPT-5.5: confirmed against ofox model catalog at the time of writing

Originally published on ofox.ai/blog.

AI API Pricing Comparison May 2026: Every Major Model in One Table

Owen — Tue, 26 May 2026 04:06:52 +0000

TL;DR: The frontier-model price gap in May 2026 is more than 100x on output tokens — GPT-5.5 charges $30/M, Claude Haiku 4.5 charges $5/M, and DeepSeek V4 Flash charges $0.28/M for a 1M-context model that benchmarks above GPT-4o. The sticker price almost never tells you what you'll actually pay: caching, batching, and long-context surcharges swing real costs by 50–90% in either direction. This table compiles every major API's verified May 2026 price, with the discount math built in.

The cheapest API in 2026 is not cheaper than the most expensive one — it's eighty-eight times cheaper on a real coding task, hundreds of times cheaper on raw sticker. That's not a rounding error, that's an architecture decision.

What's in this table

Nine providers, twenty-three models, verified from each vendor's pricing page in late May 2026. Prices in USD per million tokens (M = 1,000,000). Output is always more expensive than input — usually 4–6x; on GPT-5.5 it's a punishing 6x.

Model	Input $/M	Output $/M	Cached input $/M	Context	Notes
OpenAI GPT-5.5	$5.00	$30.00	$0.50	1M	Flagship reasoning
OpenAI GPT-5.5 Pro	$30.00	$180.00	—	1M	Hardest-tasks tier
OpenAI GPT-5.4	$2.50	$15.00	$0.25	272K	Previous flagship; 1M extended on opt-in
OpenAI GPT-5.4 Mini	$0.75	$4.50	—	200K	Mid-tier workhorse
OpenAI GPT-5.4 Nano	$0.20	$1.25	—	128K	Cheapest OpenAI option
Anthropic Claude Opus 4.7	$5.00	$25.00	$0.50	1M	Best agent reliability
Anthropic Claude Sonnet 4.6	$3.00	$15.00	$0.30	1M	Best price/perf in flagship class
Anthropic Claude Haiku 4.5	$1.00	$5.00	$0.10	1M	Cheapest US-lab frontier model
Google Gemini 3.1 Pro	$2.00	$12.00	$0.20	1M	$4.00/$18.00 above 200K
Google Gemini 2.5 Pro	$1.25	$10.00	$0.125	1M	$2.50/$15.00 above 200K
Google Gemini 2.5 Flash	$0.30	$2.50	—	1M	Cheap multimodal
Google Gemini 2.5 Flash-Lite	$0.10	$0.40	—	1M	Cheapest Google option
xAI Grok 4.3	$1.25	$2.50	$0.20	1M	Output-cheap reasoner
xAI Grok 4.20	$1.25	$2.50	$0.20	2M	Speed + tool-calling tier
xAI Grok 4.1 Fast	$0.20	$0.50	$0.05	2M	Budget agentic tool-caller
DeepSeek V4 Flash	$0.14	$0.28	$0.0028	1M	Best $/quality globally
DeepSeek V4 Pro	$0.435	$0.87	$0.0036	1M	Flagship (75% off until 2026-05-31)
Moonshot Kimi K2.6	$0.95	$4.00	$0.16	262K	Strong coding model
Moonshot Kimi K2.5	$0.60	$3.00	$0.10	262K	Cheaper sibling
Alibaba Qwen3-Max	$1.20	$6.00	$0.24	262K	Tiered: 2x above 32K input, 2.5x above 128K
Mistral Large 3	$0.50	$1.50	—	262K	Aggressive EU pricing
Meta Llama 4 Maverick (via DeepInfra)	$0.15	$0.60	—	1M	Cheap open-weight large
Meta Llama 4 Scout (via DeepInfra)	$0.08	$0.30	—	10M native (DeepInfra caps at 320K)	Cheapest tier overall

A few things the sticker doesn't show. Those come next.

The discount math behind the sticker

Prompt caching cuts repeated-prefix input by ~90%

Every major US-lab model offers some form of prompt caching now. The shape is the same: a long system prompt or document gets cached, and subsequent reads charge a fraction of the input rate. Anthropic and OpenAI both cut cached input by 90%. DeepSeek's cache hit on V4 is 98% off ($0.0028/M vs $0.14/M miss). Gemini caches at ~10% of base.

The catch: caching only helps when the same prefix repeats across many requests inside a short TTL window (typically 5 minutes on Anthropic, longer on others). A chatbot serving many users with shared system prompts: huge win. A coding agent rewriting context every turn: zero help.

Worked example. You're running a customer-support bot with a 4K-token system prompt and 1K-token user turns, serving 100 messages an hour. On Claude Sonnet 4.6:

Without caching: 100 × 5K × $3/M = $1.50/hr input
With caching (system prompt cached): 100 × 4K × $0.30/M + 100 × 1K × $3/M = $0.12 + $0.30 = $0.42/hr input

A 72% cut on a workload most teams already run.

Batch API cuts everything by 50%

If you can wait 24 hours, every major provider gives you exactly 50% off. Anthropic, OpenAI, Google, Mistral — all the same. For offline jobs (overnight document processing, dataset labeling, summary generation on yesterday's data) this is free money. Most production traffic can't use it because users want answers in seconds, not tomorrow.

Long-context surcharges on Gemini

Google is the only major provider charging a long-context premium. Above 200K tokens, both Gemini 2.5 Pro and Gemini 3.1 Pro roughly double their input price and add ~50% to output. Anthropic, which also offers 1M-context Claude models, charges flat across the full context.

If your typical request is below 100K tokens, this is moot. If you're feeding entire codebases or 500-page PDFs, the headline Gemini price is misleading by a factor of two.

What it actually costs to do real work

Sticker prices in isolation are useless. Here's what one realistic workload costs across the lineup.

Scenario: a coding agent processing one task end-to-end. Roughly 40K input tokens (context + retrieved code + tool results) and 8K output tokens (reasoning + final code). About one task is one minute of human-developer-equivalent work.

Model	Cost per task	Tasks per $1
GPT-5.5	$0.44	2.3
Claude Opus 4.7	$0.40	2.5
Gemini 3.1 Pro	$0.18	5.6
Claude Sonnet 4.6	$0.24	4.2
GPT-5.4	$0.22	4.5
Kimi K2.6	$0.07	14
Qwen3-Max	$0.10	10
Claude Haiku 4.5	$0.08	12.5
Grok 4.3	$0.07	14
GPT-5.4 Nano	$0.02	50
DeepSeek V4 Flash	$0.008	125
Llama 4 Scout	$0.005	200

The ratio between cheapest and most expensive at this workload is 88x. That gap, run a million times, is the difference between a $5,000 month and a $440,000 month.

How to pick a tier

A simple decision tree that holds up across most teams.

Do you need the absolute best reasoning on the hardest 5% of tasks? GPT-5.5 Pro or Claude Opus 4.7. Pay the premium, don't try to be clever.

Do you need frontier quality on routine work? Claude Sonnet 4.6 or Gemini 3.1 Pro. Sonnet wins on agent reliability; Gemini wins on multimodal and 1M context recall.

Are you on a budget but need US-lab quality? Claude Haiku 4.5 or GPT-5.4 Mini. Both punch above their price tag.

Are you cost-sensitive and OK with open-weight quality? DeepSeek V4 Flash is the answer for most teams — 1M context at $0.14/$0.28. Llama 4 Scout if you can route through DeepInfra and don't need vision.

Are you doing offline / batch work? Pick anything and add --batch for 50% off. The model choice matters less than turning batch on.

This is the same logic our LLM API selection decision matrix lays out by use case if you want a longer breakdown.

What this table doesn't show

Three caveats worth knowing before you route off these numbers.

Rate limits matter more than price for many teams. A $0.30/M model you can't get capacity on at peak is more expensive than a $5/M model you can. OpenAI and Anthropic have the most generous tiers; the cheaper Chinese models often gate hard on enterprise quotas.
Quality is not flat within a price band. Claude Sonnet 4.6 and Gemini 3.1 Pro are priced similarly but win on different tasks. Sonnet leads on multi-turn agent reliability; Gemini leads on 1M+ token recall and image input. There's no substitute for running your eval on both.
Provider markup is real. Going through a reseller adds 5–20% in most cases. We break down OpenRouter's actual margin versus first-party APIs in a separate piece — short version: it's higher than they advertise once you account for routing costs.
First-party Claude pricing matches ofox pricing. Anthropic does not let resellers undercut; the only saving is from removing the need for multiple billing relationships. That logic applies to all the big labs.

The aggregator question

You can pay nine providers separately, manage nine API keys, and reconcile nine invoices. Or you can route everything through one OpenAI-compatible endpoint. ofox.ai is the aggregator we run — one key for every model in this table, OpenAI-compatible SDK, and prices that match each provider's first-party rate for flagship models with up to 70% off on open-weight ones. We're not the only option, but the math is similar across aggregators: the value is in not maintaining nine integrations, not in saving 2% on token cost.

For a deeper read on flagship-level differences, Claude vs GPT vs Gemini is the pillar piece this article links into. For first-party tier breakdowns specifically: Claude API pricing breakdown, Gemini 3.1 Pro pricing, GPT-5.4 Pro pricing, DeepSeek V4 pricing, and how to actually reduce AI API costs. The May 2026 LLM leaderboard is the quality-side companion to this price-side table.

The right model is the one whose price you don't have to think about — pick it for capability and let the bill take care of itself, or pick it for cost and let the capability ceiling decide your roadmap. Anything in between just means you'll switch in six months.

Pricing sources (verified May 26, 2026)

OpenAI: openai.com/api/pricing
Anthropic: platform.claude.com/docs/en/about-claude/pricing
Google Gemini: ai.google.dev/gemini-api/docs/pricing
xAI: x.ai/api
DeepSeek: api-docs.deepseek.com/quick_start/pricing
Moonshot Kimi: official pricing via openrouter.ai/moonshotai
Alibaba Qwen: alibabacloud.com/help/en/model-studio/model-pricing
Mistral: mistral.ai/pricing
Meta Llama 4 (via DeepInfra): deepinfra.com/meta-llama

This table will be re-verified at the start of each month. If a number here disagrees with the provider's page, the provider wins — but tell us, because we want to keep this honest.

Originally published on ofox.ai/blog.

Agentic Coding in 2026: Claude Code vs Codex CLI vs Gemini CLI vs Cursor Agent

Owen — Mon, 25 May 2026 14:37:35 +0000

Agentic Coding in 2026: Claude Code vs Codex CLI vs Gemini CLI vs Cursor Agent

TL;DR

Agentic coding has fragmented into four specialized tools. Claude Code excels at high-quality pair programming with human oversight. Codex CLI dominates unattended multi-hour tasks with Goal mode reaching 82.7% on Terminal-Bench 2.0. Gemini CLI transitions to Antigravity CLI on June 18, 2026. Cursor Agent uniquely offers cloud VM-based background agents with browser/desktop capabilities and eight-way parallelism.

The fundamental shift: agents now operate beyond terminals—Codex runs unattended for hours, Cursor agents click through browsers in cloud VMs, and Gemini consolidates into a full desktop platform. The production strategy is not choosing one tool, but composing all three by task type through a unified API gateway.

What Changed in 2026 for Agentic Coding CLIs

Agentic coding evolved from "model writes a function" to "model owns multi-step tasks from specification to verified output." Each of the four mature CLIs occupies different positions on the autonomy spectrum:

Claude Code (Anthropic) prioritizes human partnership, running locally with approval gates and extension hooks for developer control.
Codex CLI (OpenAI) maximizes autonomy—Goal mode runs unattended with thousands of sequential tool calls demonstrated without intervention.
Gemini CLI (Google) offered middle-ground conversational ReAct loops with 1M-token context until the announced transition to Antigravity CLI.
Cursor Agent (Cursor) abandoned the terminal entirely for cloud VMs with desktop and browser capabilities, supporting up to eight parallel background agents.

The category fragmentation reflects a shifted question: "How much autonomy do I delegate, for how long, and where should execution occur?"

The Five-Minute Decision Matrix

CLI	Autonomy Model	Execution Environment	Primary Model	Key Strength	Main Challenge
Claude Code	Approval-gated pair programmer	Local terminal	Claude Opus 4.7 / Sonnet 4.6	Hooks, subagents, Skills with PostToolUse output replacement (May 2026)	Pro tier subscription throttle
Codex CLI	Unattended Goal mode over hours	Local or headless	GPT-5.5 (ofox: GPT-5.4 Pro, GPT-5.3)	GA Goal mode, 82.7% Terminal-Bench score, remote computer use	Less idiomatic first-pass output
Gemini CLI	Conversational ReAct loop	Local terminal (sunsetting June 18)	Gemini 3.1 Pro / Flash	1M context window, free tier (60 RPM/1000 RPD), MCP support	Consolidating into Antigravity CLI
Cursor Agent	Cloud VM background fleet	Editor + cloud VM	Composer 2 or Claude/GPT/Gemini	Desktop/browser per agent, 8x parallel fan-out	Credit-based premium model billing

Quick guidance: Claude Code for craftsmanship; Codex CLI for endurance; Gemini CLI for free-tier exploration before June 18; Cursor Agent for parallelism.

Claude Code: The Pair-Programmer Model

Claude Code's philosophy keeps developers in control. The terminal-resident CLI operates against local filesystems, requires approval before destructive changes, and exposes state through /context and /cost introspection commands. Claude Opus 4.7 is the default as of May 2026 (upgraded from 4.6), with Sonnet 4.6 handling the broader workload at lower cost.

Extensibility Architecture (Three Layers)

Hooks execute shell commands at lifecycle events—PreToolUse, PostToolUse, Stop, SubagentStop, SessionStart. The May 2026 upgrade enabled PostToolUse hooks to replace tool output across all tools via hookSpecificOutput.updatedToolOutput, enforcing patterns like "run tests before stopping" or "block edits to generated files."

Subagents spawn focused workers with isolated context windows, custom prompts, and bounded tool permissions. The primary agent handles planning while specialist subagents manage discrete tasks like code review or security scanning.

Skills package reusable expertise as markdown files plus optional scripts, functioning like internal libraries distributed across teams.

This design reflects the autonomy philosophy: short turns, frequent approvals, granular control. Extended unattended runs conflict with the architecture's core assumption.

Economic constraint: Pro at $20/month enforces hard ceilings. Max 5x ($100) and Max 20x ($200) raise limits without eliminating them—a direct disadvantage for "set and forget" workflows, precisely where Codex CLI operates.

Codex CLI: The Autonomy Champion

Codex CLI targets tasks measured in hours rather than minutes. The May 2026 changelog confirms: Goal mode transitioned from experimental to GA across the Codex app, IDE extensions, and CLI. OpenAI demonstrated 1,000+ sequential tool calls on real software tasks without intervention; Terminal-Bench 2.0 scores of 82.7% on GPT-5.5 provide empirical validation.

Remote computer use (May 2026 feature) exemplifies the autonomy bet—Codex operates Mac desktop apps after screen lock, including remote access via Codex Mobile. Authorization is time-limited, displays covered, and local input triggers relock, but the philosophy is explicit: agents don't require constant observation.

Codex CLI 0.125.0 added reasoning-token usage reporting in codex exec --json, closing observability gaps. Multi-hour session budgeting now achieves production-grade accuracy via token-level reporting and OpenTelemetry traces.

Trade-offs Worth Naming

First-pass edits show slightly lower idiomaticity compared to Claude, particularly on tight refactors. The workaround: route through GPT-5.4 Pro via ofox or GPT-5.3 Codex if GPT-5.5 availability lags.

Codex CLI mirrors OpenAI's ecosystem—tool-calling formats, prompt conventions, and trace output reflect wider OpenAI infrastructure. Anthropic-primary shops find Claude Code more native.

Gemini CLI: The Conversational ReAct Loop (With a June 18 Deadline)

Gemini CLI implements the simplest design: reason-and-act loops with built-in tools (Google Search grounding, shell, file operations, web fetch) plus MCP support. The 1M-token context window was uniquely accessible in a terminal, and the free tier (60 requests/minute, 1,000 requests/day on personal accounts) was unmatched for low-friction agentic exploration.

The June 18, 2026 Transition

Google announced May 12, 2026 that Gemini CLI and Gemini Code Assist IDE extensions stop serving Google AI Pro/Ultra and free Gemini Code Assist on June 18, 2026. The consolidation target is Google Antigravity—an agent-first platform featuring server-side infrastructure and Antigravity CLI as the terminal equivalent.

Concrete implications:

Personal free-tier users migrate to Antigravity CLI by June 18; free tier translates forward.
Paid Google AI Pro/Ultra subscribers face the same migration requirement.
Self-hosted users with custom API keys can continue via open-source community forks, though corporate recommendations shift toward Antigravity.

This represents re-platforming rather than agentic-coding deprecation. Gemini 3.1 Pro and Gemini 3.1 Flash remain available on ofox and other aggregators; the distribution channel moves.

When Gemini CLI still wins (through June 18): free-tier exploration, MCP server prototyping with generous context, pattern testing without paid subscriptions.

Cursor Agent: The Fleet Model

Cursor rejected terminal-first architecture entirely. Editor-centric from inception, 2026 pushed agents into cloud VMs with dedicated desktops and browsers.

Background Agents Architecture

Cursor clones repositories into cloud VMs where agents work on dedicated branches with full desktop and browser access. Results surface as pull requests while you continue local editing. February 2026 upgrades added desktop-per-agent infrastructure—each Background Agent receives its own development environment, browser, and UI interaction capabilities. Agents can launch browsers, navigate localhost, click UI elements, and visually verify code changes before opening PRs.

Fan-out extends to eight parallel agents—unique across the four CLIs. Dependency upgrades spanning services, test backfills, or standardized changes across multiple repositories genuinely unlock parallelism unavailable elsewhere.

Cost structure: each Background Agent consumes Cursor credits; parallelism has real economic trade-offs.

Foreground Capabilities

Composer 2, Cursor's first-party agentic model, claims ~4x speed versus frontier peers, with typical agent turns finishing under 30 seconds. Auto mode is credit-free; premium model pins (Claude Sonnet 4.6, GPT-5.5) consume credits. The $20 Pro plan translates to approximately $20 monthly credits plus unlimited Tab completions.

When Cursor Agent dominates: editor-native workflows, high-volume repetitive work benefiting from fan-out (dependency upgrades, test backfills, bulk find-and-replace), or scenarios requiring visual UI verification.

The Use-Case Matrix

Task	Best Primary	Fallback	Rationale
High-quality refactors with oversight	Claude Code (Opus 4.7)	Cursor Agent	Approval-gated execution, superior idiomatic output
Multi-hour unattended execution	Codex CLI Goal mode	Cursor Background Agent	Designed for walk-away autonomy
Browser-based UI verification	Cursor Background Agent	Codex remote computer use	Desktop/browser environment per agent
Eight-way parallel fan-out (deps)	Cursor Background Agents	Codex CLI scripted	Native parallelism
Free-tier exploration (pre-June 18)	Gemini CLI	Cursor Hobby	1M context, no card required
Free-tier exploration (post-June 18)	Antigravity CLI	Gemini CLI (BYO-key)	Free tier migration destination
Local-only, no cloud VMs	Claude Code or Codex CLI	Gemini CLI (BYO-key)	Both remain on-machine
MCP-heavy custom tools	Claude Code	Gemini CLI	Most mature MCP integration
Headless / CI integration	Codex CLI	Claude Code (`--print` mode)	Remote-control entrypoint, OpenTelemetry
Strict $30/month budget	DeepSeek TUI + Cursor Hobby	Gemini CLI free tier	See $30/month coding stack guide

How to Configure All Four Against One API Key

The under-discussed reality: you don't need four billing dashboards. Each CLI accepts custom endpoints; aggregators like ofox expose Anthropic, OpenAI, and Google models through compatible APIs.

Claude Code with Anthropic-Compatible Endpoint

export ANTHROPIC_BASE_URL="https://api.ofox.ai/anthropic"
export ANTHROPIC_API_KEY="sk-ofox-..."
claude

Codex CLI with OpenAI-Compatible Endpoint

export OPENAI_BASE_URL="https://api.ofox.ai/v1"
export OPENAI_API_KEY="sk-ofox-..."
codex

Gemini CLI with Vertex-Compatible Endpoint

export GOOGLE_GENAI_USE_VERTEXAI=false
export GEMINI_API_KEY="sk-ofox-..."
export GEMINI_API_BASE_URL="https://api.ofox.ai/gemini"
gemini

Cursor Agent Custom Models

Settings → Models → Add Custom Model accepts any OpenAI-compatible base URL plus API key. Set to https://api.ofox.ai/v1 to call Claude, GPT, and Gemini through the same authentication Cursor already understands.

This pattern runs all four agents against the same model catalog, switching by task class while paying only for consumed tokens.

Shared Gaps Across All Four (May 2026)

Cross-Repo Awareness

All four operate within single repositories. Coordinating across monorepos plus three sibling repositories requires developer intervention.

Cost Predictability

Even with /cost commands and Codex token reporting, predicting multi-hour Goal-mode expenses remains guesswork until completion.

Persistent Memory Across Sessions

Subagents and Skills enable knowledge reuse, but genuine session-to-session memory requires developer prompt scaffolding.

Reliable Test-Driven Loops

Write-test-code-iterate works for greenfield projects but degrades on flaky tests or extended CI cycles.

Verification Beyond UI

Cursor's browser-equipped agents verify UI changes visually. Data-pipeline correctness and distributed-system invariants still rely on developer-written tests.

Addressing these gaps often requires architectural workarounds (CI-side verification, persistent external memory stores) rather than awaiting agent evolution.

Closing Recommendation

Pick by autonomy axis first, then ecosystem fit.

Craftsman pair programmer locally: Claude Code with Opus 4.7; use Sonnet 4.6 for broader workloads.
Walk-away autonomy over hours: Codex CLI Goal mode with GPT-5.5 (or GPT-5.4 Pro through ofox if GPT-5.5 lags on aggregators).
Free-tier exploration before June 18: Gemini CLI; migrate to Antigravity CLI by mid-June.
Browser-aware parallel agents in cloud VM: Cursor Background Agents, up to eight in parallel.

The Production Composition Pattern

Late-2026 production teams rarely choose one tool. The converging pattern: Claude Code locally for craftsmanship, Codex CLI in a separate shell for endurance, and Cursor Background Agents in the cloud for fan-out—all three routed through one API gateway for unified billing and model catalog access.

The fastest-shipping developers aren't debating "which is best"—they're composing Claude Code for craftsmanship, Codex CLI for endurance, and Cursor Background Agents for parallelism, unified through a single API key.

Sources and Version Stamps

Claude Code: PostToolUse output replacement for all tools (May 2026); Fast mode default upgraded to Opus 4.7 (from 4.6) per Anthropic release notes and ClaudeLog, May 2026
Codex CLI: v0.124.0 quick reasoning controls; v0.125.0 reasoning-token reporting in codex exec --json; Goal mode GA; remote computer use per OpenAI developers changelog; GPT-5.5 Terminal-Bench 2.0 score of 82.7% per OpenAI launch announcement
Gemini CLI → Antigravity CLI: transition announcement May 12, 2026; cutoff for Google AI Pro/Ultra and free Gemini Code Assist on June 18, 2026, per Google Developers Blog
Cursor Agent: Background Agents v3.0 with cloud VMs; February 2026 desktop + browser per agent; 8x parallel fan-out; Composer 2 first-party model per cursor.com and v3 release notes
ofox model availability: Claude Opus 4.7, Sonnet 4.6, Haiku 4; GPT-5.4 Pro, GPT-5.4, GPT-5.3 Codex; Gemini 3.1 Pro, 3.1 Flash, 3.1 Flash-Lite—verified at ofox.ai/llms-full.txt on 2026-05-25

Originally published on ofox.ai/blog.

How to Delegate Claude Code Tasks to Mistral Vibe — Save 2-4x on Tokens

Owen — Mon, 25 May 2026 02:37:38 +0000

How to Delegate Claude Code Tasks to Mistral Vibe — Save 2-4x on Tokens

TL;DR

Mistral Vibe (Mistral's open-source coding CLI running Mistral Medium 3.5 at $1.50/$7.50 per million tokens) is roughly 3.3x cheaper than Claude Opus 4.7 ($5/$25). You don't have to choose between them—Claude Code can spawn Vibe as a subagent via the Bash tool, keeping Opus 4.7 for planning and review while Vibe handles refactors, file scans, and bulk edits. A 30-line config file enables this approach.

Why This Pattern Exists

For most of 2025, the agentic-CLI debate centered on "which one is best." In 2026, the better question is "which one does each job best." Claude Code's subagent system lets you answer pragmatically: keep the expensive model where reasoning matters, route the grunt work somewhere cheaper.

Claude Opus 4.7 is expensive because most tasks don't need Opus. A single agentic session reading 20-30 files easily burns 100K+ tokens before the model writes code. The token bill is dominated by exploration and bulk edits, not by moments where Opus actually earns its keep.

Mistral Medium 3.5—the default model in Mistral Vibe since April 29, 2026—costs $1.50/$7.50 per million tokens and scores 77.6% on SWE-Bench Verified. It's not as strong as Opus on novel reasoning, but for mechanical tasks like "rename this symbol in 14 files," "add error handling to these three functions," or "extract this prop into a config object," it's indistinguishable.

The delegation pattern lets you keep Opus for decisions and hand the mechanics to Vibe. If you're skeptical of mixing CLIs, the hybrid routing approach inside Claude Code itself is a closer-coupled alternative worth comparing.

What Mistral Vibe Actually Is

Mistral Vibe is a terminal-based coding agent that ships as a single Python CLI with built-in subagent support, MCP integration, and a non-interactive prompt mode. Installation is one line:

curl -LsSf https://mistral.ai/vibe/install.sh | bash
export MISTRAL_API_KEY=...

Config (optional—Vibe ships with defaults) lives at ~/.vibe/config.toml. A minimal version using only documented keys:

active_model = "mistral-medium-3-5"
enable_auto_update = true
enable_telemetry = false

The flag you care about for delegation is --prompt, which runs Vibe one-shot and prints the result:

vibe --prompt "refactor src/utils/date.ts to use date-fns instead of moment"

That command is the entire integration surface. Any orchestrator that can shell out—Claude Code, a Makefile, a CI job—can call it.

The Claude Code Half: Defining a Subagent

Claude Code routes work to subagents by reading the description field in each agent definition under .claude/agents/. To make Opus 4.7 delegate to Vibe, write one Markdown file with a Bash-only tool scope and a description that tells Opus when this worker is the right pick.

Create .claude/agents/vibe-worker.md:

---
name: vibe-worker
description: "Use for mechanical code changes where reasoning is shallow — renames, refactors across many files, adding error handling, extracting helpers, format/lint cleanup. Do NOT use for architectural decisions or novel logic."
tools: Bash
model: sonnet
---

You are a delegation wrapper around the Mistral Vibe CLI.

When invoked with a task description, run:

  vibe --prompt "<task description>"

Capture the output, then return a short summary: which files changed, what the change was, and any warnings Vibe surfaced. Do not edit files yourself — only run the `vibe` command.

If `vibe` returns an error or asks for clarification, return the raw output to the parent and stop.

Notice that tools: Bash is intentional: this subagent's only superpower is shelling out, which keeps its context narrow. The description is what Opus reads to decide when to dispatch, so the "do NOT use for…" line matters as much as the positive cases. The wrapper itself runs on Sonnet 4.6, not Opus, because all it does is format one shell command.

The Real Cost Math

A typical "refactor 20 files to use the new API" task burns about 50K input tokens (reading files + scratch reasoning) and produces about 10K output tokens. Running it three ways:

Path	Input cost	Output cost	Total
Claude Opus 4.7 direct	50K × $5/M = $0.25	10K × $25/M = $0.25	$0.50
Mistral Vibe (Medium 3.5)	50K × $1.50/M = $0.075	10K × $7.50/M = $0.075	$0.15
DeepSeek V4 Flash via ofox	50K × $0.14/M = $0.007	10K × $0.28/M = $0.003	$0.01

Mistral Vibe saves 3.3x against direct Opus on this task. If you run 100 such tasks a month, you've kept $35 in your wallet instead of Anthropic's. The catch is that the saving evaporates the moment you delegate something Vibe can't handle—Opus then re-does the work, so you pay twice. The decision rubric is: only delegate when you'd be comfortable letting a junior engineer do it without supervision.

For the genuinely token-paranoid, the third row in that table is real—DeepSeek V4 Flash is $0.14/$0.28 per million tokens on ofox. You can substitute it for Mistral Vibe in the same subagent pattern.

Variant: The Same Pattern on Ofox + DeepSeek V4 Flash

If you already have an ofox key for unified model access, you can skip Mistral Vibe entirely and have Claude Code dispatch to DeepSeek V4 Flash directly. The Bash wrapper changes from vibe --prompt to a curl call, but the subagent definition is otherwise identical.

Create .claude/agents/cheap-worker.md:

---
name: cheap-worker
description: Use for mechanical edits — renames, format cleanup, boilerplate generation, simple refactors. Routes to DeepSeek V4 Flash via ofox. NOT for design decisions or novel logic.
tools: Bash, Read
model: sonnet
---

For each delegated task, call:

  curl https://api.ofox.ai/v1/chat/completions \
    -H "Authorization: Bearer $OFOX_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model":"deepseek/deepseek-v4-flash","messages":[{"role":"user","content":"<task + relevant file contents>"}]}'

Apply the returned diff yourself using Read + your own edit primitives. Return a one-paragraph summary to the parent.

The trade-off: Mistral Vibe is a real coding agent with its own planning loop, so it handles multi-file tasks better. A raw DeepSeek V4 Flash call is just a language model—the orchestration logic falls on you (or on Opus, which costs Opus tokens). For single-file edits, the ofox variant wins on price. For multi-file refactors, Vibe pulls ahead because its agentic loop runs on the cheap model, not on Opus.

Where This Stops Working

The delegation pattern breaks in three specific situations.

When the task involves judgment about API design or trade-offs. Mistral Medium 3.5 will pick an answer; Opus 4.7 will tell you why one option is wrong. Architecture decisions are not where you save tokens.

When the delegated task needs context the wrapper can't supply. Vibe runs in a fresh process with no memory of your conversation. If "fix this bug" depends on three earlier discussions, you'll pass them in as prompt context—paying input tokens twice, in both Opus and Vibe. Net cost can exceed the no-delegation baseline.

When Vibe's tokenizer disagrees with Anthropic's. Claude Opus 4.7 ships a new tokenizer that uses ~12-27% more tokens than Opus 4.6 on the same text. Mistral's tokenizer is different again. Your "50K tokens" estimate from a Claude session is not what Vibe will count, and the bills won't line up exactly. The 3.3x ratio holds in aggregate; trust it monthly, not per-task.

Picking the Right Delegation Cutoff

A useful heuristic: if you can write a clear, two-sentence task description without referring to "the thing we discussed earlier" or "the approach you mentioned," it's delegable. If you can't, keep it on Opus.

Tasks that consistently win when delegated:

Symbol renames across the codebase
Adding null-checks or error handling to a list of known functions
Generating boilerplate (test scaffolding, type definitions from schemas, config files)
Format/lint fixes that grep can target but humans hate doing
Translating between formats (JSON ↔ YAML, OpenAPI ↔ TypeScript)

Tasks that lose when delegated:

Anything involving "which approach is better"
Novel algorithm work
Bug fixes where the root cause isn't established
Reviewing AI-generated code (don't ask a cheaper model to review its peer's work)

If you're already optimizing Claude Code spend, this pattern stacks on top of existing strategies—they target different cost drivers (this one targets which model does the work, while other optimization guides target how much context the model sees).

The Full Minimum-Viable Setup

Five minutes if you already have an Anthropic key:

curl -LsSf https://mistral.ai/vibe/install.sh | bash
export MISTRAL_API_KEY=... (get one from console.mistral.ai)
Drop the .claude/agents/vibe-worker.md definition from earlier into your project root
Restart Claude Code
Next time you need to do a 20-file refactor, just ask—Opus will read the subagent description and delegate

The first time you watch Claude Code dispatch to vibe-worker and come back with a diff that cost $0.15 instead of $0.50, the pattern justifies itself.

When This Is the Wrong Question Entirely

If your monthly bill is dominated by one model and you're chasing a single-digit-percent cost cut, this isn't the lever to pull. Check whether prompt caching, batching, or context window discipline would save more for less engineering effort. Delegation overhead is real—every subagent dispatch is a Bash spawn, and every Bash spawn is a roundtrip Opus has to reason about.

But if you've already done the easy optimizations and you still see Opus burning tokens on tasks that look mechanical when you watch them happen, this is the pattern. Two CLIs, one config file, predictable savings. The broader model-selection question is independent—you can run the delegation pattern with any pair of orchestrator + worker; Opus + Vibe is just the version with the cleanest CLI ergonomics in May 2026.

What you're really buying is the right to keep using the model you trust for hard problems, while paying a third of the price for the easy ones. That's the deal—and it only takes 30 lines of YAML to claim it.

Originally published on ofox.ai/blog.