<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Owen</title>
    <description>The latest articles on DEV Community by Owen (@owen_fox).</description>
    <link>https://dev.to/owen_fox</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3893304%2Fb8cec06b-7789-423e-a8d0-386db7f00620.png</url>
      <title>DEV Community: Owen</title>
      <link>https://dev.to/owen_fox</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/owen_fox"/>
    <language>en</language>
    <item>
      <title>Routing GLM-5.2, DeepSeek V4, MiniMax M3 &amp; Kimi K2.6 Through One API (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Tue, 23 Jun 2026 10:19:38 +0000</pubDate>
      <link>https://dev.to/owen_fox/routing-glm-52-deepseek-v4-minimax-m3-kimi-k26-through-one-api-2026-5m2</link>
      <guid>https://dev.to/owen_fox/routing-glm-52-deepseek-v4-minimax-m3-kimi-k26-through-one-api-2026-5m2</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Put GLM-5.2, DeepSeek V4 (Pro and Flash), MiniMax M3, and Kimi K2.6 behind one &lt;a href="https://ofox.ai/en?utm_source=blog&amp;amp;utm_medium=marvin_article&amp;amp;utm_campaign=multi-model-router" rel="noopener noreferrer"&gt;ofox&lt;/a&gt; API key and route per task instead of paying one model's price for every job. Blended per-token cost at a 2:1 input-to-output mix ranges from &lt;strong&gt;$0.19/M (V4 Flash)&lt;/strong&gt; to &lt;strong&gt;$2.40/M (GLM-5.2)&lt;/strong&gt; — a &lt;strong&gt;12.86x spread&lt;/strong&gt;. A worked 1,000-job/day routing table below cuts a &lt;strong&gt;$4,205/mo all-GLM bill to $1,453 (-65.5%)&lt;/strong&gt;. The routing rule is short: budget/batch → V4 Flash, long-context (up to 1M tokens) → V4 Pro or GLM-5.2, reasoning/code → GLM-5.2 or Kimi K2.6, images → MiniMax M3 or Kimi K2.6. All four sit on the same OpenAI-compatible endpoint, so routing is a one-string change — Python and Node loops included.&lt;/p&gt;

&lt;p&gt;The mistake teams make is picking one model and running everything through it. A batch summarization job and a hard reasoning task do not deserve the same per-token price. With one key across all four models, the cheapest tier costs &lt;strong&gt;12.86x less&lt;/strong&gt; than the most capable one — so the entire game is matching each job class to the cheapest model that clears its quality bar.&lt;/p&gt;

&lt;p&gt;This is a how-to with reproducible cost math, not a "which router is best" roundup. Every number below comes from ofox's listed per-token rates verified June 23, 2026, and you can recompute each table from the spec sheet.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: Which Model for Which Job?
&lt;/h2&gt;

&lt;p&gt;One-line verdict: &lt;strong&gt;default your batch traffic to the cheapest tier and only escalate the jobs that need it.&lt;/strong&gt; Here is the routing map by task shape.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task shape&lt;/th&gt;
&lt;th&gt;Route to&lt;/th&gt;
&lt;th&gt;ofox model ID&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Budget / high-volume batch&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.19/M blended, 12.86x cheaper than GLM-5.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost-sensitive general work&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.59/M blended, free cache reads, 1M context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context (up to ~1M tokens)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;V4 Pro&lt;/strong&gt; or &lt;strong&gt;GLM-5.2&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt; / &lt;code&gt;z-ai/glm-5.2&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;V4 Pro cheapest 1M input ($0.45/M); GLM-5.2 best reasoning at 1M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard reasoning / agentic coding&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GLM-5.2&lt;/strong&gt; or &lt;strong&gt;Kimi K2.6&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;z-ai/glm-5.2&lt;/code&gt; / &lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Strongest reasoning tier; Kimi K2.6 multimodal alternative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image input (vision tasks)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MiniMax M3&lt;/strong&gt; or &lt;strong&gt;Kimi K2.6&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;minimax/minimax-m3&lt;/code&gt; / &lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Only two of the four accept &lt;code&gt;image_url&lt;/code&gt;; M3 is cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very long single output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek V4 Pro/Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;384K max output, highest of the four&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest default for most 2026 teams: send the bulk of your traffic to &lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt; or &lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;, escalate the genuinely hard reasoning to &lt;code&gt;z-ai/glm-5.2&lt;/code&gt;, and send anything with an image to &lt;code&gt;minimax/minimax-m3&lt;/code&gt;. That covers the realistic 90% of mixed workloads behind one key with no vendor migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Specs Comparison
&lt;/h2&gt;

&lt;p&gt;Verified against the ofox &lt;code&gt;/v1/models&lt;/code&gt; catalog on June 23, 2026. Prices are per million tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;GLM-5.2&lt;/th&gt;
&lt;th&gt;DeepSeek V4 Pro&lt;/th&gt;
&lt;th&gt;DeepSeek V4 Flash&lt;/th&gt;
&lt;th&gt;MiniMax M3&lt;/th&gt;
&lt;th&gt;Kimi K2.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ofox model ID&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z-ai/glm-5.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;minimax/minimax-m3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;1,048,576&lt;/td&gt;
&lt;td&gt;1,000,000&lt;/td&gt;
&lt;td&gt;1,000,000&lt;/td&gt;
&lt;td&gt;1,131,000&lt;/td&gt;
&lt;td&gt;262,144&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max output&lt;/td&gt;
&lt;td&gt;128,000&lt;/td&gt;
&lt;td&gt;384,000&lt;/td&gt;
&lt;td&gt;384,000&lt;/td&gt;
&lt;td&gt;131,000&lt;/td&gt;
&lt;td&gt;262,144&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input $/M&lt;/td&gt;
&lt;td&gt;$1.40&lt;/td&gt;
&lt;td&gt;$0.45&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;td&gt;$0.95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output $/M&lt;/td&gt;
&lt;td&gt;$4.40&lt;/td&gt;
&lt;td&gt;$0.88&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;$2.40&lt;/td&gt;
&lt;td&gt;$4.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read $/M&lt;/td&gt;
&lt;td&gt;$0.26&lt;/td&gt;
&lt;td&gt;~$0.00&lt;/td&gt;
&lt;td&gt;~$0.00&lt;/td&gt;
&lt;td&gt;$0.12&lt;/td&gt;
&lt;td&gt;$0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modality&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;text&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;td&gt;text + image&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three structural facts drive every routing decision below:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;DeepSeek V4 Flash is the price floor.&lt;/strong&gt; At $0.14/$0.28 it is 12.86x cheaper blended than GLM-5.2. Anything that does not need top-tier reasoning starts here.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;DeepSeek V4 cache reads are effectively free.&lt;/strong&gt; Both V4 tiers bill cache reads at a rounding-to-zero rate, versus GLM-5.2's $0.26/M. On repeated-context workloads this is a large, often-overlooked saving.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Only MiniMax M3 and Kimi K2.6 take images.&lt;/strong&gt; GLM-5.2 and both DeepSeek tiers are text-only. Vision tasks have exactly two valid routes, and MiniMax M3 is the cheaper of them.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Blended Cost: The Number That Drives Routing
&lt;/h2&gt;

&lt;p&gt;A model's headline input price is half the story. What you pay depends on your input-to-output ratio. A coding agent reads a lot (large context) and writes a little (a diff) — roughly 2:1 input-to-output. Chat is closer to 1:1. Pure code generation from a short prompt is output-heavy, around 1:3.&lt;/p&gt;

&lt;p&gt;Here is the blended cost per million tokens at the coding-typical 2:1 mix (two-thirds input, one-third output), and the multiplier against GLM-5.2 as the reasoning-tier anchor:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Blended $/M (2:1)&lt;/th&gt;
&lt;th&gt;vs GLM-5.2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;$0.187&lt;/td&gt;
&lt;td&gt;12.86x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;$0.593&lt;/td&gt;
&lt;td&gt;4.04x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M3&lt;/td&gt;
&lt;td&gt;$1.200&lt;/td&gt;
&lt;td&gt;2.00x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;$1.967&lt;/td&gt;
&lt;td&gt;1.22x cheaper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$2.400&lt;/td&gt;
&lt;td&gt;1.00x (anchor)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pull quote:&lt;/strong&gt; The cheapest model on this list costs 12.86x less than the most capable one. That spread is the entire economic case for routing — not which model "wins," but which jobs can ride the cheap tier without anyone noticing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ranking shifts a little with workload shape. At 1:3 output-heavy (code generation), GLM-5.2 climbs to $3.65/M and Kimi K2.6 to $3.24/M, while V4 Flash stays at $0.245/M. Output-heavy work tilts even harder toward the DeepSeek tiers because their output token is the cheapest of the five. If you only remember one rule: &lt;strong&gt;the more your job writes, the more it pays to route off GLM-5.2 and Kimi K2.6.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want to stop estimating and measure these numbers on your own traffic, &lt;a href="https://ofox.ai/en/models/deepseek?utm_source=blog&amp;amp;utm_medium=marvin_article&amp;amp;utm_campaign=multi-model-router-cta" rel="noopener noreferrer"&gt;route all five models through one ofox key&lt;/a&gt; — pay-as-you-go, no monthly fee, same OpenAI SDK shape, and the A/B loop at the end of this post swaps models with a one-line string change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Task Cost: What One Agent Run Costs on Each Model
&lt;/h2&gt;

&lt;p&gt;Routing decisions are easier to feel in per-run dollars than per-million-token rates. Take a representative agent run: &lt;strong&gt;50,000 input tokens, 15,000 output tokens&lt;/strong&gt; (read a chunk of a codebase, produce a change).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost per run (50K in / 15K out)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;$0.0112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;$0.0357&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M3&lt;/td&gt;
&lt;td&gt;$0.0660&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;$0.1075&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$0.1360&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 10,000 such runs a month, that is &lt;strong&gt;$112 on V4 Flash versus $1,360 on GLM-5.2&lt;/strong&gt; for the same work. If even half those runs are routine enough for the budget tier, the routing decision pays for itself many times over. The point is not that V4 Flash is always right — it is that paying GLM-5.2's price for a job V4 Flash could handle is pure waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Routing Decision Matrix (Worked Example)
&lt;/h2&gt;

&lt;p&gt;Here is the part most "use a router" articles skip: the actual daily math. Assume &lt;strong&gt;1,000 mixed jobs per day&lt;/strong&gt; with this realistic distribution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job class&lt;/th&gt;
&lt;th&gt;Count/day&lt;/th&gt;
&lt;th&gt;Tokens (in / out)&lt;/th&gt;
&lt;th&gt;Routed to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Budget / batch&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;10K / 2K&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context&lt;/td&gt;
&lt;td&gt;250&lt;/td&gt;
&lt;td&gt;300K / 8K&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning / code&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;40K / 12K&lt;/td&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multimodal (image)&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;16.5K / 3K&lt;/td&gt;
&lt;td&gt;MiniMax M3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run everything on GLM-5.2 (the one-model trap) versus routing each class to its cost-appropriate model:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Daily cost&lt;/th&gt;
&lt;th&gt;Monthly (×30)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;All-GLM-5.2 baseline&lt;/td&gt;
&lt;td&gt;$140.17&lt;/td&gt;
&lt;td&gt;~$4,205&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Routed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$48.42&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$1,453&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Savings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$91.75/day&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2,753/mo (-65.5%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The breakdown of the routed total:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job class&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Daily cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Budget / batch (600)&lt;/td&gt;
&lt;td&gt;V4 Flash&lt;/td&gt;
&lt;td&gt;$1.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context (250)&lt;/td&gt;
&lt;td&gt;V4 Pro&lt;/td&gt;
&lt;td&gt;$35.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning / code (100)&lt;/td&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$10.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multimodal (50)&lt;/td&gt;
&lt;td&gt;MiniMax M3&lt;/td&gt;
&lt;td&gt;$0.85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$48.42&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 600 batch jobs — 60% of volume — cost &lt;strong&gt;$1.18/day&lt;/strong&gt; on V4 Flash. On GLM-5.2 the same 600 jobs would cost about $13.68/day — roughly 11.6× more. That single routing rule (cheap batch → V4 Flash) does most of the work. The long-context class is where the dollars actually concentrate, which is why the next section matters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
    A[Incoming request] --&amp;gt; B{Needs image input?}
    B --&amp;gt;|Yes| C[minimax/minimax-m3]
    B --&amp;gt;|No| D{Hard reasoning&amp;lt;br/&amp;gt;or agentic coding?}
    D --&amp;gt;|Yes| E[z-ai/glm-5.2]
    D --&amp;gt;|No| F{Context &amp;gt; 200K&amp;lt;br/&amp;gt;tokens?}
    F --&amp;gt;|Yes| G[deepseek/deepseek-v4-pro&amp;lt;br/&amp;gt;free cache reads, 1M ctx]
    F --&amp;gt;|No| H[deepseek/deepseek-v4-flash&amp;lt;br/&amp;gt;cheapest tier]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cache Reads: DeepSeek V4's Quiet Cost Advantage
&lt;/h2&gt;

&lt;p&gt;The long-context class above is where caching changes the math. DeepSeek V4 Pro and Flash bill cache reads at effectively $0/M. GLM-5.2 bills them at $0.26/M, MiniMax M3 at $0.12/M, Kimi K2.6 at $0.16/M.&lt;/p&gt;

&lt;p&gt;Take the 300K-input long-context job from the routing table (per-run cost includes 8K output), with 80% of the input served from cache (realistic for code-review loops where the same codebase context repeats across requests):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;No cache&lt;/th&gt;
&lt;th&gt;80% input cache&lt;/th&gt;
&lt;th&gt;Saving&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;$0.1420&lt;/td&gt;
&lt;td&gt;$0.0340&lt;/td&gt;
&lt;td&gt;76.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$0.4552&lt;/td&gt;
&lt;td&gt;$0.1816&lt;/td&gt;
&lt;td&gt;60.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;V4 Pro starts cheaper and saves a larger share, because its cache read rounds to zero while GLM-5.2 still pays $0.26/M on the cached portion. &lt;strong&gt;For any workload that re-sends the same long context — RAG over a fixed corpus, iterative code review, document Q&amp;amp;A — route to DeepSeek V4 Pro and the free cache read compounds.&lt;/strong&gt; This is a routing input GLM-5.2's stronger reasoning does not always justify overriding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Splitting the Reasoning Tier: GLM-5.2 vs Kimi K2.6
&lt;/h2&gt;

&lt;p&gt;The routing matrix sends "hard reasoning / agentic coding" to GLM-5.2 or Kimi K2.6, and that "or" deserves a rule rather than a coin flip. Both are the expensive end of this lineup — GLM-5.2 at $1.40/$4.40, Kimi K2.6 at $0.95/$4.00 — and on a 2:1 mix Kimi K2.6 actually blends slightly cheaper ($1.97/M vs $2.40/M) because its input rate is lower. Three concrete factors decide the route:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision factor&lt;/th&gt;
&lt;th&gt;Route to GLM-5.2&lt;/th&gt;
&lt;th&gt;Route to Kimi K2.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context length needed&lt;/td&gt;
&lt;td&gt;Up to 1,048,576 tokens&lt;/td&gt;
&lt;td&gt;Caps at 262,144 — drop it for &amp;gt;256K jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image input in the task&lt;/td&gt;
&lt;td&gt;Not supported (text-only)&lt;/td&gt;
&lt;td&gt;Supported (text + image)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cheaper blended cost at 2:1&lt;/td&gt;
&lt;td&gt;$2.40/M&lt;/td&gt;
&lt;td&gt;$1.97/M (18% lower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max single output&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;262,144 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The practical rule: &lt;strong&gt;if the reasoning job carries a large context (&amp;gt;256K tokens), GLM-5.2 is the only one of the two that fits — Kimi K2.6 will reject the input.&lt;/strong&gt; If the context is comfortably under 256K and the job involves an image or wants the cheaper per-token rate, Kimi K2.6 is the better route. For most short-context agentic coding turns, Kimi K2.6's lower input price makes it the value pick inside the reasoning tier; reserve GLM-5.2 for the long-context reasoning that only its 1M window can hold. The &lt;a href="https://ofox.ai/blog/kimi-k2-6-release-guide-2026/" rel="noopener noreferrer"&gt;Kimi K2.6 release guide&lt;/a&gt; covers its agentic behavior in more depth.&lt;/p&gt;

&lt;p&gt;This is exactly why client-side routing beats locking to one model: the "best reasoning model" depends on the &lt;em&gt;shape&lt;/em&gt; of the reasoning job, and a &lt;code&gt;model&lt;/code&gt; string is the cheapest possible switch between them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency and Throughput Are Routing Inputs Too
&lt;/h2&gt;

&lt;p&gt;Cost is the loudest routing signal, but not the only one. Two operational notes that change real routing decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Interactive vs batch.&lt;/strong&gt; For a user-facing assistant where first-token latency is felt, the cheapest model is not automatically the right one — a slightly pricier model that returns faster can be worth it on the interactive surface, while overnight batch jobs should ride the cheapest tier regardless of speed. Route by surface, not just by price: interactive traffic tolerates a higher per-token cost, batch traffic does not.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Output ceiling as a hard constraint.&lt;/strong&gt; If a single response must exceed 128,000 tokens — full-file rewrites, large structured exports — GLM-5.2 and MiniMax M3 cap out and the call truncates. Only the DeepSeek V4 tiers (384K) and Kimi K2.6 (262K) clear that bar in one call. This is a binary routing gate, not a cost trade-off: send oversized-output jobs to a model that can physically emit the tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of these are decisions your &lt;code&gt;pick_model&lt;/code&gt; function can encode as plain conditionals — surface type and expected output size are usually known at request time.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Route (and What to Use Instead)
&lt;/h2&gt;

&lt;p&gt;Routing is not free engineering. Three cases where a multi-model split is the wrong move:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Single developer, &amp;lt; 1,000 calls/day, all one task type.&lt;/strong&gt; The routing logic and per-model quality testing cost more time than you save. Pick &lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt; as a strong, cheap default and move on. The $0.59/M blended cost is already low enough that micro-optimizing is not worth the branching code.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You actually need server-side automatic fusion.&lt;/strong&gt; ofox routes by &lt;em&gt;your&lt;/em&gt; &lt;code&gt;model&lt;/code&gt; field — it does not auto-pick a model or fuse outputs. If you specifically want quality-based auto-selection or response fusion (the OpenRouter Auto / Sakana-style idea), that is a different product category. Use one of those tools, or read our &lt;a href="https://ofox.ai/blog/is-openrouter-reliable-honest-review-2026/" rel="noopener noreferrer"&gt;honest review of whether OpenRouter is reliable&lt;/a&gt; before deciding the auto-router is worth the unpredictability.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Every job genuinely needs top-tier reasoning.&lt;/strong&gt; If your traffic is 100% hard agentic coding with no budget-tier work, there is nothing to route — run GLM-5.2 (or Kimi K2.6) and skip the matrix. Routing only pays when your workload is &lt;em&gt;mixed&lt;/em&gt;. For a pure two-model reasoning split, our &lt;a href="https://ofox.ai/blog/claude-code-hybrid-routing-pattern-2026/" rel="noopener noreferrer"&gt;Claude Code hybrid routing pattern&lt;/a&gt; covers that narrower case.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The routing payoff is proportional to how heterogeneous your traffic is. Homogeneous traffic → one model. Mixed traffic → the matrix above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It via ofox: Route All Five in One Loop
&lt;/h2&gt;

&lt;p&gt;All five models share &lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt; and one ofox key. Routing is a client-side decision: you set the &lt;code&gt;model&lt;/code&gt; field per request. Here is the routing function and an A/B loop in both Python and Node.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python — route by task, then A/B the candidates
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;OFOXAI_API_KEY&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;has_image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimax/minimax-m3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;        &lt;span class="c1"&gt;# only M3/Kimi take images
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hard_reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;                                       &lt;span class="c1"&gt;# split the reasoning tier
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;256_000&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;moonshotai/kimi-k2.6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200_000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek/deepseek-v4-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# free cache reads, 1M ctx
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek/deepseek-v4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;                              &lt;span class="c1"&gt;# cheapest tier
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pick_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To compare candidates on your own traffic, loop over the model IDs with a fixed prompt — swap the string, keep everything else constant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CANDIDATES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek/deepseek-v4-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek/deepseek-v4-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CANDIDATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor this function for readability: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completion_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# log tokens to price each route
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Node — same shape
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OFOXAI_API_KEY&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pickModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
  &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hasImage&lt;/span&gt;        &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;minimax/minimax-m3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hardReasoning&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;256000&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;moonshotai/kimi-k2.6&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deepseek/deepseek-v4-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;deepseek/deepseek-v4-flash&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;pickModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;task&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Summarize this changelog: ...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multimodal only: attach a screenshot to MiniMax M3 or Kimi K2.6
&lt;/h3&gt;

&lt;p&gt;GLM-5.2 and both DeepSeek tiers are text-only — the call below physically fails on them. Route image input to &lt;code&gt;minimax/minimax-m3&lt;/code&gt; or &lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;

&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimax/minimax-m3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# or moonshotai/kimi-k2.6
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What error is shown in this screenshot?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/png;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;]}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole router: a &lt;code&gt;pick_model&lt;/code&gt; function and one OpenAI client. No new SDK, no per-model API key, one billing line. Detail pages for each model are linked in the table — &lt;a href="https://ofox.ai/en/models/z-ai/glm-5.2" rel="noopener noreferrer"&gt;&lt;code&gt;z-ai/glm-5.2&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://ofox.ai/en/models/deepseek/deepseek-v4-pro" rel="noopener noreferrer"&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://ofox.ai/en/models/deepseek/deepseek-v4-flash" rel="noopener noreferrer"&gt;&lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://ofox.ai/en/models/minimax/minimax-m3" rel="noopener noreferrer"&gt;&lt;code&gt;minimax/minimax-m3&lt;/code&gt;&lt;/a&gt;, and &lt;a href="https://ofox.ai/en/models/moonshotai/kimi-k2.6" rel="noopener noreferrer"&gt;&lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alternatives
&lt;/h2&gt;

&lt;p&gt;If a single-key, client-side router fits your workload, ofox is the simplest path: one OpenAI-compatible endpoint, one balance, all five model IDs. For other shapes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;ofox&lt;/strong&gt; — one key, 100+ models, OpenAI-compatible. You control routing via the &lt;code&gt;model&lt;/code&gt; field; billing and the endpoint are unified. Best when you want cost-predictable, deterministic routing you write yourself. See the &lt;a href="https://ofox.ai/blog/openrouter-alternatives-2026/" rel="noopener noreferrer"&gt;OpenRouter alternatives breakdown&lt;/a&gt; for how it compares on markup and reliability.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;OpenRouter&lt;/strong&gt; — large catalog with an optional &lt;code&gt;Auto&lt;/code&gt; server-side router that picks a model for you. Useful if you specifically want automatic selection and can tolerate less predictable routing and the platform's markup.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Direct provider APIs&lt;/strong&gt; — calling DeepSeek, Zhipu (GLM), MiniMax, and Moonshot each directly gives you the rawest pricing but four keys, four SDKs, and four billing lines to reconcile. Worth it only at very high single-provider volume.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Self-hosting&lt;/strong&gt; — GLM and DeepSeek publish open weights, so an air-gapped or fork-required deployment is possible. The economics only work at scale; see our &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;GLM-5.2 self-host hardware cost analysis&lt;/a&gt; for the breakeven math against hosted per-token pricing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For deeper per-model context, the &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;GLM-5.2 access guide&lt;/a&gt;, &lt;a href="https://ofox.ai/blog/glm-5-2-vs-gpt-5-5-cost-2026/" rel="noopener noreferrer"&gt;GLM-5.2 vs GPT-5.5 cost breakdown&lt;/a&gt;, &lt;a href="https://ofox.ai/blog/deepseek-v4-pro-vs-flash/" rel="noopener noreferrer"&gt;DeepSeek V4 Pro vs Flash comparison&lt;/a&gt;, &lt;a href="https://ofox.ai/blog/deepseek-v4-release-guide-2026/" rel="noopener noreferrer"&gt;DeepSeek V4 release guide&lt;/a&gt;, and &lt;a href="https://ofox.ai/blog/minimax-m3-vs-gpt-5-5-coding-benchmark-2026/" rel="noopener noreferrer"&gt;MiniMax M3 vs GPT-5.5 coding benchmark&lt;/a&gt; each go one layer deeper than this routing overview.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;The frontmatter FAQ block above answers the most common routing questions (one-key routing, cheapest model, longest context, which models do vision, real savings, free cache reads, no server-side auto-router, max output, and how to A/B). Those answers mirror the tables in this post — the cost numbers, model IDs, and routing rules are consistent throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources Checked for This Refresh
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  ofox &lt;code&gt;/v1/models&lt;/code&gt; live API catalog — all five model IDs, context windows, max output, and per-token pricing (input / output / cache read) verified 2026-06-23&lt;/li&gt;
&lt;li&gt;  ofox &lt;code&gt;llms-full.txt&lt;/code&gt; — OpenAI-compatible base_url &lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt; and single-key-across-models confirmed (2026-06-23)&lt;/li&gt;
&lt;li&gt;  ofox model detail pages for &lt;code&gt;z-ai/glm-5.2&lt;/code&gt;, &lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;, &lt;code&gt;deepseek/deepseek-v4-flash&lt;/code&gt;, &lt;code&gt;minimax/minimax-m3&lt;/code&gt;, &lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt; — all returned HTTP 200 (2026-06-23)&lt;/li&gt;
&lt;li&gt;  OpenAI Python SDK (&lt;code&gt;openai&lt;/code&gt; 2.43.0 on PyPI) and OpenAI Node SDK — SDK shape used in code examples (2026-06-23)&lt;/li&gt;
&lt;li&gt;  All cost tables are recomputable from the per-token rates in the Quick Specs table&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/multi-model-router-one-api-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>deepseek</category>
      <category>api</category>
    </item>
    <item>
      <title>Run GLM 5.2 Locally (2026): 2-bit on a 256GB Mac or 4090 box</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Tue, 23 Jun 2026 10:14:37 +0000</pubDate>
      <link>https://dev.to/owen_fox/run-glm-52-locally-2026-2-bit-on-a-256gb-mac-or-4090-box-1apn</link>
      <guid>https://dev.to/owen_fox/run-glm-52-locally-2026-2-bit-on-a-256gb-mac-or-4090-box-1apn</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Zhipu put the GLM 5.2 weights on HuggingFace under an MIT license, so the question stopped being "can I download a frontier coding model" and became "will it run on the machine I already own." For a single Mac Studio or a desktop with one GPU and a lot of RAM, the answer is a qualified yes. The qualifier is the quant.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What You Can Run Locally (and What You Can't)
&lt;/h2&gt;

&lt;p&gt;This guide is about running GLM 5.2 on one machine you own, using quantized GGUF weights and llama.cpp, LM Studio, or Unsloth Studio. That is a different job from serving it to a team on a rack of H200s, which the &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;GLM 5.2 self-host hardware and cost guide&lt;/a&gt; covers, and a different job again from calling the hosted API, which the &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;GLM 5.2 access guide&lt;/a&gt; covers.&lt;/p&gt;

&lt;p&gt;GLM 5.2 is a 753B-parameter model with a 1M-token context, released under MIT. At full BF16 precision the weights are ~1.5 TB, which does not fit any single desktop. Local inference means quantizing: trading some quality for a footprint that fits in your RAM. Here is the 30-second version of what fits where.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your machine&lt;/th&gt;
&lt;th&gt;Quant that fits&lt;/th&gt;
&lt;th&gt;Disk / RAM needed&lt;/th&gt;
&lt;th&gt;What to expect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra, 512 GB&lt;/td&gt;
&lt;td&gt;4-bit UD-Q4_K_XL&lt;/td&gt;
&lt;td&gt;~376-475 GB&lt;/td&gt;
&lt;td&gt;Best local quality, mostly lossless, usable coding speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra, 256 GB&lt;/td&gt;
&lt;td&gt;2-bit UD-IQ2_M&lt;/td&gt;
&lt;td&gt;~240 GB&lt;/td&gt;
&lt;td&gt;Codes well, ~3-9 tok/s, the common local rig&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Desktop + 4090 + 256 GB DDR5&lt;/td&gt;
&lt;td&gt;2-bit UD-IQ2_M&lt;/td&gt;
&lt;td&gt;~240 GB&lt;/td&gt;
&lt;td&gt;Runs via offload, low single-digit tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8x H200 or 4x H100 rack&lt;/td&gt;
&lt;td&gt;FP8 / Q4&lt;/td&gt;
&lt;td&gt;376-750 GB&lt;/td&gt;
&lt;td&gt;Production scale, see the self-host guide&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MacBook / 64-128 GB box&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;Use the hosted plan instead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest headline: a 256 GB Mac Studio running the 2-bit quant is the realistic "GLM 5.2 on my desk" setup. The 4-bit quant is the quality sweet spot, but it wants a 512 GB machine or heavy offload. Anything smaller than 256 GB is a hosted-API job, not a local one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Frame: When Local GLM 5.2 Is Worth It (and When NOT)
&lt;/h2&gt;

&lt;p&gt;Run the quant locally for the right reasons. The wrong reason is saving money, because for almost everyone the hosted plan is cheaper.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to run it locally
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Offline or air-gapped work.&lt;/strong&gt; No outbound traffic to &lt;code&gt;api.z.ai&lt;/code&gt; is allowed, so the model has to live on your hardware.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Privacy on a single box.&lt;/strong&gt; Your prompts and code never leave the machine, and one Mac Studio is the whole perimeter.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You already own the hardware.&lt;/strong&gt; A 256 GB or 512 GB Mac Studio bought for video or ML work is sitting idle at night, and a local quant costs you nothing extra to run.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Tinkering and learning.&lt;/strong&gt; You want to feel how a 753B MoE behaves, test sampling settings, or build against a local OpenAI-compatible endpoint with no rate limits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When NOT to run it locally
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;You want it to be cheap and fast.&lt;/strong&gt; The Z.ai Coding Plan is ~$30/month and runs at full speed. A 2-bit local quant at 3-9 tok/s cannot match that for the price of electricity alone. Read the &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;access guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You need to serve more than one person.&lt;/strong&gt; A single Mac Studio is a single-session machine. Two developers hammering it at once will each feel it crawl. That is the datacenter path.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Your machine is under 256 GB.&lt;/strong&gt; There is no quant that makes GLM 5.2 fit a 128 GB box at quality worth using. Do not burn a weekend trying.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You need the full 1M context.&lt;/strong&gt; Long-context KV cache does not fit on consumer hardware. Local tops out around 16K-64K in practice.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stop rule
&lt;/h3&gt;

&lt;p&gt;If you do not have at least 256 GB of unified memory or system RAM, stop here and use the hosted plan. No amount of quantization changes that floor.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Requirements
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart TD
  A[How much memory?] --&amp;gt;|512 GB Mac| B[4-bit UD-Q4_K_XL&amp;lt;br/&amp;gt;best local quality]
  A --&amp;gt;|256 GB Mac or DDR5| C[2-bit UD-IQ2_M&amp;lt;br/&amp;gt;the common rig]
  A --&amp;gt;|under 256 GB| D[Use the hosted plan&amp;lt;br/&amp;gt;not a local job]
  B --&amp;gt; E[llama.cpp / LM Studio / Unsloth Studio]
  C --&amp;gt; E
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before you pull 240 GB of weights, confirm you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Memory.&lt;/strong&gt; 256 GB minimum (unified memory on Apple silicon, or system DDR5 on a CUDA box). The 2-bit quant is ~240 GB, so on a 256 GB machine the headroom is genuinely tight: close other apps and leave macOS its share of unified memory, or you will hit swap. 512 GB to run 4-bit comfortably.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disk.&lt;/strong&gt; The quant plus headroom: ~240 GB free for 2-bit, ~376-475 GB for 4-bit. An SSD, not a spinning disk, or load times become painful.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;A runner.&lt;/strong&gt; llama.cpp built from a recent commit, LM Studio, or Unsloth Studio. The architecture (GLM MoE DSA) is new enough that an old llama.cpp build will fail to load the tensors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The right repo.&lt;/strong&gt; Community GGUF quants live at &lt;code&gt;huggingface.co/unsloth/GLM-5.2-GGUF&lt;/code&gt;. The official &lt;code&gt;zai-org/GLM-5.2&lt;/code&gt; repo is BF16 only and is not what you want for local inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step-by-Step: Run GLM 5.2 Locally
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Pull a GGUF quant
&lt;/h3&gt;

&lt;p&gt;Download only the quant you need, not the whole repo. The &lt;code&gt;--include&lt;/code&gt; filter keeps you from fetching 750 GB of shards you will not use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 2-bit for a 256 GB machine (~240 GB on disk)&lt;/span&gt;
hf download unsloth/GLM-5.2-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; ~/models/glm-5.2-gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"*UD-IQ2_M*"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should end up with a set of &lt;code&gt;GLM-5.2-UD-IQ2_M-0000X-of-0000Y.gguf&lt;/code&gt; shards in &lt;code&gt;~/models/glm-5.2-gguf&lt;/code&gt;. Swap the filter to &lt;code&gt;*UD-Q4_K_XL*&lt;/code&gt; if you are on a 512 GB machine. Check the live "Files and versions" tab on HuggingFace for the exact shard names, since Unsloth revises quant labels as the dynamic quants improve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Run it with llama.cpp
&lt;/h3&gt;

&lt;p&gt;This is the command-line path and the one with the most control. Build a recent llama.cpp first (Metal compiles automatically on Mac; add &lt;code&gt;-DGGML_CUDA=ON&lt;/code&gt; on an Nvidia box).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build once&lt;/span&gt;
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;

&lt;span class="c"&gt;# Serve an OpenAI-compatible endpoint on port 8080&lt;/span&gt;
./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--temp&lt;/span&gt; 1.0 &lt;span class="nt"&gt;--top-p&lt;/span&gt; 0.95 &lt;span class="nt"&gt;--min-p&lt;/span&gt; 0.01 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each flag earns its place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;--ctx-size 32768&lt;/code&gt; sets a 32K window. Raising it eats memory fast on a 256 GB machine; start here and grow only if a request needs it.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;--n-gpu-layers 999&lt;/code&gt; offloads every layer it can to the GPU. On a Mac the unified memory makes this nearly free; on a 4090 it offloads the fraction that fits in 24 GB and leaves the rest on the CPU.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;--temp 1.0 --top-p 0.95 --min-p 0.01&lt;/code&gt; are Zhipu's recommended sampling defaults. Getting these wrong is the most common cause of "the local model is dumber than the hosted one."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once it loads, &lt;code&gt;llama-server&lt;/code&gt; logs the layer count and then prints &lt;code&gt;server listening on http://0.0.0.0:8080&lt;/code&gt;. The first load takes a minute or two off an SSD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Or use a GUI (LM Studio / Unsloth Studio)
&lt;/h3&gt;

&lt;p&gt;If you would rather not touch a build toolchain, two GUI apps load the same GGUF quants.&lt;/p&gt;

&lt;p&gt;LM Studio runs the same GGUF quants from a desktop app. Search for &lt;code&gt;unsloth/GLM-5.2-GGUF&lt;/code&gt; in the in-app model browser, pick the 2-bit or 4-bit quant, and it handles the download and serving, exposing the same OpenAI-compatible endpoint on a local port.&lt;/p&gt;

&lt;p&gt;Unsloth Studio is a web UI with automatic memory offloading, installed in one line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://unsloth.ai/install.sh | sh
unsloth studio &lt;span class="nt"&gt;-H&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;-p&lt;/span&gt; 8888
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are the better choice if you want to swap quants and settings without re-typing a long llama.cpp command each time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Smoke test
&lt;/h3&gt;

&lt;p&gt;Point any OpenAI client at the local port and confirm it answers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "glm-5.2",
    "messages": [{"role":"user","content":"Reply with only the string OK."}],
    "max_tokens": 16
  }'&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.choices[0].message.content'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get &lt;code&gt;OK&lt;/code&gt; back after a short pause. If the reply is garbled or loops, your sampling params are off, so re-check &lt;code&gt;--temp 1.0 --top-p 0.95 --min-p 0.01&lt;/code&gt; against the values in &lt;code&gt;huggingface.co/zai-org/GLM-5.2/generation_config.json&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Tokens/sec: What to Expect by Tier
&lt;/h2&gt;

&lt;p&gt;Generation speed on local hardware is bound by memory bandwidth, not raw compute, which is why a Mac Studio with 800 GB/s unified memory beats a DDR5 desktop whose RAM runs closer to 80-100 GB/s. These are the figures to plan around.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Realistic generation speed&lt;/th&gt;
&lt;th&gt;Good for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra, 256 GB&lt;/td&gt;
&lt;td&gt;2-bit UD-IQ2_M&lt;/td&gt;
&lt;td&gt;~3-9 tok/s&lt;/td&gt;
&lt;td&gt;Solo coding agent, one session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra, 512 GB&lt;/td&gt;
&lt;td&gt;4-bit UD-Q4_K_XL&lt;/td&gt;
&lt;td&gt;a few tok/s, higher quality&lt;/td&gt;
&lt;td&gt;Solo work where correctness matters more than speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Desktop, 4090 + 256 GB DDR5&lt;/td&gt;
&lt;td&gt;2-bit UD-IQ2_M&lt;/td&gt;
&lt;td&gt;low single digits&lt;/td&gt;
&lt;td&gt;Tinkering, offline use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4x H100 / 8x H200 rack&lt;/td&gt;
&lt;td&gt;Q4 / FP8&lt;/td&gt;
&lt;td&gt;tens of tok/s per stream&lt;/td&gt;
&lt;td&gt;Teams (see self-host guide)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: local GLM 5.2 is a single-stream, single-developer tool. The speed is fine for one coding agent thinking through a task. It is not fine for a shared endpoint, and no consumer quant changes that. If you need throughput for a team, the &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;self-host hardware guide&lt;/a&gt; walks the vLLM and SGLang path on datacenter GPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Errors During Local Setup (and Fixes)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tensor not found: blk.X.attn_q.weight&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;llama.cpp build too old for GLM MoE DSA&lt;/td&gt;
&lt;td&gt;Pull a recent llama.cpp commit and rebuild with &lt;code&gt;cmake --build build&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Process killed / swap thrash on load&lt;/td&gt;
&lt;td&gt;Quant is bigger than free RAM&lt;/td&gt;
&lt;td&gt;Drop to a smaller quant, or close other apps; 2-bit needs ~240 GB free, not just installed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output is repetitive or incoherent&lt;/td&gt;
&lt;td&gt;Sampling params not aligned to Zhipu defaults&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;--temp 1.0 --top-p 0.95 --min-p 0.01&lt;/code&gt;; do not leave top_k at a low default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Painfully slow generation on a 4090 box&lt;/td&gt;
&lt;td&gt;Most layers running from DDR5, not VRAM&lt;/td&gt;
&lt;td&gt;Expected on 24 GB VRAM; lower &lt;code&gt;--ctx-size&lt;/code&gt;, or move to a 256 GB Mac for better bandwidth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;failed to allocate KV cache&lt;/code&gt; at high ctx-size&lt;/td&gt;
&lt;td&gt;Context window too large for remaining memory&lt;/td&gt;
&lt;td&gt;Lower &lt;code&gt;--ctx-size&lt;/code&gt;, or quantize the KV cache with &lt;code&gt;--cache-type-k q4_1 --cache-type-v q4_1&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model "thinks" forever before answering&lt;/td&gt;
&lt;td&gt;Thinking mode on for a task that does not need it&lt;/td&gt;
&lt;td&gt;Disable it with &lt;code&gt;--chat-template-kwargs '{"enable_thinking":false}'&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama pull only offers &lt;code&gt;glm-5.2:cloud&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;No local Ollama tag exists yet&lt;/td&gt;
&lt;td&gt;Use llama.cpp or LM Studio with the Unsloth GGUF instead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Team / Multi-Developer: When One Mac Isn't Enough
&lt;/h2&gt;

&lt;p&gt;A single local machine serves one person. The moment a second developer points an agent at the same &lt;code&gt;llama-server&lt;/code&gt;, both sessions slow to a crawl, because consumer hardware has no spare bandwidth to split. There is no clever flag that fixes this.&lt;/p&gt;

&lt;p&gt;Two real options when local stops scaling:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Move to datacenter GPUs.&lt;/strong&gt; An 8x H200 node serving FP8 handles many concurrent streams at tens of tokens per second each. That is a different cost and operations story, fully worked through in the &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;self-host vLLM and cost guide&lt;/a&gt;, including the break-even math against the hosted plan.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Use a hosted endpoint and stop running metal.&lt;/strong&gt; For most teams this wins on every axis except data residency.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The local quant is the right tool for one developer who wants the model on their own machine. It is the wrong tool for a shared service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced: Long Context and Thinking Mode
&lt;/h2&gt;

&lt;p&gt;Two knobs are worth knowing once the basic setup runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KV cache quantization.&lt;/strong&gt; The 1M context is real in the architecture but unreachable on a 256 GB box, because the KV cache alone would need hundreds of gigabytes. Quantizing it buys back room:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; ~/models/glm-5.2-gguf/GLM-5.2-UD-IQ2_M-00001-of-00006.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 65536 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cache-type-k&lt;/span&gt; q4_1 &lt;span class="nt"&gt;--cache-type-v&lt;/span&gt; q4_1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This roughly halves KV cache memory, letting you push context further on the same hardware, at a small quality cost on very long inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thinking mode.&lt;/strong&gt; GLM 5.2 has a reasoning mode that spends tokens thinking before it answers. For quick edits and short prompts it adds latency you may not want. Turn it off per request with &lt;code&gt;--chat-template-kwargs '{"enable_thinking":false}'&lt;/code&gt; and leave it on for hard multi-step problems where the extra reasoning earns its keep.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Local Is the Wrong Answer: Hosted and ofox Alternatives
&lt;/h2&gt;

&lt;p&gt;If the 256 GB floor or the single-session speed rules local out, you do not have to give up GLM 5.2 at all. The same model is on the ofox catalog as &lt;a href="https://ofox.ai/models/z-ai/glm-5.2" rel="noopener noreferrer"&gt;&lt;code&gt;z-ai/glm-5.2&lt;/code&gt;&lt;/a&gt;, priced at $1.40/M input and $4.40/M output, so you can run it hosted at full speed by changing only the base URL and model ID, with no rig to buy or babysit. You prototype against your local &lt;code&gt;llama-server&lt;/code&gt; and then point the same client at the hosted model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://api.ofox.ai/v1"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"ofox-..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"z-ai/glm-5.2"&lt;/span&gt;   &lt;span class="c"&gt;# the exact same model, now hosted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;hosted access guide&lt;/a&gt; covers the Z.ai Coding Plan route to the same model as well. And if you want a few other open-weights coding models behind that one OpenAI-compatible endpoint, ofox lists these day-one too:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;ofox model ID&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;When to pick over GLM 5.2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek V4 Pro&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;You want a longer community track record and published SWE-bench Verified numbers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi K2.6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;262K&lt;/td&gt;
&lt;td&gt;You need independently benchmarked long context, not a 16K local ceiling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3 Coder Next&lt;/td&gt;
&lt;td&gt;&lt;code&gt;bailian/qwen3-coder-next&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;256K&lt;/td&gt;
&lt;td&gt;Multilingual codebases where local speed is too slow to iterate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a price-and-quality read on GLM against a closed model before you commit to either a local rig or a hosted subscription, see the &lt;a href="https://ofox.ai/blog/glm-5-2-vs-gpt-5-5-cost-2026/" rel="noopener noreferrer"&gt;GLM 5.2 vs GPT-5.5 cost comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources Checked for This Refresh
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  HuggingFace official model card, &lt;code&gt;zai-org/GLM-5.2&lt;/code&gt; (753B parameters, MIT license, 1M context), verified 2026-06-23: &lt;a href="https://huggingface.co/zai-org/GLM-5.2" rel="noopener noreferrer"&gt;https://huggingface.co/zai-org/GLM-5.2&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Unsloth GGUF community quants and per-quant memory table, verified 2026-06-23: &lt;a href="https://huggingface.co/unsloth/GLM-5.2-GGUF" rel="noopener noreferrer"&gt;https://huggingface.co/unsloth/GLM-5.2-GGUF&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Unsloth GLM 5.2 run guide (quant sizes, sampling defaults, KV-cache flags, Unsloth Studio install): &lt;a href="https://unsloth.ai/docs/models/glm-5.2" rel="noopener noreferrer"&gt;https://unsloth.ai/docs/models/glm-5.2&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  llama.cpp project: &lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;https://github.com/ggml-org/llama.cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  LM Studio: &lt;a href="https://lmstudio.ai" rel="noopener noreferrer"&gt;https://lmstudio.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Companion ofox guides: &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;self-host hardware and cost&lt;/a&gt;, &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;hosted access&lt;/a&gt;, &lt;a href="https://ofox.ai/blog/glm-5-2-vs-gpt-5-5-cost-2026/" rel="noopener noreferrer"&gt;GLM 5.2 vs GPT-5.5 cost&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The interesting shift is not that a frontier model runs locally, it is how little it now costs to find out. A 256 GB Mac Studio you already own and an afternoon of downloading is the whole experiment. The next thing to watch is FP4 and tighter dynamic quants: the day a good 4-bit drops under 200 GB, the local floor moves from a 256 GB Mac down to a 128 GB one, and a lot more desks qualify.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/glm-5-2-run-locally-gguf-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>gguf</category>
      <category>llamacpp</category>
    </item>
    <item>
      <title>GLM-5.2 vs GPT-5.5 Cost: Per-Token Math at 10K/100K/1M Req/Day (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Sun, 21 Jun 2026 13:08:21 +0000</pubDate>
      <link>https://dev.to/owen_fox/glm-52-vs-gpt-55-cost-per-token-math-at-10k100k1m-reqday-2026-39a6</link>
      <guid>https://dev.to/owen_fox/glm-52-vs-gpt-55-cost-per-token-math-at-10k100k1m-reqday-2026-39a6</guid>
      <description>&lt;h1&gt;
  
  
  GLM-5.2 vs GPT-5.5 Cost: Per-Token Math at 10K/100K/1M Req/Day (2026)
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — At ofox.io's listed pricing, GLM-5.2 costs &lt;strong&gt;$1.4 input / $4.4 output&lt;/strong&gt; per million tokens; GPT-5.5 sits at &lt;strong&gt;$5 / $30&lt;/strong&gt;. Blended at a 2:1 input-to-output ratio, that is &lt;strong&gt;$2.40 vs $13.33&lt;/strong&gt; per million tokens — a &lt;strong&gt;5.56x cost ratio&lt;/strong&gt;. At 100K requests per day on 3K-token prompts, you spend roughly &lt;strong&gt;$720/day on GLM-5.2 versus $4,000/day on GPT-5.5&lt;/strong&gt; — about &lt;strong&gt;$21,600 vs $120,000 per month&lt;/strong&gt;. Prompt caching helps both but doesn't close the gap. Both models are on the same OpenAI-compatible endpoint at &lt;a href="https://ofox.io/en" rel="noopener noreferrer"&gt;ofox.io&lt;/a&gt; so the comparison is a one-line model swap.&lt;/p&gt;

&lt;p&gt;GPT-5.5's per-token cost is 5.56x GLM-5.2's at a typical coding mix — and 6.82x on pure output tokens. The question stopped being whether GLM-5.2 is "good enough"; it became which workload still earns the GPT-5.5 premium.&lt;/p&gt;

&lt;p&gt;If you want to skip the math and just A/B both models on your own workload, &lt;a href="https://ofox.io/en" rel="noopener noreferrer"&gt;ofox.io&lt;/a&gt; hosts both &lt;code&gt;z-ai/glm-5.2&lt;/code&gt; and &lt;code&gt;openai/gpt-5.5&lt;/code&gt; on the same key — pay-as-you-go, no monthly fee, and the same SDK shape as the OpenAI Python client. The full math below uses ofox's listed per-token rates verified June 21, 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: Which One Should You Pick?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Pick&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost-sensitive batch coding agents&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.56x cheaper at 2:1 mix, same 1M context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context refactor jobs (&amp;gt;500K input)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same 1M context and 128K output cap; 3.57x cheaper input dominates input-heavy jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output-heavy code generation pipelines&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.82x cheaper per output token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex CLI / Terminal-Bench-heavy agentic workflows&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integration depth and 82.7% Terminal-Bench 2.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency-sensitive interactive pair programming&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tuned for first-token speed on short prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure-backed procurement / Microsoft compliance shop&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ofox's GPT-5.5 line is Azure-backed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Air-gapped or fork-required deployment&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GLM-5.2 self-host&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MIT weights on Hugging Face&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The honest verdict for most 2026 coding teams: route the cost-sensitive default traffic to &lt;code&gt;z-ai/glm-5.2&lt;/code&gt;, keep &lt;code&gt;openai/gpt-5.5&lt;/code&gt; on the Codex CLI / interactive surface, escalate the hardest 10% to Claude. The two-model split below covers the realistic 80% of your traffic without a vendor migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Model Ships on ofox
&lt;/h2&gt;

&lt;p&gt;Both models live on &lt;a href="https://ofox.io/en/docs/api" rel="noopener noreferrer"&gt;api.ofox.io/v1&lt;/a&gt; under the OpenAI-compatible protocol, and on the Anthropic-protocol endpoint for Claude Code drop-in use. The boring numbers, verified against the ofox model catalog on June 21, 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;GLM-5.2&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Listed on ofox&lt;/td&gt;
&lt;td&gt;June 16, 2026&lt;/td&gt;
&lt;td&gt;April 24, 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ofox model ID&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z-ai/glm-5.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;openai/gpt-5.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detail page&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ofox.io/en/models/z-ai/glm-5.2" rel="noopener noreferrer"&gt;ofox.io/en/models/z-ai/glm-5.2&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ofox.io/en/models/openai/gpt-5.5" rel="noopener noreferrer"&gt;ofox.io/en/models/openai/gpt-5.5&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input price&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.4 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$5.00 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output price&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$4.4 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$30.00 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache read price&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.26 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.50 / M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web search add-on&lt;/td&gt;
&lt;td&gt;$0.01 / request&lt;/td&gt;
&lt;td&gt;$0.01 / request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;1,000,000 tokens&lt;/td&gt;
&lt;td&gt;1,000,000 tokens (922K in / 128K out)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maximum output&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;td&gt;128,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider backing&lt;/td&gt;
&lt;td&gt;Z.ai (Zhipu)&lt;/td&gt;
&lt;td&gt;Azure (OpenAI via Microsoft)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weights&lt;/td&gt;
&lt;td&gt;Open (MIT, Hugging Face zai-org)&lt;/td&gt;
&lt;td&gt;Closed (API only)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things to call out from the spec sheet. First, &lt;strong&gt;the context windows and output ceilings are effectively identical&lt;/strong&gt; — both list a 1M context and a 128K max-output cap, so neither model lets you emit a larger single-call patch than the other; on long refactor jobs the deciding factor is per-token cost, not output capacity. Second, &lt;strong&gt;GPT-5.5 on ofox is Azure-backed&lt;/strong&gt;. That is the procurement story for shops already inside the Microsoft compliance perimeter; it does not change the listed rate card visible to most accounts but it does mean the upstream is Microsoft, not OpenAI direct.&lt;/p&gt;

&lt;p&gt;For the full GLM-5.2 access path — pricing tiers, MIT weights timeline, Z.ai's own Coding Plan — see our &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;GLM-5.2 access guide&lt;/a&gt;. For the GPT-5.5 coding benchmark picture against the other 2026 frontier models, see the &lt;a href="https://ofox.ai/blog/minimax-m3-vs-gpt-5-5-coding-benchmark-2026/" rel="noopener noreferrer"&gt;MiniMax M3 vs GPT-5.5 SWE-Bench breakdown&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Per-Token Math: Three Workload Scenarios
&lt;/h2&gt;

&lt;p&gt;Sticker pricing is straightforward. The interesting number is what the invoice looks like at your actual scale. We use three scenarios across the realistic volume range that teams hit in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assumption block (held constant across all three):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  3,000 tokens per request, split 2:1 input to output (2K in, 1K out)&lt;/li&gt;
&lt;li&gt;  30 days per month&lt;/li&gt;
&lt;li&gt;  No cache hits in the headline number (we add cache impact in the next section)&lt;/li&gt;
&lt;li&gt;  Web search add-on excluded&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Light: 10K requests per day
&lt;/h3&gt;

&lt;p&gt;Roughly the shape of a small team running a single coding agent at moderate intensity, or a side project at scale.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Daily input tokens: 10K × 2K = 20M&lt;/li&gt;
&lt;li&gt;  Daily output tokens: 10K × 1K = 10M&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input cost / day&lt;/th&gt;
&lt;th&gt;Output cost / day&lt;/th&gt;
&lt;th&gt;Total / day&lt;/th&gt;
&lt;th&gt;Total / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;20M × $1.4 = &lt;strong&gt;$28&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;10M × $4.4 = &lt;strong&gt;$44&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$72&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2,160&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;20M × $5.0 = &lt;strong&gt;$100&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;10M × $30 = &lt;strong&gt;$300&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$400&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$12,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Difference&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$328/day&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$9,840/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Mid: 100K requests per day
&lt;/h3&gt;

&lt;p&gt;The shape of a 10-engineer team running coding agents full time, or a product feature that exposes the model to end-users at moderate concurrency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Daily input tokens: 100K × 2K = 200M&lt;/li&gt;
&lt;li&gt;  Daily output tokens: 100K × 1K = 100M&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input cost / day&lt;/th&gt;
&lt;th&gt;Output cost / day&lt;/th&gt;
&lt;th&gt;Total / day&lt;/th&gt;
&lt;th&gt;Total / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;200M × $1.4 = &lt;strong&gt;$280&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;100M × $4.4 = &lt;strong&gt;$440&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$720&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$21,600&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;200M × $5.0 = &lt;strong&gt;$1,000&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;100M × $30 = &lt;strong&gt;$3,000&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$4,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$120,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Difference&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$3,280/day&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$98,400/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Heavy: 1M requests per day
&lt;/h3&gt;

&lt;p&gt;The shape of a production agent fleet, a developer-tooling SaaS at scale, or an internal platform exposed to a four-figure-engineer org.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Daily input tokens: 1M × 2K = 2B&lt;/li&gt;
&lt;li&gt;  Daily output tokens: 1M × 1K = 1B&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input cost / day&lt;/th&gt;
&lt;th&gt;Output cost / day&lt;/th&gt;
&lt;th&gt;Total / day&lt;/th&gt;
&lt;th&gt;Total / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;2B × $1.4 = &lt;strong&gt;$2,800&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;1B × $4.4 = &lt;strong&gt;$4,400&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$7,200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$216,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;2B × $5.0 = &lt;strong&gt;$10,000&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;1B × $30 = &lt;strong&gt;$30,000&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$40,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$1,200,000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Difference&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;$32,800/day&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$984,000/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;5.56x ratio holds at every volume tier&lt;/strong&gt; — only the absolute spend scales. At light volume that is a useful saving; at mid volume it pays for two senior engineers per month; at heavy volume it is the difference between a feature shipping and a feature being killed for unit-economics reasons.&lt;/p&gt;

&lt;p&gt;These tables hold for the standard 2:1 input-to-output mix. The ratio drifts with workload shape: at 1:1 (chat-style turns) the cost ratio is 6.03x; at 1:3 output-heavy (code generation from a short prompt) the ratio is 6.51x; at 3:1 input-heavy (long-context summarization) the ratio narrows to 5.23x because GLM-5.2's per-input-token discount (3.57x cheaper input) is smaller than its per-output-token discount (6.82x cheaper output). Output-dominated workloads tilt further toward GLM-5.2; input-dominated workloads tilt less hard but still favor GLM at every realistic mix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cache Impact: How Far Does Prompt Caching Close the Gap?
&lt;/h2&gt;

&lt;p&gt;Both models bill cache reads below the full input rate: GLM-5.2 at $0.26/M (an 81% input discount), GPT-5.5 at $0.50/M (a 90% input discount). Cache hit rates above 50% are realistic for code-review workloads where the codebase context repeats across requests. Here is what 50% input cache hit does to the blended cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At 50% input cache hit (half of input tokens served from cache, output unchanged):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Uncached input ($/M)&lt;/th&gt;
&lt;th&gt;Cached input ($/M)&lt;/th&gt;
&lt;th&gt;Effective input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;th&gt;Blended ($/M) at 2:1&lt;/th&gt;
&lt;th&gt;Drop vs no cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$1.40&lt;/td&gt;
&lt;td&gt;$0.26&lt;/td&gt;
&lt;td&gt;$0.83&lt;/td&gt;
&lt;td&gt;$4.40&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2.02&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−15.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$2.75&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$11.83&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−11.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;At 100% input cache hit (every input token cached):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M, all cached)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;th&gt;Blended ($/M) at 2:1&lt;/th&gt;
&lt;th&gt;Drop vs no cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GLM-5.2&lt;/td&gt;
&lt;td&gt;$0.26&lt;/td&gt;
&lt;td&gt;$4.40&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.64&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−31.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$10.33&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;−22.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two reads on this. &lt;strong&gt;First, cache saves more absolute dollars on GPT-5.5 per cached token&lt;/strong&gt; — you avoid $4.50 per cached million on GPT-5.5 versus $1.14 on GLM-5.2. If your CFO scores the cache program by raw dollars saved, GPT-5.5 wins. &lt;strong&gt;Second, cache saves a larger share of GLM-5.2's total bill&lt;/strong&gt; — because input is a bigger fraction of GLM-5.2's blended cost, cutting input costs has a bigger proportional effect. At 100% input cache hit, GLM drops 31.7% of its blended bill; GPT-5.5 drops 22.5%.&lt;/p&gt;

&lt;p&gt;The net result is that &lt;strong&gt;GLM-5.2 stays cheaper at every cache hit rate point&lt;/strong&gt;. The cost ratio actually widens slightly as cache hit rate climbs — from 5.56x without cache to 5.86x at 50% input cache hit to 6.30x at 100% input cache hit. That sounds counterintuitive, but the math is straightforward: cache eats a larger share of GLM-5.2's blended bill than of GPT-5.5's, so GLM's bill shrinks faster in percentage terms. Prompt caching is a uniform discount on input only; it does not change the GPT-5.5 output rate, and output is where the absolute dollar gap lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  When GLM-5.2 Wins (and When the Benchmark Gap Is Acceptable)
&lt;/h2&gt;

&lt;p&gt;Five workloads where GLM-5.2 is the obviously correct routing decision:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Batch code review and async refactor sweeps.&lt;/strong&gt; Overnight dependency upgrades, doc generation, batched lint fixes — work where total token spend dominates and individual-request latency does not matter. The 5.56x cost gap compounds across thousands of requests per night.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Long-context refactor jobs.&lt;/strong&gt; GLM-5.2's 1M context lets you submit an entire mid-sized module in one prompt. Its 128K output cap is identical to GPT-5.5's, so very large rewrites still chunk on both models — but GLM-5.2 emits the same patches at 5.56x lower per-token cost, and its input is 3.57x cheaper, which dominates on input-heavy refactor passes.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Output-heavy code generation pipelines.&lt;/strong&gt; Per-output-token cost is the differentiator at 6.82x. If your agent emits more code than it reads (test generation, scaffolding, codemod application), GLM-5.2 disproportionately wins.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;High-cache-hit workloads.&lt;/strong&gt; Code-review agents reusing the same codebase context, RAG pipelines with stable corpora — GLM-5.2's cache read at $0.26/M is half of GPT-5.5's $0.50/M, and the proportional cache benefit on GLM is larger.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Open-weight insurance.&lt;/strong&gt; MIT-licensed weights mean if Z.ai changes hosted pricing or terms, you can fall back to self-hosting on the same model. GPT-5.5 has no on-prem path. Even if you never deploy the weights, the option value is real.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The honest qualifier: &lt;strong&gt;the benchmark gap to GPT-5.5 is real on Terminal-Bench-style agentic work&lt;/strong&gt;. Z.ai had not published SWE-Bench Verified scores at GLM-5.2's launch, and independent third-party benchmark numbers were pending as of mid-June 2026. If your workload depends on the multi-step shell agentic loop that Terminal-Bench measures, GPT-5.5 still leads — for everything else, the cost case is decisive.&lt;/p&gt;

&lt;h2&gt;
  
  
  When GPT-5.5 Still Makes Sense
&lt;/h2&gt;

&lt;p&gt;Three workloads where the 5.56x premium earns its keep:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Codex CLI is your primary surface.&lt;/strong&gt; OpenAI's terminal agent is tuned against GPT-5.5 at the protocol level — file handles, shell history, multi-turn recovery from failed commands. The Terminal-Bench 2.1 score (82.7%) reflects integration depth as much as model capability. Swapping the model behind Codex is not a free move.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Latency-sensitive interactive coding.&lt;/strong&gt; Pair-programming flows where every extra second of first-token latency hurts adoption. GPT-5.5 is tuned for short prompts and fast first-token; on a 5K-token interactive prompt, GPT-5.5 typically wins the latency comparison.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Azure-backed procurement.&lt;/strong&gt; ofox's GPT-5.5 line is Azure-backed, which closes the procurement story without a new vendor review for shops already inside Microsoft compliance. The procurement cost of adding a new model vendor often exceeds the per-token savings for teams below a few hundred thousand tokens per day.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fourth scenario is &lt;strong&gt;mixed-workload reasoning load&lt;/strong&gt; — if your coding agent occasionally writes architecture summaries, postmortems, or research briefs, GPT-5.5's general reasoning ceiling is higher than GLM-5.2's. That said, for purely coding workloads, the cost case for GLM-5.2 dominates.&lt;/p&gt;

&lt;h2&gt;
  
  
  A/B Routing Pattern via ofox: One Key, One Endpoint, Two Models
&lt;/h2&gt;

&lt;p&gt;Both &lt;code&gt;z-ai/glm-5.2&lt;/code&gt; and &lt;code&gt;openai/gpt-5.5&lt;/code&gt; are live on &lt;code&gt;https://api.ofox.io/v1&lt;/code&gt; under the OpenAI-compatible protocol. The model swap is a single string change. The smallest useful A/B harness:&lt;/p&gt;

&lt;h3&gt;
  
  
  Python — A/B both models in one loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ofox.io/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OFOX_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor this Python function to use async/await and return early on empty list: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives you raw latency, total token count, and side-by-side output on your own task. Run it across 20-30 representative cases from your real workload — that is the only honest input to a routing decision.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node — same shape
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.ofox.io/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OFOX_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Refactor this Python function to use async/await and return early on empty list: ...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; tokens`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production routing — single-line model swap
&lt;/h3&gt;

&lt;p&gt;The same SDK call, the same key, the same billing line. To route the cost-sensitive half of your traffic to GLM-5.2 and keep the interactive half on GPT-5.5:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;pick_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;request_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batch_refactor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc_generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;z-ai/glm-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;pick_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_type&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No migration, no new key, no separate billing reconciliation. The model column on your invoice tells you what each request cost; the routing function is one place to tune the split. For the broader pattern of routing across the full ofox catalog — including Claude for escalations — see &lt;a href="https://ofox.ai/blog/30-dollar-ai-coding-stack-setup-guide-2026/" rel="noopener noreferrer"&gt;our $30 AI coding stack guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources &amp;amp; Pricing References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://ofox.io/en/models/z-ai/glm-5.2" rel="noopener noreferrer"&gt;ofox.io model catalog: z-ai/glm-5.2&lt;/a&gt; — input $1.4/M, output $4.4/M, cache $0.26/M, 1M context, 128K max output, listed June 16, 2026 (verified June 21, 2026)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://ofox.io/en/models/openai/gpt-5.5" rel="noopener noreferrer"&gt;ofox.io model catalog: openai/gpt-5.5&lt;/a&gt; — input $5/M, output $30/M, cache $0.5/M, 1M context (922K in / 128K out), listed April 24, 2026, Azure-backed (verified June 21, 2026)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://ofox.ai/blog/glm-5-2-access-guide-2026/" rel="noopener noreferrer"&gt;GLM-5.2 access guide&lt;/a&gt; — pricing tiers, MIT weights, Z.ai Coding Plan&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://ofox.ai/blog/minimax-m3-vs-gpt-5-5-coding-benchmark-2026/" rel="noopener noreferrer"&gt;MiniMax M3 vs GPT-5.5 SWE-Bench Pro coding benchmark&lt;/a&gt; — companion benchmark-led comparison&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.vellum.ai/blog/everything-you-need-to-know-about-gpt-5-5" rel="noopener noreferrer"&gt;Vellum — GPT-5.5 reference&lt;/a&gt; — Terminal-Bench 2.1 score 82.7%, output token rate $30/M confirmed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At a 5.56x cost ratio that holds across volume tiers and a 6.82x gap on pure output tokens, the routing question is no longer "is GLM-5.2 good enough" — it is "which workload still justifies paying the GPT-5.5 premium," and "Codex CLI shop" is the cleanest honest answer.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/glm-5-2-vs-gpt-5-5-cost-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>glm</category>
      <category>openai</category>
      <category>pricing</category>
    </item>
    <item>
      <title>Self-Host GLM 5.2 in 2026: Hardware, vLLM Setup, and Cost vs Cloud</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Wed, 17 Jun 2026 04:06:26 +0000</pubDate>
      <link>https://dev.to/owen_fox/self-host-glm-52-in-2026-hardware-vllm-setup-and-cost-vs-cloud-2p0f</link>
      <guid>https://dev.to/owen_fox/self-host-glm-52-in-2026-hardware-vllm-setup-and-cost-vs-cloud-2p0f</guid>
      <description>&lt;h1&gt;
  
  
  Self-Host GLM 5.2 in 2026: Hardware, vLLM Setup, and Cost vs Cloud
&lt;/h1&gt;

&lt;p&gt;Zhipu's GLM 5.2 represents a significant milestone for the open-weights community. The MIT-licensed model weights are now available on HuggingFace, making frontier-class coding capabilities accessible for self-hosted deployments. However, the 753B parameters present substantial hardware requirements that merit careful evaluation before committing to a self-hosted infrastructure investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Get When You Self-Host GLM 5.2
&lt;/h2&gt;

&lt;p&gt;The capabilities available immediately include serving the model via vLLM on an 8-GPU H200 node. Storage requirements vary significantly by format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FP8 quantization requires approximately 750 GB&lt;/li&gt;
&lt;li&gt;BF16 format needs roughly 1.5 TB&lt;/li&gt;
&lt;li&gt;Q4_K_M GGUF weights occupy around 376 GB&lt;/li&gt;
&lt;li&gt;2-bit UD-IQ2_XXS quantization uses approximately 241 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production deployments require either 8x H200 GPUs with 141GB memory each (for FP8) or 4x H100 80GB units when using GGUF quantization. For experimentation, a Mac Studio M3 Ultra with 256GB unified memory can run the most aggressive 2-bit quantization at 3–9 tokens per second.&lt;/p&gt;

&lt;p&gt;Multiple inference engines achieve compatibility on day one: vLLM v0.23.0+, SGLang v0.5.13.post1+, Transformers v5.12+, KTransformers v0.6.1+, llama.cpp for GGUF formats, and xLLM v0.10.0+. The MIT license permits commercial use, modification, and redistribution without restriction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Performance Indicators
&lt;/h2&gt;

&lt;p&gt;"GLM 5.2 trails Opus 4.8 on raw SWE-bench Pro (62.1 vs 69.2) but pulls ahead on Terminal-Bench 2.1's Best Reported Harness run (82.7 vs 78.9) and on agentic-math (AIME 99.2 vs 95.7)."&lt;/p&gt;

&lt;p&gt;Third-party leaderboards show competitive positioning: DesignArena's Web Dev composite ranks GLM 5.2 first overall at Elo 1,360, ahead of Claude Fable 5 (1,350) and Claude Opus variants. The Code Arena Frontend slice places it second at Elo 1,595, behind Claude Fable 5 at 1,654 (which carries a "not currently being sampled" designation).&lt;/p&gt;

&lt;p&gt;The performance gap relative to open-weights competitors is substantial: 30–50 Elo points separate GLM 5.2 from Qwen 3.7 Max, Kimi K2.6, and GLM 5.1 on the composite benchmark and 60+ points on the frontend slice.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Self-Host
&lt;/h2&gt;

&lt;p&gt;Self-hosting becomes economically and operationally sensible in limited scenarios:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Valid self-host scenarios include:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data residency requirements preventing code or prompts from leaving internal infrastructure&lt;/li&gt;
&lt;li&gt;Custom fine-tuning needs on proprietary codebases without hosted API support&lt;/li&gt;
&lt;li&gt;Air-gapped deployments in restricted-network environments&lt;/li&gt;
&lt;li&gt;High sustained throughput exceeding 3,000 prompts daily&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Self-hosting is the wrong choice when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operating as a solo developer or small team (hosted plans cost ~$30–80 monthly)&lt;/li&gt;
&lt;li&gt;No existing vLLM or SGLang deployment in production&lt;/li&gt;
&lt;li&gt;Requiring vendor-published SWE-bench Verified, LiveCodeBench, or Aider polyglot benchmarks&lt;/li&gt;
&lt;li&gt;Peak load remains below 100 prompts daily with no compliance constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Do not self-host" if leveraging hosted services costs less than one-tenth of the engineering overhead required for self-hosted infrastructure management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Available Formats and Sources
&lt;/h2&gt;

&lt;p&gt;The official repository on HuggingFace distributes BF16 and FP8 variants optimized for production inference. The community-maintained Unsloth repository provides GGUF quantizations supporting both llama.cpp and LM Studio. Ollama's current &lt;code&gt;glm-5.2:cloud&lt;/code&gt; tag routes through hosted inference rather than enabling local execution—no quantized local variant exists on the official Ollama library yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware Sizing Requirements
&lt;/h2&gt;

&lt;p&gt;KV cache utilization at extended context lengths represents the primary constraint for hardware selection. For 256K context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BF16 format demands 16x H100 or 8x H200 nodes; H200 sizing remains tight&lt;/li&gt;
&lt;li&gt;FP8 requires 8x H200 comfortably or 8x H100 with constrained KV cache&lt;/li&gt;
&lt;li&gt;Q4_K_M GGUF works on 4x H100 or 2x H200 units&lt;/li&gt;
&lt;li&gt;Quantized 2-bit variants run on high-memory workstations with 256GB+ unified memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scaling to 1M context increases KV cache footprint by approximately 4x, necessitating FP8 quantization for production use. VRAM headroom of 20% above total model plus cache requirements prevents fragmentation-related out-of-memory errors during extended inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  vLLM Production Setup
&lt;/h2&gt;

&lt;p&gt;The primary production deployment path follows this sequence:&lt;/p&gt;

&lt;p&gt;First, download the FP8 weights from HuggingFace to local storage (approximately 30–60 minutes on 10 GbE connectivity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;huggingface-cli download zai-org/GLM-5.2-FP8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; /models/glm-5.2-fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir-use-symlinks&lt;/span&gt; False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launch the vLLM server with tensor parallelism across all available GPUs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve &lt;span class="s2"&gt;"zai-org/GLM-5.2-FP8"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tensor parallelism at size 8 distributes the model across all H200 GPUs. Maximum model length begins at 256K tokens (262144) and scales upward after benchmarking actual KV cache behavior. FP8 KV cache reduces memory requirements by half compared to BF16. Prefix caching reuses computed KV for shared prompt prefixes—essential for coding agents executing repetitive system prompts.&lt;/p&gt;

&lt;p&gt;Verification through a curl command confirms basic operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"zai-org/GLM-5.2-FP8","messages":[{"role":"user","content":"Reply OK"}],"max_tokens":16}'&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected response confirms operation within approximately one second after initial compilation overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  SGLang Alternative
&lt;/h2&gt;

&lt;p&gt;SGLang offers superior throughput for workloads featuring heavy prompt reuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; sglang.launch_server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-path&lt;/span&gt; zai-org/GLM-5.2-FP8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tp&lt;/span&gt; 8 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--context-length&lt;/span&gt; 262144 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-cache-dtype&lt;/span&gt; fp8_e4m3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-mixed-chunk&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 30000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RadixAttention delivers approximately 3x throughput improvement versus vLLM 0.23 when agents reuse 100K+ tokens of shared system context. Implementation complexity increases slightly but remains manageable for teams with existing SGLang production experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Deployment via llama.cpp
&lt;/h2&gt;

&lt;p&gt;For development, tinkering, or single-node air-gapped scenarios, llama.cpp with Unsloth GGUF quantizations represents the lowest-friction path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;huggingface-cli download unsloth/GLM-5.2-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  GLM-5.2-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; /models/glm-5.2-gguf

cmake &lt;span class="nt"&gt;-B&lt;/span&gt; llama.cpp/build &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; llama.cpp/build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;

./llama.cpp/build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; /models/glm-5.2-gguf/GLM-5.2-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;M3 Ultra Mac Studio deployment with 256GB unified memory achieves 3–9 tokens per second depending on context. Performance scales appropriately for solo development but remains insufficient for team-scale throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Analysis: Self-Hosted vs Hosted
&lt;/h2&gt;

&lt;p&gt;The financial comparison reveals hosted solutions dominate for most organizations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Deployment&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Z.ai Pro Plan&lt;/td&gt;
&lt;td&gt;~$30&lt;/td&gt;
&lt;td&gt;Supports ~2,000 prompts weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Z.ai Max Plan&lt;/td&gt;
&lt;td&gt;~$80&lt;/td&gt;
&lt;td&gt;Supports ~8,000 prompts weekly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud 8x H200 (24/7)&lt;/td&gt;
&lt;td&gt;$21–36k&lt;/td&gt;
&lt;td&gt;$30–50 per hour blended rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud 8x H200 (9–5)&lt;/td&gt;
&lt;td&gt;$6–10k&lt;/td&gt;
&lt;td&gt;200 hours monthly typical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Owned 8x H200&lt;/td&gt;
&lt;td&gt;$3–5k&lt;/td&gt;
&lt;td&gt;~$200k hardware amortized over 4 years&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Owned M3 Ultra&lt;/td&gt;
&lt;td&gt;~$50&lt;/td&gt;
&lt;td&gt;One-time $8k; electricity $30 monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Break-even analysis demonstrates hosted services win when self-host requirements remain absent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;M3 Ultra advantages emerge above $30 monthly hosted spend if 3–9 tokens/sec suffices&lt;/li&gt;
&lt;li&gt;Cloud H200 justifies against Max Plan only with 3,000+ daily prompts and 30%+ duty cycle&lt;/li&gt;
&lt;li&gt;Owned H200 economics favor self-hosting above 10,000 daily prompts &lt;strong&gt;and&lt;/strong&gt; existing datacenter capacity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"Hosted wins for 95% of teams." Self-hosting becomes advantageous only for organizations with compliance constraints, data residency mandates, or sustained throughput exceeding typical team workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Setup Errors and Resolutions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CUDA out of memory during model load&lt;/strong&gt; occurs when tensor parallelism remains too low or KV cache budget proves too generous. Increase tensor parallelism to match GPU count; reduce maximum model length to half intended value initially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FP8 operations unsupported&lt;/strong&gt; indicates Ampere-generation hardware (A100). FP8 E4M3 requires Hopper architecture (H100/H200). A100 users should utilize Q4_K_M GGUF via llama.cpp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model has tied_word_embeddings: false warning&lt;/strong&gt; represents harmless vLLM auto-detection noise and remains safe to ignore for GLM 5.2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;504 connection reset on 500K+ token requests&lt;/strong&gt; signals first-token latency exceeding default client timeouts. Increase client timeout to 600 seconds; limit concurrent sequences to four requests for vLLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IndexError in RadixAttention&lt;/strong&gt; indicates SGLang tokenizer cache mismatch. Delete &lt;code&gt;~/.cache/sglang/&lt;/code&gt; completely and restart for cache rebuild on next inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GGUF load failure with missing tensor references&lt;/strong&gt; reveals llama.cpp version incompatibility with GLM MoE DSA architecture. Update llama.cpp to a build version matching or exceeding the GGUF publication date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Inconsistent outputs versus Z.ai hosted&lt;/strong&gt; suggests sampling parameter misalignment. Verify temperature (1.0), top_p (0.95), and unset top_k against official generation_config.json from the HuggingFace repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Requirements
&lt;/h2&gt;

&lt;p&gt;Production deployments require three critical metrics:&lt;/p&gt;

&lt;p&gt;"Track tokens-per-second throughput separately at p50 and p95 percentiles" since individual 900K-context requests drag tail latencies by orders of magnitude.&lt;/p&gt;

&lt;p&gt;Monitor KV cache utilization percentage via vLLM's &lt;code&gt;/metrics&lt;/code&gt; endpoint. Sustained utilization crossing 90% threshold signals imminent throughput collapse.&lt;/p&gt;

&lt;p&gt;Instrument per-request total token consumption at PR or session level to catch runaway token burning in coding agent loops before budget exhaustion.&lt;/p&gt;

&lt;p&gt;Wire these metrics into existing observability infrastructure (Datadog, Honeycomb, Grafana). SGLang exposes equivalent metrics at &lt;code&gt;/metrics_collect&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managed Hosting Alternatives
&lt;/h2&gt;

&lt;p&gt;For scenarios where self-host mathematics fail but Chinese-origin coding models remain preferred, several alternatives support OpenAI-compatible API patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt; (&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;) offers 1M context and published SWE-bench Verified benchmarks—a specification missing from GLM 5.2's public table which reports only SWE-bench Pro.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kimi K2.6&lt;/strong&gt; (&lt;code&gt;moonshotai/kimi-k2.6&lt;/code&gt;) provides independently-benchmarked 262K context as verified capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen 3 Coder Next&lt;/strong&gt; (&lt;code&gt;bailian/qwen3-coder-next&lt;/code&gt;) addresses multilingual codebases with Chinese, Japanese, and Korean language support.&lt;/p&gt;

&lt;p&gt;These models share identical API wiring—only base URL and model identifier change. GLM 5.2 remains unlisted on the ofox catalog as of June 17, 2026, though eventual availability would require only a single string modification in client configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future Considerations
&lt;/h2&gt;

&lt;p&gt;The most significant opportunity emerges from potential FP4 quantization community releases within the next 90 days. Should FP4 variants prove viable, production deployments could consolidate from 8x H200 to 4x H100 hardware, fundamentally altering self-host economics for the 5% of organizations currently justifying infrastructure investment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/glm-5-2-self-host-vllm-hardware-cost-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>glm</category>
      <category>vllm</category>
      <category>selfhost</category>
    </item>
    <item>
      <title>Codex Weekly Limit Drained: 7 Fixes, Why It Burns, and a Drop-in API (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Tue, 16 Jun 2026 10:28:25 +0000</pubDate>
      <link>https://dev.to/owen_fox/codex-weekly-limit-drained-7-fixes-why-it-burns-and-a-drop-in-api-2026-3j6i</link>
      <guid>https://dev.to/owen_fox/codex-weekly-limit-drained-7-fixes-why-it-burns-and-a-drop-in-api-2026-3j6i</guid>
      <description>&lt;h1&gt;
  
  
  Codex Weekly Limit Drained: 7 Fixes, Why It Burns, and a Drop-in API (2026)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Your Codex weekly limit can drain in three hours on a $20 Plus plan if you trigger a single long refactor, and the only team-wide reset OpenAI shipped was on June 11, 2026.&lt;/p&gt;

&lt;p&gt;When the limit hits, you have two paths: wait for the rolling reset (Plus rolls every 5 hours, weekly caps reset weekly) or switch Codex CLI's &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to a pay-per-token API. This guide gives you a 30-second diagnostic, the exact tier math for Plus and Pro, and seven concrete fixes ranked from "you can still use ChatGPT today" to "drop Codex onto a metered API in 60 seconds."&lt;/p&gt;

&lt;h2&gt;
  
  
  Is Codex Down or Did Your Weekly Limit Drain? The 30-Second Diagnosis
&lt;/h2&gt;

&lt;p&gt;Before you start switching providers, confirm which window blew up. Run this inside any open Codex CLI session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;codex /status
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output shows three numbers. Match them against this table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Diagnosis&lt;/th&gt;
&lt;th&gt;Context window&lt;/th&gt;
&lt;th&gt;5-hour window&lt;/th&gt;
&lt;th&gt;Weekly window&lt;/th&gt;
&lt;th&gt;What to do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weekly cap hit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;&amp;gt;10%&lt;/td&gt;
&lt;td&gt;&amp;lt;5%&lt;/td&gt;
&lt;td&gt;Plan-level — switch tier or move to API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5-hour cap hit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;&amp;lt;5%&lt;/td&gt;
&lt;td&gt;&amp;gt;20%&lt;/td&gt;
&lt;td&gt;Wait for the next 5-hour rollover (max 5h)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context overflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;5%&lt;/td&gt;
&lt;td&gt;&amp;gt;20%&lt;/td&gt;
&lt;td&gt;&amp;gt;20%&lt;/td&gt;
&lt;td&gt;Compact the conversation or split the task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Phantom limit&lt;/strong&gt; (issue #19215)&lt;/td&gt;
&lt;td&gt;&amp;gt;40%&lt;/td&gt;
&lt;td&gt;&amp;gt;50%&lt;/td&gt;
&lt;td&gt;&amp;gt;50%&lt;/td&gt;
&lt;td&gt;Restart Codex CLI; if persists, switch base_url&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If &lt;code&gt;/status&lt;/code&gt; shows healthy capacity but you still see &lt;code&gt;You've hit your usage limit. To get more access now, send a request to your admin or try again at 3:51 PM&lt;/code&gt;, you are hitting the &lt;a href="https://github.com/openai/codex/issues/19215" rel="noopener noreferrer"&gt;#19215 phantom limit bug&lt;/a&gt; (reported April 23, 2026 on Codex CLI v0.124.0). Skip to fix #4.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Codex error wording → cause → fix
&lt;/h3&gt;

&lt;p&gt;The exact string Codex prints is the fastest way to triage. Map yours to the right column:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error wording (exact)&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;Fix to apply&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;You've hit your usage limit. ... try again at HH:MM PM&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Genuine 5-hour or weekly cap&lt;/td&gt;
&lt;td&gt;Fix #1 (wait) or fix #5/#6 (API path)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;You've hit your usage limit&lt;/code&gt; but &lt;code&gt;/status&lt;/code&gt; shows &amp;gt;50% remaining&lt;/td&gt;
&lt;td&gt;Phantom-limit bug (#19215) on CLI v0.124.0&lt;/td&gt;
&lt;td&gt;Fix #4 (clear sessions cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;request a free reset&lt;/code&gt; button visible in CLI prompt&lt;/td&gt;
&lt;td&gt;Banked reset still available (post June 11)&lt;/td&gt;
&lt;td&gt;Fix #2 (claim the free reset)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;429 Too Many Requests&lt;/code&gt; HTTP error in API mode&lt;/td&gt;
&lt;td&gt;Direct API spend-burst throttle&lt;/td&gt;
&lt;td&gt;Backoff, lower concurrency, or fix #6 (per-key budget)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;insufficient_quota&lt;/code&gt; HTTP error in API mode&lt;/td&gt;
&lt;td&gt;OpenAI account spend limit hit&lt;/td&gt;
&lt;td&gt;Raise the dashboard spend cap or switch base_url&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;You've reached your weekly limit&lt;/code&gt; without reset offer&lt;/td&gt;
&lt;td&gt;Banked reset already redeemed this launch&lt;/td&gt;
&lt;td&gt;Fix #3 (tier upgrade) or fix #5/#6 (API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;context_length_exceeded&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Conversation too long, not a tier cap&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;codex /compact&lt;/code&gt; or start a fresh session&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When to Fix the Limit (and When to Switch to a Drop-in API Instead)
&lt;/h2&gt;

&lt;p&gt;Not every drained limit deserves a workaround. Match your situation to the rule:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to fix the limit on your existing plan.&lt;/strong&gt; You hit the cap once this week. Your &lt;code&gt;/status&lt;/code&gt; confirms a normal cap (not the phantom bug). The task you were running is wrap-up rather than a fresh multi-hour agent loop. The team-wide June 11 reset banking gives you one free reset per launch event, so you may already be sitting on a free reset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to switch to a token-billed API.&lt;/strong&gt; You drain weekly limits more than twice a month. You run agent workflows that touch 300k+ input tokens in a single session. You need predictable per-task cost (e.g. quoting client work). You hit limits inside CI or shared dev environments where tiered ChatGPT auth doesn't make sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop rule.&lt;/strong&gt; If your &lt;code&gt;/status&lt;/code&gt; shows fresh remaining capacity and you only hit &lt;code&gt;usage limit&lt;/code&gt; once today, do nothing. The rolling 5-hour window will reset, and switching providers for a one-off is over-engineering. Close this tab and write code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Codex Rate Limits: Plans, Windows, and What Each Plan Actually Gives You
&lt;/h2&gt;

&lt;p&gt;Codex CLI inherits the limits of whichever auth path you use. Three subscription tiers and the API path each have different ceilings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subscription-tier 5-hour message caps
&lt;/h3&gt;

&lt;p&gt;ChatGPT subscription tiers gate Codex usage by &lt;em&gt;messages&lt;/em&gt; in a rolling 5-hour window, not tokens. The published bands are wide because each "message" is weighted by model and task complexity.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;th&gt;GPT-5.5 msgs / 5h&lt;/th&gt;
&lt;th&gt;GPT-5.4 msgs / 5h&lt;/th&gt;
&lt;th&gt;GPT-5.3-Codex msgs / 5h&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;n/a (Plus minimum for Codex)&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plus&lt;/td&gt;
&lt;td&gt;$20&lt;/td&gt;
&lt;td&gt;15-80&lt;/td&gt;
&lt;td&gt;20-100&lt;/td&gt;
&lt;td&gt;45-225&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro 5x&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;td&gt;75-400&lt;/td&gt;
&lt;td&gt;100-500&lt;/td&gt;
&lt;td&gt;225-1,125&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro 20x&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;300-1,600&lt;/td&gt;
&lt;td&gt;400-2,000&lt;/td&gt;
&lt;td&gt;900-4,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business&lt;/td&gt;
&lt;td&gt;per-seat&lt;/td&gt;
&lt;td&gt;varies by seat allocation&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://help.openai.com/en/articles/20001106-codex-rate-card" rel="noopener noreferrer"&gt;OpenAI Codex rate card&lt;/a&gt; and &lt;a href="https://developers.openai.com/codex/pricing" rel="noopener noreferrer"&gt;Codex pricing reference&lt;/a&gt;, figures verified June 15, 2026. Plus (15-80) and Pro 20x (300-1,600) corroborate against independent community reports; Pro 5x's 75-400 band is the rate-card figure only and has not been cross-verified by a third party at the time of writing.&lt;/p&gt;

&lt;p&gt;The wide bands (e.g. "15-80") reflect that OpenAI dynamically adjusts caps based on aggregate load and individual usage patterns. The low end of the band is what you actually get on a busy day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weekly cap mechanics
&lt;/h3&gt;

&lt;p&gt;A separate weekly cap stacks on top of the 5-hour window. The weekly cap resets on a rolling 7-day calendar from your first message of the week. The two windows are tracked independently — and this is where the math breaks.&lt;/p&gt;

&lt;p&gt;A single multi-file agent loop on GPT-5.5 (input-heavy: ~250k tokens / output ~25k) burns roughly the equivalent of 30-40 messages in 5-hour accounting but counts as one "session" against the weekly window. Plus users running 2-3 heavy refactors in a row can drain the weekly cap while still showing 30%+ on the 5-hour gauge.&lt;/p&gt;

&lt;h3&gt;
  
  
  API-key path (no message cap)
&lt;/h3&gt;

&lt;p&gt;If your &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; starts with &lt;code&gt;sk-&lt;/code&gt; and is a normal API key, Codex CLI bypasses subscription tier caps entirely. You pay per token at the API rate card. There is no weekly message cap — only the spend limit you set on the dashboard. This is the cleanest fix for hitting weekly limits, and we cover the math below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a single agent run can drain a weekly cap
&lt;/h3&gt;

&lt;p&gt;The unintuitive part of Codex weekly limits is that they don't scale linearly with the message count you see in &lt;code&gt;/status&lt;/code&gt;. OpenAI's &lt;a href="https://developers.openai.com/codex/pricing" rel="noopener noreferrer"&gt;Codex pricing reference&lt;/a&gt; documents the underlying credit math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  GPT-5.5 burns 125 credits per 1M input tokens, 12.50 cached, and 750 per 1M output&lt;/li&gt;
&lt;li&gt;  GPT-5.4 burns 62.50 / 6.25 / 375&lt;/li&gt;
&lt;li&gt;  GPT-5.4 mini burns 18.75 / 1.875 / 113&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single agent loop on GPT-5.5 that reads 30 files (~250k input tokens) and produces a 25k-token plan/diff burns roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input:  250,000 × 125/1M = 31.25 credits
output:  25,000 × 750/1M = 18.75 credits
total:                     ≈ 50 credits per run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Plus plan's weekly budget on GPT-5.5 averages around 250-300 credits, depending on aggregate platform load. Six of those agent loops drain the whole week — and the &lt;code&gt;/status&lt;/code&gt; 5-hour gauge can still report 40%+ remaining because the 5-hour window is tracked in &lt;em&gt;messages&lt;/em&gt;, not credits. This is why users see the apparent contradiction in &lt;a href="https://github.com/openai/codex/issues/19215" rel="noopener noreferrer"&gt;issue #19215&lt;/a&gt;: healthy 5-hour gauge, exhausted weekly window.&lt;/p&gt;

&lt;p&gt;The fix is structural: route heavy agent loops through an API key (no credit accounting, just dollar accounting) and keep your subscription for interactive chat.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix "You've Hit Your Usage Limit" (Solutions for Every Tier)
&lt;/h2&gt;

&lt;p&gt;Below are seven fixes in order from least to most disruptive. Pick the first one that matches your situation and stop reading.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #1 — Wait for the 5-hour rollover (Free / Plus, light usage)
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;/status&lt;/code&gt; shows the 5-hour window at &amp;lt;5% remaining but the weekly window above 30%, you only need to wait. The exact rollover timestamp is shown in &lt;code&gt;/status&lt;/code&gt; — typically within 3-5 hours of your last burst.&lt;/p&gt;

&lt;p&gt;While you wait, the productive move is to switch your IDE to manual coding mode or use a non-Codex tool that doesn't share the cap (any IDE-side completion).&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #2 — Claim your free reset (June 11, 2026 changelog)
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex changelog entry for June 11, 2026&lt;/a&gt; introduced "rate-limit reset banking" with one free reset granted at launch for Plus and Pro users. If you haven't redeemed yours, the option appears in the Codex CLI when you hit the cap, or via the &lt;code&gt;codex /reset&lt;/code&gt; command if you are on CLI ≥ v0.135.&lt;/p&gt;

&lt;p&gt;This is a one-time fix per launch event. If OpenAI ships another reset (the May 15, 2026 reset announced by Tibo on X was team-wide and didn't consume the banked one), check changelog announcements before burning your single banked reset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #3 — Upgrade the plan tier
&lt;/h3&gt;

&lt;p&gt;If you are draining Plus weekly caps weekly, Pro 5x ($100) gives 5× the message budget. Pro 20x ($200) gives 20×. The break-even versus the API path is roughly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Monthly cost&lt;/th&gt;
&lt;th&gt;What you get&lt;/th&gt;
&lt;th&gt;Break-even vs API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Plus&lt;/td&gt;
&lt;td&gt;$20&lt;/td&gt;
&lt;td&gt;15-80 GPT-5.5 msgs / 5h&lt;/td&gt;
&lt;td&gt;~3-5 heavy sessions / week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro 5x&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;td&gt;75-400 GPT-5.5 msgs / 5h&lt;/td&gt;
&lt;td&gt;~100 light sessions or 25-30 heavy / month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro 20x&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;300-1,600 GPT-5.5 msgs / 5h&lt;/td&gt;
&lt;td&gt;~80 heavy / month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API key (direct)&lt;/td&gt;
&lt;td&gt;$0 + spend&lt;/td&gt;
&lt;td&gt;Pay per token, no cap&lt;/td&gt;
&lt;td&gt;Linear with usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API key (ofox, 15% off flagship)&lt;/td&gt;
&lt;td&gt;$0 + spend&lt;/td&gt;
&lt;td&gt;Pay per token, no cap&lt;/td&gt;
&lt;td&gt;~$0.80 per heavy session&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your usage pattern is "occasional bursts of heavy work," the API path almost always wins. Subscriptions only make sense if your usage is steady and predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #4 — Phantom-limit workaround (issue #19215)
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;/status&lt;/code&gt; shows healthy capacity but Codex still rejects requests with &lt;code&gt;usage limit&lt;/code&gt;, you are hitting &lt;a href="https://github.com/openai/codex/issues/19215" rel="noopener noreferrer"&gt;#19215&lt;/a&gt;. The bug surfaces on Codex CLI v0.124.0 (and was still active when this issue was filed). Working steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;codex /quit
rm -rf ~/.codex/sessions/*
codex
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clearing the local session cache resolves it in most reported cases. If it persists, ship logs to OpenAI via the issue thread and switch to fix #5 (API key path) to keep working.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #5 — Switch &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to OpenAI's direct API endpoint
&lt;/h3&gt;

&lt;p&gt;The cleanest "I just need to keep coding" fix. Generate an OpenAI API key (the &lt;code&gt;sk-&lt;/code&gt; kind, not the &lt;code&gt;sk-proj-&lt;/code&gt; ChatGPT-linked one) at platform.openai.com/api-keys, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-your-api-key
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://api.openai.com/v1
codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Codex now hits the metered API instead of your ChatGPT subscription. Costs are per token. A typical small bug fix runs about $0.40 on GPT-5.5, and a multi-file refactor of ~300k input / 30k output sits around $2.40 — both well under what a single banked reset would cost if it were paid.&lt;/p&gt;

&lt;p&gt;The trade-off: you lose the cloud features bundled with ChatGPT (Slack integration, GitHub reviews from inside the Codex dashboard). For pure CLI work, this is the highest-leverage fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #6 — Switch &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to a drop-in OpenAI-compatible API
&lt;/h3&gt;

&lt;p&gt;Same shape as fix #5 but routed through a provider that aggregates models, gives a unified key, and surfaces per-key budgets. The example below uses &lt;a href="https://ofox.ai" rel="noopener noreferrer"&gt;ofox&lt;/a&gt;, but the pattern works for any OpenAI-compatible endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ofox-your-key-here
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://api.ofox.ai/v1
codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three reasons this beats hitting OpenAI directly when you've already drained your ChatGPT plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Cheaper flagship models.&lt;/strong&gt; ofox lists &lt;code&gt;openai/gpt-5.3-codex&lt;/code&gt; at $1.49 input / $11.90 output per million tokens (a 15% discount on OpenAI's $1.75 / $14.00 list). The same refactor that costs $0.95 on OpenAI direct costs about $0.80 here.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Drop-in model switching.&lt;/strong&gt; &lt;code&gt;openai/gpt-5.4-mini&lt;/code&gt; is $0.638 input / $3.83 output per million on ofox — cheap enough to use as your default for iterative work. Reserve &lt;code&gt;gpt-5.3-codex&lt;/code&gt; for hard refactors.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Per-key budgets.&lt;/strong&gt; ofox exposes a per-key monthly cap on the dashboard. Set it to $50/month and you cannot exceed that limit, regardless of how aggressive the agent loop gets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href="https://ofox.ai/docs/integrations/codex" rel="noopener noreferrer"&gt;ofox Codex integration docs&lt;/a&gt; and the deeper-dive &lt;a href="https://ofox.ai/blog/codex-cli-api-configuration-guide-2026/" rel="noopener noreferrer"&gt;Codex CLI custom endpoint guide&lt;/a&gt; cover model-specific tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix #7 — Multi-provider routing via &lt;code&gt;~/.codex/config.toml&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;For teams or anyone who wants &lt;code&gt;gpt-5.3-codex&lt;/code&gt; for refactors and a cheaper model for iteration, configure two providers and let Codex switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[model_providers.ofox]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"OfoxAI"&lt;/span&gt;
&lt;span class="py"&gt;base_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.ofox.ai/v1"&lt;/span&gt;
&lt;span class="py"&gt;wire_api&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"responses"&lt;/span&gt;

&lt;span class="nn"&gt;[model_providers.openai_direct]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"OpenAI Direct"&lt;/span&gt;
&lt;span class="py"&gt;base_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.openai.com/v1"&lt;/span&gt;
&lt;span class="py"&gt;wire_api&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"responses"&lt;/span&gt;

&lt;span class="nn"&gt;[profiles.fast]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai/gpt-5.4-mini"&lt;/span&gt;
&lt;span class="py"&gt;model_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ofox"&lt;/span&gt;

&lt;span class="nn"&gt;[profiles.heavy]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai/gpt-5.3-codex"&lt;/span&gt;
&lt;span class="py"&gt;model_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ofox"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switch on demand with &lt;code&gt;codex --profile heavy&lt;/code&gt; for hard work and &lt;code&gt;codex --profile fast&lt;/code&gt; for iteration. The full schema lives in &lt;a href="https://ofox.ai/blog/codex-cli-config-toml-deep-dive/" rel="noopener noreferrer"&gt;Codex CLI config.toml deep dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex Limit Incidents: Real Patterns from 2026
&lt;/h2&gt;

&lt;p&gt;Real timeline of weekly-limit-related incidents this year. Tracking patterns is the only way to predict the next one.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apr 23, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Issue #19215 filed — &lt;code&gt;/status&lt;/code&gt; shows healthy capacity but Codex CLI rejects with "usage limit" on GPT-5.5, Business plan, CLI v0.124.0&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/openai/codex/issues/19215" rel="noopener noreferrer"&gt;openai/codex#19215&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Affected an unknown share of CLI v0.124.0 users; workaround is local cache clear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Late Apr 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Codex hits 3M weekly active users; OpenAI announces team-wide rate-limit reset&lt;/td&gt;
&lt;td&gt;&lt;a href="https://knightli.com/en/2026/05/17/codex-usage-limit-reset-history/" rel="noopener noreferrer"&gt;Knightli debrief&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;One-shot reset for affected users&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;May 15, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tibo (OpenAI) posts on X: monitoring continues, manual reset shipped that evening&lt;/td&gt;
&lt;td&gt;Knightli debrief above&lt;/td&gt;
&lt;td&gt;Reset was active, not scheduled — no permanent cap raise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jun 11, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Changelog adds "rate-limit reset banking" — one free reset per user at launch&lt;/td&gt;
&lt;td&gt;&lt;a href="https://developers.openai.com/codex/changelog" rel="noopener noreferrer"&gt;Codex changelog&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;First user-controlled reset mechanism; one-time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jun 4, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bedrock model support — Codex can now route via Amazon-managed quota&lt;/td&gt;
&lt;td&gt;Codex changelog&lt;/td&gt;
&lt;td&gt;Indirect relief: Bedrock-side quotas instead of ChatGPT-tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is clear: OpenAI has shipped tactical resets rather than permanent cap increases. Plus and Pro users hitting their cap twice in the same month should plan for the API path rather than a third reset.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the Limit Won't Reset: Drop-in API Alternatives That Work Right Now
&lt;/h2&gt;

&lt;p&gt;If you cannot wait, here are the options ranked by switching cost. The headline comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;base_url&lt;/th&gt;
&lt;th&gt;gpt-5.3-codex input $/M&lt;/th&gt;
&lt;th&gt;gpt-5.3-codex output $/M&lt;/th&gt;
&lt;th&gt;Per-key budget&lt;/th&gt;
&lt;th&gt;Switch time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ofox&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.49&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$11.90&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (dashboard)&lt;/td&gt;
&lt;td&gt;~60s (2 env vars)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI direct&lt;/td&gt;
&lt;td&gt;&lt;code&gt;https://api.openai.com/v1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$1.75&lt;/td&gt;
&lt;td&gt;$14.00&lt;/td&gt;
&lt;td&gt;Account-level only&lt;/td&gt;
&lt;td&gt;~60s (2 env vars)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Bedrock&lt;/td&gt;
&lt;td&gt;(Bedrock proxy)&lt;/td&gt;
&lt;td&gt;varies by region&lt;/td&gt;
&lt;td&gt;varies by region&lt;/td&gt;
&lt;td&gt;AWS account cap&lt;/td&gt;
&lt;td&gt;10-30 min (IAM + region)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switch coding tool&lt;/td&gt;
&lt;td&gt;n/a (e.g. Claude Code)&lt;/td&gt;
&lt;td&gt;n/a — different model&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;hours (workflow rewrite)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: pricing verified June 15, 2026 against &lt;a href="https://ofox.ai/models" rel="noopener noreferrer"&gt;ofox model catalog&lt;/a&gt; and &lt;a href="https://developers.openai.com/codex/pricing" rel="noopener noreferrer"&gt;Codex pricing reference&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A — ofox (OpenAI-compatible drop-in)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: 15% discount on flagship OpenAI models, per-key spend caps, unified billing if you also use Claude/Gemini, Codex CLI listed as officially supported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to switch&lt;/strong&gt;: Two env vars. Set &lt;code&gt;OPENAI_BASE_URL=https://api.ofox.ai/v1&lt;/code&gt; and &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; to your ofox key. Restart Codex. Total time: under 60 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API model IDs&lt;/strong&gt;: &lt;code&gt;openai/gpt-5.3-codex&lt;/code&gt;, &lt;code&gt;openai/gpt-5.5&lt;/code&gt;, &lt;code&gt;openai/gpt-5.4&lt;/code&gt;, &lt;code&gt;openai/gpt-5.4-mini&lt;/code&gt;. The full catalog is at &lt;a href="https://ofox.ai/models" rel="noopener noreferrer"&gt;ofox model catalog&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option B — OpenAI direct API
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Same provider as your ChatGPT plan, no third-party trust questions, full model lineup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: List pricing (no discount), no per-key budgets without scripting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to switch&lt;/strong&gt;: Get an &lt;code&gt;sk-&lt;/code&gt; API key from platform.openai.com/api-keys, set &lt;code&gt;OPENAI_BASE_URL=https://api.openai.com/v1&lt;/code&gt;. Restart Codex.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option C — Amazon Bedrock (June 4, 2026 release)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Quota lives inside your AWS account, useful if you already pay for AWS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: AWS-only model catalog (currently a subset of OpenAI), region-specific availability, more complex auth than env vars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to switch&lt;/strong&gt;: Configure Bedrock credentials, set &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to your Bedrock proxy endpoint. See the changelog for setup steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option D — Switch coding tool entirely
&lt;/h3&gt;

&lt;p&gt;If Codex CLI weekly limits keep biting, you can pivot to Claude Code (on &lt;code&gt;claude-opus-4-8&lt;/code&gt;, the current Anthropic flagship as of June 2026 and live on the ofox marketplace) or any other coding agent. The friction is config rewrites — your Codex &lt;code&gt;AGENTS.md&lt;/code&gt;, prompts, and harness habits don't transfer cleanly. Reserve this for "I've decided Codex is not for me," not "I just need to ship today."&lt;/p&gt;

&lt;p&gt;For a deeper look at how the workflow translates, see the &lt;a href="https://ofox.ai/blog/codex-cli-real-world-coding-workflow/" rel="noopener noreferrer"&gt;Codex real-world coding workflow&lt;/a&gt; writeup.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Monitor Codex Status and Cap Future Burn
&lt;/h2&gt;

&lt;p&gt;Once you've stopped the bleeding, set yourself up to never hit the same wall again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check &lt;code&gt;/status&lt;/code&gt; before every heavy task
&lt;/h3&gt;

&lt;p&gt;Make it a reflex. Any task that will exceed 100k input tokens deserves a &lt;code&gt;/status&lt;/code&gt; check first. If you are under 30% on either window, switch to API key auth for that session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Subscribe to the OpenAI status page
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://status.openai.com" rel="noopener noreferrer"&gt;status.openai.com&lt;/a&gt; shows API outages and degraded Codex behavior in near-real-time. Subscribe to email or RSS — you want to know about Codex degradation before you've spent 10 minutes assuming your config is broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track Codex CLI version
&lt;/h3&gt;

&lt;p&gt;The phantom-limit bug (#19215) was tied to CLI v0.124.0. Pin a known-good version with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @openai/[email protected]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the version in use with &lt;code&gt;codex --version&lt;/code&gt;. Upgrade after at least one week on the new version to let regressions surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Set per-key spend caps when on API auth
&lt;/h3&gt;

&lt;p&gt;If you switched to a metered API (fix #5/#6), set a hard cap. On OpenAI direct, this is "Spend limits" in the dashboard. On ofox, it's the per-key monthly limit on the keys page. A reasonable starting cap for a single-developer pattern: $50/month. Raise it once you've watched a month of actual burn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Watch your daily spend with a 10-line shell hook
&lt;/h3&gt;

&lt;p&gt;A pragmatic alternative to hoping the dashboard alerts arrive: log every Codex session's token count locally and add up the daily total. Drop this in &lt;code&gt;~/.codex/hooks/post-session.sh&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# Append per-session token counts to ~/.codex/spend.log&lt;/span&gt;
&lt;span class="nv"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;~/.codex/spend.log
&lt;span class="nv"&gt;TS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%SZ&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TS&lt;/span&gt;&lt;span class="s2"&gt; in=&lt;/span&gt;&lt;span class="nv"&gt;$CODEX_INPUT_TOKENS&lt;/span&gt;&lt;span class="s2"&gt; out=&lt;/span&gt;&lt;span class="nv"&gt;$CODEX_OUTPUT_TOKENS&lt;/span&gt;&lt;span class="s2"&gt; model=&lt;/span&gt;&lt;span class="nv"&gt;$CODEX_MODEL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOG&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then read today's total with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"^&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; ~/.codex/spend.log | &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{for(i=1;i&amp;lt;=NF;i++){if($i~/^in=/){gsub("in=","",$i);ins+=$i};if($i~/^out=/){gsub("out=","",$i);outs+=$i}}} \
       END{printf "today: %d input / %d output tokens\n", ins, outs}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will not get bill-accurate dollars (model pricing changes the math) but you will see "today already burned 800k tokens" before you start the next agent loop. The behavioral nudge alone tends to cut weekly burn by 20-30% in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Default to a cheap model
&lt;/h3&gt;

&lt;p&gt;The cheapest reliable model for iterative work is &lt;code&gt;openai/gpt-5.4-mini&lt;/code&gt; at $0.638/M input on ofox. Set it as your Codex CLI default model and only switch to &lt;code&gt;gpt-5.3-codex&lt;/code&gt; when you explicitly need flagship coding capability. A week of this pattern typically cuts spend by 50%.&lt;/p&gt;

&lt;p&gt;For the broader pattern, see the &lt;a href="https://ofox.ai/blog/codex-cli-custom-model-providers-byo-setup/" rel="noopener noreferrer"&gt;custom model providers BYO setup guide&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Team-shared config (multi-developer setup)
&lt;/h3&gt;

&lt;p&gt;If you are on a team of three or more developers all hitting the same weekly cap, individual ChatGPT subscriptions stop being the right shape. The economics flip:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  5 devs × $20 Plus = $100/mo, but caps are individual and unpredictable&lt;/li&gt;
&lt;li&gt;  5 devs × shared API key with $50/mo cap = predictable $250/mo ceiling, pooled usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The team-shared pattern most ofox customers run uses one API key per environment (dev, staging, ci) rather than per developer. The &lt;code&gt;config.toml&lt;/code&gt; lives in your repo, env vars come from each developer's secret manager. Example committable repo config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# .codex/config.toml — committed to repo&lt;/span&gt;
&lt;span class="nn"&gt;[model_providers.team]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Team Shared (ofox)"&lt;/span&gt;
&lt;span class="py"&gt;base_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.ofox.ai/v1"&lt;/span&gt;
&lt;span class="py"&gt;wire_api&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"responses"&lt;/span&gt;
&lt;span class="c"&gt;# Auth comes from each dev's $OPENAI_API_KEY (developer env, not repo)&lt;/span&gt;

&lt;span class="nn"&gt;[profiles.default]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai/gpt-5.4-mini"&lt;/span&gt;
&lt;span class="py"&gt;model_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"team"&lt;/span&gt;

&lt;span class="nn"&gt;[profiles.heavy]&lt;/span&gt;
&lt;span class="py"&gt;model&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai/gpt-5.3-codex"&lt;/span&gt;
&lt;span class="py"&gt;model_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"team"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each developer sets their own &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; locally (pointed at a per-dev or per-team key in your secret manager). Spend monitoring then lives on one ofox dashboard rather than five individual ChatGPT accounts. CI pipelines use a separate key with a stricter per-run cap.&lt;/p&gt;

&lt;p&gt;The differentiator over individual ChatGPT subscriptions: when one developer ships a refactor that costs $4 in tokens, the whole team sees it on one dashboard. When five separate ChatGPT subs each drain their weekly cap in the same week, you have no visibility and no shared budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Note on Authenticity and Cost
&lt;/h2&gt;

&lt;p&gt;Switching Codex CLI off your $20 ChatGPT plan onto an OpenAI-compatible API costs around $0.13 for a small bug fix on &lt;code&gt;gpt-5.3-codex&lt;/code&gt; — less than the $20 subscription buys you on a single bad afternoon of phantom limits.&lt;/p&gt;

&lt;p&gt;That is the real arithmetic. ChatGPT subscriptions are great if your usage is steady and within the cap. The moment your usage becomes burst-shaped — a hard refactor week, a tight CI deadline, a single agent loop that touches 30 files — the subscription cap becomes the wrong abstraction. Token billing pays for what you used and lets you ship.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/codex-weekly-limit-drained-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codex</category>
      <category>cli</category>
      <category>openai</category>
    </item>
    <item>
      <title>MiniMax M3 vs Claude Opus 4.8: 59% vs 69% SWE-Bench, 10 Pricing, Pick (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Sun, 14 Jun 2026 05:58:13 +0000</pubDate>
      <link>https://dev.to/owen_fox/minimax-m3-vs-claude-opus-48-59-vs-69-swe-bench-10x-pricing-pick-2026-28jo</link>
      <guid>https://dev.to/owen_fox/minimax-m3-vs-claude-opus-48-59-vs-69-swe-bench-10x-pricing-pick-2026-28jo</guid>
      <description>&lt;p&gt;MiniMax M3 just landed at 59% on SWE-Bench Pro for one-tenth the price of Claude Opus 4.8 — but the headline that says "M3 beats GPT-5.5" quietly compares it to Anthropic's &lt;em&gt;old&lt;/em&gt; flagship.&lt;/p&gt;

&lt;h2&gt;
  
  
  30-Second Verdict
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Higher SWE-Bench Pro score?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Opus 4.8 (69.2% vs M3's 59.0%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cheaper per token?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MiniMax M3 (~10× less on both input and output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bigger context window?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tied — both 1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open-weight available today?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Neither, in practice (M3 weights delayed past the promised 10-day window)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for routine coding agents?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;M3 — quality cost gap closes once you account for $/task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for hard multi-file diffs and audit work?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus 4.8 — the ~10-point benchmark gap is real&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Verdict: if your workload is &lt;strong&gt;price-sensitive agent runs&lt;/strong&gt;, pick MiniMax M3 via &lt;code&gt;minimax/minimax-m3&lt;/code&gt;. If your workload is &lt;strong&gt;hard reasoning over multi-file PRs&lt;/strong&gt;, pick &lt;code&gt;anthropic/claude-opus-4.8&lt;/code&gt;. The clean way to find out is to swap a string and run both on the same prompt — code at the end of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: Which One Should You Pick?
&lt;/h2&gt;

&lt;p&gt;A one-line decision table for the four scenarios that cover ~90% of real coding work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Pick&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lint-fix loops, formatter agents, low-stakes refactors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MiniMax M3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10× cheaper per run; quality difference invisible on simple diffs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agentic IDE plugins (Cursor, Windsurf, Cline)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MiniMax M3&lt;/strong&gt; by default, &lt;strong&gt;Opus 4.8&lt;/strong&gt; for "explain this bug"&lt;/td&gt;
&lt;td&gt;M3 handles tool-loop volume; Opus handles the few prompts that need reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-file refactor where a wrong patch costs a debugging hour&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10-point SWE-Bench gap = noticeably fewer broken diffs on hard repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1M-context whole-repo grep+patch&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Test both&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MSA is faster at long ctx; Opus is more accurate. A/B on your actual repo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trap is treating this as one decision. Most teams want both models available, routed by task — and that's exactly what ofox's same-&lt;code&gt;base_url&lt;/code&gt; swap is built for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Specs Comparison
&lt;/h2&gt;

&lt;p&gt;All prices verified from the ofox catalog on 2026-06-13. Context and output limits from vendor docs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;MiniMax M3&lt;/th&gt;
&lt;th&gt;Claude Opus 4.8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model ID on ofox&lt;/td&gt;
&lt;td&gt;&lt;code&gt;minimax/minimax-m3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;anthropic/claude-opus-4.8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input price&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$0.60/M&lt;/strong&gt; tokens&lt;/td&gt;
&lt;td&gt;$5.00/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output price&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$2.40/M&lt;/strong&gt; tokens&lt;/td&gt;
&lt;td&gt;$25.00/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cached input price&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$0.12/M&lt;/strong&gt; tokens&lt;/td&gt;
&lt;td&gt;$0.50/M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;td&gt;1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max output&lt;/td&gt;
&lt;td&gt;131K tokens&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modalities (input)&lt;/td&gt;
&lt;td&gt;Text + image + video&lt;/td&gt;
&lt;td&gt;Text + image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor SWE-Bench Pro&lt;/td&gt;
&lt;td&gt;59.0%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Released&lt;/td&gt;
&lt;td&gt;2026-06-01&lt;/td&gt;
&lt;td&gt;2026-05-28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open weight?&lt;/td&gt;
&lt;td&gt;Promised, weights delayed&lt;/td&gt;
&lt;td&gt;No (closed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;MiniMax Sparse Attention (MSA)&lt;/td&gt;
&lt;td&gt;Dense transformer (Anthropic)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two specs worth pausing on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input price ratio is 8.3×; output ratio is 10.4×.&lt;/strong&gt; A typical coding agent emits 0.2–0.5 output tokens per input token, so the effective ratio sits between 9× and 10× depending on workload. Round to 10× for back-of-envelope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Max output is effectively a tie.&lt;/strong&gt; M3 ships 131K, Opus 4.8 ships 128K — the 3K gap doesn't change the operational shape. Both can emit a small file or a dozen unit tests in one call, and both will need chained calls past roughly 130K.&lt;/p&gt;

&lt;h2&gt;
  
  
  SWE-Bench Pro: The Number That Started the Story
&lt;/h2&gt;

&lt;p&gt;SWE-Bench Pro is the hardest variant of the SWE-bench family — problems from actively-maintained repositories, multi-file diffs, no public ground-truth leakage. It's the closest thing the field has to a coding benchmark that resists memorization.&lt;/p&gt;

&lt;p&gt;Here's where the three frontier models sat in early June 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;SWE-Bench Pro&lt;/th&gt;
&lt;th&gt;Released&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2026-05-28&lt;/td&gt;
&lt;td&gt;Anthropic-run, official&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.7&lt;/td&gt;
&lt;td&gt;64.3%&lt;/td&gt;
&lt;td&gt;2026-04&lt;/td&gt;
&lt;td&gt;What MiniMax compared M3 against&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax M3&lt;/td&gt;
&lt;td&gt;59.0%&lt;/td&gt;
&lt;td&gt;2026-06-01&lt;/td&gt;
&lt;td&gt;Vendor-run on own infra, Claude Code scaffolding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;58.6%&lt;/td&gt;
&lt;td&gt;2026-04-23&lt;/td&gt;
&lt;td&gt;OpenAI-run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;
&lt;td&gt;&amp;lt; 58.6%&lt;/td&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;td&gt;Below GPT-5.5 per public leaderboards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first sentence of MiniMax's June 1 launch announcement reads, essentially: &lt;em&gt;"M3 beats GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro at one-tenth the cost."&lt;/em&gt; That's correct as printed. What's left out: Anthropic had shipped Opus 4.8 four days earlier with a 69.2% score, and the MiniMax deck compared M3 against the older Opus 4.7 at 64.3%.&lt;/p&gt;

&lt;p&gt;Independent verification status is the other footnote. MiniMax ran the eval on its own infrastructure, using Claude Code as the agentic scaffolding, with evaluation logic aligned to the official methodology. The official SWE-Bench Pro leaderboard had not added M3 as of this writing. Treat the 59.0% as a directional signal — it might land at 56% or 61% on a clean third-party run, and either still leaves the same shape: M3 is in the same league as GPT-5.5, one tier below Opus 4.8.&lt;/p&gt;

&lt;p&gt;The honest one-line read: &lt;strong&gt;the M3 number is real, the marketing framing is selective&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terminal-Bench 2.1 and Multimodal: Where M3 Closes the Gap
&lt;/h2&gt;

&lt;p&gt;SWE-Bench Pro is one signal. On Terminal-Bench 2.1 — long-horizon terminal execution, the kind of thing a coding agent does when you ask it to "set up the dev environment and run the failing test" — MiniMax reports M3 at 66.0%. That's competitive with Opus 4.8 at similar ranges per Anthropic's release notes, and notably ahead of GPT-5.5. The reason: MSA's decoding speed at long context makes long tool-use loops cheaper to retry, so the agent can recover from more failures within a budget.&lt;/p&gt;

&lt;p&gt;Native multimodality is the other pitch. M3 accepts image &lt;em&gt;and&lt;/em&gt; video input. Opus 4.8 accepts image input but not video. In practical coding terms this matters for two things: pasting a screenshot of a stack trace, and feeding a short screencast of a UI bug. Both models handle the screenshot case; only M3 handles the screencast.&lt;/p&gt;

&lt;p&gt;For 95% of coding work neither of these tips the decision — you're staring at text. They become decisive once you start building agents that actually look at the browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing Math: What 1M Tokens Actually Cost
&lt;/h2&gt;

&lt;p&gt;Vendor benchmarks are run on perfect infrastructure. Your bill is run on production traffic. Here are three realistic shapes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload shape&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;MiniMax M3 cost&lt;/th&gt;
&lt;th&gt;Claude Opus 4.8 cost&lt;/th&gt;
&lt;th&gt;Multiplier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Routine refactor agent (1M in + 200K out)&lt;/td&gt;
&lt;td&gt;1.2M total&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.08&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;9.3×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Heavy code generation (500K in + 500K out)&lt;/td&gt;
&lt;td&gt;1M total&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;10.0×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whole-repo grep + patch (1M in + 50K out)&lt;/td&gt;
&lt;td&gt;1.05M total&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.72&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$6.25&lt;/td&gt;
&lt;td&gt;8.7×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-context audit with cache hit (1M cached + 50K out)&lt;/td&gt;
&lt;td&gt;1.05M total&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.24&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.75&lt;/td&gt;
&lt;td&gt;7.3×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Numbers use ofox's published rates verified on 2026-06-13: M3 $0.60/M input / $2.40/M output / $0.12/M cached; Opus 4.8 $5/M input / $25/M output / $0.50/M cached. Math is unit price × token count, no rounding.&lt;/p&gt;

&lt;p&gt;The picture changes when you scale to a team. Pick a representative profile — five developers, 100 coding-agent runs per day each, 500K input and 100K output per run, 22 working days per month:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  M3 per run: $0.30 + $0.24 = &lt;strong&gt;$0.54&lt;/strong&gt;. Monthly: 5 × 100 × 22 × $0.54 = &lt;strong&gt;$5,940&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  Opus 4.8 per run: $2.50 + $2.50 = &lt;strong&gt;$5.00&lt;/strong&gt;. Monthly: 5 × 100 × 22 × $5.00 = &lt;strong&gt;$55,000&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A five-person engineering org running Opus on default routing burns through a small mortgage every month. The same team on M3-default routing with Opus called only for hard problems (say 10% of runs) pays roughly $11K instead. The price-performance argument for M3 isn't "cheap is fine"; it's that you can spend the saved $44K on running Opus &lt;em&gt;more&lt;/em&gt; on the prompts that actually need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Open-Weight" Caveat: Where Are the Weights?
&lt;/h2&gt;

&lt;p&gt;MiniMax's June 1 announcement positioned M3 as "the first and only open-weight model" combining frontier coding, 1M context, and native multimodality. The weights and technical report were scheduled for Hugging Face and GitHub "within roughly 10 days" of launch — call it the June 10–11 window.&lt;/p&gt;

&lt;p&gt;As of June 13, 2026, the &lt;a href="https://github.com/MiniMax-AI/MiniMax-M3" rel="noopener noreferrer"&gt;MiniMax-M3 GitHub repo&lt;/a&gt; still notes: &lt;em&gt;"this model is not yet released — this repository exists so the community can share what they need next."&lt;/em&gt; The API is live and you can call M3 via providers including ofox, but you cannot self-host it today. The repo has been frozen on a placeholder for almost two weeks.&lt;/p&gt;

&lt;p&gt;This is not a fatal point — vendors slip weight releases all the time, and "10 days" was a soft window, not a contract. But it changes the practical comparison. If you picked M3 specifically &lt;em&gt;because&lt;/em&gt; the weights would land in your private cluster within two weeks, that bet has not paid off yet. For now, both MiniMax M3 and Claude Opus 4.8 are API-only from a deployment perspective; the open-weight axis isn't decisive in June 2026.&lt;/p&gt;

&lt;p&gt;When the weights do ship, the math changes again. A self-hosted M3 cluster amortizes against your GPU lease, not per-token pricing — for sustained 24/7 workloads that's a fundamentally different cost curve from per-token Opus.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Pick MiniMax M3
&lt;/h2&gt;

&lt;p&gt;Pick &lt;code&gt;minimax/minimax-m3&lt;/code&gt; if &lt;strong&gt;any&lt;/strong&gt; of the following is true:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;You're running coding agents at volume.&lt;/strong&gt; Lint-fixer bots, formatter loops, codemod agents, "write the docstring" pipelines. These are dominated by token cost, not per-prompt quality, and M3's 10× pricing edge dwarfs the ~10-point quality gap.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;You're paying for long-context input.&lt;/strong&gt; Whole-repo prompts (1M tokens of code in, small diff out) are where MSA's decoding speed and M3's input pricing compound. A million cached tokens on M3 costs $0.12 versus $0.50 on Opus.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Video input is a hard requirement.&lt;/strong&gt; Opus 4.8 accepts images but not video. If your agent needs to look at a 30-second screen recording of a UI bug, you have one option in this comparison.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;You're hedging against the Opus 4.8 price tier.&lt;/strong&gt; Even teams that prefer Opus 4.8 for primary work route routine prompts to a cheaper model. M3 is currently the strongest sub-$1/M coding option that also tops 1M context.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;You'll switch if and when independent SWE-Bench Pro reruns come in lower.&lt;/strong&gt; Treat the 59% as provisional. Build your stack so swapping &lt;code&gt;minimax/minimax-m3&lt;/code&gt; for the next cheap challenger is one config change away.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When to Pick Claude Opus 4.8
&lt;/h2&gt;

&lt;p&gt;Pick &lt;code&gt;anthropic/claude-opus-4.8&lt;/code&gt; if &lt;strong&gt;any&lt;/strong&gt; of the following is true:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;A wrong patch costs more than a token bill.&lt;/strong&gt; Production hotfixes, security-sensitive refactors, anything where you'd review the diff yourself before merging anyway. The ~10-point SWE-Bench Pro gap is concentrated on the hardest problems — not the median ones.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;You're building reasoning-heavy agents.&lt;/strong&gt; "Read this incident postmortem and propose three fixes." "Audit this OAuth flow and find the bug." Opus 4.8's reasoning gains over 4.7 are tangible per Anthropic's release notes and per &lt;a href="https://simonwillison.net/2026/May/28/claude-opus-4-8/" rel="noopener noreferrer"&gt;independent reviews like Simon Willison's&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;You're already in the Anthropic ecosystem.&lt;/strong&gt; Claude Code, Anthropic's MCP tooling, dynamic workflows — all of these assume Anthropic-style tool semantics. M3 works with Claude Code (MiniMax themselves used it as scaffolding) but you'll hit edge cases on tool format expectations.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The "Fast mode" cost tier suits your shape.&lt;/strong&gt; Opus 4.8 introduced a $10/M input / $50/M output Fast mode tier for latency-sensitive use cases. It's more expensive than the regular tier but less than calling Opus 4.7 and waiting longer.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Your eval harness already calibrates against Opus.&lt;/strong&gt; If your team has a "would the senior reviewer accept this PR" eval suite that's been tuned against Opus outputs, switching models invalidates your eval until you re-baseline. That's real engineering cost, not vibes.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When NOT to Pick Either (and What to Use Instead)
&lt;/h2&gt;

&lt;p&gt;A few scenarios where this whole comparison is the wrong question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Sub-$0.10/M-token budget, simple refactors.&lt;/strong&gt; Look at smaller models like Claude Haiku 4 or GPT-5.4 Mini. Spending $0.60/M on M3 when GPT-5.4 Mini at $0.10/M would do the same lint-fix is theater.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You need on-prem deployment today.&lt;/strong&gt; Both M3 (weights not shipped) and Opus 4.8 (closed) are API-only. Self-host options for frontier coding today are Qwen 3.7 Max and the open Chinese model lineup.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You're optimizing for a strict latency SLA, not cost.&lt;/strong&gt; Both M3 and Opus 4.8 are designed for quality, not p50 latency. Smaller faster models will beat both on TTFT.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;You need to evaluate multiple frontier models at once.&lt;/strong&gt; Build a comparison harness instead of picking one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try Both via ofox: A/B in 10 Lines of Code
&lt;/h2&gt;

&lt;p&gt;The whole comparison reduces to a one-string change if you call both models through ofox's OpenAI-compatible endpoint. Same &lt;code&gt;base_url&lt;/code&gt;, same SDK, just swap the &lt;code&gt;model&lt;/code&gt; argument.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python — A/B both models in one loop
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OFOX_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refactor this function to remove duplication: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minimax/minimax-m3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-opus-4.8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PROMPT&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this and you get per-model token usage and the first 120 characters of each output for eyeball comparison. Plug the &lt;code&gt;total_tokens&lt;/code&gt; numbers into the pricing math table above and you have a per-run cost on a real prompt rather than a vendor benchmark.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node — same shape
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OFOX_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Refactor this function to remove duplication: ...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;minimax/minimax-m3&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;anthropic/claude-opus-4.8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Identical shape, identical endpoint, identical SDK call. The migration cost between models is one string.&lt;/p&gt;

&lt;p&gt;For a multi-turn agent loop that includes tool calls, the same swap works — both models accept OpenAI-style &lt;code&gt;tools&lt;/code&gt; arrays via ofox. You'll want to test the tool-call format on your specific tools because each provider's strict mode handling diverges at the edges, but the contract is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compatibility Quirks: What Differs Between the Two APIs
&lt;/h2&gt;

&lt;p&gt;Same endpoint, same SDK call — but a few sharp edges worth knowing before you wire either model into production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System prompt handling.&lt;/strong&gt; Claude Opus 4.8 treats the &lt;code&gt;system&lt;/code&gt; role as a strict system prompt with elevated trust. MiniMax M3 (via the OpenAI-compatible path) folds system into the conversation more loosely. If your agent depends on system-prompt-only constraints — "never call this tool unless asked," "always respond in JSON" — M3 follows them most of the time but is statistically more likely to drift on long tool loops. Workaround: repeat critical constraints in the first user message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-call format strictness.&lt;/strong&gt; Opus 4.8 enforces tool argument schemas hard — it will refuse to call a tool if your &lt;code&gt;parameters&lt;/code&gt; JSON Schema marks a field required and the model can't fill it. M3 is more lenient and will sometimes emit a tool call with a placeholder string. If your tool layer treats placeholders as valid, you'll silently execute wrong actions; if it validates strictly, you'll see more retry loops. The fix is the same either way: validate tool arguments on the &lt;em&gt;server&lt;/em&gt; side, not just at the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching semantics.&lt;/strong&gt; Both models offer cached input pricing, but Anthropic splits the bill into write and read. On Opus 4.8 you pay a one-time &lt;strong&gt;cache write&lt;/strong&gt; at $6.25/M (5-minute TTL) or $10/M (1-hour TTL), then every subsequent &lt;strong&gt;cache read&lt;/strong&gt; lands at $0.50/M. M3's cache on ofox is a single $0.12/M read rate with implicit TTL and no separate write surcharge. For workloads that hit the same long-context prompt many times per minute, M3 is dramatically cheaper at the cache read layer. For workloads where the cached portion stays warm for hours and write costs amortize across many reads, Opus 4.8's 1-hour tier is competitive on a per-token basis even before quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming chunk shape.&lt;/strong&gt; Both models stream OpenAI-compatible &lt;code&gt;chunks&lt;/code&gt;, but Opus 4.8 emits more granular &lt;code&gt;delta.thinking&lt;/code&gt; events when extended thinking is enabled. If your client parses thinking deltas separately from content deltas, that code works against Opus but no-ops against M3, which doesn't currently expose thinking deltas through the OpenAI-compatible route. Not a bug — just an unused field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limits at the provider edge.&lt;/strong&gt; When you call both models through ofox, you share one rate limit envelope keyed to your API key — not two separate per-vendor quotas. That's the point of the gateway shape: M3 fallback when Opus is rate-limited, Opus fallback when M3 is, all without juggling two sets of credentials.&lt;/p&gt;

&lt;p&gt;The whole MiniMax M3 vs Claude Opus 4.8 question collapses to one string swap on the same endpoint — which is the only sane way to pick a coding model in 2026.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/minimax-m3-vs-claude-opus-4-8-coding-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>minimax</category>
      <category>coding</category>
    </item>
    <item>
      <title>DeepSeek V4 Pro Real Cost: 120x Cache Gap Behind Sticker</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Fri, 12 Jun 2026 08:03:37 +0000</pubDate>
      <link>https://dev.to/owen_fox/deepseek-v4-pro-real-cost-120x-cache-gap-behind-sticker-1977</link>
      <guid>https://dev.to/owen_fox/deepseek-v4-pro-real-cost-120x-cache-gap-behind-sticker-1977</guid>
      <description>&lt;h1&gt;
  
  
  DeepSeek V4 Pro Real Cost: 120x Cache Gap Behind Sticker
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — DeepSeek V4 Pro's "$0.28/M" reputation is wrong on two counts. First, $0.28/M is the output price of V4 &lt;em&gt;Flash&lt;/em&gt;, not Pro — Pro is $0.87/M output. Second, the published $0.435/M input price assumes a cache miss. Cache hits cost $0.003625/M, a &lt;strong&gt;120x spread&lt;/strong&gt; that quietly determines whether your bill matches the headline. To show why the published price isn't the bill, we ran V4 Pro, GPT-5.5, and Claude Sonnet 4.6 against the same refactoring task on &lt;code&gt;api.ofox.ai&lt;/code&gt;. The mechanism is the lesson, not the leaderboard: &lt;strong&gt;V4 Pro finished in ~1,500 output tokens, GPT-5.5 ran into our 8,192-token cap on 2 of 3 runs, Sonnet 4.6 used ~3,800.&lt;/strong&gt; Output verbosity × output sticker price compounds into an order-of-magnitude gap on a reasoning-heavy task like this one. The exact "Nx cheaper" multiplier depends on caps, quality gates, and sample size we explicitly didn't lock down — see the methodology limits below. Pricing verified at &lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;api-docs.deepseek.com/quick_start/pricing&lt;/a&gt; on 2026-06-12.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Try it yourself, with the 75% permanent discount built in.&lt;/strong&gt; &lt;a href="https://ofox.ai/?utm_source=blog&amp;amp;utm_medium=cta&amp;amp;utm_campaign=deepseek-v4-pro-real-cost" rel="noopener noreferrer"&gt;Run V4 Pro through ofox.ai&lt;/a&gt; on the same API key that already covers Claude, GPT, and Gemini — same V4 Pro price as the official endpoint, no Chinese phone number required, no separate account.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The three variables that decide your bill
&lt;/h2&gt;

&lt;p&gt;A model's sticker price is the smallest input to your monthly invoice. Three variables matter more:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Cache miss rate.&lt;/strong&gt; Cache-miss input on V4 Pro is $0.435/M. Cache-hit is $0.003625/M. That is not a 90% discount, that is a &lt;strong&gt;120x discount&lt;/strong&gt;. A workload that runs at 95% cache hit costs a different order of magnitude than one running at 60% hit. Most "DeepSeek is 50x cheaper than Claude" comparisons silently assume the high-hit case.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Thinking-mode output inflation.&lt;/strong&gt; V4 Pro's thinking mode is on by default. On a single refactoring prompt in our benchmark, V4 Pro emitted 1,556 output tokens; GPT-5.5 emitted 8,192 (it hit our &lt;code&gt;max_tokens&lt;/code&gt; cap) for the same prompt. Output is also where the real spend lives — $0.87/M on V4 Pro vs $0.435/M for cache-miss input. Verbose models burn the more expensive token.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tokenizer drift.&lt;/strong&gt; Different providers count the same English text slightly differently. For our 1,400-character prompt, DeepSeek counted 339 input tokens, OpenAI counted 331, Anthropic counted 396 — a ~20% spread between the lightest and heaviest counter. Small. But it's measurable and asymmetric: Anthropic tends to count more.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These three variables compound. A 60% cache hit rate with a verbose model and a heavy-counting tokenizer can turn a "12x cheaper" provider into "3x cheaper" on the actual invoice. None of that is misrepresentation by DeepSeek — it's what cache pricing always means, just at an unusually large spread.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "$0.28/M" actually refers to
&lt;/h2&gt;

&lt;p&gt;The reason $0.28/M shows up in conversations about V4 Pro pricing is that V4 Flash, the smaller sibling, lists its output at exactly that number. Once a price gets repeated on social media, it stops carrying the model name attached. Here's the full table, verified against the official DeepSeek docs on 2026-06-12:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input (cache-miss)&lt;/th&gt;
&lt;th&gt;Input (cache-hit)&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.435/M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.003625/M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.87/M&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget&lt;/td&gt;
&lt;td&gt;&lt;code&gt;deepseek-v4-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;$0.14/M&lt;/td&gt;
&lt;td&gt;$0.0028/M&lt;/td&gt;
&lt;td&gt;$0.28/M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Source: &lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;api-docs.deepseek.com/quick_start/pricing&lt;/a&gt;, accessed 2026-06-12. Both models advertise a 1M-token context window and a 384K max output. Pricing is in USD per 1M tokens.&lt;/p&gt;

&lt;p&gt;The original V4 Pro launch listed regular prices of $1.74/M input cache-miss and $3.48/M output, with a 75% promotional discount running through May 31, 2026. In late May, the 75% discount was reportedly made permanent — DeepSeek's official docs no longer show a "promo until" footnote, and the community announcement thread on &lt;a href="https://reddit.com/r/GithubCopilot/comments/1tkzjqx/deepseek_v4_pro_75_off_is_now_permanent/" rel="noopener noreferrer"&gt;r/GithubCopilot&lt;/a&gt; (340 upvotes, 71 comments) summarizes the change:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DeepSeek just made the 1/4 discounted price for v4 Pro permanent. […] It increases the gap with the frontier (Sonnet/GPT 5.4) models to a 12 to 17x difference. And we are not even talking about the cache hit, where the difference is easily 60 to 80x cheaper.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The 75% discount being permanent matters because it locks in the cache-hit/miss spread at 120x. At the original sticker price the spread was the same ratio but in absolute terms looked different — at the new "permanent" price the spread is large enough that hit-rate behavior dominates your bill outright.&lt;/p&gt;

&lt;h2&gt;
  
  
  A one-task mechanism probe (not a leaderboard)
&lt;/h2&gt;

&lt;p&gt;We ran one workload to make the mechanism visible — not to publish a Nx ranking. Three models, one task, three runs each, same single user message to all three. Same prompt body, &lt;code&gt;temperature=0.2&lt;/code&gt;, &lt;code&gt;max_tokens=8192&lt;/code&gt;. All three were called through &lt;code&gt;api.ofox.ai/v1&lt;/code&gt;, an OpenAI-compatible router that exposes V4 Pro, GPT-5.5, and Sonnet 4.6 on the same key. Verbatim prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;You&lt;/span&gt; &lt;span class="nx"&gt;are&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;senior&lt;/span&gt; &lt;span class="nx"&gt;TypeScript&lt;/span&gt; &lt;span class="nx"&gt;engineer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="nx"&gt;Refactor&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;following&lt;/span&gt; &lt;span class="kr"&gt;module&lt;/span&gt; &lt;span class="nx"&gt;into&lt;/span&gt; &lt;span class="nx"&gt;idiomatic&lt;/span&gt; &lt;span class="nx"&gt;TypeScript&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;these&lt;/span&gt; &lt;span class="nx"&gt;constraints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Replace&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;any&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="nx"&gt;types&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;precise&lt;/span&gt; &lt;span class="nx"&gt;generics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;Extract&lt;/span&gt; &lt;span class="nx"&gt;pure&lt;/span&gt; &lt;span class="nx"&gt;helper&lt;/span&gt; &lt;span class="nx"&gt;functions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Add&lt;/span&gt; &lt;span class="nx"&gt;Vitest&lt;/span&gt; &lt;span class="nx"&gt;unit&lt;/span&gt; &lt;span class="nx"&gt;tests&lt;/span&gt; &lt;span class="nx"&gt;covering&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;happy&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;empty&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;malformed&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;network&lt;/span&gt; &lt;span class="nx"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Use&lt;/span&gt; &lt;span class="nx"&gt;AbortController&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Return&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;single&lt;/span&gt; &lt;span class="nx"&gt;fenced&lt;/span&gt; &lt;span class="s2"&gt;```ts code block with the refactored module followed by a separate fenced ```&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;tests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Do&lt;/span&gt; &lt;span class="nx"&gt;not&lt;/span&gt; &lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="nx"&gt;prose&lt;/span&gt; &lt;span class="nx"&gt;commentary&lt;/span&gt; &lt;span class="nx"&gt;outside&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="nx"&gt;blocks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="s2"&gt;```ts
// fetchAndAggregate.ts — original
export async function fetchAndAggregate(urls, opts) {
  const results = [];
  for (const u of urls) {
    try {
      const r = await fetch(u, opts);
      const j = await r.json();
      if (j &amp;amp;&amp;amp; j.records) {
        for (const rec of j.records) {
          if (!rec.skip) results.push(rec);
        }
      }
    } catch (e) {
      console.log('failed', u, e);
    }
  }
  // group by category
  const out = {};
  for (const r of results) {
    const k = r.category || 'misc';
    if (!out[k]) out[k] = [];
    out[k].push(r);
  }
  // sort each group by ts desc
  for (const k of Object.keys(out)) {
    out[k].sort((a, b) =&amp;gt; (b.ts || 0) - (a.ts || 0));
  }
  return out;
}
```&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Paste that into any OpenAI-compatible client (point &lt;code&gt;baseURL&lt;/code&gt; at &lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt; or your provider of choice) with the parameters above and you'll re-run exactly what we ran.&lt;/p&gt;

&lt;p&gt;Median of 3 runs, on 2026-06-12. The right way to read this table is &lt;strong&gt;output verbosity&lt;/strong&gt;, not the dollar column — the dollar column is illustrative because GPT-5.5's outputs are capped at 8,192 (see methodology limits):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Median wall-clock&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Output tokens&lt;/th&gt;
&lt;th&gt;&lt;code&gt;finish_reason&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Illustrative cost per run*&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;deepseek/deepseek-v4-pro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18.8 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;339&lt;/td&gt;
&lt;td&gt;1,556&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;stop&lt;/code&gt; (3/3)&lt;/td&gt;
&lt;td&gt;$0.0015&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai/gpt-5.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;117.4 s&lt;/td&gt;
&lt;td&gt;331&lt;/td&gt;
&lt;td&gt;8,192&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;length&lt;/code&gt; (2/3), &lt;code&gt;stop&lt;/code&gt; at 7,961 (1/3)&lt;/td&gt;
&lt;td&gt;$0.2474 (truncated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;anthropic/claude-sonnet-4.6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;38.9 s&lt;/td&gt;
&lt;td&gt;396&lt;/td&gt;
&lt;td&gt;3,880&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;stop&lt;/code&gt; (3/3)&lt;/td&gt;
&lt;td&gt;$0.0594&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Computed using each provider's published sticker price (DeepSeek $0.435/$0.87, OpenAI $5/$30, Anthropic $3/$15, USD per 1M tokens). The GPT-5.5 dollar number is a &lt;strong&gt;lower bound&lt;/strong&gt; — its outputs were truncated by our cap on 2 of 3 runs, and a complete run would cost more.&lt;/p&gt;

&lt;p&gt;What the data does say with confidence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Verbosity is doing more of the work than per-token price.&lt;/strong&gt; V4 Pro emitted 1,556 output tokens; Sonnet emitted 3,880; GPT-5.5 emitted 8,192+ before our cap stopped it. The model-behavior gap (5x verbosity between V4 Pro and GPT-5.5) compounds with the sticker gap (34x output price between V4 Pro and GPT-5.5). That's a mechanism statement and it doesn't depend on whether the GPT-5.5 cost number is exact.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;None of the three runs reported cached tokens.&lt;/strong&gt; Each call was a cold prompt. On a real Claude Code or OpenCode session, the system prompt and project context stay stable across turns and the cache fires. Our cold-start numbers are an upper bound on V4 Pro's per-call cost. If you sustain a 70% input cache hit, V4 Pro's effective input cost drops from $0.435/M to ~$0.13/M.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Tokenizer differences exist but are small here.&lt;/strong&gt; DeepSeek counted 339 input tokens, OpenAI 331, Anthropic 396 for the same English prompt. Anthropic counts heaviest, but not by enough to change the conclusion. On longer prompts with code, the spread widens — worth eyeballing if you're doing nine-figure-token workloads.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Methodology limits (read this before you quote the numbers)
&lt;/h2&gt;

&lt;p&gt;The numbers above are honest about what they are: a single mechanism probe, not a benchmark you can quote a clean "Nx cheaper" multiplier from. Four limits, in order of how much they would change the headline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;GPT-5.5 was truncated on 2 of 3 runs.&lt;/strong&gt; Our &lt;code&gt;max_tokens=8192&lt;/code&gt; cap is the reason 2 of 3 GPT-5.5 runs report &lt;code&gt;finish_reason: "length"&lt;/code&gt; — those outputs are unfinished. A fair version of this benchmark would set &lt;code&gt;max_tokens&lt;/code&gt; high enough that no model is artificially cut off, or discard &lt;code&gt;length&lt;/code&gt;-terminated runs from the cost average. Until that's redone, the GPT-5.5 cost column is a lower bound and the V4-Pro-vs-GPT-5.5 cost ratio is best stated as "an order of magnitude on this kind of task," not "Nx."&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;No quality gate.&lt;/strong&gt; We did not run the generated code or its tests through Vitest. "V4 Pro is cheaper" only matters if the output is also correct — a model that ships syntactically valid code but with broken tests isn't actually cheaper. A proper rerun would extract each model's &lt;code&gt;&lt;/code&gt;&lt;code&gt;ts&lt;/code&gt;&lt;code&gt;blocks, write them to disk, run&lt;/code&gt;vitest run&lt;code&gt;, and only count&lt;/code&gt;finish_reason=stop` + tests-pass runs into the cost comparison.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;n=3 with no variance reported.&lt;/strong&gt; Three runs are enough to spot a mechanism (cap hits, verbosity, latency) but not enough to claim a stable rate. A defensible rerun would be n≥10 with median &lt;em&gt;and&lt;/em&gt; min/max.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;One task.&lt;/strong&gt; This is a TypeScript refactor with test scaffolding. The conclusion does &lt;em&gt;not&lt;/em&gt; generalize to "V4 Pro is an order of magnitude cheaper on every task." On a one-shot prose generation, on tool-heavy agent loops, or on a Flash-suitable scaffold, the ranking changes — sometimes inverts. The honest framing: V4 Pro is cheaper &lt;em&gt;on reasoning-heavy refactors where its smaller output footprint is what matters&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We're doing the proper rerun (max_tokens lifted, Vitest gate applied, n≥10, full per-run &lt;code&gt;usage&lt;/code&gt; and &lt;code&gt;finish_reason&lt;/code&gt; published) as a follow-up — link will land in this section when it ships. Until then: trust the mechanism, treat the multipliers as illustrative.&lt;/p&gt;

&lt;h2&gt;
  
  
  What real bills look like in the wild
&lt;/h2&gt;

&lt;p&gt;Our benchmark is one task across three models. The community has been posting real monthly bills for weeks. Sampled from r/DeepSeek, r/opencode, r/Anthropic, r/LLMDevs, and r/GithubCopilot — bills are real, screenshots posted by users on their own accounts:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I just spent $2 over two days with DS on OpenCode. I'd be so upset if I'd spent $265 with Claude for the same thing." — u/pepeperezcanyear on &lt;a href="https://reddit.com/r/DeepSeek/comments/1ts0byd/pricing_is_crazy/" rel="noopener noreferrer"&gt;r/DeepSeek "Pricing is crazy"&lt;/a&gt; (1,027 upvotes, 149 comments).&lt;/p&gt;

&lt;p&gt;"200M tokens total. roughly 70/30 split on prompt vs completion. came out under 35 bucks all in. […] for context, when we were on claude pro for similar workload the per-seat math was 6x that and we had to babysit context limits. when we tested gpt-5.5-codex on the same kind of work the per-token was 8-10x and the wall time was worse." — u/Fun_Walk_4965, &lt;a href="https://reddit.com/r/DeepSeek/comments/1twfqdp/200m_tokens_last_month_around_30_bucks_total_how/" rel="noopener noreferrer"&gt;r/DeepSeek, 189↑/132💬&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;"With $3.88 &amp;amp; 690,003,591 tokens and 5 hours, Deepseek Pro &amp;amp; Flash combined, managed to reverse engineer Teamspeak's Licensing System […] In 5 hours of trial and error, debugging with Ghidra and x64dbg." — original post on &lt;a href="https://reddit.com/r/DeepSeek/comments/1txcfrh/with_388_690003591_tokens_and_5_hours_deepseek/" rel="noopener noreferrer"&gt;r/DeepSeek, 365↑/48💬&lt;/a&gt;. Note the top reply: "Less than 1% of those tokens are output tokens" (u/—Spaci—, +23) — meaning the cache was doing real work on this run.&lt;/p&gt;

&lt;p&gt;"Deepseek V4 current cost: 78.2m tokens for $1.14. What's yours?" — r/opencode, &lt;a href="https://reddit.com/r/opencode/comments/1t3q2qw/deepseek_v4_current_cost_782m_tokens_for_114/" rel="noopener noreferrer"&gt;361↑/55💬&lt;/a&gt;. A reply from u/Still-Notice8155 (+5): "mine roughly 85m/1$ using pro, it's insane with opus 4.6 like quality."&lt;/p&gt;

&lt;p&gt;"65 million tokens for 7 dollars lol" — r/DeepSeek, &lt;a href="https://reddit.com/r/DeepSeek/comments/1trwjy5/65_million_tokens_for_7_dollars_lol/" rel="noopener noreferrer"&gt;170↑/93💬&lt;/a&gt;. Top reply, with a screenshot: "You spent too much lmfao. 680 million for 14" (u/deleted-account69420, +55).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The thread on r/Anthropic — where you'd expect skepticism — is &lt;a href="https://reddit.com/r/Anthropic/comments/1suorit/the_costs_are_getting_out_of_hand_check_out_the/" rel="noopener noreferrer"&gt;232 upvotes and 123 comments&lt;/a&gt; deep, and the highest-rated reply (u/KaMaFour, +74) gets the warning exactly right:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PRICE PER MILLION TOKENS IS NOT A GOOD MEASURE OF THE COST BECAUSE IT DOESN'T TAKE INTO ACCOUNT THE VERBOSITY OF THE MODEL.&lt;/strong&gt; 5x cheaper model per token can be as expensive if it uses 5x more tokens per task.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the cleanest version of the GPT-5.5 result above. Per-token price is half the story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The counter-narrative: when V4 Pro isn't cheap
&lt;/h2&gt;

&lt;p&gt;It's not all "$2 for two days." Worth giving the dissent a real airing. From &lt;a href="https://reddit.com/r/LLMDevs/comments/1tjkaqj/token_costs_are_actually_unsustainable_for/" rel="noopener noreferrer"&gt;r/LLMDevs "Token costs are actually unsustainable for multi-project work"&lt;/a&gt; (32↑/94💬), top reply from u/look (+11):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I primarily use a mix of Mimo V2.5 Pro, GLM-5.1, Qwen 3.6 Plus, and Deepseek V4 Flash (&lt;strong&gt;don't waste your time with Pro — it's as expensive as US models in actual use&lt;/strong&gt;) and my average blended token costs are under 5 cents per Mtok and still dropping."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The "Pro is as expensive as US models in actual use" claim is the inverse of our benchmark result. Both can be true. Pro's pricing assumes you don't blow through cache, and assumes the task benefits from thinking output enough to justify the higher per-token cost. If you're driving Pro on prompts where Flash would have answered correctly, you are paying the Pro tax — 3x more input, 3x more output — for output you didn't need. Our benchmark task was deliberately reasoning-heavy (refactor + write tests + reason about edge cases). On a one-shot "write a function that sums an array" prompt, the conclusion flips and Flash wins decisively. We covered that tradeoff in detail in our earlier piece on &lt;a href="https://dev.to/blog/deepseek-v4-pro-vs-flash/"&gt;DeepSeek V4 Pro vs Flash: real cost-quality tradeoff&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The honest answer: on the reasoning-heavy refactor we tested, V4 Pro produced output an order of magnitude cheaper than GPT-5.5 and Sonnet 4.6 — most of that gap coming from output verbosity, not per-token price. On tasks Flash can handle, Pro is &lt;em&gt;not&lt;/em&gt; cheap relative to Flash. And the exact multiplier we publish above is one mechanism probe, not a benchmark you can A/B procurement on (see methodology limits). Route by task type, not by default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do with this
&lt;/h2&gt;

&lt;p&gt;Three concrete moves, in order of how much they actually affect your bill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Measure your cache hit rate before you trust any "Nx cheaper" claim.&lt;/strong&gt; Pull a week of usage from your API dashboard. DeepSeek reports &lt;code&gt;prompt_cache_hit_tokens&lt;/code&gt; and &lt;code&gt;prompt_cache_miss_tokens&lt;/code&gt; per call. If your hit rate is 80%+, the published savings numbers apply roughly as-is. If you're at 60%, multiply the cache-miss cost by 0.4 and the cache-hit cost by 0.6 and recompute. The math in &lt;a href="https://dev.to/blog/llm-api-cache-hit-math-real-bills-2026/"&gt;the cache-hit math primer&lt;/a&gt; walks through this step by step.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Route reasoning to Pro, scaffolding to Flash.&lt;/strong&gt; On a one-shot CRUD scaffold, Flash produces output that's hard to distinguish from Pro on a blind read, and it's 3x cheaper at every tier. On a multi-file refactor with implicit invariants, Pro holds the constraints across the rewrite; Flash drifts. The &lt;a href="https://dev.to/blog/deepseek-api-pricing-guide-2026/"&gt;DeepSeek API pricing guide&lt;/a&gt; has a per-task decision table.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Run cold-start prompts through a router that lets you A/B.&lt;/strong&gt; Routing through a unified endpoint like &lt;a href="https://dev.to/blog/ai-api-aggregation-access-every-model-one-endpoint/"&gt;ofox.ai's API&lt;/a&gt; means you can ship the same prompt to V4 Pro, GPT-5.5, and Sonnet 4.6 with a one-character model-ID change. We use it ourselves — the benchmark in this post took three runs per model on one key, all three models reachable through the same endpoint. If you're still budgeting against someone else's sticker price, that's the cheapest experiment to run before you commit.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The sticker price is a number on a tag. The bill is a receipt that unspools downward. Both are real; only one of them gets paid.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Methodology footnote.&lt;/strong&gt; Benchmark task: TypeScript refactor + add Vitest tests. 3 runs per model, median reported, no Vitest gate applied. Routed through &lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt; on 2026-06-12, &lt;code&gt;temperature=0.2&lt;/code&gt;, &lt;code&gt;max_tokens=8192&lt;/code&gt;. Finish reasons from raw runs: V4 Pro &lt;code&gt;stop&lt;/code&gt; × 3; Sonnet 4.6 &lt;code&gt;stop&lt;/code&gt; × 3; GPT-5.5 &lt;code&gt;length&lt;/code&gt; × 2 (truncated at 8,192) + &lt;code&gt;stop&lt;/code&gt; × 1 (at 7,961 tokens, hugging the cap). Cost computed using each provider's published sticker price (USD per 1M tokens): V4 Pro $0.435 input cache-miss / $0.87 output (verified at &lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;api-docs.deepseek.com/quick_start/pricing&lt;/a&gt; 2026-06-12); GPT-5.5 $5 input / $30 output (verified at &lt;a href="https://ofox.ai/models/openai" rel="noopener noreferrer"&gt;ofox.ai/models/openai&lt;/a&gt; 2026-06-12); Sonnet 4.6 $3 input / $15 output (verified at &lt;a href="https://ofox.ai/models/anthropic" rel="noopener noreferrer"&gt;ofox.ai/models/anthropic&lt;/a&gt; 2026-06-12). Cached-token columns were 0 across all runs (cold prompts). The prompt body and per-run breakdown are in the post so you can repeat the same probe on your own key. &lt;strong&gt;Known limits&lt;/strong&gt; (see in-post section): GPT-5.5 cap hit, no quality gate, n=3, single task.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/deepseek-v4-pro-real-cost-cache-miss-thinking-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deepseek</category>
      <category>costoptimization</category>
      <category>apipricing</category>
    </item>
    <item>
      <title>Claude Code Safe Mode: 5 Things Disabled + When to Use Over /clear (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Thu, 11 Jun 2026 06:59:53 +0000</pubDate>
      <link>https://dev.to/owen_fox/claude-code-safe-mode-5-things-disabled-when-to-use-over-clear-2026-5an6</link>
      <guid>https://dev.to/owen_fox/claude-code-safe-mode-5-things-disabled-when-to-use-over-clear-2026-5an6</guid>
      <description>&lt;p&gt;You upgrade to Claude Code v2.1.169. The first thing you notice is that your old &lt;code&gt;/clear&lt;/code&gt; reflex no longer fixes everything. CLAUDE.md is stale, a plugin is shadowing a real command, your MCP server stopped responding, and a hook is silently rewriting every diff. Safe Mode is the kill switch you reach for when &lt;code&gt;/clear&lt;/code&gt; only wipes the conversation but the problem lives in your config.&lt;/p&gt;

&lt;p&gt;This guide covers the five things &lt;code&gt;--safe-mode&lt;/code&gt; actually disables, exactly when to reach for it instead of &lt;code&gt;/clear&lt;/code&gt;, and how to enable it in under 10 seconds on macOS, Windows, and WSL — with a triage workflow you can hand a teammate without explanation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Do After Enabling Safe Mode (And What You Can't)
&lt;/h2&gt;

&lt;p&gt;Before you run the flag, know which problems it solves and which it leaves on the floor.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;th&gt;Safe Mode&lt;/th&gt;
&lt;th&gt;&lt;code&gt;/clear&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Reinstall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Disable a broken CLAUDE.md&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disable a misbehaving plugin&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disable hooks rewriting tool output&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disable a hung MCP server&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hide bundled skills from the model&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wipe runaway conversation context&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset API credentials&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset the model used&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset slash-command history&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Take less than 10 seconds&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Decision in one sentence&lt;/strong&gt;: if &lt;code&gt;/clear&lt;/code&gt; does not fix it within one or two turns, reach for &lt;code&gt;--safe-mode&lt;/code&gt; before you start uninstalling things.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Frame: When to Use Safe Mode (and When NOT)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When to use
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Claude Code starts behaving differently after upgrading from v2.1.16x to 2.1.169 — the upgrade is the only variable, so cancel your customizations to confirm.&lt;/li&gt;
&lt;li&gt;  A coworker reports "it works for me, breaks for you" — strip your CLAUDE.md and plugins, see if the bug survives.&lt;/li&gt;
&lt;li&gt;  A plugin or MCP server installed in the last 24 hours starts blocking, throwing, or producing wrong tool output — isolate before you uninstall.&lt;/li&gt;
&lt;li&gt;  Hooks are silently mutating tool calls and you are not sure which one — turn them all off in one shot.&lt;/li&gt;
&lt;li&gt;  You inherit a workstation set up by someone else and want to see the "stock" CLI behavior before opting into anyone's config.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When NOT to use
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  You only need to wipe context — use &lt;code&gt;/clear&lt;/code&gt;, do not relaunch.&lt;/li&gt;
&lt;li&gt;  You only need to swap models — set &lt;code&gt;ANTHROPIC_MODEL&lt;/code&gt; or use the model picker, not safe mode.&lt;/li&gt;
&lt;li&gt;  You suspect a credential issue — safe mode does not touch &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; or &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt;; check those first.&lt;/li&gt;
&lt;li&gt;  You want to disable one plugin out of five — edit your settings or the plugin directly; safe mode is all-or-nothing.&lt;/li&gt;
&lt;li&gt;  You are mid-task with valuable conversation state — safe mode launches a fresh session, so any work-in-progress context is gone.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Stop rule
&lt;/h3&gt;

&lt;p&gt;If safe mode reproduces the same bug as a normal launch, the problem is &lt;strong&gt;not&lt;/strong&gt; in your customizations. Stop reading this guide, do not waste time pruning plugins, and start looking at the model, network, account, or upstream API.&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;--safe-mode&lt;/code&gt; Disables: The Five Layers (Verbatim from v2.1.169)
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/anthropics/claude-code/releases/tag/v2.1.169" rel="noopener noreferrer"&gt;official v2.1.169 release notes&lt;/a&gt; describe the flag in one sentence: "start Claude Code with all customizations (CLAUDE.md, plugins, skills, hooks, MCP servers) disabled for troubleshooting." Each of those five layers maps to a distinct failure mode.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What gets disabled&lt;/th&gt;
&lt;th&gt;Most common failure it fixes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;CLAUDE.md&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Project root + parent + &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; global&lt;/td&gt;
&lt;td&gt;Stale instructions overriding the user prompt; contradictory team rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Plugins&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All entries in your plugin directory, including marketplace plugins&lt;/td&gt;
&lt;td&gt;Plugin shadowing a built-in command; plugin crashing on cold start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User-added skills under &lt;code&gt;~/.claude/skills/&lt;/code&gt;. To also hide built-in / bundled skills, combine with &lt;code&gt;CLAUDE_CODE_DISABLE_BUNDLED_SKILLS=1&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Skill being invoked for the wrong intent; skill emitting bad tool calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hooks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every hook event you have registered — across all ~30 documented types (&lt;a href="https://code.claude.com/docs/en/hooks" rel="noopener noreferrer"&gt;full list in the hooks reference&lt;/a&gt;) including &lt;code&gt;PreToolUse&lt;/code&gt;, &lt;code&gt;PostToolUse&lt;/code&gt;, &lt;code&gt;UserPromptSubmit&lt;/code&gt;, &lt;code&gt;Stop&lt;/code&gt;, &lt;code&gt;Notification&lt;/code&gt;, &lt;code&gt;SessionStart&lt;/code&gt;, &lt;code&gt;SessionEnd&lt;/code&gt;, &lt;code&gt;PreCompact&lt;/code&gt;, &lt;code&gt;SubagentStart&lt;/code&gt;/&lt;code&gt;Stop&lt;/code&gt;, and the rest&lt;/td&gt;
&lt;td&gt;Hook rewriting diffs or blocking tool calls silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MCP servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every MCP server, whether from settings, &lt;code&gt;--mcp-config&lt;/code&gt;, or IDE-typed configs&lt;/td&gt;
&lt;td&gt;MCP server hung on startup; MCP tool list collision; auth loop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What it does &lt;strong&gt;not&lt;/strong&gt; disable: your API key, the configured base URL (so your ofox endpoint stays live), the model name in &lt;code&gt;ANTHROPIC_MODEL&lt;/code&gt;, your conversation history file, your slash-command keybindings, or the trust dialog for the current project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: CLAUDE.md — the silent system prompt extender
&lt;/h3&gt;

&lt;p&gt;CLAUDE.md is loaded from up to three locations on every session start: the project root, every parent directory walked toward &lt;code&gt;/&lt;/code&gt;, and your global &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;. Whatever you write there is injected ahead of every user prompt. When safe mode is on, the model behaves as if none of those files exist. This is the first layer to suspect when "the same prompt produces different output today than yesterday" — someone (or you) likely added a new rule in CLAUDE.md that pulls behavior in an unexpected direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Plugins — the third-party extension surface
&lt;/h3&gt;

&lt;p&gt;Plugins live in &lt;code&gt;~/.claude/plugins/&lt;/code&gt; (or your configured plugin directory) and can register slash commands, intercept tool calls, expose new MCP servers, or modify the prompt. Safe mode skips the plugin loader entirely, which means a plugin that crashes on cold start, shadows a built-in command like &lt;code&gt;/clear&lt;/code&gt;, or holds a Windows file lock will not appear at all. Plugin issues are the second most common "weird behavior after update" cause after CLAUDE.md.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Skills — both user-added and (optionally) bundled
&lt;/h3&gt;

&lt;p&gt;User-added skills under &lt;code&gt;~/.claude/skills/&lt;/code&gt; are disabled by safe mode automatically. Bundled skills, workflows, and built-in slash commands are &lt;strong&gt;not&lt;/strong&gt; disabled by safe mode alone — they require the separate &lt;code&gt;CLAUDE_CODE_DISABLE_BUNDLED_SKILLS=1&lt;/code&gt; env var or &lt;code&gt;disableBundledSkills: true&lt;/code&gt; setting (also new in v2.1.169). For most triage you want only safe mode; reach for the bundled-skills flag when you suspect a built-in skill is being chosen over your own logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Hooks — the silent tool-call mutators
&lt;/h3&gt;

&lt;p&gt;Hooks fire on dozens of events — &lt;code&gt;PreToolUse&lt;/code&gt;, &lt;code&gt;PostToolUse&lt;/code&gt;, &lt;code&gt;UserPromptSubmit&lt;/code&gt;, &lt;code&gt;Stop&lt;/code&gt;, &lt;code&gt;Notification&lt;/code&gt;, &lt;code&gt;SessionStart&lt;/code&gt;, &lt;code&gt;SessionEnd&lt;/code&gt;, &lt;code&gt;PreCompact&lt;/code&gt;, &lt;code&gt;SubagentStart&lt;/code&gt;/&lt;code&gt;Stop&lt;/code&gt;, and more — covering tool calls, session lifecycle, and async events. A hook can rewrite the model's tool input, block a tool from executing, swallow output, or just print extra chatter. They are the hardest layer to debug manually because their effect is silent and out-of-band. Safe mode disables every hook in one shot — usually the fastest way to confirm "is a hook eating my diffs?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: MCP servers — the longest-tail surface
&lt;/h3&gt;

&lt;p&gt;MCP servers come from your settings file, &lt;code&gt;--mcp-config&lt;/code&gt; flags, IDE-typed configs, and v2.1.169 enterprise-managed &lt;code&gt;allowedMcpServers&lt;/code&gt;/&lt;code&gt;deniedMcpServers&lt;/code&gt; policies. They can hang on cold start, collide on tool names, throw auth loops, or simply respond too slowly to be useful. Disabling all of them in one flag lets you ask "would this session work with zero MCP at all?" — a question that is otherwise expensive to answer because most users have between three and ten MCP servers configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Claude Code ≥ v2.1.169&lt;/strong&gt; (June 8, 2026). Verify with &lt;code&gt;claude --version&lt;/code&gt;. If you see &lt;code&gt;2.1.168&lt;/code&gt; or below, upgrade first: &lt;code&gt;npm install -g @anthropic-ai/claude-code&lt;/code&gt; for the npm install, or run the official installer.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;A terminal you can set env vars in&lt;/strong&gt;: zsh, bash, fish, PowerShell, CMD, or WSL all work.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Existing API credentials&lt;/strong&gt;: &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; (or your ofox-style key under the same name) must still be configured. Safe mode disables customizations, not auth.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;(Optional) Write access to your shell rc&lt;/strong&gt;: if you want to persist the flag for a triage session, you will set &lt;code&gt;CLAUDE_CODE_SAFE_MODE=1&lt;/code&gt; in &lt;code&gt;~/.zshrc&lt;/code&gt;, &lt;code&gt;~/.bashrc&lt;/code&gt;, or PowerShell &lt;code&gt;$PROFILE&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step-by-Step: Enable Claude Code Safe Mode
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Confirm your version is v2.1.169 or newer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;claude --version
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Expect: 2.1.169 &lt;span class="o"&gt;(&lt;/span&gt;or higher&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it returns an older version, run the upgrade matching how you installed it, then relaunch your terminal to make sure the new binary is on your &lt;code&gt;PATH&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Launch with the flag (one-shot)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;macOS / Linux / WSL&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;claude --safe-mode
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows PowerShell&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;claude --safe-mode
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows CMD&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;claude --safe-mode
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The flag is the same everywhere; only the shell wrapping changes. The session that opens will skip every CLAUDE.md, plugin, skill, hook, and MCP server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: (Alternative) Persist with the env var for a triage session
&lt;/h3&gt;

&lt;p&gt;When you want every &lt;code&gt;claude&lt;/code&gt; invocation in a shell to be safe — useful when you are debugging across multiple panes or repos:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS / Linux / WSL&lt;/strong&gt; (bash, zsh):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_CODE_SAFE_MODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows PowerShell&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_CODE_SAFE_MODE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;claude&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Windows CMD&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;&lt;span class="kd"&gt;set&lt;/span&gt; &lt;span class="kd"&gt;CLAUDE_CODE_SAFE_MODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="kd"&gt;claude&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the env var is live before you launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;echo $&lt;/span&gt;CLAUDE_CODE_SAFE_MODE
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Expect: 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Confirm safe mode is active
&lt;/h3&gt;

&lt;p&gt;Once in the session, type a prompt that should trigger a project rule from your CLAUDE.md. If the rule is ignored, safe mode is on. You can also list your loaded MCP servers — the list should be empty.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Exit safe mode cleanly
&lt;/h3&gt;

&lt;p&gt;For a one-shot launch, just &lt;code&gt;/quit&lt;/code&gt; and start &lt;code&gt;claude&lt;/code&gt; again without the flag.&lt;/p&gt;

&lt;p&gt;For the env-var path, unset it before you relaunch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS / Linux / WSL&lt;/span&gt;
&lt;span class="nb"&gt;unset &lt;/span&gt;CLAUDE_CODE_SAFE_MODE

&lt;span class="c"&gt;# PowerShell&lt;/span&gt;
Remove-Item Env:CLAUDE_CODE_SAFE_MODE

&lt;span class="c"&gt;# CMD&lt;/span&gt;
&lt;span class="nb"&gt;set &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_CODE_SAFE_MODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then check your rc files (&lt;code&gt;~/.zshrc&lt;/code&gt;, &lt;code&gt;~/.bashrc&lt;/code&gt;, &lt;code&gt;$PROFILE&lt;/code&gt;) for a stale &lt;code&gt;export&lt;/code&gt; that would re-arm safe mode on the next shell. This is the single most common reason people report "Claude Code still ignores my CLAUDE.md" days after they thought they had turned safe mode off.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: The 3-step triage workflow
&lt;/h3&gt;

&lt;p&gt;Once safe mode confirms the bug is in your customizations, narrow it down with a binary-search pattern instead of un-installing everything at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Relaunch normally&lt;/strong&gt;, keep &lt;code&gt;CLAUDE.md&lt;/code&gt; and plugins, manually disable hooks and MCP via settings. Reproduce? If yes, the culprit is in CLAUDE.md or a plugin — usually the larger and more recent layer.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Re-enable hooks&lt;/strong&gt;, keep MCP off. Reproduce? If yes, the culprit is a hook. Disable hooks one at a time; v2.1.169 hooks are evaluated in load order, so disable the most recently added hook first.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Re-enable MCP servers one by one&lt;/strong&gt;. The one that reintroduces the bug is your suspect. Restart the session between each MCP toggle — MCP servers cache their tool list at session start, so live toggling without a restart can give you a false negative.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You almost never need step 3 to finish — the bug usually surfaces by step 1 or 2. Step 3 only matters when you run many MCP servers, which is increasingly common in larger teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Capture a clean repro for the bug tracker
&lt;/h3&gt;

&lt;p&gt;After you have identified the offending layer, run safe mode one more time with the &lt;em&gt;single&lt;/em&gt; offending file copied into place. This isolates the bug to one customization and gives you a tight repro to file. The repro should fit in two lines: "I added &lt;code&gt;~/.claude/plugins/foo/index.ts&lt;/code&gt;, with safe mode I get behavior A, with that plugin re-enabled I get behavior B." This is the format the Claude Code maintainers respond to fastest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Errors During Safe Mode Setup (and Fixes)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Root cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Unknown flag: --safe-mode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude Code is on v2.1.168 or older&lt;/td&gt;
&lt;td&gt;Upgrade to v2.1.169+, relaunch terminal so &lt;code&gt;PATH&lt;/code&gt; picks up the new binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safe mode "doesn't seem to do anything" — CLAUDE.md still applies&lt;/td&gt;
&lt;td&gt;Two &lt;code&gt;claude&lt;/code&gt; binaries on &lt;code&gt;PATH&lt;/code&gt;; older one wins&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;which claude&lt;/code&gt; (Mac/Linux) or &lt;code&gt;where claude&lt;/code&gt; (Windows); remove or rename the older copy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE.md still loads&lt;/code&gt; despite the flag&lt;/td&gt;
&lt;td&gt;You confused &lt;code&gt;--safe-mode&lt;/code&gt; with &lt;code&gt;/clear&lt;/code&gt;; you ran &lt;code&gt;/clear&lt;/code&gt; inside an already-loaded session&lt;/td&gt;
&lt;td&gt;Quit the session entirely (&lt;code&gt;Ctrl-D&lt;/code&gt; or &lt;code&gt;/quit&lt;/code&gt;), then relaunch with the flag&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP servers still appear in the slash-command list&lt;/td&gt;
&lt;td&gt;Stale plugin cache on Windows (a known v2.1.169 fix for MCPB plugin cache)&lt;/td&gt;
&lt;td&gt;Quit, delete the MCPB plugin cache directory, relaunch with &lt;code&gt;--safe-mode&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;claude -p&lt;/code&gt; hangs forever on Windows after enabling safe mode&lt;/td&gt;
&lt;td&gt;Unrelated regression already fixed in v2.1.169 (skill-scan stall)&lt;/td&gt;
&lt;td&gt;Confirm you are on &lt;code&gt;&amp;gt;=2.1.169&lt;/code&gt;; if so, file an issue with &lt;code&gt;--verbose&lt;/code&gt; output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE_CODE_SAFE_MODE&lt;/code&gt; set but env var has spaces around &lt;code&gt;=&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Bash treats &lt;code&gt;CLAUDE_CODE_SAFE_MODE =1&lt;/code&gt; as a command, not an assignment&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;export CLAUDE_CODE_SAFE_MODE=1&lt;/code&gt; with no spaces — strict, no shortcuts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safe mode disables your hook that does mandatory pre-commit signing — now your commits are unsigned&lt;/td&gt;
&lt;td&gt;Working as designed; safe mode is for diagnosis, not for everyday work&lt;/td&gt;
&lt;td&gt;Exit safe mode before any commit you intend to push; reserve safe mode for read-only triage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise managed &lt;code&gt;allowedMcpServers&lt;/code&gt; list still seems "active"&lt;/td&gt;
&lt;td&gt;Safe mode disables MCP servers, not the policy schema parser; policy text remains in settings&lt;/td&gt;
&lt;td&gt;Treat the policy file as dormant in safe mode — it is parsed but no MCP servers run&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Team / Multi-Developer Configuration
&lt;/h2&gt;

&lt;p&gt;Safe mode is the cleanest reproducibility primitive your team has when a bug report comes in. Wire it into your runbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shared triage alias
&lt;/h3&gt;

&lt;p&gt;Drop a one-liner into your team's dotfiles repo so every engineer has the same entry point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ~/.zshrc or ~/.bashrc, committed to the team's dotfiles repo&lt;/span&gt;
&lt;span class="nb"&gt;alias &lt;/span&gt;claude-triage&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'CLAUDE_CODE_SAFE_MODE=1 claude'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a teammate reports a Claude Code bug, the first request from the on-call is "run &lt;code&gt;claude-triage&lt;/code&gt;, paste the repro." That is one command, no context, no version drift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Triage matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug surface&lt;/th&gt;
&lt;th&gt;Reporter runs&lt;/th&gt;
&lt;th&gt;Expected if bug is in their config&lt;/th&gt;
&lt;th&gt;Expected if bug is upstream&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tool output looks rewritten&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-triage&lt;/code&gt; then repro&lt;/td&gt;
&lt;td&gt;Bug disappears → hook is the culprit&lt;/td&gt;
&lt;td&gt;Bug persists → model / API issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slash command shadowed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-triage&lt;/code&gt; then &lt;code&gt;/help&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Built-in command reappears → plugin is shadowing&lt;/td&gt;
&lt;td&gt;Command still missing → CLI bug&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md rules ignored&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-triage&lt;/code&gt; then ask a question only the rule would answer&lt;/td&gt;
&lt;td&gt;Behavior identical → your CLAUDE.md was unreachable anyway (path issue)&lt;/td&gt;
&lt;td&gt;Behavior changes only in safe mode → CLAUDE.md was loading but losing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP tool returns errors&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-triage&lt;/code&gt; then attempt task&lt;/td&gt;
&lt;td&gt;Task succeeds without MCP → server is broken&lt;/td&gt;
&lt;td&gt;Task fails the same way → not an MCP issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  CI / scripts: don't ship safe mode
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;--safe-mode&lt;/code&gt; is a diagnostic flag, not a deploy target. Do &lt;strong&gt;not&lt;/strong&gt; add it to your CI invocations of &lt;code&gt;claude -p&lt;/code&gt; or &lt;code&gt;claude agents&lt;/code&gt; — you will silently lose your team's CLAUDE.md, hooks, and any MCP integrations your CI relies on. If you suspect CI is hitting a customization bug, run safe mode locally to confirm, then patch the offending customization. A &lt;code&gt;--safe-mode&lt;/code&gt; line in a CI YAML file is an outage waiting to happen — keep the flag where humans can see it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Onboarding new hires
&lt;/h3&gt;

&lt;p&gt;For day-one onboarding, give the new engineer a 5-minute exercise: run &lt;code&gt;claude-triage&lt;/code&gt;, ask the same question they ask in normal mode, observe the difference. They walk away knowing what their team's customizations actually do — instead of treating CLAUDE.md as magic that just happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced: Pair Safe Mode with &lt;code&gt;disableBundledSkills&lt;/code&gt; and &lt;code&gt;/cd&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;v2.1.169 shipped three related controls, and they compose well.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;--safe-mode&lt;/code&gt; + &lt;code&gt;CLAUDE_CODE_DISABLE_BUNDLED_SKILLS=1&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;--safe-mode&lt;/code&gt; disables &lt;strong&gt;user-added&lt;/strong&gt; skills under &lt;code&gt;~/.claude/skills/&lt;/code&gt;. To also hide the built-in / bundled skills, workflows, and built-in slash commands from the model, set &lt;code&gt;CLAUDE_CODE_DISABLE_BUNDLED_SKILLS=1&lt;/code&gt; (or &lt;code&gt;disableBundledSkills: true&lt;/code&gt; in your settings). Combined, the model sees a CLI with nothing on top — useful when you want to compare "stock model behavior" against your customized behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CLAUDE_CODE_SAFE_MODE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;CLAUDE_CODE_DISABLE_BUNDLED_SKILLS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Safe mode + &lt;code&gt;/cd&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;v2.1.169 also added &lt;code&gt;/cd&lt;/code&gt; to switch the session's working directory without breaking the prompt cache mid-session. Combine the two when you triage across repos: launch with &lt;code&gt;--safe-mode&lt;/code&gt; once, then &lt;code&gt;/cd&lt;/code&gt; between repositories. You only pay the cold-start cost once and your triage stays on the same model context window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safe mode + an ofox-style upstream
&lt;/h3&gt;

&lt;p&gt;If you route Claude Code through a multi-provider gateway like &lt;a href="https://ofox.ai/docs" rel="noopener noreferrer"&gt;ofox&lt;/a&gt; — using &lt;code&gt;ANTHROPIC_BASE_URL=https://api.ofox.ai/anthropic&lt;/code&gt; — safe mode keeps the routing intact while still neutralizing every local customization. That separation is the whole point: you can rule out local config without touching your upstream credentials. Safe mode is the one CLI flag that lets you ask "is it me or is it everything I added on top?" — and get an honest answer in under ten seconds.&lt;/p&gt;

&lt;p&gt;For deeper config patterns once you are past triage, see the &lt;a href="https://ofox.ai/blog/claude-code-ofoxai-configuration-guide-2026/" rel="noopener noreferrer"&gt;Claude Code ofox configuration guide&lt;/a&gt;, the &lt;a href="https://ofox.ai/blog/claude-code-hooks-subagents-skills-complete-guide-2026/" rel="noopener noreferrer"&gt;Claude Code hooks, subagents, and skills guide&lt;/a&gt;, and the &lt;a href="https://ofox.ai/blog/claude-code-safety-prevent-accidental-file-deletion/" rel="noopener noreferrer"&gt;Claude Code safety guide for preventing accidental file deletion&lt;/a&gt; — all three lean on the customization layers safe mode disables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related Reading on Claude Code Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  For the broader CLI permission model, see the &lt;a href="https://ofox.ai/blog/claude-code-token-optimization-2026/" rel="noopener noreferrer"&gt;Claude Code token optimization guide&lt;/a&gt; — the customizations safe mode disables are the same ones that drive token cost.&lt;/li&gt;
&lt;li&gt;  If you are still picking between CLIs, the &lt;a href="https://ofox.ai/blog/claude-code-vs-codex-cli-vs-cursor-vs-deepseek-tui-2026/" rel="noopener noreferrer"&gt;Claude Code vs Codex vs Cursor vs DeepSeek TUI comparison&lt;/a&gt; lays out which has a comparable triage flag.&lt;/li&gt;
&lt;li&gt;  For the "Go to Sleep" / model-stops bug that often gets confused with a customization failure, see the &lt;a href="https://ofox.ai/blog/claude-go-to-sleep-bug-explained-2026/" rel="noopener noreferrer"&gt;Claude "Go to Sleep" bug explainer&lt;/a&gt; — safe mode does not fix model-side issues.&lt;/li&gt;
&lt;li&gt;  For multi-tool setups, the &lt;a href="https://ofox.ai/blog/cursor-claude-code-cline-custom-api-setup-2026/" rel="noopener noreferrer"&gt;Cursor + Claude Code + Cline custom API setup guide&lt;/a&gt; shows how the same &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; pattern survives safe mode.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/claude-code-safe-mode-guide-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>tutorial</category>
      <category>troubleshooting</category>
    </item>
    <item>
      <title>Claude Fable 5 vs Opus 4.8 vs GPT-5.5: SWE-Bench, Pricing, When to Switch</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Wed, 10 Jun 2026 02:08:07 +0000</pubDate>
      <link>https://dev.to/owen_fox/claude-fable-5-vs-opus-48-vs-gpt-55-swe-bench-pricing-when-to-switch-2656</link>
      <guid>https://dev.to/owen_fox/claude-fable-5-vs-opus-48-vs-gpt-55-swe-bench-pricing-when-to-switch-2656</guid>
      <description>&lt;h1&gt;
  
  
  Claude Fable 5 vs Opus 4.8 vs GPT-5.5: SWE-Bench, Pricing, When to Switch
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Anthropic shipped Claude Fable 5 on June 9, 2026, its first publicly available Mythos-class model. It hits &lt;strong&gt;95.0% on SWE-bench Verified&lt;/strong&gt; and &lt;strong&gt;80.3% on SWE-bench Pro&lt;/strong&gt; — an 11-point lead over Opus 4.8 and 21.7 points clear of GPT-5.5. Pricing is &lt;strong&gt;$10/$50 per million tokens&lt;/strong&gt;, exactly 2x Opus 4.8. GPT-5.5 still wins Terminal-Bench 2.1 (82.7% vs 80.5%), Opus 4.8 still owns long-context retrieval and price-performance, and the upgrade math turns on whether your bottleneck is capability or bill. Below: the real numbers, the cost-per-point math, and a decision tree you can apply today.&lt;/p&gt;

&lt;p&gt;Fable 5 is the first publicly available model to clear 80% on SWE-bench Pro and 95% on Verified — but at $10/$50 per million tokens, the cost per SWE-bench point runs 72% higher than Opus 4.8.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Each Model Actually Shipped
&lt;/h2&gt;

&lt;p&gt;Three releases over seven weeks reset the top of the coding leaderboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; dropped on April 23, 2026 as OpenAI's single flagship — no Standard/Pro split for capability, just two surfaces (GPT-5.5 and GPT-5.5 Pro) for cost and latency. The launch leaned on Codex CLI and computer use; "agentic coding" was the headline. GPT-5.5 Instant followed on May 5 as the default model in ChatGPT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; landed on May 28, 2026 at the same $5/$25 price as 4.7. SWE-bench Pro jumped from 64.3% to 69.2%, OSWorld-Verified to 83.4%, and Artificial Analysis's independent GDPval-AA leaderboard put it 121 Elo points clear of GPT-5.5 on real economic work — using 35% fewer output tokens per task than 4.7. Same price, higher score, lower bill. We covered the full release in our Opus 4.8 review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; shipped on June 9, 2026 — yesterday, as of this writing. It's Anthropic's first generally available model from the Mythos class, the family Anthropic previously held back because of cybersecurity capabilities Anthropic deemed too risky for broad release. Fable 5 is the Mythos model with three safety classifiers layered on top: when a query hits cybersecurity, biology/chemistry, or distillation patterns, the request automatically routes to Opus 4.8 instead. Pricing is $10/$50 — half of what Anthropic charged for Mythos Preview, but still 2x Opus 4.8.&lt;/p&gt;

&lt;p&gt;The headline isn't that Anthropic shipped two models in two weeks. It's that the gap between &lt;em&gt;capability leader&lt;/em&gt; and &lt;em&gt;value leader&lt;/em&gt; widened — and they're now both Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SWE-Bench Picture, Side by Side
&lt;/h2&gt;

&lt;p&gt;Coding benchmarks are noisy. SWE-bench Verified and SWE-bench Pro are the two that matter most for production decisions because they run against real GitHub issues end-to-end, with a maintainer-graded ground truth. Here's how the three line up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Fable 5&lt;/th&gt;
&lt;th&gt;Opus 4.8&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;95.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;88.6%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Pro&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.3%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;td&gt;58.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Terminal-Bench 2.1&lt;/td&gt;
&lt;td&gt;80.5%&lt;/td&gt;
&lt;td&gt;74.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82.7%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FrontierCode Diamond&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Leader&lt;/strong&gt; (5x GPT-5.5, 2x Opus)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Every Senior Engineer (/100)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;91&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GraphWalks BFS @ 1M tokens&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;68.1%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSWorld-Verified&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;td&gt;78.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GDPval-AA (Elo, real work)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1890&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1769&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things in that table are worth more than the headline numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every's Senior Engineer benchmark is the cleanest read on capability ceiling.&lt;/strong&gt; Every runs it on the hardest coding problems they can write — the kind a senior engineer would take a working day to solve. Fable 5 at 91/100 lands in the range of the human engineers who've taken the test. Opus 4.8 at 63 and GPT-5.5 at 62 are essentially tied, and both sit in the "junior engineer with debugger" range. The 28-point gap between Fable 5 and Opus 4.8 on this test is the gap that justifies the price premium — &lt;em&gt;if your work lives at that ceiling&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal-Bench is the one place GPT-5.5 still wins, and the asterisk matters.&lt;/strong&gt; GPT-5.5 hits 82.7% against Fable 5's 80.5% — close, but a real lead. The asterisk: GPT-5.5's score comes through Codex CLI, OpenAI's strongest agentic surface for terminal work. Fable 5's number is the model in a standard harness. On Codex CLI, GPT-5.5 has had two months to embed itself in real workflows; if your stack is already Codex-centric, "switch to Fable" isn't a free upgrade. We unpack this trade-off in Codex CLI configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-context retrieval is a Claude-family lead that compounded.&lt;/strong&gt; On the GraphWalks BFS benchmark at 1M tokens, Opus 4.8 hits 68.1% versus GPT-5.5's 45.4% — a 22.7-point spread that turns into "the agent actually remembers what happened on turn 12" in practice. Anthropic hasn't published Fable 5's GraphWalks score directly, but the long-context architecture is shared, so the gap to GPT-5.5 on million-token retrieval almost certainly persists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing, and What "Cost Per Benchmark Point" Actually Buys
&lt;/h2&gt;

&lt;p&gt;Sticker pricing is straightforward. The interesting number is what each model returns per dollar.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input ($/M)&lt;/th&gt;
&lt;th&gt;Output ($/M)&lt;/th&gt;
&lt;th&gt;Blended (2:1)*&lt;/th&gt;
&lt;th&gt;Per SWE-bench Pro point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Fable 5&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;td&gt;$50.00&lt;/td&gt;
&lt;td&gt;$23.33&lt;/td&gt;
&lt;td&gt;~$0.62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.8&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;$11.67&lt;/td&gt;
&lt;td&gt;~$0.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.5&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$30.00&lt;/td&gt;
&lt;td&gt;$13.33&lt;/td&gt;
&lt;td&gt;~$0.50&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Blended assumes a 2:1 input-to-output token ratio typical of coding workloads (more context in than code out). ofox.ai routing applies the same per-token rates with no markup.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost per SWE-bench Pro point&lt;/strong&gt; is the metric most teams should care about, because it's what your monthly invoice looks like when you scale agentic coding traffic. Fable 5's $0.62 is 72% more expensive per point than Opus 4.8's $0.36. GPT-5.5 sits between at $0.50 — losing on absolute capability to both Claudes, but cheaper per point than Fable 5.&lt;/p&gt;

&lt;p&gt;Two adjustments push the math in Fable 5's favor before you write it off as a luxury:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fable 5 finishes the same task in fewer turns.&lt;/strong&gt; Anthropic's reported numbers, corroborated by independent runs, put Fable 5 at roughly 25–30% fewer turns than Opus 4.8 on agentic spreadsheet and codebase tasks. If your bottleneck is output token volume — common on long autonomous runs — that efficiency partially offsets the 2x rate card. Opus 4.8 already runs 35% fewer output tokens than 4.7; Fable 5 pushes that further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The capability ceiling is real on the hardest 10–20%.&lt;/strong&gt; If your team's escalation pattern today is "Opus 4.8 hands off to a human after three failed attempts," routing those handoffs to Fable 5 instead may finish the task without the human in the loop. The cost question stops being "which model is cheaper per token" and becomes "which model removes a senior engineer from the loop." That comparison usually pays out at Fable 5's price.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Test the routing math on your own workload.&lt;/strong&gt; Through ofox.ai, one key gets you Opus 4.8 and GPT-5.5 today (Fable 5 rolling in), on a single OpenAI-compatible endpoint. Run the same prompts through all three, compare token counts and quality on &lt;em&gt;your&lt;/em&gt; workload before committing to the upgrade.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When to Switch: A Decision Tree
&lt;/h2&gt;

&lt;p&gt;The right question isn't "which model wins" — Fable 5 wins most benchmarks. The right question is "which model wins on &lt;em&gt;my&lt;/em&gt; task and bill." Here is the routing logic that maps the published numbers to a defensible choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Your primary workload is long-horizon agentic coding (multi-hour runs, codebase-wide migrations).&lt;/strong&gt; Use &lt;strong&gt;Fable 5&lt;/strong&gt;. The Senior Engineer benchmark, the FrontierCode Diamond lead, and the 25–30% turn reduction all compound on long runs. The price premium is offset by fewer wasted turns and fewer human escalations. Best AI model for coding walks through the routing patterns that work at this scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Your primary workload is terminal-driven CLI work, ops automation, or you're already on Codex CLI.&lt;/strong&gt; Use &lt;strong&gt;GPT-5.5&lt;/strong&gt;. Terminal-Bench 2.1 is the only benchmark of the three GPT-5.5 leads on, and the gap on Codex-centric workflows is real — not benchmark noise. The 7-week head start on integration matters here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Your primary workload is everything else — refactors, code review, daily agent loops at scale.&lt;/strong&gt; Use &lt;strong&gt;Opus 4.8&lt;/strong&gt;. Same $5/$25 pricing as Opus 4.7, top of the GDPval-AA real-work leaderboard, 35% fewer output tokens than the prior generation. For 80% of teams, this is the right answer in 2026 — and it stays the right answer until your workload pushes past the capability ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. You need million-token context retrieval (legal review, codebase audits, long transcripts).&lt;/strong&gt; Use &lt;strong&gt;Opus 4.8&lt;/strong&gt; (or Fable 5 if you can absorb the price). GPT-5.5's 45.4% on GraphWalks BFS at 1M tokens is the disqualifying number — it means the model is no longer reliably finding facts past the first ~200K tokens. The Claude family architecture is the only one that holds up at that scale today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. You're hitting refusals or routing to Opus 4.8 on Fable 5.&lt;/strong&gt; Expected behavior, not a bug. Fable 5's three safety classifiers (cybersecurity, biology/chemistry, distillation attempts) trigger on ~5% of sessions per Anthropic, and the fallback is silent — the request runs on Opus 4.8 anyway. If your workload sits in any of those three areas (security research, biotech, model training pipelines), don't try to engineer around the classifier. Just call Opus 4.8 directly and skip the indirection.&lt;/p&gt;

&lt;p&gt;The one routing pattern that doesn't survive the new numbers: &lt;strong&gt;"Opus is the daily driver, GPT-5.5 for math and long context."&lt;/strong&gt; That logic was true through May. GraphWalks closed the long-context gap. Opus 4.8 closed the math gap (USAMO 2026 jumped from 69.3% on Opus 4.7 to 96.7% on 4.8). If you're routing math or long-context to GPT-5.5 today, you're paying more per output token for a worse result.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Access Through ofox.ai
&lt;/h2&gt;

&lt;p&gt;The three models land on a single OpenAI-compatible endpoint, so the upgrade path from "use one model" to "test all three" is one base URL change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-ofox-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Claude Opus 4.8 — daily driver
&lt;/span&gt;&lt;span class="n"&gt;opus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Audit this service for race conditions...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# GPT-5.5 — terminal-heavy workflows
&lt;/span&gt;&lt;span class="n"&gt;gpt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a shell script that...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Opus 4.8 and GPT-5.5 are live on ofox.ai today at &lt;code&gt;anthropic/claude-opus-4-8&lt;/code&gt; and &lt;code&gt;openai/gpt-5.5&lt;/code&gt;. Fable 5 is rolling into the aggregator now — check the model page or the changelog for the live ID. One key covers all three, and going through an aggregator makes the capability vs. cost question easier to answer empirically: same prompts, three models, one endpoint, real numbers on &lt;em&gt;your&lt;/em&gt; traffic.&lt;/p&gt;

&lt;p&gt;For Anthropic-native features (adaptive thinking, effort control on Opus 4.8), point the official Anthropic SDK at &lt;code&gt;https://api.ofox.ai/anthropic&lt;/code&gt; instead. We walk through both protocols in Why use an LLM API gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verdict
&lt;/h2&gt;

&lt;p&gt;Fable 5 is the new capability ceiling. Opus 4.8 is the new value floor. GPT-5.5 is the ecosystem play that still wins one important benchmark.&lt;/p&gt;

&lt;p&gt;If you're shipping agentic coding to production in 2026, the migration path is no longer "pick one and go." Route Opus 4.8 by default, escalate the hardest 10–20% to Fable 5, and keep GPT-5.5 on Codex CLI workflows where it has the integration lead. The cost-per-point math justifies the routing complexity within the first few thousand requests.&lt;/p&gt;

&lt;p&gt;The one thing that hasn't changed: independent leaderboards still beat vendor claims. Watch Artificial Analysis's GDPval-AA for Fable 5's real-work Elo when it lands. That's the number that will tell you whether the 2x price tag holds up against the 25–30% turn reduction outside the benchmark suite.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/claude-fable-5-vs-opus-4-8-vs-gpt-5-5-swe-bench-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>coding</category>
      <category>swebench</category>
    </item>
    <item>
      <title>Apple's Third-Generation Foundation Models: A Developer's Read on WWDC 2026</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Tue, 09 Jun 2026 06:07:04 +0000</pubDate>
      <link>https://dev.to/owen_fox/apples-third-generation-foundation-models-a-developers-read-on-wwdc-2026-2ej1</link>
      <guid>https://dev.to/owen_fox/apples-third-generation-foundation-models-a-developers-read-on-wwdc-2026-2ej1</guid>
      <description>&lt;h1&gt;
  
  
  Apple's Third-Generation Foundation Models: A Developer's Read on WWDC 2026
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Apple shipped its third generation of foundation models on June 8, 2026, alongside a rebranded "Siri AI." Five models. The headline is a &lt;strong&gt;20-billion-parameter sparse on-device model (AFM 3 Core Advanced)&lt;/strong&gt; that activates only 1–4B parameters per prompt using a technique Apple Research calls Instruction-Following Pruning. The other headline — quieter, more consequential for developers — is that Apple's most capable cloud model, &lt;strong&gt;AFM 3 Cloud Pro&lt;/strong&gt;, runs on &lt;strong&gt;NVIDIA GPUs hosted in Google Cloud&lt;/strong&gt;, and is refined using outputs from Google's Gemini frontier models. Apple says the resulting model is theirs; Apple executives are careful to distinguish "trained using" Gemini from "is" Gemini. The Foundation Models framework, which exposes the on-device model to any Swift app, now accepts images. None of it works in the EU on iPhone/iPad or in mainland China at launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five-Model Lineup
&lt;/h2&gt;

&lt;p&gt;Apple's research post names five distinct models. The naming is more disciplined than 2024's "AFM-on-device / AFM-server" pair, and it tracks how Apple wants you to think about the stack: two tiers on-device, three in Private Cloud Compute.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Where it runs&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Active params&lt;/th&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AFM 3 Core&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-device&lt;/td&gt;
&lt;td&gt;3B (dense)&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;Lightweight text, routing, fast NLU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AFM 3 Core Advanced&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-device&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;20B (sparse)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1–4B per prompt&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New Siri / dictation / TTS; image understanding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AFM 3 Cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Private Cloud Compute&lt;/td&gt;
&lt;td&gt;undisclosed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Main cloud text / image-understanding model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ADM 3 Cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Private Cloud Compute&lt;/td&gt;
&lt;td&gt;undisclosed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Image generation (Image Playground, Reframe, Extend, Cleanup)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AFM 3 Cloud Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;NVIDIA GPUs in Google Cloud&lt;/strong&gt; (Private Cloud Compute extension)&lt;/td&gt;
&lt;td&gt;undisclosed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Complex reasoning, agentic tool use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Apple has not published parameter counts for any of the three cloud models. The on-device models are the only ones with disclosed sizes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 20B Sparse Model and Why It Matters
&lt;/h2&gt;

&lt;p&gt;The most technically interesting model is AFM 3 Core Advanced. It's a 20-billion-parameter model that fits — and runs — on a phone, by never activating more than ~4B parameters at once.&lt;/p&gt;

&lt;p&gt;The trick is &lt;strong&gt;Instruction-Following Pruning (IFP)&lt;/strong&gt;, originally published by Apple Research in a January 2025 paper. The idea: rather than treating sparsity as a static structural decision (set at training), let a small predictor read the prompt and dynamically choose which rows and columns of the feed-forward-network matrices to activate for &lt;em&gt;that&lt;/em&gt; request. The paper's headline result: their 3B activated model "outperformed the 3B dense baseline by 5–8 absolute points on math and coding, while matching the performance of a 9B dense model." So the same active compute footprint as a 3B dense model bought roughly 9B-class quality.&lt;/p&gt;

&lt;p&gt;What changes for the production model is the memory story: Apple stores the full model in flash (NAND), keeps a small set of "always-active shared experts" in DRAM, and pages routed experts into DRAM only when the predictor selects them. That's how 20B fits in an on-device model footprint without melting battery.&lt;/p&gt;

&lt;p&gt;The blunt way to read this: Apple just gave the iPhone the &lt;strong&gt;first production-scale dynamic-sparse LLM that ships to consumers&lt;/strong&gt;. It's not a mixture-of-experts model in the classic sense (no learned router selecting K-of-N experts per token), but it's a cousin — and the deployment hardening is a first.&lt;/p&gt;

&lt;p&gt;What Apple does &lt;em&gt;not&lt;/em&gt; claim: it does not benchmark AFM 3 Core Advanced against GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, Qwen 3.7, or Llama 4. Every comparison is against Apple's own 2025 baseline. Treat the eval numbers below as evidence of &lt;em&gt;generational&lt;/em&gt; progress, not as a competitive ranking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Apple's Human Evaluations Actually Show
&lt;/h2&gt;

&lt;p&gt;Apple's evaluation methodology is &lt;strong&gt;side-by-side blind human preference vs. the previous AFM generation&lt;/strong&gt;. The numbers, verbatim from the research post:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Eval&lt;/th&gt;
&lt;th&gt;New model preference&lt;/th&gt;
&lt;th&gt;2025 baseline preference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Text (AFM 3 Core, on-device)&lt;/td&gt;
&lt;td&gt;45.6%&lt;/td&gt;
&lt;td&gt;23.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text (AFM 3 Cloud)&lt;/td&gt;
&lt;td&gt;64.7%&lt;/td&gt;
&lt;td&gt;8.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding (AFM 3 Core)&lt;/td&gt;
&lt;td&gt;&amp;gt;61%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding (AFM 3 Cloud)&lt;/td&gt;
&lt;td&gt;37.8%&lt;/td&gt;
&lt;td&gt;9.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictation overall quality (AFM 3 Core Advanced)&lt;/td&gt;
&lt;td&gt;44.7%&lt;/td&gt;
&lt;td&gt;17.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cloud Pro adds &lt;strong&gt;+10% relative&lt;/strong&gt; preference over Cloud on text, &lt;strong&gt;+14%&lt;/strong&gt; on math, and &lt;strong&gt;+14%&lt;/strong&gt; on image understanding.&lt;/p&gt;

&lt;p&gt;Mean Opinion Score for the new on-device TTS:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Voice&lt;/th&gt;
&lt;th&gt;Current TTS&lt;/th&gt;
&lt;th&gt;AFM 3 Core Advanced&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;General&lt;/td&gt;
&lt;td&gt;3.87&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.15&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conversational&lt;/td&gt;
&lt;td&gt;3.82&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.24&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two caveats matter when you cite these:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;No third-party benchmarks&lt;/strong&gt;. No MMLU, no SWE-bench, no GPQA. Apple's published numbers are preferences against the 2025 baseline only.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Side-by-side preference is loose for technical work&lt;/strong&gt;. It captures "did the human like this answer better," which is informative for chat, weaker for code or reasoning.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Gemini Question: What's Verified
&lt;/h2&gt;

&lt;p&gt;The Apple–Google partnership produced two parallel storylines that have been hard to reconcile in coverage. Here's what each Apple executive actually said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The amount of the Google Assistant we use is none." — &lt;strong&gt;Craig Federighi&lt;/strong&gt;, SVP Software Engineering&lt;/p&gt;

&lt;p&gt;"All of these are custom builds for Apple Silicon, trained using proprietary data, and refined using outputs from Gemini frontier models." — &lt;strong&gt;Amar Subramanya&lt;/strong&gt;, Apple AI VP&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reconciled: Apple is &lt;strong&gt;not running Gemini&lt;/strong&gt; in production for Apple Intelligence. Apple &lt;strong&gt;is&lt;/strong&gt; using Gemini's outputs as part of post-training (distillation-style refinement). For AFM 3 Cloud Pro specifically, multiple reports describe a deeper Google involvement — Gemini-derived training infrastructure, Apple-owned pre-training and post-training, NVIDIA inference. Apple has not contradicted that account but has chosen not to volunteer it on stage.&lt;/p&gt;

&lt;p&gt;The honest summary: &lt;strong&gt;Gemini is a teacher signal, not the runtime model.&lt;/strong&gt; That's a real and growing pattern in 2026 — frontier labs train teacher models, downstream players distill — and Apple is the largest distribution channel to publicly adopt it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Private Cloud Compute, Now on NVIDIA in Google's Datacenter
&lt;/h2&gt;

&lt;p&gt;Apple's Private Cloud Compute (PCC) launched in 2024 with a striking security architecture: Apple Silicon servers running attested, code-audited builds, with cryptographic guarantees that user data is unreachable even by Apple. The 2026 extension is the surprise: PCC now also runs on &lt;strong&gt;NVIDIA GPUs hosted inside Google Cloud&lt;/strong&gt;, while Apple says the same data-handling guarantees still apply.&lt;/p&gt;

&lt;p&gt;Two related details worth flagging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Why Google's datacenter?&lt;/strong&gt; Reporting suggests Apple tried to run the new Cloud Pro model on its own PCC hardware first, and the model was too slow. NVIDIA capacity on Google Cloud was the path that shipped.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Why none of this in the keynote?&lt;/strong&gt; Apple's keynote mentions NVIDIA, not Google. Google appears only in the research post and in executive interviews afterward. The brand story Apple wants you to hear is "Apple models, NVIDIA hardware, Apple privacy." The full supply chain is more entangled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For builders evaluating Apple's privacy claim, the engineering substance is the cryptographic attestation chain, not the geographic location of the GPUs. The substrate moving to NVIDIA-in-GCP doesn't break that — but it does mean the trust model now spans more vendors than the 2024 version.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Foundation Models Framework: What 2026 Adds
&lt;/h2&gt;

&lt;p&gt;This is the under-covered part of the announcement, and the one most directly relevant to developers.&lt;/p&gt;

&lt;p&gt;The Foundation Models framework was introduced in 2025 as a Swift API that gives any third-party app direct access to Apple's ~3B on-device model — no API key, no network, no per-token cost. The 2026 update adds &lt;strong&gt;image input&lt;/strong&gt;: developers can now pass images alongside text into the on-device model, enabling on-device visual tasks (caption a photo, extract structured data from a receipt, classify a UI element) without any cloud round-trip.&lt;/p&gt;

&lt;p&gt;What the framework is good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Structured output&lt;/strong&gt; (typed Swift values, not just text)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Tool calling / function calling&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Privacy-sensitive embedded intelligence&lt;/strong&gt; (notes summarization, on-device search, smart suggestions)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Offline reliability&lt;/strong&gt; (no network dependency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it is &lt;em&gt;not&lt;/em&gt; good at, by design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  General-knowledge Q&amp;amp;A (it's not a chatbot back-end)&lt;/li&gt;
&lt;li&gt;  Anything that requires fresh world knowledge&lt;/li&gt;
&lt;li&gt;  Workloads that need frontier-tier reasoning, long context, or multi-step agentic tool use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For an iOS app shipping in fall 2026, the realistic pattern is a hybrid: &lt;strong&gt;use the Foundation Models framework for fast, free, offline work; fall back to a cloud model for everything else.&lt;/strong&gt; That fallback is where multi-provider gateways (including ofox.ai) get useful — you want OpenAI/Anthropic/Google/Qwen/DeepSeek behind one API so you can change providers without reshipping the app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Can't Use This at Launch
&lt;/h2&gt;

&lt;p&gt;The geography is unusually restrictive even by Apple AI standards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;🇪🇺 EU&lt;/strong&gt;: Siri AI is &lt;strong&gt;not&lt;/strong&gt; available on iPhone or iPad at launch. Mac, Apple Watch, and Vision Pro are included. Apple cites DMA compliance work.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;🇨🇳 Mainland China&lt;/strong&gt;: All of Apple Intelligence, including Siri AI, is &lt;strong&gt;unavailable&lt;/strong&gt; pending regulatory approval.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hardware floor&lt;/strong&gt;: iPhone 16 family, iPhone 15 Pro / 15 Pro Max, iPad mini with A17 Pro, M1-or-later iPads, M1-or-later Macs, Apple Vision Pro. On Apple Watch, watchOS 27 runs on Series 10, Series 11, Ultra 2, Ultra 3, and SE 3 — and Watch-side Apple Intelligence additionally requires pairing with an iPhone 15 Pro / Pro Max or newer.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Launch cadence&lt;/strong&gt;: Siri AI starts as a beta later in 2026 in English, with the 32 supported locales rolling in over time. The locales span English (US, UK, Australia, India), PFIGSCJK (Portuguese, French, Italian, German, Spanish, Chinese, Japanese, Korean), DNNSTV (Danish, Dutch, Norwegian, Swedish, Turkish, Vietnamese), and AFIHHMPRTU (Arabic, Finnish, Indonesian, Hebrew, Hindi, Malay, Polish, Russian, Thai, Ukrainian).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The EU/China gap means Apple Intelligence is now formally a &lt;strong&gt;partial product&lt;/strong&gt; across geographies — the same hardware does materially different things depending on Apple ID region, and developer documentation will need to fork on capability availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Changes for Builders
&lt;/h2&gt;

&lt;p&gt;Three things to take away if you're shipping AI features in late 2026:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;On-device LLMs cross a usability threshold.&lt;/strong&gt; A 20B sparse model on a phone, with image input, free for app developers, is enough to handle a meaningful slice of in-app AI tasks — structured extraction, classification, embedded summarization, tool routing. Apps that previously paid for cloud calls to do this can stop.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Frontier work still belongs in the cloud.&lt;/strong&gt; Cloud Pro exists for a reason. Long context, agentic loops, frontier reasoning, vision-language across many images — all still cheaper, more capable, or both via a cloud LLM. The build decision is now "what &lt;em&gt;can't&lt;/em&gt; run on-device" rather than "how big a model do I need."&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Multi-provider sourcing is the safer default.&lt;/strong&gt; Apple now ships an on-device model partly distilled from Gemini, running cloud workloads on NVIDIA-in-GCP. Vendor coupling at the model layer is no longer optional even for Apple. If you're building a cross-platform product, picking a single model vendor at the application layer is the bet that's getting harder to justify.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The throughline: Apple just made on-device LLMs a baseline capability on iOS. The interesting work moves up the stack — to deciding when to use it, when to route past it, and how to do that without locking your app to any one vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources Checked
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Apple Machine Learning Research — Introducing the Third Generation of Apple's Foundation Models (model lineup, IFP, eval numbers verbatim)&lt;/li&gt;
&lt;li&gt;  Apple Newsroom — Apple unveils next generation of Apple Intelligence, Siri AI, and more (hardware list, language list, region availability)&lt;/li&gt;
&lt;li&gt;  9to5Mac — Federighi details Apple's collaboration with Google for Siri AI (Federighi quote)&lt;/li&gt;
&lt;li&gt;  CNBC — Apple partnering with Google and Nvidia for most advanced AI model (Subramanya quote, NVIDIA-in-GCP arrangement)&lt;/li&gt;
&lt;li&gt;  AppleInsider — Apple's new foundation models don't contain a drop of Gemini (independent read on the Gemini relationship)&lt;/li&gt;
&lt;li&gt;  MacRumors — Siri AI not available in EU/China initially (region restrictions)&lt;/li&gt;
&lt;li&gt;  arXiv 2501.02086 — Instruction-Following Pruning for Large Language Models (IFP technique, original Apple paper)&lt;/li&gt;
&lt;li&gt;  MarkTechPost — Apple Researchers Introduce IFPruning (third-party IFP explainer)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/apple-foundation-models-3-wwdc-2026-developer-read/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>apple</category>
      <category>ai</category>
      <category>ios</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Claude API Error 529 Overloaded: 8 Fixes, When to Switch Providers, and How to Avoid It in 2026</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Mon, 08 Jun 2026 09:43:38 +0000</pubDate>
      <link>https://dev.to/owen_fox/claude-api-error-529-overloaded-8-fixes-when-to-switch-providers-and-how-to-avoid-it-in-2026-e1e</link>
      <guid>https://dev.to/owen_fox/claude-api-error-529-overloaded-8-fixes-when-to-switch-providers-and-how-to-avoid-it-in-2026-e1e</guid>
      <description>&lt;h1&gt;
  
  
  Claude API Error 529 Overloaded: 8 Fixes, When to Switch Providers, and How to Avoid It in 2026
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; Claude API error 529 means Anthropic is temporarily overloaded, not that your code is wrong and not that your account is throttled. Four confirmed platform-wide incidents in 2026 already (March 2, March 18, March 19, June 2), the longest stretching past three hours. The retry-only playbook fails after roughly five minutes; what survives is a layered strategy — exponential backoff for the first 30 seconds, then automatic failover to a different Claude model or a different vendor entirely. The eight fixes below are sorted by tier and by how long they take to recover, and the closing section shows the ten-line unified-endpoint pattern that turns a three-hour outage into a two-second hop.&lt;/p&gt;

&lt;p&gt;529 is Anthropic temporarily saying "no." 429 is Anthropic telling you you are saying too much. The fixes look similar for the first three retries and diverge completely after that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Is Claude API Down Right Now? The 30-Second Diagnosis
&lt;/h2&gt;

&lt;p&gt;Three checks, in order. If any of them confirms the issue is upstream, stop debugging your own code and move to the fixes section.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What to check&lt;/th&gt;
&lt;th&gt;Confirms 529 if&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;The error body&lt;/td&gt;
&lt;td&gt;Contains &lt;code&gt;{"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Anthropic status page at &lt;code&gt;status.claude.com&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Open incident on Claude API, claude.ai, or Claude Code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Your own request log over the last 5 minutes&lt;/td&gt;
&lt;td&gt;529 rate jumped from &amp;lt;1% to &amp;gt;20% with no deploy on your side&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If all three light up, this is a platform-wide overload, not a bug in your application. The next section tells you which fix to reach for based on how long you can wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Apply These Fixes (and When to Switch Models Instead)
&lt;/h2&gt;

&lt;p&gt;This is the decision frame that keeps you from wasting an afternoon on the wrong layer of the stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to retry in place (apply fixes 1-3):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The 529 rate just spiked and the status page is still green — you are early in the incident&lt;/li&gt;
&lt;li&gt;  Your workload is batch or asynchronous and can tolerate ~30 seconds of delay&lt;/li&gt;
&lt;li&gt;  You are on a free or low-tier plan where multi-provider routing is not yet justified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to switch models or vendors (apply fixes 4-8):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Same &lt;code&gt;claude-opus-4-8&lt;/code&gt; request has returned 529 three times in a row across a 30-second window&lt;/li&gt;
&lt;li&gt;  The status page has confirmed an incident and the ETA is "&amp;gt;1 hour" or unstated&lt;/li&gt;
&lt;li&gt;  Your workload is user-facing and any visible latency above one second damages product experience&lt;/li&gt;
&lt;li&gt;  You are calling Claude from an agent loop (Claude Code, Codex, Cursor) where retries amplify into compound delay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stop rule.&lt;/strong&gt; If you have retried the same model four times in 60 seconds and three of the responses were 529, &lt;strong&gt;stop retrying that model&lt;/strong&gt;. Every retry past that point is queuing behind every other client doing the same thing — you are making the storm worse. Switch.&lt;/p&gt;

&lt;p&gt;The shortest version: retry buys you 30 seconds, failover buys you the rest of the day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding 529 vs 429: Anthropic's Two Limits
&lt;/h2&gt;

&lt;p&gt;Half the production teams hitting 529 in 2026 walked in thinking they had a 429 problem and tuned the wrong knob. The two errors look superficially similar and need different fixes.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Whose problem&lt;/th&gt;
&lt;th&gt;Long-term fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;rate_limit_error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your account exceeded the per-minute or per-day quota for your tier&lt;/td&gt;
&lt;td&gt;Yours&lt;/td&gt;
&lt;td&gt;Smaller batches, request a tier upgrade, request acceleration limits raised&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;529&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;overloaded_error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Anthropic's platform is over capacity globally across all customers&lt;/td&gt;
&lt;td&gt;Anthropic's&lt;/td&gt;
&lt;td&gt;Multi-model fallback, multi-provider failover, retry with jitter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Anthropic also flags a third edge case in the official errors documentation: if your organization itself causes a sudden traffic spike, you can see 429 errors specifically because of acceleration limits that ramp protect Anthropic's own infrastructure. The fix there is to ramp up traffic gradually rather than treat the 429 as a quota problem.&lt;/p&gt;

&lt;p&gt;The practical version: if you see 429 from a single account while everyone else is fine, it is yours to fix. If you see 529 (or 429s correlated with the status page), it is upstream and retry-only strategies will not save you.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Fix Claude API 529 Overloaded (Solutions for Every Tier)
&lt;/h2&gt;

&lt;p&gt;The eight fixes are sorted from "ten seconds, works on free tier" to "production-grade failover." Apply the ones that match your tier; do not skip ahead to fix 8 if you have not done 1-3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Free / Pro Tier (Solutions 1-3)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fix 1 — Retry once after 2 seconds.&lt;/strong&gt; Most 529 spikes are short. If your workload is interactive (you are pasting a prompt into a script), wait two seconds and retry once. Roughly 60% of 529s clear inside two seconds based on the March 19 incident pattern where the status page moved from open to monitoring inside an hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 2 — Exponential backoff with jitter (Python).&lt;/strong&gt; For any script that runs more than once, this is the floor. The jitter component matters: without it, every client that hit 529 at the same time retries at the same time, recreating the overload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_claude_with_backoff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.anthropic.com/v1/messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023-06-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;502&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;504&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;529&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;
        &lt;span class="n"&gt;delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Claude returned 529 four times — switch provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backoff sequence is 1s, 2s, 4s, 8s with up to 30% jitter on each. Four attempts cover roughly 15 seconds total — enough for a transient spike, not enough to blow a user-facing SLO.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 3 — Stream instead of polling.&lt;/strong&gt; If you are seeing 529s on a non-streaming call that takes more than 30 seconds, your problem is partly socket-level: idle connections drop and the SDK retries from scratch. Anthropic's docs explicitly recommend the streaming Messages API for requests over 10 minutes. Streaming holds the connection and reduces the surface area for 529-during-reconnect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Paid / Team Tier (Solutions 4-6)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fix 4 — Fallback to a sibling Claude model.&lt;/strong&gt; Inside Anthropic, capacity is shared across models but not symmetrically. When &lt;code&gt;claude-opus-4-8&lt;/code&gt; is overloaded, &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; often is not. The catch: model behavior differs, so do a regression pass on prompts before flipping the fallback. The pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PRIMARY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;FALLBACK&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_with_model_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_claude_with_backoff&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PRIMARY&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;call_claude_with_backoff&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FALLBACK&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This buys you a second pool of capacity and adds maybe 200ms to the failover. It does not survive a platform-wide 529 (March 18, March 19) because both pools are overloaded at once — but it covers the model-specific incidents (the May 22 and May 25 events on Opus 4.7 documented in our Opus 4.7 reliability fix guide).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 5 — Move tolerable workloads to the Message Batches API.&lt;/strong&gt; Per Anthropic's official Message Batches API documentation, the Batches API gives you two properties that matter for 529 exposure: a 24-hour processing window (most batches finish in under an hour, but you have up to 24h before expiry) and a 50% discount on all usage. The docs are explicit that batch processing speed "may be slowed down based on current demand and your request volume," so batches are &lt;strong&gt;not&lt;/strong&gt; on a separate, 529-immune infrastructure — they can be delayed under the same platform pressure. What changes is the shape: a 529-driven delay never reaches a user-facing latency budget, and the half-price bill blunts the cost impact of any retries the system absorbs internally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything you can defer to overnight summarization, bulk classification, or async report generation belongs here. You trade real-time response for a tolerant window and a halved bill — and a 529 storm during that window costs you wall-clock delay inside a 24h budget instead of a customer-facing outage.&lt;/p&gt;

&lt;p&gt;If you are running Claude Code interactively rather than as a batch worker, note the separate &lt;code&gt;--fallback-model&lt;/code&gt; CLI flag — per the official CLI reference, it enables automatic fallback when the default model is overloaded or unavailable, which does cover 529. Two limits, both spelled out in the docs: it takes effect in &lt;code&gt;-p&lt;/code&gt; (print mode) and background sessions but is &lt;strong&gt;ignored in interactive sessions&lt;/strong&gt;, and the fallback target is another Anthropic model that shares the same upstream capacity pool — useful for model-specific incidents, not the platform-wide March 18 / June 2 pattern. The current Claude Code settings reference does not list a &lt;code&gt;fallbackModel&lt;/code&gt; entry in &lt;code&gt;settings.json&lt;/code&gt;, so the CLI flag is the documented surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 6 — Send to AWS Bedrock as a second region.&lt;/strong&gt; Anthropic's direct API and the Claude on AWS Bedrock deployment are separate infrastructure pools. The March 18 outage hit the direct API; Bedrock was unaffected for the first 90 minutes. If you have AWS credentials and your compliance posture allows it, run Bedrock as the fallback path. The trade-off is request-ID complexity (you now have to track an AWS request ID and an Anthropic request ID, per the official docs), but the dual-region capacity is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enterprise / Production (Solutions 7-8)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Fix 7 — Multi-provider failover through a unified endpoint.&lt;/strong&gt; This is the only fix that survives a platform-wide outage like June 2 because it routes across vendors, not just across Anthropic models. Through ofox.ai's unified endpoint, the same OpenAI-compatible request hits &lt;code&gt;anthropic/claude-opus-4.8&lt;/code&gt; first, falls back to &lt;code&gt;openai/gpt-5.5&lt;/code&gt; on 529, then to &lt;code&gt;bailian/qwen3.7-max&lt;/code&gt; if GPT also degrades — all in sub-200ms because the failover happens inside the gateway, not your application. Code in the next section.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix 8 — Circuit breaker around the entire Anthropic path.&lt;/strong&gt; When 529 rate exceeds 20% over a five-minute rolling window, open a circuit breaker that stops calling Anthropic entirely and routes 100% of traffic to a secondary provider for ten minutes. This stops your retries from contributing to the storm and gives Anthropic's autoscaler room to recover. Implementation is a classic circuit breaker pattern — sliding-window counter, open/half-open/closed states, automatic reset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude 529 Outage History: Real Incidents in 2026
&lt;/h2&gt;

&lt;p&gt;Four confirmed platform-wide 529 events so far this year. Two of them lasted long enough that a retry-only strategy would have failed entirely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Date (UTC)&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Services affected&lt;/th&gt;
&lt;th&gt;Public root cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;March 2, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2 hours&lt;/td&gt;
&lt;td&gt;Global — Claude API, claude.ai, Claude Code&lt;/td&gt;
&lt;td&gt;Not disclosed; correlated with Opus 4.7 launch traffic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;March 18, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3+ hours from 06:30 UTC&lt;/td&gt;
&lt;td&gt;Claude Code 529 storm, persisted past status-page "monitoring"&lt;/td&gt;
&lt;td&gt;Not disclosed — see GitHub issue #35704 (Max subscriber, 3h+ no recovery)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;March 19, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~53 minutes from 00:28 UTC&lt;/td&gt;
&lt;td&gt;Elevated errors across claude.ai, platform.claude.com, Claude API, Claude Code&lt;/td&gt;
&lt;td&gt;Authentication errors 23:59-00:30 UTC, moved to monitoring at 01:21 UTC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;June 2, 2026&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global outage window&lt;/td&gt;
&lt;td&gt;Cross-vendor outage observed at AI infra layer&lt;/td&gt;
&lt;td&gt;Not disclosed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: status-page time-to-detect ran 10-30 minutes behind real user impact in three of the four events. The March 18 episode is the worst case study — the page said monitoring while real users were still seeing pure 529 for over two hours.&lt;/p&gt;

&lt;p&gt;What this means for your retry budget: &lt;strong&gt;plan for a five-minute upper bound on retry, then route&lt;/strong&gt;. Anything beyond five minutes of retries during one of these windows produces zero successful responses and burns through your error budget for the day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why 529 Happens: Anthropic's Capacity Architecture in 2026
&lt;/h2&gt;

&lt;p&gt;A retry strategy you trust requires a mental model of what is on the other side of the wire. Three structural reasons explain why Claude 529 is more common in 2026 than it was in 2025 and why the pattern is not going away.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared capacity pools across models.&lt;/strong&gt; Anthropic does not run a dedicated cluster per model name. &lt;code&gt;claude-opus-4-8&lt;/code&gt; and &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; share infrastructure with weighted routing. When Opus 4.8 demand spikes — for example, the first hour after a major release — both models can return 529 simultaneously. This is why fix 4 (sibling model fallback) helps during model-specific incidents but does not survive a true platform-wide event.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Launch-week traffic shocks.&lt;/strong&gt; The March 2 and June 2 incidents both correlated with new model releases. Anthropic's autoscaler is reactive, not predictive, so a release that triples baseline load in 30 minutes blows past the autoscale ramp. The pattern is predictable enough that you can pre-warm your fallback chain in the week of any major Anthropic announcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speculative decoding and tokenizer changes.&lt;/strong&gt; Newer Claude models use more output tokens per task even when the prompt is identical (~35% more for Opus 4.7 vs 4.6, documented in our Claude Max throttling postmortem). More output tokens means more GPU-seconds per request, which means the same QPS pressure now consumes more capacity. The math compounds: a tokenizer change effectively reduces platform capacity without anyone seeing a hardware downgrade.&lt;/p&gt;

&lt;p&gt;The takeaway: 529 is a structural feature of running on a single provider's evolving infrastructure. Treating it as a transient bug to retry through misses the point — you need a routing strategy that assumes 529 will be normal in 2027 too.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Claude 529 Won't Stop: Multi-Provider Failover via ofox
&lt;/h2&gt;

&lt;p&gt;The honest version: no gateway prevents 529 errors. Those come from Anthropic. What a gateway does is collapse the failover from a multi-minute incident-response exercise into a sub-200ms request-time decision. You write the fallback chain once, in one place, and every downstream service inherits it automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python — failover in 10 lines via OpenAI SDK shape
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;OFOX_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;FALLBACK_CHAIN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-opus-4.8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# primary
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# cross-vendor failover
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bailian/qwen3.7-max&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# second cross-vendor failover
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_with_failover&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FALLBACK_CHAIN&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;529&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All three providers are overloaded — that almost never happens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pattern: same OpenAI SDK shape, swap one string per provider, no second SDK. The same &lt;code&gt;client.chat.completions.create()&lt;/code&gt; call works against Claude, GPT, and Qwen because ofox terminates the call into the right provider transparently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node — same shape
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OFOX_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://api.ofox.ai/v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;FALLBACK_CHAIN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;anthropic/claude-opus-4.8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// primary&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai/gpt-5.5&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;// cross-vendor failover&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;bailian/qwen3.7-max&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// second cross-vendor failover&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;chatWithFailover&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;FALLBACK_CHAIN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="sr"&gt;/overloaded|529/i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;All three providers overloaded — escalate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How the routing decision actually works
&lt;/h3&gt;

&lt;p&gt;The decision logic at the gateway level is simpler than it looks. The platform watches three signals and picks per-request:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;What it tells the router&lt;/th&gt;
&lt;th&gt;Action on positive signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HTTP 529 from primary&lt;/td&gt;
&lt;td&gt;Anthropic is overloaded right now&lt;/td&gt;
&lt;td&gt;Retry against next model in chain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median latency on primary &amp;gt; 3× baseline&lt;/td&gt;
&lt;td&gt;Primary is degraded but not failing&lt;/td&gt;
&lt;td&gt;Send 30% of traffic to secondary; observe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open status-page incident for primary&lt;/td&gt;
&lt;td&gt;Sustained outage in progress&lt;/td&gt;
&lt;td&gt;Send 100% of traffic to secondary until incident closes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the difference between "I retry until something works" and "the gateway routed me to a working model before I noticed." The application code does not change between a green day and an incident day.&lt;/p&gt;

&lt;p&gt;For the broader case where the same pattern applies to cost optimization rather than just reliability, see the hybrid routing breakdown — same plumbing, optimizing for $/task instead of for uptime. For the broader unified-endpoint pattern, see the API aggregation explainer.&lt;/p&gt;

&lt;h3&gt;
  
  
  A production team's June 2 playbook (what actually shipped)
&lt;/h3&gt;

&lt;p&gt;The pattern below is reconstructed from public GitHub issue threads and Reddit r/ClaudeAI posts during the June 2, 2026 outage. Names removed; the timeline and the specific decisions are real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0:00&lt;/strong&gt; — A two-engineer team's agent-loop product starts logging 529 from &lt;code&gt;anthropic/claude-opus-4-8&lt;/code&gt; (the primary in their ofox routing chain). Initial retry rate of 12% jumps to 67% inside 90 seconds. The on-call pager fires on their internal SLO dashboard, not on the Anthropic status page (which was still green for another 18 minutes).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0:02&lt;/strong&gt; — The gateway routing config (already pinned to ofox, with a four-step fallback chain) starts shifting traffic. First failover hop to &lt;code&gt;anthropic/claude-opus-4.7&lt;/code&gt; returns the same 529s — capacity is shared inside Anthropic. Second hop to &lt;code&gt;openai/gpt-5.5&lt;/code&gt; succeeds. Median latency goes from 2.1s to 2.4s. Users see nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+0:17&lt;/strong&gt; — Anthropic's status page opens an incident at "investigating." The team's own dashboard shows 78% of traffic already on GPT-5.5; the remaining 22% are non-streaming Claude requests that succeeded before the routing tier flipped them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T+1:48&lt;/strong&gt; — Incident moves to "monitoring." 529 rate on Claude drops below the 5% threshold and the gateway gradually shifts traffic back. Final tally for the engineering team: zero customer-facing errors, a small bill skew toward GPT-5.5 for two hours, and one Slack message that read "did you notice anything?"&lt;/p&gt;

&lt;p&gt;The cost of the routing setup that absorbed this incident was roughly four hours of one-time configuration. The cost of not having it would have been a multi-hour outage on a user-facing product.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-patterns we have watched fail in 2026
&lt;/h3&gt;

&lt;p&gt;The most common retry mistakes — each one observed in a production system in the first half of 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Infinite retry with linear backoff.&lt;/strong&gt; A 1-second retry loop that never gives up turns your application into a denial-of-service amplifier against Anthropic's already-overloaded infrastructure. It also blows your bill on retried token spend.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;No jitter on exponential backoff.&lt;/strong&gt; Every client that hit 529 at the same time retries at the same instant, re-creating the overload spike at 2s, 4s, 8s. Always add ±30% random offset to backoff delays.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Catching 529 as a generic Exception.&lt;/strong&gt; Hides the signal in your error logs and prevents the routing tier from acting on it. Match the type explicitly (&lt;code&gt;overloaded_error&lt;/code&gt; or HTTP 529) and re-raise other errors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Setting a fallback model that lives in the same capacity pool.&lt;/strong&gt; &lt;code&gt;claude-opus-4-8&lt;/code&gt; fallback to &lt;code&gt;claude-opus-4-7&lt;/code&gt; survives model-specific incidents but not platform-wide ones. The fallback chain must cross vendors at least once.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Skipping the circuit breaker.&lt;/strong&gt; During the March 18 three-hour event, teams without circuit breakers paid for tens of thousands of retried failed requests. The 529 itself is free, but the upstream token-count meter still incremented when the retry succeeded after the budget was already blown.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Monitor Claude Status and Get Alerts
&lt;/h2&gt;

&lt;p&gt;Three layers of monitoring, in order of importance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Status page subscription.&lt;/strong&gt; Subscribe to email or Slack notifications at &lt;code&gt;status.claude.com/subscribe&lt;/code&gt;. Useful for awareness, useless for detection — the page lags real incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your own 529 counter.&lt;/strong&gt; A simple rolling-window counter on your error log is the leading indicator. Page yourself when the 529 rate on a five-minute window exceeds 5% — that is the threshold where retry-only strategies start failing. The May 2026 throttling backlash we documented in our Claude Max throttling postmortem was visible in user retry counters two weeks before Anthropic announced the May 6 reversal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway-level fallback metrics.&lt;/strong&gt; If you route through ofox, the gateway exposes per-model success rates and median time-to-failover. When the &lt;code&gt;anthropic/claude-opus-4-8&lt;/code&gt; success rate dips below 95% for an hour, the gateway has already shifted traffic; the dashboard just tells you why your bill mix changed.&lt;/p&gt;

&lt;p&gt;The fastest production teams in 2026 do not detect 529 storms. Their gateway already routed around them before the engineer woke up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alternatives That Work When Claude 529 Persists
&lt;/h2&gt;

&lt;p&gt;If a 529 window stretches past 30 minutes and your fallback chain is exhausted, here are the realistic next moves, ranked by setup time.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;ofox.ai&lt;/strong&gt; — Single OpenAI-compatible endpoint covering Claude, GPT, Gemini, Qwen, DeepSeek, Kimi, Doubao, Zhipu, Mistral. 99.9% uptime, ~300ms median latency, 100+ models. Pre-write a fallback chain in any client SDK; no incident-time code changes. Best for teams that want one bill and one auth surface across vendors.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;AWS Bedrock (Claude)&lt;/strong&gt; — Separate Anthropic capacity pool with strong compliance story. Higher setup cost (AWS account, IAM, separate billing) and longer cold-start for new prompts but real value during direct-API incidents like March 18.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Google Vertex AI (Claude)&lt;/strong&gt; — Same model, third capacity pool. Similar trade-off to Bedrock.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Direct provider rotation&lt;/strong&gt; — Talk to GPT-5.5 or Gemini 3.1 Pro directly via their own SDKs. Lowest infra cost, highest per-task integration cost because you now own three SDKs and three retry strategies.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;OpenRouter or LiteLLM&lt;/strong&gt; — Open-source-ish alternative gateways. Slower per-request than ofox (300-500ms median) and the cost stack passes upstream margin to you, but worth knowing as a comparison.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Sources Checked for This Refresh
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Anthropic official API errors documentation — &lt;a href="https://platform.claude.com/docs/en/api/errors" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/api/errors&lt;/a&gt; (verified 2026-06-08): defines 529 as &lt;code&gt;overloaded_error&lt;/code&gt; and warns it occurs during high traffic across all users&lt;/li&gt;
&lt;li&gt;  Anthropic status page — &lt;a href="https://status.claude.com" rel="noopener noreferrer"&gt;https://status.claude.com&lt;/a&gt; (incidents from March 2, March 18, March 19, June 2, 2026)&lt;/li&gt;
&lt;li&gt;  GitHub issue #35704 — Claude Max subscriber 3h+ 529 storm on March 18, 2026 (&lt;a href="https://github.com/anthropics/claude-code/issues/35704" rel="noopener noreferrer"&gt;https://github.com/anthropics/claude-code/issues/35704&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;  Anthropic Message Batches API documentation — &lt;a href="https://platform.claude.com/docs/en/build-with-claude/batch-processing" rel="noopener noreferrer"&gt;https://platform.claude.com/docs/en/build-with-claude/batch-processing&lt;/a&gt; (verified 2026-06-08): 24-hour processing window, 50% discount on all usage, processing speed "may be slowed down based on current demand" (docs do not claim a separate capacity tier)&lt;/li&gt;
&lt;li&gt;  Claude Code official CLI reference — &lt;a href="https://code.claude.com/docs/en/cli-reference" rel="noopener noreferrer"&gt;https://code.claude.com/docs/en/cli-reference&lt;/a&gt; (verified 2026-06-08): &lt;code&gt;--fallback-model&lt;/code&gt; documented for "overloaded or not available" cases, takes effect in &lt;code&gt;-p&lt;/code&gt;/background sessions, ignored interactively&lt;/li&gt;
&lt;li&gt;  Claude Code official settings reference — &lt;a href="https://code.claude.com/docs/en/settings" rel="noopener noreferrer"&gt;https://code.claude.com/docs/en/settings&lt;/a&gt; (verified 2026-06-08): no &lt;code&gt;fallbackModel&lt;/code&gt; entry in the settings.json schema as of this date&lt;/li&gt;
&lt;li&gt;  ofox.ai unified endpoint — &lt;a href="https://ofox.ai" rel="noopener noreferrer"&gt;https://ofox.ai&lt;/a&gt; (verified 2026-06-08, 100+ models, 99.9% uptime, ~300ms median latency, OpenAI-compatible base URL &lt;a href="https://api.ofox.ai/v1" rel="noopener noreferrer"&gt;https://api.ofox.ai/v1&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;  Reddit r/ClaudeAI threads correlated with the March 18 and June 2, 2026 incident windows (community-reported 529 spikes and recovery times used to cross-check the status-page timeline)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/claude-api-error-529-overloaded-fix-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>api</category>
      <category>troubleshooting</category>
    </item>
    <item>
      <title>Install CC Switch: Manage Claude Code, Codex, Gemini CLI, OpenCode, OpenClaw &amp; Hermes Agent from One App (2026)</title>
      <dc:creator>Owen</dc:creator>
      <pubDate>Fri, 05 Jun 2026 08:42:07 +0000</pubDate>
      <link>https://dev.to/owen_fox/install-cc-switch-manage-claude-code-codex-gemini-cli-opencode-openclaw-hermes-agent-from-3nej</link>
      <guid>https://dev.to/owen_fox/install-cc-switch-manage-claude-code-codex-gemini-cli-opencode-openclaw-hermes-agent-from-3nej</guid>
      <description>&lt;h2&gt;
  
  
  What You Can Do After This Setup, In 30 Seconds
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What you'll have&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One desktop app that switches API providers across Claude Code, Claude Desktop, Codex, Gemini CLI, OpenCode, OpenClaw, and Hermes Agent — without editing config files by hand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10 minutes (install + first provider + tray switch verified)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What you need&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;macOS 12+ / Windows 10+ / Linux (Ubuntu 22.04+, Debian 11+, Fedora 34+, Arch), Node.js 20 LTS for most CLIs (Gemini CLI and Hermes Agent floor), Node 22.19+ if you install OpenClaw, and at least one provider API key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latest version (verified)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CC Switch &lt;strong&gt;v3.16.1&lt;/strong&gt;, released 2026-06-01&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're already on CC Switch and looking for a specific problem — config file not picked up, OAuth re-login loop, Codex &lt;code&gt;wire_api&lt;/code&gt; mismatch — jump to Common Errors During Setup below. Otherwise, the next four sections walk through install → first provider → verify → multi-CLI in order.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Requirements
&lt;/h2&gt;

&lt;p&gt;CC Switch is a Tauri 2 desktop app — small (~10 MB installer), but it does need a graphical session and the right CLI runtimes underneath. Check each row before you start, because most "CC Switch isn't working" reports trace back to a missing CLI rather than the app itself.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;th&gt;Verified version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CC Switch&lt;/td&gt;
&lt;td&gt;v3.13.0 (Codex OAuth proxy floor)&lt;/td&gt;
&lt;td&gt;v3.16.1&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;v3.16.1&lt;/strong&gt; (2026-06-01)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Node.js (for npm-installed CLIs)&lt;/td&gt;
&lt;td&gt;20 LTS (22.19+ for OpenClaw)&lt;/td&gt;
&lt;td&gt;22 LTS&lt;/td&gt;
&lt;td&gt;22.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code (&lt;code&gt;@anthropic-ai/claude-code&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Node 18+&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.1.163&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex (&lt;code&gt;@openai/codex&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Node 16+; 0.137.0 (wire_api responses-only floor)&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.137.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini CLI (&lt;code&gt;@google/gemini-cli&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Node 20+&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.45.1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenCode (&lt;code&gt;opencode-ai&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;no engine floor declared&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.16.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw (&lt;code&gt;openclaw&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Node 22.19+&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2026.6.1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent (&lt;code&gt;hermes-agent&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Node 20+&lt;/td&gt;
&lt;td&gt;latest&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.15.2&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS (desktop)&lt;/td&gt;
&lt;td&gt;macOS 12 / Windows 10 / Ubuntu 22.04 / Debian 11 / Fedora 34 / Arch&lt;/td&gt;
&lt;td&gt;Latest stable&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You don't have to install every CLI up front — CC Switch only manages the panels you actually have on disk. Start with the one or two you use most.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Install CC Switch
&lt;/h2&gt;

&lt;p&gt;Pick the platform-native path. Don't mix Homebrew and manual download on macOS; auto-update only tracks one channel.&lt;/p&gt;

&lt;h3&gt;
  
  
  macOS (Homebrew)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--cask&lt;/span&gt; cc-switch
&lt;span class="c"&gt;# upgrade later with:&lt;/span&gt;
brew upgrade &lt;span class="nt"&gt;--cask&lt;/span&gt; cc-switch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cask is signed and notarized by Apple — first launch opens directly without a Gatekeeper prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  macOS (manual &lt;code&gt;.dmg&lt;/code&gt;)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the latest DMG from the Releases page, then:&lt;/span&gt;
open CC-Switch-v3.16.1-macOS.dmg
&lt;span class="c"&gt;# Drag CC Switch.app to /Applications&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Windows
&lt;/h3&gt;

&lt;p&gt;Download &lt;code&gt;CC-Switch-v3.16.1-Windows.msi&lt;/code&gt; from &lt;a href="https://github.com/farion1231/cc-switch/releases" rel="noopener noreferrer"&gt;github.com/farion1231/cc-switch/releases&lt;/a&gt; and double-click. A portable &lt;code&gt;CC-Switch-v3.16.1-Windows-Portable.zip&lt;/code&gt; is available if your machine blocks MSI installs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux (Debian / Ubuntu)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# x86_64 example&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;-i&lt;/span&gt; CC-Switch-v3.16.1-Linux-x86_64.deb
&lt;span class="c"&gt;# If apt complains about dependencies:&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linux (Arch)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;paru &lt;span class="nt"&gt;-S&lt;/span&gt; cc-switch-bin   &lt;span class="c"&gt;# or: yay -S cc-switch-bin&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linux (universal AppImage)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x CC-Switch-v3.16.1-Linux-x86_64.AppImage
./CC-Switch-v3.16.1-Linux-x86_64.AppImage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Headless servers / CI
&lt;/h3&gt;

&lt;p&gt;The Tauri 2 desktop app needs Wayland or X11. For headless boxes — build runners, remote sandboxes — use the Rust CLI fork &lt;a href="https://github.com/saladday/cc-switch-cli" rel="noopener noreferrer"&gt;&lt;code&gt;SaladDay/cc-switch-cli&lt;/code&gt;&lt;/a&gt; instead. It reads the same WebDAV-synced config bundle, so you can author providers on a laptop and pull them down on the agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verify the install
&lt;/h3&gt;

&lt;p&gt;After first launch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; The CC Switch window opens to the &lt;strong&gt;Claude Code&lt;/strong&gt; panel by default.&lt;/li&gt;
&lt;li&gt; A CC Switch icon appears in the system tray (menu bar on macOS, system tray on Windows/Linux).&lt;/li&gt;
&lt;li&gt; The top-left app switcher lists the seven managed targets — Claude Code, Claude Desktop, Codex, Gemini CLI, OpenCode, OpenClaw, Hermes Agent. Greyed-out entries are CLIs CC Switch couldn't detect on disk; that's expected if you haven't installed them yet.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the tray icon is missing on Linux, your DE may not support &lt;code&gt;StatusNotifierItem&lt;/code&gt;; install &lt;code&gt;gnome-shell-extension-appindicator&lt;/code&gt; (GNOME) or equivalent and restart the session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Install the CLIs You'll Manage
&lt;/h2&gt;

&lt;p&gt;CC Switch doesn't install the CLIs themselves — it manages their config files. Install whichever you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# All seven, for reference. Pick what you actually use.&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code   &lt;span class="c"&gt;# Claude Code (Node 18+)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @openai/codex               &lt;span class="c"&gt;# Codex (Node 16+)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @google/gemini-cli          &lt;span class="c"&gt;# Gemini CLI (Node 20+)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; opencode-ai                 &lt;span class="c"&gt;# OpenCode (no engine floor)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; openclaw                    &lt;span class="c"&gt;# OpenClaw (Node 22.19+)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; hermes-agent                &lt;span class="c"&gt;# Hermes Agent (Node 20+)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Desktop is the standalone Anthropic app — download from &lt;a href="https://claude.ai/download" rel="noopener noreferrer"&gt;claude.ai/download&lt;/a&gt; rather than npm.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;npm install -g&lt;/code&gt; fails with &lt;code&gt;EACCES&lt;/code&gt; or the new shim isn't on PATH, see our walkthrough on &lt;a href="https://ofox.ai/blog/codex-command-not-found-fix-npm-install-2026/" rel="noopener noreferrer"&gt;npm global install permission and PATH fixes for Codex CLI&lt;/a&gt; — the same fixes apply to every CLI on this list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Add Your First Provider
&lt;/h2&gt;

&lt;p&gt;Open CC Switch and click the &lt;strong&gt;+&lt;/strong&gt; in the top-right corner. The Add Provider panel has two tabs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;App-specific Provider&lt;/strong&gt; — only for the panel you're on right now&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Universal Provider&lt;/strong&gt; — shared across Claude Code, Codex, and Gemini CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the first run, use App-specific:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; From the &lt;strong&gt;Preset&lt;/strong&gt; dropdown, pick the provider that issued your key. CC Switch ships with ~50 third-party presets across all panels: DeepSeek, Zhipu GLM, MiniMax, Kimi, Bailian (Alibaba Qwen), AWS Bedrock, NVIDIA NIM, OpenRouter, plus a long list of relay services. Most just need the API key — the endpoint URL and protocol fields are pre-filled.&lt;/li&gt;
&lt;li&gt; Paste your &lt;strong&gt;API Key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt; (Codex only) If the preset is marked "Needs Local Routing" — that's any chat-completions backend like DeepSeek or Zhipu — CC Switch turns on the &lt;strong&gt;Local Routing&lt;/strong&gt; toggle for you, which proxies Codex's Responses API protocol down to the provider's &lt;code&gt;/v1/chat/completions&lt;/code&gt;. Don't disable it; Codex 0.137.0 has dropped support for &lt;code&gt;wire_api = "chat"&lt;/code&gt; at config-load time and only accepts &lt;code&gt;responses&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; Click &lt;strong&gt;Add&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What each preset actually writes
&lt;/h3&gt;

&lt;p&gt;The preset list is the convenience layer. Underneath, CC Switch writes the CLI's native config file:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CLI&lt;/th&gt;
&lt;th&gt;Config file&lt;/th&gt;
&lt;th&gt;What CC Switch writes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.claude/settings.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;env.ANTHROPIC_API_KEY&lt;/code&gt; + &lt;code&gt;env.ANTHROPIC_BASE_URL&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;~/.codex/auth.json&lt;/code&gt; + &lt;code&gt;~/.codex/config.toml&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;OPENAI_API_KEY&lt;/code&gt; to &lt;code&gt;auth.json&lt;/code&gt;; &lt;code&gt;model_provider&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;base_url&lt;/code&gt;, &lt;code&gt;wire_api&lt;/code&gt; to &lt;code&gt;config.toml&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;~/.gemini/.env&lt;/code&gt; + &lt;code&gt;~/.gemini/settings.json&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;GEMINI_API_KEY&lt;/code&gt; + &lt;code&gt;GOOGLE_GEMINI_BASE_URL&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenCode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.config/opencode/opencode.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;provider block under &lt;code&gt;provider&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;~/.openclaw/openclaw.json&lt;/code&gt; (JSON5)&lt;/td&gt;
&lt;td&gt;provider block under &lt;code&gt;models.providers&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;~/.hermes/config.yaml&lt;/code&gt; + &lt;code&gt;~/.hermes/.env&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;custom_providers&lt;/code&gt; entry + &lt;code&gt;.env&lt;/code&gt; secrets; &lt;code&gt;model.provider&lt;/code&gt; / &lt;code&gt;model.default&lt;/code&gt; on switch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The principle is "minimal intrusion" — even if you uninstall CC Switch tomorrow, every CLI still works because their native config files are intact. CC Switch's own state in &lt;code&gt;~/.cc-switch/cc-switch.db&lt;/code&gt; is purely additive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Switch and Verify
&lt;/h2&gt;

&lt;p&gt;Once you have at least one provider per CLI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;From the app&lt;/strong&gt;: click the provider card, then &lt;strong&gt;Enable&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;From the system tray&lt;/strong&gt; (the genuinely useful part): right-click the CC Switch icon → pick a CLI panel → click a provider name. The new config is written before the menu closes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then activate it on the CLI side:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CLI&lt;/th&gt;
&lt;th&gt;Activation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Hot-reload — the next prompt picks up the new key with no restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;Quit and reopen the terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;td&gt;Hot-reload — &lt;code&gt;~/.gemini/.env&lt;/code&gt; is re-read on each request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenCode&lt;/td&gt;
&lt;td&gt;Quit and reopen the terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;Quit and reopen the terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;Quit and reopen the terminal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quick sanity test inside each CLI's session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Hello — which model am I talking to right now?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A reasonable model identification confirms both that the key is valid and that the routing is correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Errors During Setup (and Fixes)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Root cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tray icon missing on Linux&lt;/td&gt;
&lt;td&gt;DE doesn't support &lt;code&gt;StatusNotifierItem&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Install &lt;code&gt;gnome-shell-extension-appindicator&lt;/code&gt; (or KDE equivalent); restart session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switch succeeds but Codex still uses old key&lt;/td&gt;
&lt;td&gt;Codex caches &lt;code&gt;config.toml&lt;/code&gt; at startup&lt;/td&gt;
&lt;td&gt;Quit the Codex process fully (not just the prompt) and restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex preset connects, but every request returns &lt;code&gt;CHAT_WIRE_API_REMOVED_ERROR&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Preset's "Needs Local Routing" toggle was turned off after adding&lt;/td&gt;
&lt;td&gt;Re-enable Local Routing on the provider card; Codex 0.137.0 rejects &lt;code&gt;wire_api = "chat"&lt;/code&gt; at config-load time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code "first-run wizard" shows up every time you switch&lt;/td&gt;
&lt;td&gt;Default onboarding flow re-triggers when &lt;code&gt;~/.claude/settings.json&lt;/code&gt; changes&lt;/td&gt;
&lt;td&gt;Settings → General → enable &lt;strong&gt;Skip Claude Code first-run confirmation&lt;/strong&gt; (writes &lt;code&gt;skipIntroduction&lt;/code&gt; to &lt;code&gt;~/.claude/settings.json&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Authentication failed (401/403)&lt;/code&gt; on every preset&lt;/td&gt;
&lt;td&gt;API key string is wrong or the key isn't authorized for the chosen model&lt;/td&gt;
&lt;td&gt;Verify the key on the provider's own dashboard first; the Auto-Fetch Models button (download icon next to the model input) calls &lt;code&gt;/v1/models&lt;/code&gt; and surfaces auth errors immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can't delete the currently active provider&lt;/td&gt;
&lt;td&gt;Last-config-standing guard&lt;/td&gt;
&lt;td&gt;Switch to another provider first, then delete; if you really want to wipe the panel, hide that CLI in Settings instead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini CLI ignores the provider&lt;/td&gt;
&lt;td&gt;Node 18 detected, but &lt;code&gt;@google/gemini-cli&lt;/code&gt; requires Node 20&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nvm install 20 &amp;amp;&amp;amp; nvm alias default 20&lt;/code&gt;, then reinstall the CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw fails to install or start&lt;/td&gt;
&lt;td&gt;Node &amp;lt; 22.19 (engines floor &lt;code&gt;&amp;gt;=22.19.0&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nvm install 22 &amp;amp;&amp;amp; nvm alias default 22&lt;/code&gt;, then reinstall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent fails to install&lt;/td&gt;
&lt;td&gt;Node &amp;lt; 20 (engines floor &lt;code&gt;&amp;gt;=20.0.0&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nvm install 20 &amp;amp;&amp;amp; nvm alias default 20&lt;/code&gt;, then reinstall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WebDAV sync writes succeed but other devices don't see the new providers&lt;/td&gt;
&lt;td&gt;Both devices need the same WebDAV root and the same custom config directory in Settings → Storage&lt;/td&gt;
&lt;td&gt;Set the same &lt;code&gt;claudeConfigDir&lt;/code&gt; / &lt;code&gt;codexConfigDir&lt;/code&gt; etc. on every device&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;~/.cc-switch/cc-switch.db&lt;/code&gt; is locked&lt;/td&gt;
&lt;td&gt;Two CC Switch instances open at once (common on macOS with iCloud sync)&lt;/td&gt;
&lt;td&gt;Quit one; CC Switch holds an exclusive SQLite mutex on the file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If your CLI complains about its own config — &lt;code&gt;model_not_found&lt;/code&gt;, &lt;code&gt;EACCES&lt;/code&gt;, &lt;code&gt;429&lt;/code&gt;, &lt;code&gt;503&lt;/code&gt; — that's the CLI's problem, not CC Switch's. For Codex specifically, the install-time &lt;code&gt;command not found&lt;/code&gt; failure mode is covered in &lt;a href="https://ofox.ai/blog/codex-command-not-found-fix-npm-install-2026/" rel="noopener noreferrer"&gt;Codex &lt;code&gt;command not found&lt;/code&gt;: 7 fixes after &lt;code&gt;npm install -g&lt;/code&gt;&lt;/a&gt;. Each CLI has its own runtime-error runbook; CC Switch only manages the config file, not the request path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Team / Multi-Developer Configuration
&lt;/h2&gt;

&lt;p&gt;The pain CC Switch actually solves is teams that share third-party keys across multiple agents — not the solo dev with one CLI. Three patterns scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1 — Universal Provider for a shared relay key
&lt;/h3&gt;

&lt;p&gt;If your team uses a single OpenAI-compatible relay (one base URL, one shared key) across Claude Code, Codex, and Gemini CLI, the Universal Provider tab is the right tool. Add it once with the relay's endpoint and key, check the boxes for the three apps to sync to, and CC Switch writes the same key to each CLI's native config file. When the relay rotates the key, you update one row and click &lt;strong&gt;Save and Sync&lt;/strong&gt; — every CLI on every laptop picks it up on next switch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2 — WebDAV-synced config bundle for laptop + CI
&lt;/h3&gt;

&lt;p&gt;Set up a WebDAV server (Nextcloud, self-hosted, whatever your team already has) and point Settings → Storage → custom config directory at it. Authors create providers on their laptops; CI runners using &lt;code&gt;SaladDay/cc-switch-cli&lt;/code&gt; pull the same SQLite snapshot. Don't sync &lt;code&gt;~/.cc-switch/cc-switch.db&lt;/code&gt; through a vanilla Dropbox/iCloud folder — those will happily overwrite the database mid-write if two devices push at the same time. WebDAV with proper locking is the safer transport for the database file specifically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3 — Per-machine override for an on-call engineer
&lt;/h3&gt;

&lt;p&gt;If one engineer needs to test a different provider without affecting the rest of the team, the device-level &lt;code&gt;~/.cc-switch/settings.json&lt;/code&gt; overrides for &lt;code&gt;claudeConfigDir&lt;/code&gt;, &lt;code&gt;codexConfigDir&lt;/code&gt;, etc. point that engineer's CC Switch at a non-shared config directory. The team's WebDAV bundle stays untouched; the override evaporates when they revert the setting.&lt;/p&gt;

&lt;p&gt;For all three, treat &lt;code&gt;~/.cc-switch/cc-switch.db&lt;/code&gt; like &lt;code&gt;~/.ssh/&lt;/code&gt; — it's not encrypted at rest, so the file's protection is filesystem permissions plus whatever transport you sync it over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced — MCP, Skills, Prompts, and OAuth Reuse
&lt;/h2&gt;

&lt;p&gt;Once switching providers is solved, the unified panels are where CC Switch starts pulling its weight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified MCP across five CLIs
&lt;/h3&gt;

&lt;p&gt;Open the &lt;strong&gt;MCP&lt;/strong&gt; button on any panel. CC Switch reads the MCP server list from each CLI's native location — &lt;code&gt;~/.claude.json&lt;/code&gt; for Claude Code, &lt;code&gt;[mcp_servers.*]&lt;/code&gt; blocks in &lt;code&gt;~/.codex/config.toml&lt;/code&gt;, &lt;code&gt;mcpServers&lt;/code&gt; in &lt;code&gt;~/.gemini/settings.json&lt;/code&gt;, and the equivalents in OpenCode and Hermes — and lets you toggle each server's presence per CLI. Add an &lt;code&gt;mcp-fetch&lt;/code&gt; server once, sync it to Claude Code + Codex + Gemini, and every agent has the same tool surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompts (CLAUDE.md / AGENTS.md / GEMINI.md)
&lt;/h3&gt;

&lt;p&gt;The Prompts panel is a Markdown editor with cross-app sync. Edit one prompt in CC Switch, sync to &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;, &lt;code&gt;~/.codex/AGENTS.md&lt;/code&gt;, and &lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt; in one click. Backfill protection means edits made directly in the live files are pulled back into CC Switch before the next sync overwrites them — useful if a teammate edited &lt;code&gt;CLAUDE.md&lt;/code&gt; directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Skills
&lt;/h3&gt;

&lt;p&gt;The Skills panel installs skills from GitHub repos or ZIP files into each CLI's skills directory (&lt;code&gt;~/.claude/skills/&lt;/code&gt;, &lt;code&gt;~/.config/opencode/skills/&lt;/code&gt;, etc.). Symlink mode is the default — one canonical copy lives in &lt;code&gt;~/.cc-switch/skills/&lt;/code&gt; and the CLIs see it via symlink. Switch to file-copy mode if your CLI is sandboxed and doesn't follow symlinks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codex OAuth reverse proxy (since v3.13)
&lt;/h3&gt;

&lt;p&gt;CC Switch added a path that reuses your ChatGPT account's Codex entitlement inside Claude Code. It shows up as a new Claude provider card type — not as a Codex preset — and routes Claude Code's requests through a local proxy fronting &lt;code&gt;chatgpt.com&lt;/code&gt; and &lt;code&gt;auth.openai.com&lt;/code&gt;. Useful if you already pay for ChatGPT Plus or Pro. Read the in-app risk notice before relying on it for production; it's an unofficial reuse path and could break the day OpenAI tightens auth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alternatives — When You Don't Want a Desktop App
&lt;/h2&gt;

&lt;p&gt;Two paths cover the cases where the Tauri desktop app isn't a fit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Single OpenAI-compatible gateway&lt;/strong&gt; — If you just want one base URL that Claude Code, Codex, Cursor, Cline, and OpenClaw can all point at, &lt;a href="https://ofox.ai" rel="noopener noreferrer"&gt;OfoxAI&lt;/a&gt; exposes &lt;code&gt;https://api.ofox.ai/v1&lt;/code&gt; as a unified OpenAI-compatible endpoint. Most teams reach for this when they want to avoid editing per-CLI configs at all; see the &lt;a href="https://ofox.ai/docs/integrations/claude-code" rel="noopener noreferrer"&gt;Claude Code integration guide&lt;/a&gt; for the one-line setup. You can keep CC Switch for tray-level switching and use ofox as the underlying provider, or skip CC Switch entirely if a single endpoint is enough.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Headless / scripted environments&lt;/strong&gt; — &lt;a href="https://github.com/saladday/cc-switch-cli" rel="noopener noreferrer"&gt;SaladDay/cc-switch-cli&lt;/a&gt; is a Rust CLI fork that does provider switching, MCP management, skills install, and local proxy from the terminal. Same WebDAV-compatible storage format as the desktop app. Best fit for build agents, remote dev containers, and &lt;code&gt;tmux&lt;/code&gt;-only servers.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Browser UI&lt;/strong&gt; — &lt;a href="https://github.com/Laliet/cc-switch-web" rel="noopener noreferrer"&gt;Laliet/cc-switch-web&lt;/a&gt; is a community web-based alternative covering Claude Code, Codex, and Gemini CLI. Less feature-complete than the desktop app today, but useful if you want a shared URL the whole team opens.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Monitor and Stay Current
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Auto-updater&lt;/strong&gt; — Settings → About → Check for updates. The desktop app auto-checks on launch.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Release feed&lt;/strong&gt; — Subscribe to &lt;a href="https://github.com/farion1231/cc-switch/releases" rel="noopener noreferrer"&gt;github.com/farion1231/cc-switch/releases&lt;/a&gt;. Recent shipped changes worth tracking: Claude Desktop as a first-class panel, the Codex OAuth reverse proxy (v3.13), unified Skills install from GitHub (added pre-v3.16), and per-app usage dashboards.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Backup hygiene&lt;/strong&gt; — &lt;code&gt;~/.cc-switch/backups/&lt;/code&gt; keeps the 10 most recent SQLite snapshots automatically. Before any major version upgrade, &lt;code&gt;cp ~/.cc-switch/cc-switch.db ~/.cc-switch/cc-switch.db.bak.$(date +%F)&lt;/code&gt; — a manual snapshot survives the rolling window.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;The questions here mirror the People Also Ask cluster for &lt;code&gt;cc switch claude code install&lt;/code&gt;, &lt;code&gt;ccswitch install npm&lt;/code&gt;, and related queries — covering supported CLI scope, terminal-restart behavior, fork landscape, official-account coexistence, the Codex OAuth reverse-proxy mode, key storage, headless support, and the hand-editing comparison.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://ofox.ai/blog/cc-switch-install-multi-cli-setup-2026/" rel="noopener noreferrer"&gt;ofox.ai/blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claudecode</category>
      <category>codex</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
