DEV Community: synthorai

Claude Opus 5 vs Opus 4.8, Measured: Same Price, 3x Apart

synthorai — Sun, 26 Jul 2026 17:41:35 +0000

Claude Opus 5 and Claude Opus 4.8 bill the same $5 per million input tokens and $25 per million output, and on identical prompts the default Opus 5 configuration cost 3.1x more. The reason is adaptive thinking: Opus 5 thinks by default, bills the thinking as output, and never shows it to you. One request setting closes the gap to exact parity, and it is a setting the bigger Fable 5 refuses to accept. Opus 5 went GA on 2026-07-24, positioned as Fable-5-level intelligence at half the token price; whether your bill actually halves depends almost entirely on this one choice.

TL;DR

Default Opus 5 billed 3.1x the same-priced Opus 4.8 on our five-task matrix; 42-95% of its output tokens were hidden thinking.
thinking: {"type": "disabled"} brought Opus 5 to exact parity with 4.8 (384 vs 384 output tokens), accuracy held; Fable 5 rejects that parameter.
On agent traffic the tax collapses to +33%, with tool/batch scenarios near parity: adaptive thinking barely fires in tool loops.
The 1M context is real (needle recalled at 969,950 tokens) and the cache floor is 512 tokens, half of 4.8's 1,024.

How do Opus 5, Opus 4.8, and Fable 5 differ as platforms?

Before the cost deep-dive, the orientation map: three current Claude tiers, two axes of difference, rates and request shape. Side by side (measured items marked, the rest from the model docs):

	Opus 4.8	Opus 5	Fable 5
List price (in/out per M)	$5 / $25	$5 / $25	$10 / $50
Thinking default	off unless requested	adaptive, on (measured)	always on
`thinking: disabled`	accepted, independent of effort	accepted at effort `high` or below	rejected with 400 (measured)
Effort ladder	`low`-`max`, default `high`	`low`-`max`, default `high` (cost side measured below)	`low`-`max`, default `high`
Thinking content returned	n/a by default	never (measured)	never
Cache floor	1,024 tokens	512 tokens (measured)	512 tokens
Context window	1M	1M, default and max (969,950-token needle, measured below)	1M
Assistant prefill	rejected	rejected, named 400 (measured)	rejected
Fast mode	available (research preview)	available, $10/$50	not offered
Refusal fallbacks	serves as the default fallback target	`fallbacks` incl. new `"default"` mode (beta)	introduced here (explicit lists)
Data retention	standard options	standard options	30-day retention required

Three rows deserve a sentence. The thinking: disabled row is the migration trap: on Opus 5 it is coupled to effort, accepted only at high or below per the docs, where 4.8 treated the two settings as independent; version-gate migration scripts. The fast-mode row sets up a comparison that matters later: Opus 5 in a hurry costs exactly Fable 5's rate card, so "fast Opus 5 vs default Fable 5" is a pure speed-vs-capability trade at equal token prices. And the retention row is the quiet compliance win: Fable-class intelligence on Opus 5 comes without Fable 5's 30-day retention mandate.

Two more platform notes that resist table form. Anthropic documents edge cases when thinking is disabled (tool calls occasionally written into visible text, leaked internal tags); we did not hit either across 84 thinking-off calls in the agent suite measured below, but the guidance reinforces the routing rule: keep thinking on for tool-heavy routes, where the tax is small anyway. And mid-conversation tool changes (beta, Opus 5) let you add or remove tools between turns without busting the prompt cache, which protects the cached-prefix economics our prompt caching guide is built on for long agent sessions.

What does Opus 5 cost out of the box vs Opus 4.8?

3.1x more for the same work at the same rate card. We ran the three current Claude tiers through a five-task matrix (n=3 per cell, salted prompts) at their defaults, on the native Messages API, with Fable 5 alongside for scale:

Task	Opus 5 default	Opus 4.8	Fable 5	Accuracy
Trivial arithmetic	12	3	12	3/3 all
Factual one-liner	40	6	11	3/3 all
Small code function	70	38	44	—
Multi-step word problem	152	102	52	3/3 all
120-word paragraph	1,031	236	264	—
Total out tokens (cost per set)	1,305 ($0.03427)	384 ($0.01120)	383 ($0.02233)

Same answers, same rates as 4.8, three times the bill. Opus 4.8 does not think unless you ask it to; Opus 5 ships with adaptive thinking on, and the thinking bills at the full $25/M output rate.

The Fable 5 column holds the counterintuitive result: the model with double the rates ($10/$50) billed 35% less than default Opus 5 in absolute dollars, because it answered the same tasks in 383 output tokens to Opus 5's 1,305. All three models run the same documented default effort (high), so the gap is thinking calibration, not configuration. Two mechanisms fit the numbers. First, a more capable model needs less deliberation to be sure of an easy answer: Fable 5 spent 52 tokens on the word problem where Opus 5 spent 152, and 264 on the paragraph where Opus 5 spent 1,031. Second, Opus 5's headline capability is test-time compute scaling, converting extra deliberation into quality on hard problems, and its default calibration buys that insurance on every request, including the ones that need none. On easy traffic you are paying for insurance you do not use; Fable 5 mostly declines to buy it.

Where does the extra spend go?

Into reasoning you cannot read. Comparing billed output tokens against the visible answer text, 42-95% of Opus 5's default output spend was hidden thinking, and it fires even on questions that need none: the answer to 17*23 carried 11 thinking tokens behind a 1-token answer, and the 120-word writing task spent about 806 of its 1,031 output tokens deliberating. The thinking content is never returned in any form, no summary, no trace, which puts Opus 5 at the most closed end of the visibility spectrum we mapped in our token usage anatomy study, alongside Fable 5. You can see the count in the usage itemization; you cannot see what it bought.

What does the thinking switch actually do?

It turns Opus 5 into Opus 4.8's bill. Sending thinking: {"type": "disabled"} drove thinking to zero on every task, and the totals landed at exact parity with 4.8, 384 output tokens to 384, $0.01130 to $0.01120 per set:

Arm	Output tokens (set)	Cost (set)	vs Opus 4.8	Accuracy (3 checkable tasks)
Opus 5 default (= effort `high`)	1,305	$0.03427	3.1x	9/9
Opus 5, effort low	1,019	$0.02720	2.4x	9/9
Opus 5, effort medium	1,167	$0.03089	2.8x	9/9
Opus 5, explicit effort high	1,514	$0.03956	3.5x	9/9
Opus 5, effort xhigh	1,633	$0.04262	3.8x	8/8
Opus 5, effort max	1,569	$0.04093	3.7x	9/9
Opus 5, thinking disabled	384	$0.01130	1.0x	9/9
Opus 4.8 default (= effort `high`, no thinking)	384	$0.01120	1.0x	9/9
Fable 5 default (= effort `high`, thinking always on)	383	$0.02233	2.0x	9/9

A labeling note first: every model here documents the same API default, effort high, and Anthropic states an explicit high is identical to omitting the parameter. Our implicit-default and explicit-high arms still differ by 16%, which is run-to-run variance on the writing task (the noisiest cell in every batch we have run), not a real distinction; read those two rows as one arm measured twice. What separates the three defaults is not the effort level but what thinking does at that level: none on 4.8, frugal and adaptive on Fable 5, aggressive and adaptive on Opus 5.

Two things stand out. First, the effort dial is a quality ladder, not a cost knob: the ladder itself (low through max) is not new, 4.8 accepts the same range, but on tasks this simple every step above low just buys more deliberation at identical accuracy, with xhigh topping out at 3.8x the 4.8 bill. The top tiers exist for test-time compute scaling on genuinely hard problems, which a five-task sanity matrix cannot exercise; what it can show is the cost side, and the cost side says the dial never gets you back to 4.8 parity. Only the off switch does, at 67% below the default. Second, the switch exists at all: Fable 5 rejects thinking: {"type": "disabled"} with a 400, so this is a genuine Opus 5 differentiator, not a family constant. The same thinking object works on both gateway surfaces, /v1/messages and the OpenAI-compatible /v1/chat/completions; on the latter, disabled runs report zero reasoning_tokens in completion_tokens_details:

{"model": "claude-opus-5", "thinking": {"type": "disabled"}}

The honest caveat: our checkable tasks are retrieval- and single-step-shaped, and accuracy held 9/9 with thinking off, including the multi-step word problem. Harder agentic work is exactly what adaptive thinking exists for, so treat the switch as a per-route decision, the same rule we landed on for Kimi K3 and Gemini 3.6 Flash: off for extraction, formatting, and single-step calls; default where your evals say the thinking earns its bill.

Does the 3x tax hold on agent workloads?

No, and the difference is the point. Across our agent scenario suite (tool loops, RAG, tooling, batch, long chat; 50 episodes per arm), default Opus 5 cost only 33% more than Opus 4.8, not 210%, and the tool-shaped scenarios ran near parity (1.01-1.22x). Adaptive thinking is genuinely adaptive there: about 88 thinking tokens per call inside agent loops versus 806 on a bare writing prompt. The outlier is long chat at 1.58x, which is where the off switch still pays (1.22x with thinking disabled). Function calling showed zero thinking tax at all: at the default, a tool-call request came back in 52 output tokens with no thinking attached, the same budget a non-thinking model would spend.

So the practical split is by traffic shape, not by model: bare completions and chat-shaped calls carry the 3x default tax and want the switch; tool-heavy agent traffic mostly does not need it.

Is Opus 5 really half the price of Fable 5?

Only after you flip the switch. Per token, yes: $5/$25 versus Fable 5's $10/$50. In practice, on our bare-task matrix, default Opus 5 billed $0.03427 per set against Fable 5's $0.02233, 53% more in absolute dollars, because Fable 5 answered the same tasks in 383 output tokens to Opus 5's 1,305. With thinking disabled, Opus 5's $0.01130 is almost exactly half of Fable 5's bill, which is the launch promise made real, through a parameter Fable 5 itself does not accept.

Context, cache, and tokenizer: what else did we verify?

The 1M window is real and fails loud. A recall needle at the front of a 969,950-token prompt came back correct in 39 seconds, and a 1,010,221-token prompt returned a clean prompt is too long: … > 1000000 maximum rather than silently truncating.

The cache floor halved. Anthropic documents a 512-token minimum cacheable prefix for Opus 5 (and Fable 5), down from 1,024 on Opus 4.8 and Sonnet 5; our sweep matched it, with prefixes near 511 tokens never caching and 547 caching reliably. Cached reads bill $0.50/M (0.1x), writes 1.25x, 5-minute TTL. Shorter system prompts are now cacheable, which quietly matters for high-QPS routes.

The tokenizer is unchanged across Opus 5, Opus 4.8, Fable 5, and Sonnet 5: identical token counts on our multilingual and code samples, so per-language budgets and prompt-size estimates carry over with no re-baselining.

FAQ

Can you turn thinking off on Claude Opus 5?

Yes, at effort high or below; the docs state that combining it with xhigh or max returns a 400. In our probes it drove thinking tokens to zero and brought cost to parity with Opus 4.8 (384 vs 384 output tokens on our matrix). This is specific to Opus 5: Fable 5 rejects the same parameter at any effort with a 400. The effort dial (low through max) also works but never reaches parity: from -21% at low to +24% at xhigh versus the default in our matrix.

Why is my Opus 5 bill higher than Opus 4.8 at the same list prices?

Because Opus 5 thinks by default and the thinking bills as output at $25/M. On bare prompts, 42-95% of its billed output was hidden reasoning in our measurements; a two-digit multiplication carried 11 thinking tokens behind a 1-token answer. Read reasoning_tokens from the usage itemization to see the share on your own traffic, and disable thinking on routes that do not need it.

Should agent workloads disable thinking on Opus 5?

Usually not. In our agent suite the default cost only +33% versus Opus 4.8, with tool and batch scenarios near parity, because adaptive thinking barely fires inside tool loops. The exception is long chat-shaped sessions (1.58x), where the switch still pays. Measure your own mix; the tax lives in bare completions, not tool calls.

Measured 2026-07-25 through 2026-07-27 on claude-opus-5, claude-opus-4-8, and claude-fable-5 via the Synthorai gateway: five-task matrix and effort/switch ablation from one canonical batch (n=3 per cell, salted prompts, native Messages API), agent numbers from a 150-episode scenario suite, context and cache probes from needle-recall and prefix sweeps, API-shape rows (prefill, switch acceptance) from direct request probes. Accuracy counts use tasks with a single checkable answer. Prices and behavior may change; verify against your own usage records.

Gemini 3.6 Flash: the Thinking Dial That Moves Cost 30x (Measured)

synthorai — Fri, 24 Jul 2026 11:16:46 +0000

Gemini 3.6 Flash charges you for thinking tokens on top of the answer, and how many it spends is a dial you control per request. On the same 120-word writing task, the default setting billed $0.03316 and the minimal setting billed $0.00110, a 30x swing for output a reader could not tell apart. That dial is the most important cost decision on this model, and it comes with one sharp edge. Gemini 3.6 Flash went generally available on 2026-07-21 at $1.50 per million input tokens and $7.50 per million output, down from $9 output on 3.5 Flash. It shipped alongside Gemini 3.5 Flash-Lite and a security-tuned 3.5 Flash Cyber; this post measures the two general-purpose tiers, 3.6 Flash and Flash-Lite.

TL;DR

reasoning_effort: "minimal" cut per-call cost 91–97% versus the default (a 30x swing on a 120-word task), free on single-step, structured-output, and tool-calling work but breaking multi-step math 3/3 → 0/3.
Google's "17% fewer output tokens" is workload-dependent: our reasoning-heavy tasks ran 19% lighter (32% cheaper), our agent suite 9% heavier (6% cheaper).
The 1M context is real (a needle at 972K tokens recalled) and prompt caching matches Google's published 4,096-token floor exactly, a clean spec match unlike some "1M context" models that undershoot what they advertise.

Everything below was measured on 2026-07-24 through the Synthorai gateway, with repeated prompts salted to defeat caches; raw usage records back every number.

What does Gemini 3.6 Flash cost per task at default settings?

Reasoning dominates the output bill, and it is charged whether or not you see it. At the default effort, the model spends far more tokens thinking than answering, and those reasoning tokens bill at the full $7.50/M output rate:

Task	Answer tokens	Reasoning tokens (billed)	Cost per call
Factual one-liner	2	69	$0.00056
Trivial arithmetic	3	167	$0.00131
Small code function	29	379	$0.00312
Multi-step word problem	4	472	$0.00368
120-word paragraph	139	4,274	$0.03316

The pattern is the one to internalize: a two-token factual answer still carried 69 tokens of reasoning, and the 120-word paragraph spent 30x more tokens thinking than writing. Reasoning tokens are itemized in completion_tokens_details.reasoning_tokens, so you can see the count, but never the content. Gemini returns no thinking summary or trace at all, the most closed end of the spectrum we mapped in our token usage anatomy study, where Kimi K3 returns its full chain of thought and GPT-5.6 a summary. The next section is about turning that spend down.

What does the thinking dial actually do?

It is a genuine, monotonic cost lever, and on most task types it is nearly free money. Setting reasoning_effort (or the native thinking_config.thinking_level) to minimal drove reasoning tokens to zero and cut cost 91–97% per task:

Task	Default cost	`minimal` cost	Swing	Accuracy default → minimal
Factual one-liner	$0.00056	$0.00005	12x	3/3 → 3/3
Trivial arithmetic	$0.00131	$0.00006	22x	3/3 → 3/3
Small code function	$0.00312	$0.00028	11x	—
Multi-step word problem	$0.00368	$0.00014	26x	3/3 → 0/3
120-word paragraph	$0.03316	$0.00110	30x	—

The dial is real and the accepted values are minimal, low, medium (the default), and high; each step bought monotonically more reasoning in our probes (minimal 0 tokens, low ~180, medium ~530, high ~650). The one thing minimal cannot do is think, and multi-step arithmetic needs to: forced to answer the pencils-and-bags word problem tersely, the model got it wrong all three times, with scattered wrong answers rather than one systematic slip. On retrieval, classification, formatting, and single-step questions, minimal held accuracy and cut the bill by an order of magnitude.

The practical rule mirrors what we found on Kimi K3: minimal is a defensible default for extraction, lookup, and formatting, and a footgun for anything that needs intermediate steps. Set it per route, not globally, and verify accuracy on your own tasks before shipping it on a reasoning-heavy one.

Two high-volume production shapes make the case concrete: structured output and function calling both spend reasoning at the default, and both are safe to run at minimal. A schema-constrained extraction (response_format with a JSON schema) billed 337 reasoning tokens at the default and returned valid JSON; at minimal it billed zero reasoning, still returned valid schema-conforming JSON, and cost 9x less. A function call behaved the same way: 74 reasoning tokens and a correct get_weather(city) call at the default, versus zero reasoning and the same correct call at minimal, 4x cheaper. These are single-step tasks dressed up as "structured," and the model does not need to think its way to a field it was told to fill, so if your traffic is extraction or tool routing, minimal is close to free money.

Does the "17% fewer output tokens" claim hold up?

It depends on the workload, and the split is instructive. Google's launch positioned 3.6 Flash as spending about 17% fewer output tokens than 3.5 Flash on the Artificial Analysis Index (up to 65% on individual agentic evals). We ran both models through two of our own testbeds and got opposite signs:

Testbed	Output tokens 3.6 vs 3.5	Cost 3.6 vs 3.5
Task matrix (five short tasks, reasoning-heavy)	−19%	−32%
Agent suite (tool loop, RAG, batch, long chat)	+9%	−6%

On the reasoning-heavy short tasks, the claim not only reproduced but beat its headline: total output fell 19%, close to Google's 17%, and it came almost entirely from thinking, not the answer. Splitting the output tokens in a paired rerun of both models, the visible answer shrank only 4% while reasoning fell 19%, concentrated in the math and writing tasks where 3.6 reaches the same result with less deliberation. That is the mechanism behind the benchmark: on work that leans on the thinking budget, 3.6 is genuinely more efficient at the same answer.

On agentic, multi-turn traffic the sign flips: 3.6 spent about 9% more output than 3.5 across the suite. The efficiency gain lives in the reasoning phase, and agent loops spend proportionally less of their budget there, so there is less to save and 3.6's slightly longer turns win out. Either way the bill drops, because the two effects stack differently: reasoning-heavy tasks save on both tokens and the $9→$7.50 rate cut (−32%), while agent traffic saves on price alone (−6%). The honest summary is that "17% fewer output tokens" is real where thinking dominates the output and inverts where it does not, so measure your own mix rather than assume the headline, and remember the dial from the previous section moves this far more than the version bump does.

Is the 1M context window real?

Yes, and it fails loud rather than silent. We placed a recall needle at the front of prompts of increasing size: it was still recalled correctly at 972K input tokens, and a prompt past the limit returned a clean 400 input token count exceeds the maximum rather than silently dropping content. That is worth stating because not every "1M-context" model on the market actually serves the window it advertises. One testing note for anyone reproducing this: pad with varied, sentence-shaped filler, because a prompt built from a single repeated token pushed the model into degenerate gibberish well before the size limit.

Prompt caching is automatic and matches the spec on the number that matters. Google documents a 4,096-token minimum for context caching on the Flash models, and our sweep landed exactly there: prefixes at or below ~2.1K never cached, hits began around 4.1K tokens, and each hit left roughly the last 2.1K uncached, after a 5-to-8 call warm-up. Cached input reads at $0.15/M, a 10x discount off the $1.50 fresh rate. This is worth stating plainly because it is the reassuring case: unlike some models we have measured whose advertised numbers overstate what the endpoint actually delivers, Gemini 3.6 Flash's cache floor and its 1M window both do what the docs say. Caching still only pays for genuinely long, stable prefixes, and note the Flash tiers support only automatic (implicit) caching, not the explicit cached-content API, so you cannot manually pin a big document and reuse it below the floor.

Where does Gemini 3.5 Flash-Lite fit?

Flash-Lite is the predictable-cost tier. It never spends reasoning tokens silently, so its bill tracks visible output one-to-one. On the same multi-step math problem Flash-Lite billed $0.00057 against 3.6 Flash's $0.00368 default, roughly 6x cheaper, and it worked the answer out in the open rather than in a hidden reasoning field. At $0.30/M input and $2.50/M output it is the right default for high-volume, latency-sensitive, single-step work; step up to 3.6 Flash when a task needs the reasoning the dial can add back. The tokenizer is unchanged not just across the three new models but back to Gemini 2.5 Flash: identical token counts on English, Chinese, Japanese, Korean, and Python across every generation we checked, so per-language budgets built for 2.5 carry to 3.6 without re-baselining.

FAQ

Can you turn reasoning off completely on Gemini 3.6 Flash?

reasoning_effort: "minimal" (or thinking_level: "minimal") drove reasoning tokens to zero in our probes and is the floor of the dial; the accepted steps are minimal, low, medium, and high. There is no separate "disabled" state, and attempts to hard-disable reasoning are rejected upstream, so minimal is as low as it goes, and for single-step tasks it is low enough.

Why is my Gemini bill higher than the visible answer suggests?

Because reasoning tokens bill at the full output rate and are not part of the text you get back. A two-token answer can carry dozens to thousands of billed reasoning tokens; read completion_tokens_details.reasoning_tokens (or reconcile total_tokens − prompt − completion) to see the real output charge, and turn the dial down where the task allows.

Gemini 3.6 Flash or Claude Haiku 4.5?

They occupy the same fast-tier slot at similar prices, and the split is by workload, not by a single winner. On our cost lens, 3.6 Flash's thinking dial is the differentiator: minimal makes it an order of magnitude cheaper on single-step traffic, while its default spends reasoning that Haiku 4.5, at $1/$5, does not. Published benchmarks give Haiku 4.5 the edge on coding depth and 3.6 Flash the lead on math and raw token price; pick by which your traffic is made of, and measure both on your own tasks before committing.

Is Gemini 3.6 Flash cheaper than 3.5 Flash?

Yes, in every workload we measured, though by how much depends on the shape. Output dropped from $9/M to $7.50/M, and on reasoning-heavy short tasks 3.6 also spent fewer output tokens, so cost fell about 32%; on agent traffic it spent slightly more tokens and the saving came from the rate cut alone, about 6%. Either way it is cheaper; migrate and re-measure your own mix. For per-token cost decomposition across families, see our token usage anatomy study.

Measured 2026-07-24 on gemini-3.6-flash, gemini-3.5-flash, and gemini-3.5-flash-lite via the Synthorai gateway; task-matrix and agent-suite token counts from per-call usage records, effort-dial results from a salted five-task ablation (n=3 per cell), context and cache probes from needle-recall and prefix sweeps. Accuracy counts use tasks with a single checkable answer. Prices and behavior may change; verify against your own usage records.

Seedance API Pricing, Measured: the Video-Token Formula, Solved

synthorai — Wed, 22 Jul 2026 17:48:31 +0000

Seedance bills video by tokens, and the formula that actually matches the meter is encoded_width × encoded_height × (24 × seconds + 1) / 1024. Two parts of that are not in any documentation we could find: the +1 frame, and the fact that the encoded dimensions are not the nominal ones (720p bills as 1248×704, not 1280×720). We ran 27 generations across five Seedance models and every resolution tier, and every billed token count reconciles against that formula to the token. This post maps the family, shows the two-request API, then gives the solved formula, the per-second price ladder that falls out of it, and the two places the rate card misleads.

TL;DR

Seedance video tokens = encoded width × height × (24 × seconds + 1) / 1024; the +1 frame and the encoded dimensions are measured, not documented.
720p bills as 1248×704 and 1080p as 1920×1088; aspect ratio changes nothing.
The same 4-second clip billed identical tokens on every tier from 1.5-pro up; rates run $1.0 to $7.0 per million.
4k's $4.0/M rate undercuts 1080p's $7.7/M yet costs $0.78 per second against $0.38, 2.1x more.
generate_audio leaves tokens unchanged and doubles the rate on 1.5-pro.

Everything below was measured on 2026-07-22 through the Synthorai gateway's /v1/videos endpoint, where the Seedance family is live at ByteDance list prices; raw per-task records back every number.

Which Seedance model does what?

Tier choice changes the rate, never the meter: pick by capability, then let the next sections price it. The same 4-second 480p prompt billed identical tokens on Seedance 2.0, 2.0-fast, 2.0-mini, and 1.5-pro (40,594 each; 1.0-pro-fast billed 39,285 on its slightly smaller pixel grid). What you pay for going up the ladder is capability, and the family splits it unevenly:

	Resolutions	Duration	Audio	Image input	`seed` / `camera_fixed`	Rate ($/M tokens)
2.0	480p-4k	4-15s (+auto)	✓ native	first + last frame	✗	$7.0 (1080p $7.7 · 4k $4.0)
2.0-fast	480p-720p	4-15s (+auto)	✓	first + last frame	✗	$5.6
2.0-mini	480p-720p	4-15s (+auto)	✓	first + last frame	✗	$3.5
1.5-pro	480p-1080p	4-12s (+auto)	toggle ($1.2 / $2.4)	first + last frame	✓	$1.2-2.4
1.0-pro-fast	480p-1080p	2-12s	✗	first frame only	✓	$1.0

Three quirks worth knowing before you pick. Reproducibility flows the wrong way: seed and camera_fixed exist only on the 1.x models, so the cheapest tiers are the controllable ones and the flagship is not. Capability fields are hard-gated per model: sending generate_audio to a 1.0 model returns a named 400 (extension_not_supported), so build requests from each model's capability list instead of one shared shape. And only 1.0-pro-fast goes down to 2-second clips, which is why the formula measurements below lean on it; everything newer starts at 4 seconds. Beyond the table, the 2.0 models also accept reference-file input upstream (up to 9 images, 3 video clips, and 3 audio files) where a platform exposes it.

How do you call the Seedance API?

Video generation is an async job API: one POST creates the task and returns in about two seconds, then you poll the task URL until it completes. The whole flow in Python:

import os, time, requests

BASE = "https://synthorai.io/v1"
auth = {"Authorization": f"Bearer {os.environ['SYNTHORAI_API_KEY']}"}

task = requests.post(f"{BASE}/videos", headers=auth, json={
    "model": "seedance-1-5-pro-251215",
    "prompt": "A paper boat drifting across a rain puddle, cinematic",
    "resolution": "720p", "ratio": "16:9",
    "duration": 5, "generate_audio": True,
}).json()                                    # returns in ~2s: {"id": "vid_...", "status": "queued", ...}

while task["status"] not in ("completed", "failed", "cancelled"):
    time.sleep(8)
    task = requests.get(f"{BASE}/videos/{task['id']}", headers=auth).json()

print(task["data"][0]["url"])                # signed MP4, valid 24h
print(task["usage"]["total_tokens"])         # the billing meter this post is about

If you would rather block than poll, send a Prefer: wait=60 header on the create and the response holds until the clip is done or the window expires, then degrades back to the polling object. Generation itself took anywhere from 20 seconds to a bit over two minutes for the 480p and 720p clips in our runs. The usage.total_tokens field on the completed task is the billing meter, and the rest of this post is about what drives it.

What is a video token, exactly?

A video token is a fixed slice of output pixels over time: the billed count is W × H × frames / 1024, where the frame count is 24 × seconds + 1 and W×H are the encoder's real dimensions, not the label on the resolution tier. We recovered both terms from the meter itself. Billed token counts across three durations at each resolution fit a straight line with zero residual, the line's slope gives tokens per second, and its intercept, at every resolution, is exactly one frame's worth of tokens (405 at 480p, 858 at 720p, 2,040 at 1080p). Solving the slopes for W×H gives the real encoded grids:

Nominal tier	Encoded dimensions (measured)	Tokens per second	One extra frame
480p	864×480 (1.0 series) / 864×496 (1.5/2.0 series)	9,720 / 10,044	405 / 418
720p	1248×704 (not 1280×720)	20,592	858
1080p	1920×1088 (not 1920×1080)	48,960	2,040
4k	3840×2160 (nominal = encoded)	194,400	8,100

One independent cross-check: ByteDance's widely circulated worked example for 2.0, a 15-second clip at about 308,880 tokens, is exactly 15 times our measured 720p rate, so the 720p grid holds beyond the 1.0 series we fitted it on (and their marketing math drops the +1 frame).

Two practical consequences. Aspect ratio does not move the bill: 16:9 and 9:16 billed identical tokens in all nine paired runs, so vertical output is not a cost decision. And the widely quoted approximation W × H × 24 × duration / 1024 undershoots by one frame and uses the wrong dimensions, which is why third-party estimates drift a few percent from real invoices.

What does Seedance actually cost per second?

Multiply the measured token rates by each tier's list rate and the whole catalog collapses into one ladder:

Model @ resolution	$/second (measured tokens × list rate)
seedance-1.0-pro-fast @ 480p	$0.0097
seedance-1.5-pro @ 480p, no audio	$0.0121
seedance-1.0-pro-fast @ 720p	$0.0206
seedance-1.5-pro @ 480p, with audio	$0.0241
seedance-2.0-mini @ 480p	$0.0352
seedance-1.0-pro-fast @ 1080p	$0.0490
seedance-2.0-fast @ 480p	$0.0563
seedance-2.0 @ 480p	$0.0703
seedance-2.0 @ 4k	$0.778

For context against the per-second billers' July 2026 list prices (as aggregated by public pricing trackers): Kling runs about $0.07/s, Sora 2 about $0.10/s, and Veo 3.1 about $0.40/s. Seedance 2.0 at 480p sits at Kling's price point with native multi-track audio included, and the 1.0-fast tier generates watchable 480p at roughly a seventh of Kling's rate. A 15-second 720p clip on 2.0 lands around $2.17; the same clip on 1.0-pro-fast is $0.31.

Is 4k cheaper? The rate card says yes, the bill says no

4k carries the lowest per-token rate on Seedance 2.0, $4.0 per million against 1080p's $7.7, and it is still the most expensive thing on the menu. The reason is the formula: 4k emits about four times 1080p's tokens per second (194,400 vs 48,960), so the cheaper rate buys 2.1x the per-second cost, $0.778/s against $0.377/s. Our 4-second 4k anchor billed 785,700 tokens, $3.14 for four seconds of video. Read any per-token rate card with that resolution's tokens-per-second beside it; on token-billed video, the sticker and the bill can point in opposite directions.

What does generate_audio change?

The rate, not the tokens. On 1.5-pro the same 5-second clip billed 50,638 tokens with audio on and 50,638 with it off; the list rate doubles from $1.2/M silent to $2.4/M with audio, so soundtracked output costs exactly 2x, cleanly, with no hidden token surcharge. On the 2.0 series audio is part of the model's single rate, so there is no toggle arithmetic to do. If your pipeline adds its own music bed anyway, 1.5-pro silent at $0.0121/s is the quiet bargain of the catalog.

How does async video billing behave?

Billing attaches to the task lifecycle you saw above: the bill lands once, when the task first reports completed, no matter how many times you poll. Cancellation is honest about physics: a queued task cancels cleanly and bills nothing, but once the task is running the upstream rejects deletion (409) and you wait out the result. Failed generations are not billed.

Three operational notes from running 27 tasks. First, video responses carry token usage but no cost field, unlike chat's usage.cost; budget as tokens × your tier's rate. Second, output arrives as a signed URL with a 24-hour lifetime; move it to your own storage promptly. Third, pace your creates: the platform allows 6 video tasks per workspace per minute, and because each pending task reserves worst-case cost against your balance, bursting creates during an upstream slowdown can bounce off a temporary insufficient-quota response even when the eventual spend is fine.

FAQ

Is the Seedance 2.0 API actually available?

Yes. The early-2026 access story (Chinese credentials, waitlists, staged rollout) is dated: the full family, 2.0 included, is live on the Synthorai gateway's /v1/videos endpoint at ByteDance list prices (model pages: Seedance 2.0, 1.5-pro, 1.0-pro-fast), and several other platforms serve it as well. If a page tells you Seedance 2.0 is experience-quota only, it predates the API rollout.

Why do Seedance prices differ so much between providers?

Because most resellers convert token billing into flat per-second or per-clip prices with their own margin and rounding, and the conversion hides the resolution- and duration-dependent token math. The formula in this post is the ground truth underneath every quote: compute W × H × (24s + 1) / 1024 with the encoded dimensions and multiply by the official rate, and you can price any provider's markup precisely.

Does aspect ratio or a vertical format cost extra?

No. 16:9 and 9:16 billed identical tokens at every resolution and duration we tested. Resolution and duration are the only levers that move the meter, and duration is exactly linear.

What about Seedance 2.5?

Announced, not yet served on the APIs we track. The token formula and the tier-versus-rate structure are the parts likely to carry over; when it lands we will re-run the same probes.

Measured 2026-07-22/23: 27 completed generations across five Seedance models via /v1/videos, token counts from per-task usage records, rates from the ByteDance list, and the capability matrix from the live /v1/videos/models catalog. Per-second competitor figures are list prices, not measurements. Formula fits are exact on every measured point; encoded dimensions are recovered from the meter, so re-verify them if ByteDance changes encoders. For the rest of the modality series: image generation, speech-to-text, voice sessions, and text tokens.

GPT-5.6 Prompting Guide: Two Defaults That Bill 1.5x and 10x More

synthorai — Tue, 21 Jul 2026 09:54:06 +0000

Prompting GPT-5.6 well is mostly two request parameters, and both default to the expensive setting. Omitting reasoning_effort billed 1.5x as much as pinning it to "none" across our 50-call matrix, with identical answers; leaving a stable prefix unmarked bills it at 10x the cached read rate on every call. This guide is the request-shape playbook that falls out of the measurements in our GPT-5.6 cost guide: what a well-formed request looks like, how to set the effort dial per task, how to lay out a prompt so the cache does its work, and what breaks when you port prompts from GPT-5.5.

TL;DR

Pin reasoning_effort on every GPT-5.6 request: omitting it billed 1.5x as much as "none" with identical answers on our 4-task matrix.
Accepted efforts run none through xhigh; "max" returns a 400 on Sol and Terra alike.
Mark stable prefixes with explicit cache breakpoints: cached reads bill at 10% of the input rate, writes at 1.25x, so mark what repeats, not what merely looks stable.
prompt_cache_options and breakpoints return a 400 on GPT-5.5 and older; version-gate the rollout.

What should a GPT-5.6 request look like?

Start from this shape and delete what you do not need. It pins the two levers explicitly instead of inheriting the expensive defaults:

{
  "model": "gpt-5.6-terra",
  "reasoning_effort": "low",
  "prompt_cache_options": { "mode": "explicit", "ttl": "30m" },
  "prompt_cache_key": "tenant-42",
  "messages": [
    { "role": "system", "content": "…stable instructions…",
      "prompt_cache_breakpoint": { "mode": "explicit" } },
    { "role": "user", "content": "…the part that changes per request…" }
  ]
}

The ordering rule behind it: everything stable goes before the breakpoint, everything per-request goes after, and anything dynamic (timestamps, user names, retrieved documents that differ per call) never sits inside the marked block, because one changed byte re-bills the block at the 1.25x write premium. The prompt_cache_key routes repeats to the same cache; use one stable key per tenant or session, and note the documented soft limit of about 15 requests per minute per key.

How should you set reasoning_effort?

Explicitly, always: the one setting to avoid is no setting. In our measurements, requests without reasoning_effort billed 1.5x as much as requests pinned to "none", and the answers were identical across the matrix. The accepted values run none, low, medium, high, xhigh; "max" is rejected with a 400 listing the valid range. What the dial bought on our one-line math check, on Luna:

`reasoning_effort`	Reasoning tokens	Answer	Cost per call
`none`	0	correct	$0.000062
`low`	52	correct	$0.000410
`medium`	85	correct	$0.000608
`high`	74	correct	$0.000542

GPT-5.6 was the only family in our token usage anatomy study that stayed correct with thinking fully off on that check, which makes none a defensible default for extraction, classification, formatting, and retrieval-shaped calls. When it does reason, the tokens are invisible and billed at the full output rate: 88% of the output charge on the default-setting math example was chain of thought you cannot read. Step up the dial when your evals say the task needs it, not because the default already spent it.

How do you lay out a prompt so the cache pays?

Layer the prompt in order of stability and mark the layers: system instructions first, then tool definitions, then reference documents, each ending at a breakpoint, with the volatile user turn after the last mark. You get four cache writes per request; in the default implicit mode an automatic breakpoint on the latest message consumes one of them, so explicit mode gives you the full four and, more importantly, caches only what you mark.

Partial reuse is the payoff, and it is measured. With a stable block A and a swapped tail B, the meter re-billed only the tail: 1,212 tokens read back at the cached rate, 1,210 written fresh at the premium, out of a 2,431-token prompt, reconciling against the rate card to the digit. Three budgeting rules follow:

Reads bill at 10% of the input rate, so a warm layered prefix flattens the input side of the bill.
Writes bill at 1.25x, so a marked block that never gets read again costs 25% more than not caching. Mark what repeats, not everything that looks stable.
On full repeats the matched length can snap below the mark (1,897 cached of a 2,422-token write in one probe), so budget on the discount rate, not exact match counts; our cache minimums study has the per-family floors.

The ttl: "30m" floor is a guaranteed minimum, not a cap, and it is 6x Claude's default 5 minutes; there is no 24-hour tier anymore, so daily-batch workloads that leaned on extended retention should re-run the break-even.

What breaks when you port prompts from GPT-5.5?

Two things break loudly and one silently. Loudly: prompt_cache_options and prompt_cache_breakpoint return a clean 400 on GPT-5.5 and older (prompt_cache_options is not supported on this model), so any shared prompt-builder needs a version gate. Also loudly: "max" effort, which some 5.5 configs carried, is rejected.

Silently, and more expensively: GPT-5.6 reasons by default where a 5.5 workload may have had reasoning off. A ported prompt that never sets reasoning_effort picks up the 1.5x omission tax at the same rate card. The cache migration runs the other way: 5.5's automatic prefix detection needed no markup but could not be triggered or debugged; on 5.6 the same prompt does nothing until you mark it, and then reports every write in usage.prompt_tokens_details.cache_write_tokens, where a miss shows up as a zero in a field you created rather than as silence.

Which tier should run the prompt?

The same request shape runs on all three tiers, so tier choice is a price decision, not a prompting one: Sol at $5/$30 per million tokens, Terra at half, Luna at a fifth. Once the prefix is stable, keyed, and warm, the cached read discount flattens the input side on every tier, which makes output price the differentiator; step down as far as output-quality evals allow. The full tier arithmetic, including the write-premium break-even per tier, is in the cost guide.

FAQ

Does GPT-5.6 support reasoning_effort: "max"?

No. Requests with "max" return a 400 listing none through xhigh as the valid values, on Sol and Terra alike. Workloads that want the ceiling should send xhigh explicitly.

Do the cache breakpoints work on GPT-5.5?

No. GPT-5.5 and older reject prompt_cache_options and breakpoint markers with a 400. On those models you are back to automatic prefix detection, which cannot be triggered, keyed, or debugged; treat cache behavior as best-effort there and version-gate any prompt builder that emits the new fields.

How many breakpoints should a prompt actually use?

As many layers as genuinely repeat, up to the budget: four writes per request, one of which the implicit auto-breakpoint consumes unless you switch to explicit mode. A typical layered prompt needs two or three (instructions, tools, reference block), and a fifth marker is accepted without error but simply shares the write slots, since a later mark covers everything before it.

All numbers in this guide were measured through the Synthorai gateway on the day-one GPT-5.6 models and reconcile against the live usage.cost meter; methodology and raw probes are in the cost guide and the cache minimums study. Verify against your own usage records; rates and accepted values may change.

Kimi K3 API Pricing, Measured: Turn Off the 'Always-On' Reasoning

synthorai — Mon, 20 Jul 2026 09:46:49 +0000

Kimi K3's documentation says thinking cannot be turned off, and that reasoning_effort accepts only "max". In our measurements the API accepts "none" anyway, and it works: the same trivial question that costs $0.00179 with default reasoning costs $0.000285 without it, a 6.3x difference. K3 launched on 2026-07-16 at $3 per million input tokens and $15 per million output, the most expensive list price a Chinese lab has shipped and the same sticker as Claude Sonnet 5. At that output price, the reasoning tokens the model spends by default are the bill, which makes the undocumented off-switch worth understanding precisely.

TL;DR

Kimi K3 spends 69-93% of its output tokens on reasoning at default settings; a 120-word paragraph billed 2,289 output tokens, $0.0346.
reasoning_effort: "none" is accepted despite docs saying otherwise, and cut our simple-query cost 6.3x, but multi-step arithmetic went from 3/3 correct to 0/6.
Kimi K3's prompt cache hits from roughly 256 tokens of prefix, in 256-token blocks, at a $0.30/M read rate.
Chinese is K3's cheapest CJK lane: 52 net tokens per 100 characters, below GLM-5.2 and DeepSeek at 58.

Everything below was measured on 2026-07-20 against kimi-k3, which is live on the Synthorai gateway at Moonshot's list prices, with repeated prompts salted to defeat response caches and the behavioral claims cross-checked on a second, independent request path. Raw usage records back every number.

What does Kimi K3 cost per answer by default?

Reasoning dominates the bill on every task shape we sent, including ones that need no reasoning at all. Per answer, at default settings:

Task	Output tokens	Reasoning share	Cost per answer
Trivial arithmetic (17×23)	99	84%	$0.0018
Factual one-liner	80	79%	$0.0015
Small code function	119	69%	$0.0009
Multi-step word problem	139	87%	$0.0025
120-word paragraph	2,289	93%	$0.0346

The cross-model contrast is the point of the chart: GPT-5.6 reasons adaptively (zero thinking tokens on the trivial and factual questions, 65-70% on math and writing), Claude Sonnet 5 ships with thinking off, and GLM-5.2 thinks even harder than K3 in relative terms. But GLM's output price is $4.40/M and K3's is $15/M, so the same 17×23 answer billed $0.00078 on GLM-5.2, $0.00027 on GPT-5.6, $0.0001 on Sonnet 5, and $0.0018 on K3. The gap widens with output length: the identical 120-word paragraph cost $0.0346 on K3 against $0.0186 on GLM-5.2, $0.0072 on GPT-5.6, and $0.0024 on Sonnet 5, a 15x spread on the most ordinary task in the set. The tax rate is comparable to other Chinese reasoning models; the tax bill is not.

Two more default-mode facts worth budgeting for. First, thinking mode injects a hidden preamble of about 67 tokens into every request: an identical one-word message billed 86 prompt tokens with reasoning on and 19 with it off. That is the "hidden system prompt" early testers noticed, and it disappears when reasoning does. Second, K3 is slow at the moment: our trivial-question calls took roughly 19-24 seconds end to end with reasoning on and 3-8 seconds with it off, launch-week serving included. Budget latency, not just dollars.

Can you turn Kimi K3's reasoning off?

Yes, despite the documentation. The official API reference states that K3 "always enables thinking" and that reasoning_effort accepts only "max". In practice the endpoint accepted "none", "low", "medium", and "high" without error, honored them, and we confirmed the same behavior on an independent request path. On the multi-step word problem, the dial is real but coarse:

`reasoning_effort`	Reasoning tokens (avg)	Accuracy
`none`	0	0/6
`low`	78	3/3
`medium`	94	3/3
`high`	105	3/3
`max` / default	100-121	3/3

Two things stand out. The intermediate settings cluster: low through max bought similar token counts and identical accuracy on this task, so the meaningful switch is binary. And none has a real cliff: forced to answer a multi-step arithmetic problem tersely, K3 got it wrong six times out of six, with scattered wrong answers rather than one systematic error. When we did not force a terse format, the model sometimes ignored brevity and worked through the steps in its visible answer instead: correct, but the tokens moved from the reasoning field into the text field rather than disappearing.

Latency moves less than the token counts suggest. Streaming the same problem at every effort level, first-byte times ranged 6-24 seconds with the effort levels' ranges overlapping heavily; even none, with nothing to think about, waited 12-13 seconds, so serving dominates time to first token at this task size. What the dial actually changes is the gap between the first byte and the first answer token, which is the thinking phase the user sits through.

The practical reading: none is a genuine cost lever for tasks that are retrieval, formatting, or single-step, and a footgun for anything that needs intermediate steps. There is no documented guarantee this parameter keeps working; treat it as measured behavior, verify it in your own usage fields, and expect it may be formalized or removed when the docs catch up.

When should reasoning stay on in agent workloads?

We ran K3 through five agent-shaped scenarios twice, default versus reasoning_effort: "none", with identical simple tasks that both configurations passed in full:

Scenario	Thinking share (default)	Cost with `none`	TTFT with `none`
Tool-call loop	8%	−10%	−35%
RAG answering	71%	−37%	−53%
Structured tooling	29%	−16%	−31%
Batch extraction	80%	−15%	−11%
Long chat (15 turns)	34%	−13%	−25%

The surprise is the first row: in tool-call loops K3 barely thinks even at default settings (8% share), so there is little to save; the model treats tool selection as reflex, not deliberation. The savings concentrate where thinking share is high and the task is mechanical (RAG lookups and batch extraction), which is exactly where a fixed always-on tax least belongs. For genuinely multi-step agent plans, the previous section's accuracy cliff applies; leave reasoning on and spend the tokens.

At scale the latency effect is real even though single calls are noisy: across these scenarios first tokens arrived in 10-19 seconds at default settings and 8-13 seconds with none, and generation ran at a median 35 tokens per second on substantive outputs. Those numbers, launch-week serving included, fit asynchronous and batch shapes better than anything conversational today.

Does the reasoning you pass back re-bill as input?

Yes, token for token. Kimi's docs tell you to keep reasoning_content from every assistant turn in the message history unmodified. We measured what that costs: a second turn sent with the first turn's chain of thought billed 599 prompt tokens; the identical request without it billed 198. The 401-token difference matches the first turn's 402 reasoning tokens almost exactly, so retained thinking re-enters every subsequent request at the full $3/M input rate, and a long conversation re-pays its accumulated reasoning on every turn.

Dropping it is not automatically cheaper, though. Without the prior chain of thought, K3 re-reasoned the follow-up from scratch: reasoning tokens on the second turn rose 31% (343 to 449). At $3/M input against $15/M output, keeping the CoT was the net-cheaper option in our probe, which means the docs' advice holds on cost grounds as well as quality ones. The lever that actually pays here is the prompt cache from the next section: retained history is a stable prefix, and stable prefixes stop billing at full price.

Does Kimi K3 cache prompts, and from how many tokens?

K3's prompt cache is automatic, and its floor is low: hits started at roughly 256 tokens of shared prefix and advanced in 256-token blocks (a 303-token prompt cached 256; a 153-token prompt never cached across repeated attempts). Cached input bills at $0.30/M, a flat 90% discount against the $3/M fresh rate, with no cache-write premium on any call we issued. Warm-up took two to five identical calls before the first hit, so a single retry proves nothing either way; measure across several.

For context, that floor is a quarter of OpenAI's documented 1,024-token minimum, and the block size is coarser than the 64-token granularity we measured elsewhere. Lifetime is best-effort rather than a fixed TTL: cached entries in our probe survived idle gaps of 4 and 15 minutes while one 8-minute gap missed, so treat expiry as load-dependent eviction and verify the cached split on every call. One more pricing fact worth arithmetic: the 1M-token context window is priced flat, with no long-context tier on the price list. A maxed-out window is $3.00 of fresh input per call, and $0.30 once the prefix is warm, so large-context workloads live or die by the cache far more than by the sticker. If your traffic reuses a system prompt of even a few hundred tokens, K3's cache engages where most providers' would not have started yet; the mechanics and how to verify hits from usage are covered in our prompt caching guide and the measured cache minimums study.

Is Chinese actually more expensive on Kimi K3?

No. Chinese is where K3's tokenizer is most efficient relative to peers, which answers a question raised repeatedly in launch-week discussions. Net tokens per 100 characters on semantically aligned passages, envelope overhead subtracted:

Model	en	zh	ja	ko	hi	Python
kimi-k3	19.7	51.9	87.5	83.2	62.8	26.5
glm-5.2	19.7	58.4	75.7	76.9	91.3	25.6
deepseek-v4-flash	19.7	58.4	70.6	69.2	60.7	26.7
claude-sonnet-5	32.3	114.3	94.1	106.3	70.9	41.1

K3 bills Chinese at 52 tokens per 100 characters, 11% under GLM-5.2 and DeepSeek and less than half of Sonnet 5. Its weak lane is Japanese, where it pays 16-24% more than the other open-weight models. We also confirmed the tokenizer is unchanged across the family: K3, K2.7-code, and K2.5 produced identical counts on all 23 aligned samples, so per-language budgets built for K2 carry over. How tokenizer density compounds with per-token price across nine languages is the subject of our cheapest LLM by language study.

FAQ

When will Kimi K3's open weights be released?

Moonshot has promised full weights under a Modified MIT license by July 27, 2026; as of this post K3 is API-only. The "largest open-weight model ever" framing is a commitment, not yet a download link.

What is actually new in K3 compared with the K2 family?

On the bill, three things we measured: the price (K3's $3/$15 versus K2.7-code at $0.95/$4, a 3.2-3.75x jump), always-on thinking (K2.5 does not reason at all, K2.7-code has a toggle), and nothing else: the tokenizer is byte-identical across K3, K2.7-code, and K2.5 on all 23 of our aligned samples, so K2-era token budgets carry over. On the spec sheet, per Moonshot: a new 2.8T-parameter MoE (896 experts, 16 active per token) with Kimi Delta Attention, a 1M-token context window against K2.7-code's 256K, and native image input. We measured the billing claims, not the architecture ones.

Does Kimi K3 support structured output?

Yes. A response_format with json_schema returned a valid, schema-conforming object in our probe. Note that reasoning still runs underneath: 66 of the 97 output tokens on that extraction call were reasoning, so schema-constrained calls pay the thinking tax like everything else unless you also set reasoning_effort: "none".

Does turning reasoning off change what you can see?

Yes. At default settings K3 returns its full chain of thought in reasoning_content, and the docs advise passing it back unmodified in multi-turn history. With reasoning_effort: "none" the field is absent entirely, and the ~67-token thinking preamble disappears from your prompt bill with it.

Measured 2026-07-20 on kimi-k3 at launch-week list prices ($3/M input, $0.30/M cached, $15/M output). Repeated prompts were salted to avoid response-level caches; accuracy counts use tasks with single checkable answers; behavioral claims were reproduced on a second, independent request path. Prices and behavior may change as the release matures; verify against your own usage records before relying on any number here.

GPT Realtime API Pricing: Speaking Costs 4x Listening (Measured)

synthorai — Sun, 19 Jul 2026 13:38:31 +0000

A voice conversation on OpenAI's Realtime API costs $0.0192 per minute while the user talks and $0.0768 per minute while the model talks back. Speaking is exactly four times listening, and that single ratio explains most of a voice session's bill. One naming note before the numbers: "GPT Live" is the ChatGPT consumer feature and has no API. The API products behind it are gpt-realtime-2.1 and gpt-realtime-2.1-mini, and those are what this post measures.

TL;DR

gpt-realtime-2.1 bills exactly 1 audio token per 100 ms of user speech and 1 per 50 ms of model speech: $0.0192 per minute to listen, $0.0768 per minute to speak.
Sixty seconds of silence under server VAD billed zero input tokens.
Automatic caching covered 93% of input by turn 30; deleting one history item tripled full-price input for one turn.
Cancelling a long spoken answer 2 seconds in billed 4 seconds of audio.
gpt-realtime-2.1-mini has identical billing mechanics at 3.2x lower audio prices.

Every number here comes from instrumented WebSocket sessions run against both models on 2026-07-19, with every server event logged. Both models are live on the Synthorai gateway's /v1/realtime endpoint, which is where these sessions ran; the protocol and the billing are the same as talking to OpenAI directly. The harness is a single stdlib-only Python file, and each figure below traces back to a raw response.done usage record.

How do you connect to the Realtime API?

Unlike the text APIs, Realtime is not request/response over HTTP. You open one WebSocket per session and exchange JSON events over it: the client streams microphone audio in, the server streams spoken audio back, and one connection carries the whole conversation.

import websocket, json

ws = websocket.create_connection(
    "wss://synthorai.io/v1/realtime?model=gpt-realtime-2.1",
    header=["Authorization: Bearer sk-..."])

ws.send(json.dumps({"type": "session.update", "session": {
    "type": "realtime", "output_modalities": ["audio"],
    "audio": {"input": {"format": {"type": "audio/pcm", "rate": 24000}}}}}))

# stream mic audio as base64 chunks: {"type": "input_audio_buffer.append", ...}
# then either let server VAD end the turn, or commit and ask for an answer:
ws.send(json.dumps({"type": "response.create"}))
# read events until "response.done": billing usage rides on that event

The session lifecycle that matters for billing: session.update sets instructions, voice, tools, and turn detection (this becomes the cacheable prefix); input_audio_buffer.append / commit add user audio; response.create triggers a reply; and every response.done carries the full usage breakdown for that response. One dialect note: the GA API uses output_modalities and a nested audio.input/audio.output config; the beta-era response.modalities field is rejected with unknown_parameter.

How much does GPT Realtime cost per minute?

The official conversion rates hold to the exact token: a 30.0-second clip billed 300 input audio tokens (1 per 100 ms), and a 4.5-second spoken answer billed 90 output audio tokens (1 per 50 ms). That turns the per-token price list into per-minute arithmetic:

Lane	gpt-realtime-2.1	gpt-realtime-2.1-mini
Listening (user audio in, full price)	$0.0192/min	$0.0060/min
Speaking (model audio out)	$0.0768/min	$0.0240/min
Listening, cached replay	$0.00024/min (1/80th)	$0.00018/min
Transcription add-on (optional)	+$0.017/min	+$0.017/min

Two costs hide outside the table. First, a spoken answer also bills text output: the transcript plus reasoning tokens (gpt-realtime-2.1 does reason; output_token_details.reasoning_tokens came back non-zero on every run). On our short test answer that added about 24% on top of the audio tokens, billed at the $24/M text rate.

Second, the transcription add-on is its own billing lane. Its usage record reads {"type": "duration", "seconds": 30}: duration-billed at $0.017 per minute, independent of tokens, and the transcript never enters the model's input. Flipping that one flag roughly doubles the input-side cost on 2.1 and nearly quadruples it on mini, so turn it on only where a compliance or product requirement actually needs the text.

Does silence, interruption, or tool calling cost anything?

Silence costs nothing. We streamed 60 seconds of silence into a session with server VAD enabled, then asked a question: the usage was byte-identical to a control session that never sent audio. VAD only commits audio it detects as speech, so hold music, a customer reading a form, or an open idle line bill zero input tokens. The caveat is that real background noise can trip VAD; pure silence is the floor, not a guarantee for a noisy call.

Interruptions bill to the generation frontier, not to the user's ear, and never for the un-generated remainder. We requested a slow spoken count to forty and cancelled after hearing 2.0 seconds: the bill was 81 audio tokens, or 4.0 seconds. The 2-second overhang is how far generation ran ahead of playback before response.cancel landed. On mini the same experiment billed 6.3 seconds, because the smaller model generates further ahead of real time. The practical rule: send response.cancel the instant your client detects barge-in, because the meter runs until the cancel arrives.

Tool calls are billing-neutral. A session with one function definition emitted the call, took the injected result, and the very next response showed 99% of its input billed at the cached rate. Function-call items and their outputs cache like any other appended history, and the tool definitions themselves sit in the static prefix that caches from turn 2 onward.

How does caching keep long sessions affordable?

The Realtime API re-reads the entire conversation as input for every response, so per-turn input grows linearly with session length. What keeps that affordable is automatic prefix caching: cached audio replays at $0.40/M instead of $32/M, 1/80th of full price. In our 30-turn session the cached share climbed steadily to 93% of input by turn 30:

The cached split is reported on every response.done:

"usage": {
  "input_tokens": 891,
  "input_token_details": {
    "text_tokens": 891, "audio_tokens": 0,
    "cached_tokens": 832,
    "cached_tokens_details": { "text_tokens": 832, "audio_tokens": 0 }
  }
}

Three specs the official docs do not state, all measured: caching engages at roughly 128 tokens of prefix (the text API's documented minimum is 1,024), it advances in 64-token blocks, and the static prefix is reusable across sessions on the same key. That last one matters for the 60-minute session cap: a new session's very first turn already billed its instructions at the cached rate, so rotation only pays full price to re-read conversation history, not the system prompt.

Editing history is the one way to lose the discount, and we measured the exact penalty. Deleting one early item mid-session collapsed the cached share for precisely one turn, then the cache rebuilt:

Turn	Input	Cached	Full price
8 (before delete)	319	256	63
9 (first user item deleted)	326	128	198
10	352	320	32

One more measured relief: the model's own spoken answers re-enter later inputs as text, not audio. In an 8-turn voice conversation, input audio grew by exactly the user's clip size each turn, while the assistant side reappeared as transcript tokens at $4/M. The expensive part of the compounding term is user audio only.

The playbook that falls out is short: keep history append-only, keep instructions and tool definitions byte-identical for the whole session (and across sessions), put anything dynamic in the latest user message instead of the prefix, and when you must trim, trim rarely and in large steps rather than every turn. For the general mechanics across providers, see our prompt caching guide and the measured cache minimums study.

gpt-realtime-2.1 vs mini: which should you pick?

Billing mechanics are identical on both models: same conversion rates, same 64-token cache quantization, same curve shapes. What differs is price and behavior:

	gpt-realtime-2.1	gpt-realtime-2.1-mini
Audio in / out (per 1M tokens)	$32 / $64	$10 / $20 (3.2x cheaper)
Text in / out	$4 / $24	$0.60 / $2.40 (6.7x cheaper)
Cached audio	$0.40 (1/80th)	$0.30 (1/33rd)
Text-turn latency (measured)	0.5–0.9 s	0.5–0.6 s
Verbosity on identical prompts	baseline	consistently higher output tokens
Barge-in overhang (2 s heard)	4.0 s billed	6.3 s billed

Two details worth noticing. On the cached-replay lane the price gap nearly closes ($0.40 vs $0.30), so a highly cached long session narrows mini's advantage slightly, though fresh tokens still dominate the total. And mini's speed works against it on interruptions: it generates further ahead of playback, so each barge-in discards about twice as much generated audio. In dollars mini still wins every scenario we measured; the 3.2x price gap absorbs both effects.

Pick mini by default for short-command assistants, IVR, and high-concurrency support. Pick 2.1 when the session needs complex tool orchestration or multi-step reasoning; OpenAI positions it as the flagship for instruction following, which our cost harness deliberately does not judge.

What do common voice scenarios actually cost?

Scenario	Dominant cost	What the measurements say
Voice chat, companions	Speaking lane + history compounding	Keep history append-only; the 60-min rotation re-reads history once at full price while the prompt stays cached
Live translation	Speaking ≈ listening duration	The dedicated `gpt-realtime-translate` SKU is $0.034/min flat; building translation on 2.1 runs roughly 3x that at list prices
Call center	Silence share of the call	Silence is free, so quiet minutes cost ≈$0; compliance transcription adds $0.017/min per leg and needs its own budget line
Device assistants	Connection setup + first turn	Keeping one line open beats reconnecting: idle is free, and session setup measured about 2.5 s of user-visible delay
Voice agents with tools	Tool round trips	Tool calls leave caching intact (99% cached on the following turn); keep definitions static
Meeting notes	Not a Realtime job	Duration-billed transcription plus a text model avoids the compounding term and the 60-minute cap entirely

For interrupt-heavy scenarios, add the barge-in overhang to your per-interaction math: each interruption costs the audio the user heard plus a few seconds of generation lead.

FAQ

Is GPT Live the same as the GPT Realtime API?

No. GPT Live is the voice feature inside the ChatGPT apps and has no API or pricing page of its own. Developers who want that experience programmatically use the Realtime API models gpt-realtime-2.1 and gpt-realtime-2.1-mini, whose prices this post measures.

How long can a Realtime session last?

Sixty minutes is the hard cap, and a closed session cannot be resumed. Text history can be re-injected into a fresh session (billed once at full price, while the static prompt stays cached), but assistant audio cannot be replayed, so long-running voice products need a rotation plan before minute 60.

Is there an idle timeout between turns?

No idle timeout is documented, and silence bills zero tokens under server VAD in our measurement, so keeping a line open between interactions costs nothing except the connection itself. For sparse-use products this makes one long session cheaper and faster than reconnecting per interaction, since setup measured about 2.5 seconds.

What audio format does the API expect?

PCM16 at 24 kHz mono is the default for both input and output, configured via audio.input.format and audio.output.format in session.update. Billing does not depend on the format: audio tokens are a function of duration only, 1 token per 100 ms in and 1 per 50 ms out.

The engineering patterns for living with the 60-minute wall (rotation, history hand-off, what survives a reconnect) are their own topic, and this post's numbers are the inputs to that math. For how billed tokens decompose across families in the text API, the companion piece is our token usage anatomy.

LLM Token Usage: Why a 4-Token Answer Bills 217 Tokens

synthorai — Tue, 14 Jul 2026 12:41:20 +0000

Ask GPT-5.6 a one-line math question and 88% of the output charge is reasoning you will never see: 10 visible tokens, 81 billed. And GPT-5.6 is the mild case. The same question billed 217 completion tokens for a 4-token answer on GLM 5.2, and 1,104 for the identical answer on Qwen3.7-max. That is not an anomaly; it is how reasoning models bill by design, and it is the first of several token classes that most cost dashboards never break out. This post dissects a real usage object, class by class, with the measured numbers.

TL;DR

GPT-5.6's default billed 81 tokens for a 10-token answer, 88% reasoning; GLM 5.2 ran 98% and Qwen3.7-max 99.3%.
Claude Sonnet 5 billed 114 thinking tokens on a 5-token answer with no thinking parameter sent.
Five families answered wrong without thinking (399, 400, 427, 466, 467); GPT-5.6 alone was correct without it, and every reasoning run said 401.
A 1,181-token Claude cache write then read cost $0.01246 and $0.00566, matching list prices exactly.

What we tested with

Every measurement in this post is the same single-turn prompt, sent to each model at its default settings unless a row says otherwise:

How many positive integers n <= 1000 are divisible by 3 or 5
but not by 15? Reply with just the number, nothing else.

The correct answer is 401 (333 multiples of 3, plus 200 multiples of 5, minus 66 counted twice, gives 467; drop the 66 multiples of 15 and 401 remain). We chose it deliberately: the visible answer is tiny and fixed at 3-4 tokens on every tokenizer, there is exactly one right answer so "did the thinking pay off" is checkable, and it is just hard enough that models want to think, which is the behavior being audited.

The five token classes on your bill

A modern completion bills up to five different token classes, at four different rates. A single flat "tokens used" number hides all of it.

Class	Where it appears	Billed at
Prompt (uncached)	`prompt_tokens`	input rate
Visible output	`completion_tokens` minus reasoning	output rate
Reasoning	`completion_tokens_details.reasoning_tokens`	output rate, separate from the answer
Cache write	`cache_creation_input_tokens`	input rate x 1.25 (Anthropic 5m TTL) or x 2 (1h TTL)
Cache read	`cache_read_input_tokens`	input rate x 0.1 (Anthropic)

The field names above are the OpenAI-compatible shape. Claude carries the same five classes under its own names: input_tokens and output_tokens, with thinking reported in output_tokens_details.thinking_tokens and billed as output, and the cache write further split by TTL in a cache_creation object (ephemeral_5m_input_tokens at 1.25x, ephemeral_1h_input_tokens at 2x). Same anatomy, different labels: that parsing problem returns in a later section.

The five classes and their price multipliers, labeled with the OpenAI-compatible and Anthropic field names. The 88% figure is GPT-5.6's measured reasoning share from the example above.

Here is the real object behind the headline, GPT-5.6 on its default setting answering the test question:

{
  "prompt_tokens": 38,
  "completion_tokens": 81,
  "total_tokens": 119,
  "prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0 },
  "completion_tokens_details": { "reasoning_tokens": 71 },
  "cost": 0.000524
}

This is where the billing arithmetic lives, and it is worth doing once explicitly. The answer you received is completion_tokens minus reasoning_tokens: 81 − 71 = 10 tokens, the word "401" and its formatting. The other 71 tokens are chain-of-thought billed at the full output rate, 88% of the output charge, and on GPT-5.6 you cannot read a single one of them. Every "visible answer" figure in this post is computed the same way. The ratio only gets steeper elsewhere: GLM 5.2 answered the same question with 217 completion tokens for a 4-token answer, so a cost model that reads completion_tokens as "what the model said" is off by 8x here and 54x there.

Reasoning is the budget, and it is steerable

The same one-line question across every thinking setting the three steerable families accept, one gateway, 2026-07-13/14 (rows ordered least to most reasoning within each family; the remaining families' flagship defaults are in the next section):

Configuration	Answer	Reasoning tokens	Cost
GPT-5.6 mini (luna), `none` / `low` / `medium` / `high`	401 (all correct)	0 / 52 / 85 / 74	$0.000062 / 0.000410 / 0.000608 / 0.000542
GPT-5.6 mini, default	401 (correct)	71	$0.000524
GLM 5.2, thinking off	399 (wrong)	0	$0.000062
GLM 5.2, default (thinking on)	401 (correct)	213	$0.001016
GLM 5.2, `reasoning_effort: high`	401 (correct)	359	$0.001659
Claude Sonnet 5, thinking disabled	467 (wrong)	0	$0.000130
Claude Sonnet 5, `effort: low`	401 (correct)	84	$0.000970
Claude Sonnet 5, no thinking param	401 (correct)	114	$0.001290
Claude Sonnet 5, adaptive thinking	401 (correct)	168	$0.001830
Claude Sonnet 5, `effort: high`	401 (correct)	249	$0.004830

Four things worth reading out of that table honestly:

The lever is large where it works. reasoning_effort: none answered correctly for $0.000062, 8.5x under luna's default, and on harder tasks we've measured 20x on GLM 5.2. Tier choice is part of the same lever: GPT-5.6's flagship answered this question with zero reasoning by default (next section), so a bigger model that doesn't need to think can out-cheap a smaller one that does.
And nearly inert where it doesn't. How far the dial turns is a family trait: monotonic with a 5x cost swing on Sonnet 5 (84 to 249), mild and unordered on GPT-5.6, nearly inert on Qwen3.7-max (low still spent 974 reasoning tokens against the default's 1,096), dead on DeepSeek V4 Pro (267 against 269).
Labels are not monotonic and defaults are not deterministic. GPT-5.6's high spent fewer tokens than its medium here; GLM's low out-spent its high in our earlier post; Sonnet 5 thought for 114 tokens on a request that sent no thinking parameter at all; and the same GLM default request spent 213 reasoning tokens in one run and 1,312 in another, six times apart. Measure the dial's real effect on your workload; don't read it off the label.
Cheap-and-wrong is the recurring failure. Both families that answered without thinking here answered wrong (GLM 399, Sonnet 5 467); the full five-family roll call is in the next section. Reasoning is a correctness budget, and whether cutting it pays off belongs to the task, not the model.

The practical rule: treat reasoning_tokens as a first-class line item. It bills at the output rate, it routinely dwarfs the visible answer, and it responds to parameters whose real effect you must measure. On GPT-5.6, note that the pricing rules changed too; the GPT-5.6 cost guide covers the write premium and the cache-key requirement.

Can you actually read what you paid for?

"Separate from the answer" does not always mean hidden, and the difference is worth being precise about. We checked the response bodies, not just the usage:

GLM 5.2, DeepSeek, Qwen3.7-max, and MiniMax return the full reasoning text in a reasoning_content field beside the answer (3,987, 1,604, 2,509, and 581 characters here). A developer can read every billed token; end users only see it if you render it, and most applications don't.
GPT-5.6 withholds the raw chain of thought; a summary is the most you get. The response can carry a model-written reasoning.summary (359 characters here), but the 91 billed tokens are the hidden raw text, not the summary. The closest thing to that text is reasoning.encrypted_content: an encrypted blob you can hand back for multi-turn continuity but never decrypt. The tokens you paid for sit in your own response body, unreadable.
Claude depends on how you ask. Our adaptive-thinking Sonnet 5 call returned a thinking block whose text was empty while thinking_tokens billed 114: proof it thought, nothing to read. Fable 5 behaved the same way on its always-on default (59 billed, empty block). Yet the same Sonnet 5 called with an explicit reasoning budget returned the actual thinking text (73 tokens billed, text present). How you ask decides what you can see.

So the billing statement is universal, the visibility is not: every family charges output rate for reasoning, and whether you can audit the text you bought ranges from "in full" through "summary only" to "a signed empty block."

When the text does come back, you can grade it mechanically. Our question has five fixed intermediate results (333, 200, 66, 467, 401), and every returned reasoning text contained all of them: GLM 5.2, DeepSeek V4 Pro, Qwen3.7-max, Kimi K2.7 Code, and MiniMax M3 each shipped a complete derivation, while the low-effort variants dropped one step each. For anyone who needs the process and not just the answer, that is the split: with reasoning_content you can verify what you paid for; with a summary or an empty block you take it on faith. Invisible Tokens, Visible Bills formalizes that accountability gap, and PALACE estimates hidden reasoning from the outside.

The same question across every family's best model

The steering table used specific tiers to show the knobs. Here is the newest flagship of each family on the same question, default settings:

Model	Answer	Completion tokens	Reported reasoning	Cost
Qwen3.7-max	401 (correct)	1,104	1,096 (99.3%)	$0.008393
DeepSeek V4 Pro	401 (correct)	272	269 (98.9%)	$0.000933
Kimi K2.7 Code	401 (correct)	261	258 (99%)	$0.001082
MiniMax M3	401 (correct)	260	text returned, count not itemized	$0.000349
GLM 5.2	401 (correct)	217	213 (98%)	$0.001016
Claude Fable 5	401 (correct)	62	59 (95%)	$0.003600
GPT-5.6 sol	401 (correct)	4	0	$0.000310
Gemini 3.5 Flash	466 (wrong)	3	0	$0.000080

Billed output tokens per flagship for the same question (hatched orange = reasoning share; green = the visible answer). The asterisk on MiniMax: its usage carries no reasoning count, so the share is reconstructed by subtraction, verified against the reasoning text it returns. The check mark on GPT-5.6 sol marks the only correct zero-reasoning run; the cross on Gemini 3.5 Flash marks the only wrong flagship answer (466).

Every reporting style in one chart:

Seven of eight flagships answered correctly, at wildly different prices for the same 401. Qwen3.7-max spent 1,096 reasoning tokens, 22 seconds, and $0.0084; GPT-5.6's flagship spent zero reasoning tokens and $0.00031. A 27x cost spread and a 7x latency spread for an identical right answer is the reasoning budget made visible.
MiniMax returns its reasoning text but no count. 260 completion tokens for a 3-token visible answer: the response carries the full derivation in reasoning_content, yet completion_tokens_details has no reasoning line. When the count is missing, reconstruct it by subtraction: completion minus visible tokens is your hidden-output count.
Gemini 3.5 Flash is the outlier: the only flagship to answer wrong (466), with 3 completion tokens and no reasoning count anywhere. Its sibling 2.5 Flash once spent 12.5 seconds producing a 3-token 401 with nothing in the bill saying why, and a repeat run answered 427.
Smaller tiers and disabled thinking are where the wrong answers live. GLM with thinking off said 399; Sonnet 5 disabled said 467; the older qwen3-max didn't think at all (3 tokens) and said 400; Kimi K2.5, which has no reasoning channel, reasoned out loud for 144 visible billed tokens, derived 401 inside its own prose, then concluded 400. Five families, five different wrong answers: 399, 400, 427, 466, 467; GPT-5.6 alone was correct without reasoning.

Cache tokens: two directions, 12x apart

Prompt caching splits input into two more classes, and the price gap between them is the reason the split exists: writes bill at 1.25x the input rate on Claude (2x for the 1-hour TTL), reads at 0.1x. A measured pair of Opus 4.8 calls with a 1,181-token cached system prompt cost $0.01246 (write) then $0.00566 (read), both reconciling against list prices to the sixth decimal, with the input-side charge dropping about 11x call-over-call. The point for this post is the accounting: if you lump cache_creation_input_tokens and cache_read_input_tokens into "input tokens", you can neither verify the discount nor notice when it silently stops happening. And it stops more often than the docs suggest: our caching measurements found effective thresholds 1.4 to 2.4x above the documented minimums, and the prompt caching guide covers the per-provider mechanics in depth.

Why your local estimate never matches the bill

A common pattern is estimating cost client-side with a tokenizer library and reconciling later. Three reasons the numbers won't match:

Tokenizers differ per vendor. Counting Claude-bound text with an OpenAI tokenizer is measuring with the wrong ruler; the same string tokenizes differently on every family.
You bill for more than your message. System prompts and tool schemas are input tokens on every request that carries them, and they're easy to leave out of a local estimate.
Reasoning is unpredictable until the response. No client-side count can predict how many thinking tokens a model will spend; you only learn it from the returned usage.

The returned usage object is the upstream's own billing record, so the cheapest accurate counter is to stop estimating and read it. The catch is that every provider shapes it differently: cached tokens alone appear as cached_tokens, prompt_cache_hit_tokens, total_cached_tokens, or cache_read_input_tokens depending on the family.

The detail objects are not a fixed schema either. OpenAI's reference documents four completion-side fields: reasoning_tokens, audio_tokens, and the Predicted Outputs pair accepted_prediction_tokens / rejected_prediction_tokens; rejected prediction tokens never appear in the output yet still bill as completion tokens. On the prompt side, the same shape adds text_tokens, audio_tokens, and image_tokens beside cached_tokens; GPT-5.6 adds cache_write_tokens, and we have seen video_tokens in the wild. Vendors extend it freely: Kimi K2.7 returned an undocumented completion_tokens_details.text_tokens, and Gemini counts thinking and tool-use tokens separately under its own names. Parse defensively: treat unknown detail fields as expected, and never assume a missing field means zero.

This is where a gateway earns its place: Synthorai normalizes all of these into one object, with reasoning_tokens and the two cache directions populated across OpenAI, Anthropic, Gemini, and the open-weight families, so one parser covers every model you route to.

From seeing to stopping

Reading the ledger is half the job; the other half is making overruns impossible rather than merely visible. Month-end dashboards report a surprise after the money is gone, and an agent stuck in a retry loop does not read dashboards. On the gateway, every key carries a quota with tracked used_quota and an RPM ceiling, enforced at request time: a key that exhausts its budget gets an explicit error on the next request, not a bigger invoice three weeks later. Per-request attribution (which key, which model, BYOK or platform billing) comes back in the same response envelope, so cost-per-feature is a group-by, not a reconstruction project.

The order of operations that follows from the measurements: read reasoning_tokens and the cache fields before optimizing anything, set the reasoning dial per task, and put a hard quota on every key whose failure mode is a loop. To see what a specific mix of models and token classes costs at your volume, the cost optimizer prices it from the same per-token rates used above.

Prompt Cache Minimums: The Docs Under-State by 1.4–2.4x

synthorai — Sun, 12 Jul 2026 14:04:51 +0000

A customer told us prompt caching on our gateway was not kicking in at the token count the model's own documentation promised. We reproduced it, then re-ran every model against a second, independent serving path, one of the largest AI gateways, where the same gaps reproduced to the token. The documentation, not any one gateway, was the optimistic party: the published minimum is an eligibility floor, not the length that earns a cache hit. For the automatic-cache families the two differ by 1.4 to 2.4x. OpenAI's effective first-hit threshold measured near 1,456 tokens against a documented 1,024; Gemini 2.5 Flash first read from cache near 5,000 against a documented 2,048; Claude, which caches only where you place an explicit marker, hit its documented per-model minimum within a few percent.

TL;DR

OpenAI's documented 1,024-token cache minimum under-states the effective threshold of about 1,456 tokens, measured on two paths.
Gemini 2.5 Flash documents 2,048 but first read near 5,000 tokens, about 2.4x higher.
Claude's explicit cache_control hit its documented minimum within a few percent (Opus 1,073 vs 1,024).
GLM 5.2 and DeepSeek V4 publish no minimum and read from about 800 tokens; MiniMax M3 reports roughly 114 cached tokens at any length.
Automatic caches also need a 2-to-8 call warm-up before the first read.

We ran every measurement through two serving paths, our own gateway and one of the largest independent AI gateways, and treated a result as model behavior only where the paths agreed. The point of the second path is attribution: a discrepancy that reproduces on an unrelated vendor's stack is the model's, not ours. That cross-check worked cleanly for OpenAI, Gemini, and GLM, which cached on both, at the same effective thresholds. It does not work for every model: on that second gateway the open-weight models are largely served by GPU hosts that do not implement the vendor's prompt cache, which the gateway's own endpoint metadata confirms per provider, and unpinned routing drifts across those hosts so cache affinity is lost. Where the second path could not corroborate, the numbers below come from the path that reaches each vendor's own caching API. Lengths are each model's own tokens, calibrated from the returned usage, not characters. Every arm used a fresh prefix and recorded the first call index that produced a cache read, not a single hit-or-miss.

The gap between documented and effective

The documented minimum tells you when a prompt can be cached. The effective threshold is the length at which a repeated prompt actually comes back as a read. For the automatic-cache families they are not the same number.

Family	Cache type	Documented min	Measured first-hit	Gap
OpenAI GPT-5.5 / 5.4-mini	automatic	1,024	≈1,456	+40%
Gemini 2.5 Flash	automatic	2,048	≈5,000	2.4x
Gemini 3.5 Flash	automatic	4,096	≈5,200	+27%
Claude Opus 4.8 / Sonnet 5	explicit marker	1,024	1,073	exact
Claude Haiku 4.5	explicit marker	4,096	4,206	exact

The OpenAI figure held to the token across both paths: a 1,356-token prompt never read back, a 1,456-token prompt did. Gemini was the widest gap. A sweep that topped out at 3,300 tokens saw zero reads and looked like caching was off; extending the sweep to 5,000 produced a clean read on both paths at the same length. The documented 2,048 is a floor the cache is eligible at, not one it serves reads at.

The pattern across the study: the cache you mark explicitly has an accurate spec, and the cache that happens automatically does not.

The minimum is not the only undocumented variable

Clearing the effective threshold is necessary but not sufficient. The automatic-cache families need a warm-up: the first read lands on a later call, not the second.

OpenAI: first read on call 2 to 3.
Gemini: first read on call 4 to 8.

This matters for cost modelling. A prompt at 6,000 tokens is above every documented and effective Gemini threshold, but a workload that sends it twice and moves on can still pay full price both times, because the cache had not warmed. Short or bursty traffic pays the uncached rate even when the length qualifies. We only concluded "does not cache" after at least twelve repeated calls with a settle between them; a shorter sweep produced a false negative on Gemini that a deeper one overturned.

Cached counts also snap to fixed blocks, which is worth knowing when you reconcile a bill: 128-token blocks on OpenAI, 64-token blocks on DeepSeek. A read of 4,073 cached tokens on a 5,014-token prompt is a partial-prefix hit rounded to a block boundary, not a bug.

The caches you control are exact

Claude caches only the segments you tag with cache_control, and that control comes with an accurate spec. Every Anthropic claim we tested held:

Per-model minimum, to the token. Opus 4.8 and Sonnet 5 first read at 1,073 tokens against a documented 1,024; Haiku 4.5 at 4,206 against 4,096. The small overage is block rounding, not drift.
Read rate of 0.1x input. We derived each model's input price from its own cold rows, then solved the cached rate from a hit row. Opus 4.8 and Haiku 4.5 both came out at 0.10, matching the documented multiplier.
Five-minute refresh, free on every read. Seeding a prefix and re-reading at two, four, and six minutes hit on every read. A read inside each five-minute window keeps the entry alive with no extra write.
Cascade invalidation. With a stable system prefix and one tool defined, changing only the tool's description forced a full rewrite of the system cache below it. Changing a tool definition invalidates the system and message caches, matching the documented hierarchy.

One doc-versus-doc conflict fell out of this. A third-party table listed Claude Opus's minimum as 4,096 tokens; the measurement read at 1,073, and Anthropic's own 1,024 is the correct figure.

The open-weight families usually document nothing

The families above at least publish a number to be wrong about. The open-weight and Chinese-lab models mostly publish no minimum at all, which leaves measurement as the only option. Our provider cache comparison covers how their published rates line up once a hit lands; the question here is only the length at which the first read appears.

Family	Documented min	Measured first-hit	Granularity
GLM 5.2 (Z.ai)	none	reads from ≈800, both paths	64-token blocks
DeepSeek V4	none	reads from ≈800 against the vendor API	64-token blocks
MiniMax M3	512	reports a flat ≈114 cached at any length	non-standard

GLM 5.2 publishes no minimum length and cached from about 800 tokens with 64-token block granularity on both paths, a lower floor than any of the documented families. DeepSeek V4 also publishes no minimum and read from about 800 tokens with the same 64-token granularity, but only against its own caching API. DeepSeek's docs call the cache best-effort with no hit-rate guarantee, and that is exactly what an intermediary exposes: the other gateway serves DeepSeek through a set of GPU hosts where only DeepSeek's own endpoint implements the cache, so routing that is not pinned to that endpoint returns no reads at all.

MiniMax M3 is the case where the reported number itself misleads. It documents a 512-token minimum, but it reports a constant count near 114 cached tokens from the first call at every length from 200 to 5,000 tokens. That number does not track prompt length, and it appears even on paths that perform no caching, so it is the model's own bookkeeping rather than a signal of what was reused. This is the same lesson the newer OpenAI models teach from the other direction: the usage token fields and the real caching can disagree, so when the savings matter, reconcile against usage.cost, not the token count.

The newest models are moving the rules

Two doc-level changes are worth flagging before you assume older behavior carries forward. On the GPT-5.6 family, OpenAI's guide states that cache writes cost 1.25x the uncached input rate, where earlier families wrote for free. The same guide describes implicit caching as placing a breakpoint on the latest message, which is a different shape from prefix caching a stable system block across turns. If you want a stable prefix reused across differing user turns on those models, mark it with an explicit breakpoint rather than relying on the implicit path. Confirm the write multiplier and the minimum per model, because both now vary by family in ways a single doc page flattens.

What to do with this

Measure your own effective threshold. Sweep prompt length in your own tokens and record the first length that returns a cache read. Do not assume the documented minimum is where hits begin.
Budget for warm-up. Treat the first two to eight calls on a new prefix as uncached in your cost model for automatic-cache providers.
Prefer explicit markers where the provider offers them. Claude's cache_control gave an accurate, testable spec: a known minimum, a known read rate, a known TTL, and a known invalidation rule. That predictability is worth more than a lower documented floor you cannot rely on.
Re-baseline on new model families. Minimums, write pricing, and breakpoint behavior shifted within a single vendor's lineup during this study.

For the read rates, TTLs, and keying rules that sit behind these thresholds, our prompt caching guide has the per-provider mechanics.

The short version: the documented minimum is an eligibility floor, not a hit threshold, and for automatic caches the two differ by 1.4 to 2.4x. Verify the number that governs your bill, in your own tokens, against your own traffic.

GPT-5.6 Cost Guide: Prompt Caching 90% Off, Reasoning Effort

synthorai — Fri, 10 Jul 2026 14:08:05 +0000

GPT-5.6 moves both cost levers at once: cached input drops to 10% of the input rate (5.x discounted 50%), and with reasoning on by default, not sending reasoning_effort billed 1.5x as much as pinning it to none across our 50-call matrix, with identical answers. On the input side you can now pin up to four cache breakpoints explicitly; on the output side the effort knob decides how much thinking you pay for. We measured both levers through the gateway on the day-one models: Sol ($5/$30 per 1M tokens in/out), Terra ($2.50/$15), and Luna ($1/$6), every rate confirmed against the live usage.cost meter.

TL;DR

Cached input bills at 10% of the input rate, measured $0.10/$0.25/$0.50 per 1M across the tiers; 5.x discounted 50%.
Breakpoints deliver partial reuse: changing the block after a marker re-billed only 1,210 of 2,431 tokens.
Prefixes under 1,024 tokens never cache and repeats can miss silently; budget hit rates below 100%.
Cache writes bill at 1.25x on written tokens; a never-read write costs more than not caching.
Omitting reasoning_effort billed 1.5x as much as none across a 4-task matrix, identical answers; pin it explicitly.

Measured 2026-07-10 through the Synthorai gateway (OpenAI-compatible chat completions), one day after OpenAI announced the family. All three models are live; the new caching parameters pass through unchanged.

Three tiers, one generation

The naming scheme is new: the number is the generation, and Sol, Terra, and Luna are capability tiers that replace the pro/mini/nano suffixes. All three share a 1M-token context window and 128K max output. Every rate below reconciles exactly against metered usage.cost on known token counts, including the cached column:

tier	input /1M	output /1M	cached input /1M (measured)
gpt-5.6-sol	$5.00	$30.00	$0.50
gpt-5.6-terra	$2.50	$15.00	$0.25
gpt-5.6-luna	$1.00	$6.00	$0.10

Sol is the flagship and the price-parity successor to gpt-5.5: the rate card is identical at $5/$30. Terra and Luna are the scaled-down tiers of the same generation, at half and a fifth of Sol's price, taking the slots the mini and nano suffixes used to hold. For token counting the three are one model: all returned identical counts on every sample we sent.

How 5.6 caching works, per the docs

GPT caching used to be a single behavior: the API detected repeated prefixes of 1,024+ tokens on its own and billed the cached share at half price. Our provider comparison filed the GPT column under "fully automatic" for exactly that reason. The 5.6 caching guide replaces that with a two-mode design:

{
  "model": "gpt-5.6-luna",
  "prompt_cache_options": { "mode": "explicit", "ttl": "30m" },
  "prompt_cache_key": "tenant-42",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "...stable system prompt, 1024+ tokens...",
          "prompt_cache_breakpoint": { "mode": "explicit" }
        }
      ]
    },
    { "role": "user", "content": "the varying part" }
  ]
}

The rules that matter, condensed from the guide:

A breakpoint marks the end of a cached prefix, covering that block and everything before it. implicit mode (the default) still auto-places a breakpoint on the latest message; explicit mode caches only what you mark.
Four cache writes per request, and the implicit auto-breakpoint consumes one of them, so explicit markers get three slots in the default mode and four in explicit mode. Breakpoints from earlier conversation turns are read-only on later requests.
The 1,024-token floor survives: a marked prefix below it is not cached.
ttl: "30m" is a guaranteed minimum lifetime, not a cap ("at least 30 minutes... may retain it longer"). It replaces prompt_cache_retention, which is deprecated on 5.6, and that means the old 24h extended-retention option is gone with it.
prompt_cache_key is how you get reliable matching: the guide recommends a stable key per tenant or session to route repeats to the same cache, with a soft limit around 15 requests per minute per key. Caches are scoped to your organization.
Cache writes bill at 1.25x the input rate on 5.6+, reported in the new usage.prompt_tokens_details.cache_write_tokens field. Writes were free on 5.x and before.

GPT-5.5 and older reject the new parameters with a clean 400 (prompt_cache_options is not supported on this model), so version-gate any rollout.

If the design sounds familiar, it should: markers on content blocks, four breakpoints, a write premium, and a sliding read-only history are the shape Claude's cache_control has had all along. The difference is the TTL: OpenAI's 30-minute guaranteed floor is 6x Claude's default 5 minutes.

What the meter says

Docs are claims; here is what the gateway's meter returned, probe by probe. Full raw records are in the run log; every cost below reconciles to the digit against the tier rates.

probe	result
explicit write, ≈3k-token marked prefix (Luna)	`cache_write_tokens=3012`, billed at $1.25/1M: the 1.25x premium, exactly
repeat with a different question	`cached_tokens=3012`, the full mark, at $0.10/1M; the call cost 90% less than the write call
write premium on Sol / Terra	$6.25 / $3.125 per million written tokens: 1.25x each, to the digit
cached rate on Sol / Terra	$0.50 / $0.25 per million: exactly 10% of input
marked block of 621 tokens, twice	never cached: `cache_write=0`, `cached=0`, full price both calls
marked block of 1,221 tokens	writes normally (1,212 written)
two breakpoints [A][B], then change B	`cached=1212` (exactly block A) + `cache_write=1210` (the new tail, at 1.25x)
five breakpoints in one request	accepted without error, all 5,548 tokens written (the 4-write cap counts slots, not tokens; a later mark covers everything before it)
prefix written on Luna, re-sent on Terra	`cached=0`, re-written: caches are per-model
a cache miss	can arrive with `cache_write=0` too: full price, nothing cached, no error

Three of those rows deserve unpacking.

Partial reuse is real, and it is the reason to adopt breakpoints. With a stable block A and a swapped tail B, the meter re-billed only the tail: 1,212 tokens read back at the cached rate, 1,210 written for the new B at the write premium, out of a 2,431-token prompt, and the total reconciles against the rate card to the digit. That is the layered-prefix behavior (system prompt, then tools, then documents, each marked) that Claude users structure prompts around, and GPT's automatic mode could never guarantee it. One footnote: on full repeats the matched length sometimes snaps below the mark (1,897 of a 2,422-token write in one probe), so budget on the discount rate, not on exact match counts.

The floor and the silent misses are the operational traps. A 621-token marked block cached nothing, twice, with no error and no usage hint beyond the zeros; if your "stable prefix" is a short system prompt, you are paying full price and nothing tells you. And a miss can arrive without a write, at full price, equally silently. Hit rate is a distribution, not a promise, whatever path your requests take: read cached_tokens in production and alert on it, the way our five-minute cache audit does.

The write premium is real, and it changes the break-even. Written tokens bill at exactly 1.25x the input rate on all three tiers (Luna writes meter at $1.25 per million, Terra at $3.125, Sol at $6.25), reconciling to the digit on every probe in our final run. That premium only pays back once the prefix is read again: a write that never gets a hit costs 25% more than not caching at all, the same trap we measured on Claude's write premium in the LangChain post. Mark prefixes you know will repeat, not everything that looks stable.

The 30-minute floor held as far as we probed it. A keyed re-read 15 minutes after the write came back fully cached (1,313 of 1,313 tokens, reconciling at the 10% rate), well past the old in-memory horizon of 5 to 10 minutes, and a second keyed probe at the same gap repeated the result. We did not probe the full 30 minutes.

The same workload on GPT-5.5, for contrast

The fair comparison is price-parity: Sol carries gpt-5.5's exact rate card ($5/$30), so it is the like-for-like successor, with Terra and Luna as the scaled-down tiers below it. Same sticker price, very different caching terms:

	gpt-5.5	gpt-5.6-sol
list price in / out per 1M	$5.00 / $30.00	$5.00 / $30.00
cached input rate	50% of input (documented)	10% of input (measured)
cache control	automatic only	automatic + up to 4 explicit marks
lifetime	5-10 min best effort, optional 24h retention	30 min guaranteed floor (keyed); 24h option gone
cache-write fee	none	1.25x input on written tokens

At the same sticker price, the caching terms are the upgrade. A 3,000-token prefix costs $0.0075 per call on 5.5 when its automatic cache decides to hit, and $0.0015 on warm Sol, 5x less on the cached share. The deeper change is control and visibility: 5.5's hits depend on opaque prefix detection you can neither trigger nor debug, while 5.6 lets you mark exactly what should cache, route repeats with a prompt_cache_key, and watch every write land in usage. A miss now shows up as a zero in a field you created, not as silence. The step-down path stacks on top: if 5.5 was over-serving the workload, Terra halves the whole rate card and Luna divides it by five, and the same warm prefix drops to $0.00075 and $0.0003. The one thing 5.5 keeps is the optional 24-hour retention; if your traffic is a daily batch against a giant prefix, that trade-off runs the other way. And note the second lever cuts the other way on migration: 5.6 reasons by default, so a 5.5 workload moved over without pinning reasoning_effort picks up new output spend at the same rate card.

The second lever: reasoning effort

Caching governs what you pay for input; reasoning_effort governs the output side, because reasoning tokens bill at the output rate and, unlike your prefix, can never be cached. GPT-5.6 accepts none through xhigh on all tiers. The launch post also introduces a max effort for Sol; through chat completions it does not exist (400: 'reasoning_effort' does not support 'max' with this model, on Sol and Terra alike), so xhigh is the practical ceiling on the API path gateways and SDKs use.

We ran a 50-call matrix: four task shapes (classify a review, extract a field from a log line, a multi-step arithmetic word problem, a small code-generation task) across all six settings, none to xhigh plus the parameter omitted, on Terra and Luna with a Sol spot check. Every one of the 50 answers was correct at every setting. What varied was the bill. These are short-output calls (tens of visible tokens), so a few dozen reasoning tokens at the output rate dominate the total; the ratio column compares whole-call cost:

task (Luna)	at default (omitted)	default cost vs `none`
classify	0	1.0x
extract	0	1.0x
math	24	3.5x
code	39	2.5x

Three findings. First, 5.6 is adaptive on its own: on the two trivial shapes no setting spent a single reasoning token, so the knob is free there. Second, on shapes that look like they deserve thought (math, code), the default reasons even when it buys nothing: omitting the parameter cost 2.5x to 3.5x the none price on Luna's math and code and Terra's math (Terra's code run happened to spend nothing at default) for the same correct answers, and 1.5x as much summed across the Terra and Luna grid. Third, the intermediate settings (not shown in the table) are noise, not a dial: Terra's math run spent 19 reasoning tokens at low, zero at medium, 21 at high, and zero again at xhigh, and Luna's code run spent 101 at xhigh against 41 at high. The names are intents, not budgets, the same behavior we measured on GLM 5.2.

Send reasoning_effort explicitly on every call, and default it to none for classification, extraction, routing, and short transforms. Escalate a specific call site only when your evals show the higher setting changing outcomes, not because the task feels hard. Our four shapes are short-output API workloads; genuinely hard multi-step work may earn its reasoning spend, but let measurements say so.

The two levers compound. Once a prefix is warm, the input side of a Luna call costs a tenth of list price, and the reasoning default becomes the largest remaining line item on short tasks: Luna's math call was $0.00007 total at none and $0.00025 with the parameter omitted: the reasoning default alone added $0.00018, more than twice the entire managed call. Cache without pinning effort and the savings leak back out the other side.

What to recommend per workload

The decision now has real structure, so here is what we tell gateway customers:

workload shape	recommendation
chat, one big stable system prompt	stay `implicit`; the auto-breakpoint covers you and the discount is 90% either way
agents with layered prefixes (system + tools + files)	`explicit` mode, mark each stable layer, volatile content last; a changed layer re-bills only from its mark
RAG with reordered context	explicit marks on the layers above the retrieved chunks; the reorder then costs only the tail
cron and sporadic jobs, 10-30 min apart	the 30m TTL floor targets exactly these (5.x and Claude's 5m default never hit); keyed re-reads hit in full at 15 minutes in our probes
short prompts (<1,024 tokens)	caching does not apply; don't spend effort marking

Independent of shape: send a stable prompt_cache_key per tenant or session (the docs make the key the basis of reliable matching), keep every marked layer above the 1,024-token floor, and monitor cached_tokens because silent misses exist. Remember the caches are per-model: an A/B test across tiers re-warms from zero on each side. And set the other lever in the same commit: pin reasoning_effort per the matrix above, none unless an eval says otherwise.

On tier choice, the 90% discount changes the arithmetic more than the tier does. A workload replaying a 3,000-token prefix pays about $0.30 per thousand calls for that prefix on warm Luna traffic and $1.50 on warm Sol; the spread between tiers on the cached share is smaller than the spread on output tokens, so pick the tier for output quality and price, then let caching flatten the input side. Coming from gpt-5.5, Sol is the drop-in upgrade at the same rate card with 5x cheaper cache reads and control over when they happen; step down to Terra or Luna where your evals say the smaller tier holds, and the rate card halves or divides by five on top.

The tokenizer did not change

Our 24 samples (a narrative passage in nine languages, technical and news versions in six of them, a Python function, and a JSON tool-call) count identically across GPT-5.5, Sol, Terra, and Luna in every comparison that completed. Token budgets and cache-floor estimates calibrated on 5.5 carry over untouched; the cross-language behavior is in our tokenizer-by-language post and applies to 5.6 as-is.

Bottom line

The cache discount deepening from 50% to 90%, with a 30-minute guaranteed TTL, is the real price cut in this release; the tier prices are the headline, but the caching terms move real bills more.
Adopt explicit breakpoints for layered prompts: partial reuse is measured, not theoretical, and the mental model transfers from Claude one-to-one.
Respect the 1,024-token floor, send a prompt_cache_key, and monitor cached_tokens; silent misses and silent not-cached-at-all both exist.
Send reasoning_effort explicitly, default none: the unmanaged default billed 1.5x as much across our matrix, up to 3.5x on single tasks, for identical answers.
xhigh is the reachable ceiling (max 400s through chat completions); no tokenizer re-baseline from 5.5.

FAQ

Does GPT-5.6 support explicit prompt caching like Claude?
Yes. prompt_cache_options: {"mode": "explicit"} plus prompt_cache_breakpoint markers on content blocks, up to four writes per request (three in implicit mode, where the automatic breakpoint takes a slot). Measured through an OpenAI-compatible gateway, a marked 3,012-token prefix wrote on the first call and read back in full at the cached rate on the second.

What does cached input cost on GPT-5.6?
10% of the input rate, measured on all three tiers: $0.10 per million on Luna, $0.25 on Terra, $0.50 on Sol. GPT-5.x billed cached tokens at 50% of input, so the cached-token rate is 5x lower on 5.6.

Is GPT-5.6 caching better than GPT-5.5's?
On discount depth and control, yes: 10% cached rate versus 50%, four explicit breakpoints versus automatic-only detection you cannot trigger or debug, and a 30-minute keyed floor versus 5-10 minutes best effort. 5.5's one remaining advantage is the optional 24-hour retention tier, which 5.6 drops.

How long does the GPT-5.6 cache live?
The docs guarantee at least 30 minutes (ttl: "30m" is the only accepted value), possibly longer; the option replaces the deprecated prompt_cache_retention, including the old 24-hour extended tier. In our probes keyed re-reads 15 minutes after the write hit in full; we did not probe the full 30.

Do I need prompt_cache_key?
Send it: the docs make a stable key per tenant or session the basis of reliable matching on 5.6, with a soft limit around 15 requests per minute per key. It costs nothing to include, and pairing it with cached_tokens monitoring is how you verify the discount is actually landing.

How much does reasoning_effort change GPT-5.6 cost?
On our 50-call matrix (four task shapes, six settings, Terra and Luna), every setting produced correct answers, and omitting the parameter billed 1.5x as much as none overall, up to 3.5x on the arithmetic task. On trivial shapes (classification, extraction) no setting spent reasoning tokens at all. Pin none and escalate only on evals.

Is the max reasoning effort available on GPT-5.6 Sol?
Not through chat completions. Requests with reasoning_effort: "max" return a 400 listing none through xhigh, on Sol and Terra alike.

Which GPT-5.6 tier should an API workload use?
Sol is the price-parity successor to gpt-5.5 (identical $5/$30 rate card, 5x cheaper cached reads); Terra and Luna are the scaled-down tiers at half and a fifth of that. Once your prefix is stable and keyed, the 90% cache discount flattens the input side, so step down as far as your output-quality evals allow and let the tier set the output price.

Which LLM Is Cheapest for Your Language? Tokenizer Costs Measured

synthorai — Wed, 08 Jul 2026 15:43:24 +0000

There is no single cheapest LLM for multilingual text: measured on the same passage, GPT-5.5 bills the fewest tokens for European languages, Hindi, and Korean, Kimi K2.5 is leanest on Chinese, and DeepSeek on Japanese. Claude Fable 5, Opus 4.8, and Sonnet 5 share one tokenizer (identical counts on every sample we sent) and are never the leanest: the same English paragraph bills 90 raw tokens on Claude against DeepSeek's 55, and the net premium runs from 1.3x on Japanese to 2.2x on Chinese. The token is the billing unit, so two things you rarely see on a pricing page decide your input cost: how densely a language packs meaning into characters, and how well each model's tokenizer compressed that script. They multiply, and the product is not what the per-character view suggests.

TL;DR

Claude Fable 5, Opus 4.8, and Sonnet 5 share one tokenizer, never the leanest: 1.2-2.3x the cheapest count everywhere.
The cheapest tokenizer flips by language: GPT-5.5 on European languages, Hindi, Korean; Kimi on Chinese; DeepSeek on Japanese.
Per character CJK looks 3x worse, but per meaning Chinese sits near parity while Japanese and Korean run 1.5-2.4x.
Cost is script density times tokenizer coverage; missing coverage multiplies it (GLM bills Hindi at 4.9x its English).
Localizing rarely saves money; match model to language by tokens.

Counts were measured through the Synthorai gateway on 2026-07-08, always using each provider's own count, never a local tokenizer. Every count was identical across repeats.

The token is the billing unit, not the text

You are billed per token, but a token is not a character and not a word. Each model ships its own tokenizer with its own vocabulary, and the same sentence resolves to a different token count on each one. That count then multiplies the per-token price, so two things vary at once: how many tokens your text becomes, and what each token costs.

Most pricing pages only show the second number. This post measures the first. We sent three semantically aligned passages to seven models (claude-fable-5, claude-opus-4-8, claude-sonnet-5, deepseek-v4-flash, glm-5.2, gpt-5.5, kimi-k2.5) and read back what each one billed as input.

A casual narrative (a Saturday market story) exists in nine languages; a technical explanation (retry with exponential backoff) and a news brief (a city-budget vote) exist in English, Chinese, Japanese, Korean, German, and Hindi. A Python function and a JSON tool-call blob round out the set. The non-English versions are machine translations produced under a faithful, no-compression instruction and spot-checked by hand; translation verbosity is a real confound, and the register section below bounds it at roughly 20%.

Counting is always the provider's own: the Claude line via a real Messages call reading usage.input_tokens (the gateway does not currently proxy count_tokens), OpenAI-compatible models via a small call reading usage.prompt_tokens. A local tokenizer that disagrees with the invoice is the exact failure this avoids. One control matters: every request carries fixed framing (chat template, role markers) worth a handful of tokens, so we measure a two-character baseline sample and subtract it. Every ratio in this post is net of that envelope and compares the text, not the framing.

The same text, five tokenizers

Here is the raw input-token count for the narrative passage, per language, per tokenizer; the three Claude models share a column because they returned identical counts on every sample (more on that below). The other two passages reproduce the same pattern and are folded in further down. The character column is the length of that language's version; scripts pack meaning differently, so Chinese says the same thing in 77 characters that English needs 254 for.

language	chars	fable-5 / opus-4-8 / sonnet-5	deepseek-v4	glm-5.2	gpt-5.5	kimi-k2.5
en	254	90	55	63	57	60
zh	77	96	50	58	69	50
ja	136	136	101	116	114	129
ko	143	160	104	123	93	129
hi	196	147	124	192	76	133
de	289	146	92	92	75	104
fr	259	111	76	79	66	93
es	253	112	75	79	66	91
it	272	127	84	91	78	100

Two facts jump out. The Claude column is one number for three models because Claude Fable 5, Opus 4.8, and Sonnet 5 returned identical counts on every sample, languages, code, and JSON alike: all three carry the tokenizer introduced with Opus 4.7, so a count on one is a count on all of them. And that column is the largest in every row except Hindi, where GLM's 192 is worse. Normalized on net counts so the leanest model in each language reads 1.00 (the envelope is subtracted first, so these ratios won't match a division of the raw cells above):

language	fable-5 / opus-4-8 / sonnet-5	deepseek-v4	glm-5.2	gpt-5.5	kimi-k2.5
en	1.64	1.00	1.00	1.00	1.00
zh	2.20	1.12	1.12	1.55	1.00
ja	1.33	1.00	1.07	1.11	1.24
ko	1.77	1.15	1.28	1.00	1.38
hi	2.01	1.72	2.59	1.00	1.78
de	2.03	1.28	1.16	1.00	1.38
fr	1.75	1.20	1.12	1.00	1.41
es	1.76	1.19	1.12	1.00	1.37
it	1.68	1.11	1.10	1.00	1.27

The English row's four-way tie is not rounding: DeepSeek, GLM, GPT-5.5, and Kimi all land on exactly 50 net tokens for that passage. Claude runs 1.3x to 2.2x the leanest tokenizer on this passage, 1.2x to 2.3x across all three, a property of the vocabulary that applies to every call for the life of the model. The technical and news passages reproduce the ranking: summing both, Claude bills Chinese at 212 net tokens against Kimi's 114 (1.9x) and Hindi at 477 against GPT-5.5's 210 (2.3x). But there is no single winner underneath it. The leanest column moves as the language changes:

GPT-5.5 is leanest on German, French, Spanish, Italian, Hindi, and Korean, and ties for leanest on English (the tie and fr/es/it hold on the narrative only). Its vocabulary is tuned toward Latin scripts and it holds up on Devanagari and Hangul.
Kimi K2.5 is leanest on Chinese and competitive across CJK.
DeepSeek-v4 is leanest on Japanese and close behind on Chinese.
GLM 5.2 is middle-of-pack on most languages but posts the worst cells in the matrix on Hindi: 2.59x the leanest on the narrative (179 net tokens where GPT-5.5 needs 69), worse still on the formal passages, and the one column that exceeds even Claude.

The penalty is not limited to prose. On the Python function Claude runs 1.61x the leanest, and on the JSON tool-call 1.29x. The JSON gap is narrower because structured text is mostly punctuation and short ASCII keys that every tokenizer handles similarly. For a long-running agent that replays a large tool schema every turn, that per-turn tax compounds, which is where caching earns its keep. The prompt-caching series covers those mechanics.

The per-character trap: CJK looks worse than it bills

The tables above compared models. Hold the model fixed instead, and language still moves the count, but not the way the raw character view suggests. The most-quoted tokenizer number is tokens per character, and CJK dominates it: on Claude, Chinese runs about 114 net tokens per 100 characters, Korean 106, Japanese 94, against 32 for English. Read that column alone and CJK looks like a 3x tax, but it is the wrong column: you pay for meaning, not characters, and the aligned passage carries the same meaning in every language. Here are both views on Claude, for the narrative passage:

language	chars	net tokens	tokens / 100 chars	tokens vs English
en	254	82	32	1.00
zh	77	88	114	1.07
ko	143	152	106	1.85
ja	136	128	94	1.56
hi	196	139	71	1.70
de	289	138	48	1.68
it	272	119	44	1.45
es	253	104	41	1.27
fr	259	103	40	1.26

The two right-hand columns disagree, and Chinese is the sharpest case: the highest per-character density in the set, yet a per-meaning premium of only 1.07x English on this passage. 77 characters carry what English spends 254 on, so a steep per-character rate multiplies a tiny character count and nearly cancels out. Across all three passages the cancellation holds but is not magic: Chinese averages 1.17x Claude's own English, and 0.95x to 1.32x depending on the model, near parity rather than the 3x the per-character column threatens.

Japanese and Korean answer the obvious next question: same illusion, weaker cancellation. Both share the high per-character density, because Hangul and Japanese kana spell sounds out roughly one glyph per syllable rather than packing a whole word into each character the way Chinese hanzi do. So Korean needs 143 characters and Japanese 136 for the passage Chinese says in 77. More characters times a high per-character rate does not cancel, it compounds: on Claude, Korean averages 1.96x English per meaning across the three passages and Japanese 1.56x, both genuinely expensive, even though their per-character column looks a lot like Chinese's.

German is the mirror image of Chinese: a low per-character rate (48, near English) but the most characters of any language here (289, its compound words), which still totals 1.68x. Cost is the product of the two axes, and either one read alone misleads.

Why the numbers move: two factors, multiplied

The rule underneath every table above is one equation:

tokens for a passage = (characters to express the meaning) x (tokens per character)

The first factor is the writing system's density, a property of the language, not the model. It is a spectrum, not a Chinese exception. Logographic Chinese packs a morpheme into each character and sits at the dense extreme. Japanese kana and Korean Hangul spell sounds out, so they are less dense and need more characters. Devanagari and the Latin alphabets are less dense still. Meaning-per-character falls steadily from Chinese to English.

The second factor is how many tokens the model's vocabulary spends per character of that script, and it is entirely model-specific. A BPE tokenizer learns multi-character merges from its training corpus: scripts it saw often get compact tokens, scripts it saw rarely fall back toward character-by-character or even byte-level encoding, where one character can become two or three tokens. The same three languages, net tokens per character:

tokens per character	Chinese	Hindi	English
Claude	1.14	0.71	0.32
DeepSeek	0.58	0.61	0.20
GPT-5.5	0.81	0.35	0.20
GLM 5.2	0.58	0.91	0.20
Kimi K2.5	0.52	0.63	0.20

The table explains three things. Chinese looks special in the totals because it is extreme on the first factor: even Claude's weak Chinese compression (1.14 tokens per character, still splitting some hanzi in two) can't make the total large when there are only 77 characters, and the China-trained models compress it well enough (0.52 to 0.58) to land near parity with their own English. Hindi's premium is the second factor, not density: GLM spends 0.91 tokens per Devanagari character, nearly one token per character because its vocabulary has almost no multi-character Devanagari merges, while GPT-5.5 spends 0.35 by covering whole syllable clusters, a coverage gap on the same script. And Claude is high everywhere because its per-character rate is high even on English (0.32 versus DeepSeek's 0.20), a model-level baseline that stacks on top of whatever the language does.

None of this is a quirk of our seven models. The research literature calls the phenomenon the token premium, and Petrov et al. (NeurIPS 2023) measured it across hundreds of language pairs, finding the same two root causes (character counts differ per meaning, and tokenizer coverage differs per script) and premiums up to 15x for low-resource languages, with the same consequences: higher cost, higher latency, and less usable context window, since a high-premium language fills the same context budget with less meaning. The gap also narrows as vendors invest: independent measurements put Chinese at +182% tokens versus English on GPT-3-era vocabularies and +24% on GPT-4o's, close to the +32% we measure on GPT-5.5 and the parity we measure on the China-trained models. Coverage is bought with vocabulary slots, and vendors keep buying.

Does localizing ever save money?

It is tempting to read the trap section as "Claude is flat across languages, so ignore localization" and "China models are cheap on Chinese, so localizing saves money." Both are wrong. Here is each language against that model's own English, averaged over the three passages, for the five languages that have all three:

vs own English	zh	de	hi	ja	ko
Claude	1.17	2.11	2.40	1.56	1.96
DeepSeek	1.00	1.94	3.11	1.85	1.99
GLM 5.2	1.03	1.77	4.89	2.03	2.31
GPT-5.5	1.32	1.53	1.70	2.09	1.72
Kimi K2.5	0.95	2.20	3.15	2.18	2.41

Claude is not flat: Korean costs 1.96x its English, Hindi 2.40x. Chinese sitting near 1.17x is a one-language coincidence, not a property of the model. And the China models do not beat parity on Chinese so much as reach it: the best cell in the whole table is Kimi's 0.95x, five percent below its own English, and every other cell costs the same or more. On Hindi, Japanese, and Korean those same models carry a bigger penalty than Claude does, not a smaller one, because those scripts are further from their training focus. The pattern is not "vendor X is cheap"; it is that every model is leanest, relative to its own English, on the languages nearest its training data.

Register moves these numbers too. The casual narrative is the friendliest case; the technical and news passages raise nearly every multiplier, because terminology and loanwords are exactly what non-Latin vocabularies lack merges for. German goes from 1.68x (casual) to 2.29x (technical) on Claude, and GLM's Hindi reaches 5.98x its English on the news passage. A one-passage benchmark flatters whichever language got the friendliest translation, which is why the single narrative had Chinese at 0.80x on Kimi and three passages put it at 0.95x.

Relative-to-own-English is the wrong lens anyway. What you pay is absolute tokens, and there Claude is the most expensive in eight of the nine languages, with GLM's Hindi the one column that is worse. Chinese content that is "cheap relative to Claude's English" is still 88 net tokens on Claude against 40 on Kimi for the narrative. So the move is not to localize for savings but to match the model to the language: Chinese to Kimi or DeepSeek, Hindi and Korean to GPT-5.5, Japanese to DeepSeek. Claude is never the token-cost winner in any language, though it may still win on quality.

Counts are only half the bill

A token multiplier only matters next to the per-token price, and the two compound. Claude Fable 5 lists at $10 per million input tokens, Opus 4.8 at $5, and Sonnet 5 at $3 once its introductory pricing ends; on Chinese their shared tokenizer also counts the text at 2.2x the leanest model, so the count premium multiplies whatever rate gap the model already carries over the alternative you would route to. The opposite can happen too: a model can count leanly and still cost more per call because its rate is high. Neither number alone tells you the bill. We don't print the other vendors' rates here because they change faster than tokenizers do; the counts above are the durable half of the calculation.

The practical move is to stop comparing sticker prices and compare effective input cost: your real traffic mix, counted on each candidate model, multiplied by that model's input rate. On a Chinese-heavy or Korean-heavy product that reordering can flip which model is cheapest, and the gap is a durable 1.5x to 2x, not a rounding error. It is the same reason the number that matters when you cache is effective cost weighted by hit rate, not the headline rate, which the provider comparison works through. The version-to-version cut of this story, why Sonnet 5 counts 41% more than Sonnet 4.6 on the same English, is in the Sonnet 5 tokenizer post.

Bottom line

Token cost is script density times tokenizer coverage. The language sets the first factor, the model sets the second, and either one read alone misleads.
Claude Fable 5, Opus 4.8, and Sonnet 5 are 1.2x to 2.3x the leanest in every language, because their per-character rate is high even on English.
The leanest model is language-specific: GPT-5.5 for European languages, Hindi and Korean, Kimi for Chinese, DeepSeek for Japanese. GLM is weakest on Hindi, at nearly one token per character.
Formal and technical register raises the multiplier in nearly every language; benchmark in the register you actually ship.
Don't localize for savings; match the model to the language on absolute tokens, then multiply by each model's rate to compare effective cost.

FAQ

Which LLM tokenizer is cheapest?
It depends on the language. Across seven models on the same aligned passages, GPT-5.5 was leanest on the European languages, Hindi and Korean (tied on English), Kimi K2.5 on Chinese, and DeepSeek-v4 on Japanese. The Claude family (Fable 5, Opus 4.8, Sonnet 5) was never the leanest, running 1.2x to 2.3x the cheapest count in every language and register.

Do Claude Fable 5, Opus 4.8, and Sonnet 5 use the same tokenizer?
Yes. All three produced identical token counts on every sample, in every language, code, and JSON. They carry the tokenizer introduced with Opus 4.7, so a count on one transfers to the others, and Fable 5's higher bill comes entirely from its per-token price.

Is Chinese more expensive than English on Claude?
Slightly: 1.17x per meaning, averaged over three passages (and roughly parity on the China-trained models). Per character it looks far worse (about 114 net tokens per 100 Chinese characters versus 32 for English), but Chinese conveys the same meaning in about a third of the characters, so the totals nearly cancel.

Do Japanese and Korean behave like Chinese?
Only halfway. They share Chinese's high per-character density, but Hangul and kana spell sounds out, so they need far more characters for the same passage (Japanese 136, Korean 143, versus Chinese's 77). The high per-character rate no longer cancels, so per meaning Japanese runs about 1.6x English on Claude and Korean about 2x, and 1.5x to 2.4x across the seven models.

How do I measure this for my own prompts?
Send a few real prompts, in the register you actually ship, to each candidate model and read the provider's own input-token count from the usage fields, rather than trusting a local tokenizer. One friendly passage can flatter a language by about 20%, so use several. Then multiply each count by that model's input price to get effective cost on your traffic.

Claude Fable 5 for Agents: Tool-Call Refusals, Cost vs GLM 5.2

synthorai — Tue, 07 Jul 2026 05:49:17 +0000

Claude Fable 5 refused mid-tool-call on 11 of 44 coding-agent turns in our eval, on tasks as mundane as fixing a config default. The refusal arrives as stop_reason: "refusal" partway through generating the tool arguments, the truncated arguments still parse as valid JSON, and an agent loop that executes tool calls without checking the stop reason will happily write a half-finished file to disk. That behavior, not the price tag, is the first thing to engineer around when you put Fable 5 in an agent.

TL;DR

Claude Fable 5 returned stop_reason: "refusal" mid-tool-call on mundane agent tasks (a config-default fix, a meeting-room booking); the truncated write_file arguments still parsed, so a loop that doesn't check the stop reason executes half-written files.
Fable 5's thinking is adaptive, with no off switch: enabled and disabled are both rejected; the control is output_config.effort.
Fable 5's cost premium is shape-dependent: a four-turn coding task ran $0.045 vs $0.003 on glm-5.2 (15x), but only 5x sonnet-5 on warm batch work.
Fable 5 requires 30-day data retention.

Everything below was measured on 2026-07-05 through the Synthorai gateway with a small scenario harness: five agent workload shapes (a tool-using coding loop, RAG question answering, tool-heavy orchestration, batch classification, and a 15-turn conversation), run against claude-fable-5, claude-opus-4-8, claude-sonnet-5, and glm-5.2, three runs per task where variance matters. The tasks are deliberately trivial; pass rates are a sanity gate, not a capability benchmark. Costs are the gateway's billed usage.cost.

Check stop_reason before executing tool calls

This is the failure the docs don't warn you about, and it corrupts state. The agent reads app.py, decides to write the fix, and starts emitting a write_file call. Partway through the file content, the stream stops:

{
  "stop_reason": "refusal",
  "content": [{
    "type": "tool_use",
    "name": "write_file",
    "input": {
      "path": "app.py",
      "content": "DEFAULTS = {\n    \"timeout_s\": 30,\n    "
    }
  }]
}

The input object is complete, parseable JSON. Nothing about it says "I stopped early." If your loop's contract is "got tool calls, run them," you just overwrote app.py with a 38-character fragment that ends mid-dictionary and no longer parses as Python, and the next turn is a refusal too, so the loop ends with the workspace corrupted.

Three things we can say from the data:

It triggers on mundane work. The tasks that drew refusals were fixing a KeyError in a config lookup, implementing a slugify function, booking a meeting room, and creating a draft invoice. Nothing dual-use, nothing sensitive.
It is repeatable, not random. One coding task drew a refusal on all three runs, streaming and non-streaming alike. Other tasks never drew one. Across conditions, 58-75% of our trivial coding episodes passed on Fable 5 versus 100% for claude-opus-4-8, claude-sonnet-5, and glm-5.2, and every single failure traces to a refusal, not to wrong code.
Once a refusal is in the conversation, the episode is done. Follow-up turns returned stop_reason: "refusal" with empty output. Retrying within the same context did not recover.

The trigger is not the task content, and the data is blunt about it. The task that refused every run was a nine-line KeyError fix in a config dictionary, no credentials, no exploits. Meanwhile the batch scenario classified support tickets about cryptomining, leaked Stripe keys, and phishing pages without a single refusal, and the RAG scenario answered over docs full of AES-256-GCM secrets and breach-response procedures, also clean. Every refusal landed in the two multi-turn, tool-executing scenarios; the three single-shot scenarios never refused once, even carrying heavier content. The pattern is the agent-loop shape, not the words, which means sanitizing your inputs won't prevent it.

The fix is one line before your tool-execution step:

if response.stop_reason == "refusal":
    # do NOT execute tool calls from this turn: arguments may be truncated
    raise AgentInterrupted("model refused; restart episode or escalate")

Anthropic documents the mechanics: a refusal that fires before any output returns an empty content array and is not billed; a mid-stream refusal bills the already-streamed output, and the guidance is to discard the partial. The response also carries a stop_details object with a category (such as cyber or bio, or null) so you can tell classifier blocks from ordinary declines. What the documentation doesn't spell out is the interaction with tool use we hit above: the refusal can land inside argument generation, and the partial arguments are indistinguishable from complete ones.

There is also an official recovery path. On the Claude API, the beta fallbacks parameter (betas: ["server-side-fallback-2026-06-01"], fallbacks: [{"model": "claude-opus-4-8"}]) re-runs a declined request on a fallback model inside the same call, with the decline itself unbilled when it fired pre-output. It is not available on Amazon Bedrock, Vertex AI, or Microsoft Foundry, where the SDKs ship a client-side fallback middleware instead. Whichever path applies, the guard above still comes first: never execute tool calls from a turn whose stop reason is a refusal.

What five agent shapes cost

Median cost per completed unit (task, query, item, or conversation), same prompts, same day:

Scenario	fable-5	opus-4-8	sonnet-5	glm-5.2
Coding loop (per task, 4 turns median)	$0.045	$0.012	$0.0059	$0.0031
RAG answer (per query)	$0.024	$0.0075	$0.0036	$0.0031
Tool orchestration (per task)	$0.048	$0.011	$0.0045	$0.0027
Batch classification (per item, warm)	$0.0024	$0.0012	$0.00046	$0.00057
15-turn conversation (whole)	$0.94	$0.34	$0.26	$0.083

Two readings of that table matter more than any single cell:

The cheapest model changes with the shape. glm-5.2 wins the loops and the long conversation, but claude-sonnet-5 is the cheapest batch classifier in the set, under glm-5.2, because its introductory price rides on a 97% cache-read share once the scaffold prompt is warm.
Fable 5's premium is shape-dependent too: 15x glm-5.2 on the coding loop, 11x on conversation, but only 5x sonnet-5 on warm batch items, where caching absorbs most of the prompt.

The rest of the cost picture is about controlling those numbers, and then about two things that quietly push them back up.

Keeping an agent's bill down

Caching is the single biggest lever, and on Fable 5 its contract is unchanged. The agent data shows what it is worth: with cache_control markers removed, the same coding task cost 2.0x more, and warm batch items 6.8x more. On opus-4-8 the same ablation reads 3.8x and 6.9x. In a loop, the sliding-marker pattern is not an optimization, it is the difference between a viable bill and not.

Prompt order is the second lever, and it held up across every model we ran. Putting stable rules before per-query context (instead of after it) made RAG queries 26-37% cheaper on all four models; on the Claude line, the wrong order additionally pays the 1.25x cache-write premium on every call. The mechanics are in the LangChain caching post; the numbers here just confirm they apply unchanged to Fable 5.

Fable 5 also adds two levers of its own. The first is that the cache-eligibility floor dropped to 2,048 tokens, half of Opus 4.8's 4,096. That reads like trivia until you remember where an agent's savings come from: the repeated scaffold (system prompt, tool definitions, the sliding conversation prefix) is what caches, and only if it clears the floor. A tool-heavy agent whose per-turn prefix sat between 2,048 and 4,096 tokens got no caching at all on Opus 4.8, and starts caching on Fable 5, turning a full-price prefix into a roughly 10%-of-price cache read on every subsequent turn. It cuts the other way too: a prefix padded to clear the old 4,096 floor may now be carrying dead weight. Read cache_read_input_tokens off a live response rather than assuming, because on Fable 5 the discount begins earlier.

The second is task budgets (beta, header task-budgets-2026-03-13), which address the exact problem this comparison keeps surfacing: a Fable 5 loop runs up a bill fast, and max_tokens does not help. It is a hard per-response cap the model cannot see, so the model plans as if it has unlimited room, then gets cut off mid-thought. A task budget is different. You give the loop a token ceiling (minimum 20,000) that the model sees as a running countdown and paces itself against, wrapping up gracefully instead of being truncated. It counts what the model generates plus the tool results it reads on the turn, not the full history you resend each request. On a model whose coding-loop turn costs 15x glm-5.2, a budget the model self-moderates toward is the cheapest guardrail you can bolt on.

Two cost surprises the docs won't flag

With the levers set, two smaller things still moved our bill in directions the documentation doesn't warn you about.

"Low" effort was not cheaper. Fable 5's thinking depth is controlled by output_config.effort, and the intuition is that low costs less. It didn't. Setting effort: "low" ran our coding loop at $0.0478 per task versus $0.0451 at the default, with more output tokens, not fewer. We saw the same pattern on GLM 5.2, where the effort names don't track token counts either. On both model lines, measure the knob on your own workload before assuming "low" means "less." One reason the number is hard to predict: adaptive thinking's share of output tokens swung from 2% on the coding loop to 30% on RAG answers to 52% on batch classification, same model, same day. Budget output tokens per workload shape, not per model.

Never replay reasoning_content. On OpenAI-compatible models, the reasoning field is not conversation history. DeepSeek's API requires stripping it; on GLM 5.2 replaying it is legal but billed. Feeding it back into the message history inflated our GLM loop cost by about 28% until we stripped it. Anthropic's own thinking blocks are different: on the same model you must replay them unchanged, but a Fable 5 thinking block routed to a different model (say, on a fallback to Opus) is dropped from the prompt automatically and not billed, so there's nothing to strip.

What changed in the request surface

Fable 5 shares most of its request surface with Opus 4.7/4.8 and Sonnet 5. What's gone, per the docs:

thinking: {type: "enabled", budget_tokens: N} returns a 400. Extended thinking with a token budget, the mechanism from Claude 3.7 Sonnet through the 4.5 family, is retired across the 4.7+ line in favor of adaptive thinking.
thinking: {type: "disabled"} returns a 400, and this one is unique to Fable 5. Opus 4.7/4.8 and Sonnet 5 still let you switch thinking off; Fable 5 does not.
temperature, top_p, and top_k are rejected at any non-default value.
Assistant-message prefills (a trailing assistant turn) return a 400.

The temperature/top_p/top_k and prefill removals are the two that most often bite a ported request; the thinking and retention changes are covered above and in the retention post.

Bottom line

Fable 5 in an agent is an engineering problem before it is a budget problem. Handle stop_reason: "refusal" before executing tool calls, or a truncated write will corrupt state on a task as boring as a config fix. Then treat the cost as something you shape: caching is the biggest lever, the eligibility floor is now 2,048 tokens so re-check your prefix, a task budget keeps a loop from running up the highest per-turn bill in this comparison, and effort: "low" is not the discount its name implies. Budget by workload shape, too: the same model is 15x the cost of glm-5.2 on a coding loop and 5x sonnet-5 on warm batch work. None of this says use it or don't; it says the defaults are not neutral, and the bill and the failure modes both depend on how your agent is shaped.

FAQ

Does Fable 5 refuse tool calls often?
It concentrated on specific tasks: one config-fix task refused on every run, others never did, and the same tasks reproduced under both streaming and non-streaming calls. So it is not a rare flake you can retry past. Rates on your workload will differ; the engineering answer is the same either way: check stop_reason before executing tool calls.

Can I turn Fable 5's thinking off?
No. thinking.type.disabled and enabled are both rejected. Thinking is adaptive by default and output_config.effort is the only control; in our loop, low effort did not reduce cost.

Is Fable 5 ever the cheap option?
Not in this set. Its smallest premium is on warm cache-heavy batch work, about 5x sonnet-5, where a warm cache absorbs most of the prompt. On the loops and the long conversation it is the most expensive model we ran.

Verification: all figures measured 2026-07-05 against https://synthorai.io/ (Anthropic-native /v1/messages for the Claude line, /v1/chat/completions for glm-5.2), 505 episodes and 1,022 calls across five scenario shapes, three runs per task where variance matters. Costs are the gateway-reported usage.cost; medians shown. Tasks are deliberately simple, so pass rates are a sanity gate, not a capability benchmark; we don't publish capability claims we haven't measured. Refusal behavior reproduced in both streaming and non-streaming modes. Your numbers will vary with prompts, region, and load.

LLM Prompt Caching #5: LangChain Setups That Actually Hit

synthorai — Sat, 04 Jul 2026 09:41:35 +0000

Here is a LangChain system prompt that looks perfectly reasonable and caches nothing:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", BIG_STABLE_SYSTEM_PROMPT),   # the syntax every tutorial uses
    ("human", "{question}"),
])

We ran this against claude-sonnet-5 twice with an identical 1,800-token system prompt and read the usage fields back. Both calls: cache writes 0, cache reads 0. Not a partial hit, not a fragmented cache. Nothing. The reason: Anthropic only caches what you mark with cache_control, and a plain string in a ("system", ...) tuple has nowhere to put the marker. The most convenient syntax in LangChain is also the one that leaves the entire discount on the table, and no error tells you.

This is part 5 of the caching series. Part 1 covers how prefix caching works, part 3 does the raw-SDK tutorial. This part is about what changes when LangChain assembles your prompts for you. Everything below was measured on 2026-07-04 through the Synthorai gateway with langchain-core 1.4.8, langchain-anthropic 1.4.8, and langchain-openai 1.3.3.

First, which "caching" are you looking for?

Two unrelated features share the word, and the LangChain docs page you land on when you search is usually the wrong one.

	Response caching (LangChain's `InMemoryCache`)	Prompt caching (this series)
What it stores	The whole completion, in your app	The prompt prefix's KV state, on the provider
When it saves money	The exact same request repeats	Different requests share a prefix
Where	`set_llm_cache(InMemoryCache())`, SQLite, Redis	`cache_control` markers or automatic prefix matching
Agent loops, RAG, chat	Almost useless (every request differs)	The main lever, since system + tools repeat every turn

And "exact same request" means exact: the built-ins key on the pair (serialized prompt, model-config string). Measured: an identical repeat returned in 0 ms with no API call; adding one space to the prompt missed; the same prompt with max_tokens changed by one also missed. (The cached replay also returns the original call's usage numbers, so naive token accounting double-counts.) Semantic caches exist as third-party integrations; the built-ins are exact-match only.

So set_llm_cache is fine for deduplicating identical calls in tests; the 2,000-token system prompt you re-send on every agent turn is prompt caching's job, and it needs the prompt assembled the right way.

The fix: content blocks, not strings

cache_control travels inside a content block, so the system message has to be a SystemMessage with block content rather than a bare string:

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate

llm = ChatAnthropic(
    model="claude-sonnet-5",
    base_url="https://synthorai.io",   # any Anthropic-compatible endpoint
)

prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=[{
        "type": "text",
        "text": BIG_STABLE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},   # a bare string has nowhere to put this
    }]),
    ("human", "{question}"),
])
chain = prompt | llm

Same 1,800-token system prompt, measured through the same gateway:

Call	String-tuple syntax	Content-block syntax
1st (cold)	write 0 / read 0	write 1,875 / read 0
2nd, different question	write 0 / read 0	write 0 / read 1,875

A warm read bills at roughly 10% of the input price, so on Claude this one structural change is the difference between paying full price forever and a 90% discount on the stable half of every call. The economics are in part 1; the marker mechanics match the raw SDK usage in the LangChain Anthropic integration docs and Anthropic's prompt caching guide.

Where your template variables go decides your hit rate

LangChain templates make it effortless to interpolate variables anywhere, which is exactly the hazard. The cache key is the byte-exact prefix. We put a date inside the cached block and measured:

SystemMessage(content=[{
    "type": "text",
    "text": f"Today is {today}. " + BIG_STABLE_SYSTEM_PROMPT,   # variable INSIDE the block
    "cache_control": {"type": "ephemeral"},
}])

Call	Result
day A, question 1	write 1,865 (cold for this value)
day A, question 2	read 1,865 (same value, hit)
day B, question 1	write 1,865 (new value, cold again)

The cache did not break. It got keyed on the variable. A value that repeats, like a date, costs one cache write per value and hits after that. A value that is unique per call, like a timestamp or a request ID, makes every call a cold write and the hit rate exactly zero.

The expensive real-world version of this mistake is RAG. The template many chains reach for puts the retrieved context at the top of the system prompt, before the static instructions. We measured both orders with an 800-token retrieved context that changes per query and a marked system block:

Order inside the prompt	Call 1	Call 2 (new query, new context)
Context first, then rules	write 3,133	write 3,133 again, read 0
Rules first (marked), context in the human turn	write 1,852	read 1,852

The wrong row is not merely "no discount": every call pays the cache-write premium, about 1.25× the normal input price, on the full 3,133 tokens, and nothing is ever read back. A mis-ordered RAG prompt with caching enabled costs more than not caching at all. The fixed content sits after the variable content, so it might as well not exist.

The rule that falls out of the measurements:

Static text first, inside the marked block. System rules, tool definitions, few-shot examples.
Anything that varies goes after the marker, ideally in the human turn: retrieved context, dates, user questions.
A variable inside the block is acceptable only if it repeats often enough to amortize its own cache write.

Tool definitions get cached too

An agent re-sends its tool schemas on every call, and in Anthropic's request layout the tools sit before the system prompt. Since a marker means "cache everything from the start of the request up to here," that raises two practical questions. Does a marker on the system block also cover the tools in front of it? And does LangChain's bind_tools turn your tools into the exact same bytes on every call, because if the serialization wobbles, the prefix changes and every call misses.

Measured answers to both. With the same marked system prompt, the warm cache read was 1,861 tokens without tools and 2,389 tokens with two tools bound: the extra 528 tokens are the tool schemas coming back out of the cache. And that 2,389 repeated exactly across three consecutive calls, which means bind_tools serializes the same way every time; the framework does not leak noise into the prefix. So to be explicit: as long as the system block carries the marker, the tools themselves need no cache_control; that one marker behind them does all the work.

The opposite arrangement exists for one specific shape: the tools are the biggest stable thing you send and the system prompt is thin or absent. Then the request still needs a marker somewhere, and it can live on a tool. This only works with the raw Anthropic-format dict, because a @tool-decorated function has no field to carry it; bind_tools passes the dict through untouched:

# variant: NO marked system block anywhere; the tool carries the request's only marker
llm.bind_tools([{
    "name": "get_weather",
    "description": LONG_TOOL_DESCRIPTION,
    "input_schema": {...},
    "cache_control": {"type": "ephemeral"},   # passes through bind_tools verbatim
}])

Measured: write 3,002 cold, read 3,002 warm, with no marked system message in the request.

Multi-turn: move the marker to the last message

A conversation looks like another ordering problem, but it is the opposite case: the order is already perfect, because history only ever appends, so the whole transcript is stable prefix. The problem here is coverage. A marker on the system block caches the system block and nothing after it: as the history grows, the warm read stays flat at the system size while every accumulated turn bills as ordinary input.

The pattern that fixes it is the same one the raw SDK uses: put the marker on the latest message, so the breakpoint slides forward and the whole conversation-so-far becomes the cached prefix:

def marked(text):
    return HumanMessage(content=[{
        "type": "text", "text": text,
        "cache_control": {"type": "ephemeral"},
    }])

# each turn: history stays plain, only the newest human message carries the marker
llm.invoke([system, *history, marked(new_question)])

Measured across two turns: turn 1 wrote 1,864; turn 2 read 1,864 and wrote only the 15-token delta (the previous answer plus the new question), the prior prefix billing at the ≈10% read rate. That is the shape an agent loop wants, and LangChain expresses it with an ordinary message list. Anthropic allows up to four markers per request, so the sliding marker composes with a fixed marker on the system block or on the tools.

Read the meters, and know their names

LangChain standardizes usage into usage_metadata, and here is the gotcha we hit: with langchain-anthropic 1.4.8, on every response in our runs, the standard input_token_details.cache_creation field stayed 0 even when a cache write happened. The real write count lands in a nonstandard key:

r = chain.invoke({"question": "..."})
det = r.usage_metadata["input_token_details"]
det["cache_read"]                  # correct on hits (1875 above)
det["cache_creation"]              # 0 even on a cold write; do not alert on this
det["ephemeral_5m_input_tokens"]   # the actual write count (1875)

The provider reported the write correctly (cache_creation_input_tokens: 1875 in the raw response, visible via r.response_metadata["usage"]); the standardized mapping just files it under the TTL-bucket key. A cost dashboard watching cache_creation will tell you caching is free while the write premium quietly accrues. Trust the raw usage object or know the bucket keys. This is the same class of problem as gateways mis-reporting cache fields, which we audit in Does Your LLM Gateway Lie About Cache?.

Implicit caches: mis-ordering fails silently, so watch it hardest here

Claude's cache is explicit. GPT and most open-weight providers cache automatically on prefix match, no markers, and through LangChain the same chain works after one constructor change:

llm = ChatOpenAI(model="glm-5.2", base_url="https://synthorai.io/v1")

Plain string system prompt, no markers: GLM 5.2's second call read 1,088 tokens of the roughly 1,850-token prefix. (Not all of it: automatic caches match in coarse block increments rather than byte-for-byte to the end; OpenAI, for instance, documents 128-token granularity.) So far, free money. But the mis-ordering hazard from the RAG table above applies with full force here, and with a nastier failure mode. We reran the same order experiment on the automatic path, new retrieved context on every call:

Order (no markers, automatic cache)	Call 1	Call 2 (new query, new context)
Context first, then rules	read 0	read 0
Rules first, context in the human turn	read 0	read 1,088

The wrong order is an unconditional zero: the changing context sits at the front, no two calls share a prefix, and the discount never arrives. On the explicit path the same mistake at least shows up in the bill as a cache-write premium on every call; on the implicit path there is no premium, no error, and no signal. The prompt just quietly never qualifies, while you assume "automatic" means "working." And since there is no marker to place, prompt order is the only knob the implicit path gives you.

So verify from the meters, in production rather than once in a test: input_token_details.cache_read (LangChain) or prompt_tokens_details.cached_tokens (raw). OpenAI's automatic caching additionally documents a 1,024-token minimum prefix, and per-provider TTL and eligibility differ, which is part 2's territory.

Checklist

On Claude, a ("system", "...") string tuple has nowhere to carry cache_control: nothing gets cached, nothing warns you. Cacheable system prompts go in a SystemMessage with content blocks and the marker.
The cache key is the byte-exact prefix: static content first, variables after the marker or in the human turn. RAG context before the rules does not just miss; it pays the write premium every call.
A variable inside the cached block creates one cache entry per value: repeating values amortize, per-call-unique values (timestamps, request IDs) never hit.
Tools sit before the system prompt in the prefix, so the system marker caches bound tools too (bind_tools serializes deterministically). If tools are your biggest stable block, the marker can go on an Anthropic-format tool dict instead.
In conversations, a marker fixed on the system block leaves the growing history at full price; put it on the latest message so each turn reads the prior prefix and writes only the delta.
Do not monitor input_token_details.cache_creation: it stays 0 even on writes, so a dashboard concludes caching is free while write premiums accrue. The real count is in ephemeral_5m_input_tokens, or read the raw response_metadata["usage"].
On automatic-cache models (GPT, GLM, DeepSeek), prompt order is the only knob and mis-ordering fails silently: no premium, no error, just a discount that never arrives. Verify hits from the usage fields.
set_llm_cache stores whole responses keyed on the exact prompt and model config; it only pays off when identical requests repeat, never on an agent loop.

The habits are small: a content block instead of a string, static before variable, a marker that slides with the conversation, one usage field read correctly. The measured difference was a 90% discount on every stable token versus nothing, and in the mis-ordered RAG case, versus paying extra. LangChain does not get in the way of prompt caching; it just makes the wrong prompt shape as easy to write as the right one.

Disclaimer

Measured on 2026-07-04 against https://synthorai.io/ with langchain-core 1.4.8, langchain-anthropic 1.4.8, langchain-openai 1.3.3, models claude-sonnet-5 and glm-5.2, a roughly 1,800-token English system prefix, small samples, and a 1–2 second gap between consecutive calls so cache writes have time to land. Each experiment used a fresh randomized prefix to guarantee a cold cache, which is why the baseline token counts differ slightly between tables (1,852 to 1,875). Library field mappings and provider cache behavior change between versions; re-measure against your own stack before depending on the numbers.

language	chars	fable-5 / opus-4-8 / sonnet-5	deepseek-v4	glm-5.2	gpt-5.5	kimi-k2.5
en	254	90	55	63	57	60
zh	77	96	50	58	69	50
ja	136	136	101	116	114	129
ko	143	160	104	123	93	129
hi	196	147	124	192	76	133
de	289	146	92	92	75	104
fr	259	111	76	79	66	93
es	253	112	75	79	66	91
it	272	127	84	91	78	100

language	chars	fable-5 / opus-4-8 / sonnet-5	deepseek-v4	glm-5.2	gpt-5.5	kimi-k2.5
en	254	90	55	63	57	60
zh	77	96	50	58	69	50
ja	136	136	101	116	114	129
ko	143	160	104	123	93	129
hi	196	147	124	192	76	133
de	289	146	92	92	75	104
fr	259	111	76	79	66	93
es	253	112	75	79	66	91
it	272	127	84	91	78	100

language	chars	fable-5 / opus-4-8 / sonnet-5	deepseek-v4	glm-5.2	gpt-5.5	kimi-k2.5
en	254	90	55	63	57	60
zh	77	96	50	58	69	50
ja	136	136	101	116	114	129
ko	143	160	104	123	93	129
hi	196	147	124	192	76	133
de	289	146	92	92	75	104
fr	259	111	76	79	66	93
es	253	112	75	79	66	91
it	272	127	84	91	78	100