- Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A team I talked to last month was extracting a sentiment label from a product review. One word out of three: positive, negative, neutral. They had wired the call through response_format={"type": "json_schema", ...} because the next step in the pipeline was a Pydantic model and JSON felt safer. Their per-call cost was four times what the same prompt cost in free-form mode.
Four times. For a one-word answer.
The model was not slower. The model was not dumber. The model was generating {"sentiment": "positive"} instead of positive. Twelve characters of payload had become thirty-one, and the billing meter does not care that twenty of those characters were braces, quotes, and a field name the caller already knew.
This is the tax that gets averaged away in vendor blog posts. It's real and measurable, and on small payloads it dominates.
The three shapes
There are three ways to ask a modern LLM for structured information, and they bill very differently.
- Free-form text. You prompt for the answer, parse the string yourself. Cheapest output. No guarantees.
-
JSON mode /
response_format. The model is constrained to emit valid JSON. On OpenAI you can attach a schema; on Anthropic you do this with a system prompt that says "respond only with JSON" plus aprefillof{. -
Tool calls. You declare a function with a JSON schema. The model emits a
tool_useblock with arguments matching the schema. This is the strictest shape.
The output token count climbs in that order. So does the safety. Picking the right shape is a trade-off; the default is not free.
The same task, three ways
Here is the script. It runs the same classification request through OpenAI's gpt-4o-mini and Anthropic's claude-3-5-haiku-latest in all three shapes, then prints the output token count and the dollar cost per shape.
import os
from anthropic import Anthropic
from openai import OpenAI
REVIEW = (
"The battery dies in four hours and the keyboard "
"rattles when you type. Returning it tomorrow."
)
oa = OpenAI()
an = Anthropic()
The free-form version. One word back, no scaffolding.
def openai_freeform():
r = oa.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system",
"content": "Reply with one word: "
"positive, negative, or neutral."},
{"role": "user", "content": REVIEW},
],
max_tokens=10,
)
return r.choices[0].message.content, r.usage
The JSON-mode version. Same prompt, plus a schema. The model now has to emit braces, a quoted key, and a quoted value.
def openai_json():
r = oa.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system",
"content": "Classify the sentiment."},
{"role": "user", "content": REVIEW},
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "sentiment",
"strict": True,
"schema": {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"enum": [
"positive",
"negative",
"neutral",
],
},
},
"required": ["sentiment"],
"additionalProperties": False,
},
},
},
max_tokens=30,
)
return r.choices[0].message.content, r.usage
The tool-call version. Strictest. Output is wrapped in a tool_use envelope that includes a tool name, an id, and the argument object.
def openai_tool():
r = oa.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": REVIEW},
],
tools=[{
"type": "function",
"function": {
"name": "classify_sentiment",
"parameters": {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"enum": [
"positive",
"negative",
"neutral",
],
},
},
"required": ["sentiment"],
},
},
}],
tool_choice={
"type": "function",
"function": {"name": "classify_sentiment"},
},
max_tokens=30,
)
return r.choices[0].message.tool_calls, r.usage
The Anthropic equivalents follow the same idea. Free-form is plain text. JSON mode is a system prompt asking for JSON only, with a { prefill nudging the model into the shape. Tool use is a real tools declaration.
def anthropic_freeform():
r = an.messages.create(
model="claude-3-5-haiku-latest",
max_tokens=10,
system="Reply with one word: positive, "
"negative, or neutral.",
messages=[{"role": "user", "content": REVIEW}],
)
return r.content[0].text, r.usage
def anthropic_json():
r = an.messages.create(
model="claude-3-5-haiku-latest",
max_tokens=40,
system=(
"Classify the sentiment. Respond with only "
'JSON of the form {"sentiment": "..."}. '
"No prose, no code fences."
),
messages=[
{"role": "user", "content": REVIEW},
{"role": "assistant", "content": "{"},
],
)
return r.content[0].text, r.usage
def anthropic_tool():
r = an.messages.create(
model="claude-3-5-haiku-latest",
max_tokens=200,
tools=[{
"name": "classify_sentiment",
"description": "Classify a product review.",
"input_schema": {
"type": "object",
"properties": {
"sentiment": {
"type": "string",
"enum": [
"positive",
"negative",
"neutral",
],
},
},
"required": ["sentiment"],
},
}],
tool_choice={
"type": "tool",
"name": "classify_sentiment",
},
messages=[{"role": "user", "content": REVIEW}],
)
return r.content, r.usage
What the numbers actually look like
I am not going to invent exact token counts here. Your run will vary by tokenizer revision, model version, and the whitespace the model picks. What I can tell you is the shape, which is consistent across the runs you will see in vendor blog posts and in your own quick test against the script above.
For a one-word answer to a one-line input, expect roughly the following order of magnitude:
- Free-form output is typically 1–2 tokens. Just
positiveornegative. - JSON mode output lands around 8–12 tokens. The braces, the quotes, the field name, the value. On some models it adds a newline and indentation on top.
- Tool-call output is around 15–25 output tokens, plus the system-prompt overhead the provider injects to enable tool use. That overhead lands on your input bill. Anthropic's tool-use docs explain that the schema is sent on every call, so the declaration itself shows up on both request and response sides (docs).
Your numbers will vary by model version and tokenizer revision, so treat the ranges as order-of-magnitude rather than measured benchmarks.
So the overhead on a tiny extraction task is 5–10×, not 10–30%. The 10–30% number you see in vendor blog posts is the average across realistic payloads: a 200-token answer where the JSON wrapper is rounding error. On a one-token answer the wrapper outweighs the payload.
A community write-up summarising Anthropic's structured-output overhead puts it at roughly 50–200 input tokens, with a 2–3% cost increase at scale (source). That number is honest for the average case. It does not describe your sentiment classifier.
Why the wrapper is so heavy
Three reasons, each of them billed.
Field names get tokenized as text. "sentiment" is roughly three tokens. Multiply by the number of fields. A response with eight fields is paying for eight field-name strings in addition to the eight values.
JSON punctuation is not free. Each {, }, :, ,, and " is a token. Pretty-printed JSON with two-space indentation roughly doubles the punctuation cost over compact JSON. In some configurations the JSON arguments come back pretty-printed, which doubles punctuation cost on every call.
Tool calls add an envelope. A tool_use block carries a tool name, a unique id, and the type marker, all of which serialise into output tokens. On the next turn, your tool_result carries that id back, which costs input tokens. Round-trip a single tool call and you are paying twice for the envelope.
The escape sequences are smaller but real. A review that contains a literal quote becomes \" in the JSON-mode output. A newline becomes \n. Most reviews do not, but log lines, emails, and code snippets routinely do.
When the tax is worth paying
Structured output is not a trap. It is a tool. The question is when the safety dividend exceeds the token tax.
Pay the tax when:
- The downstream code parses the output and a single malformed response breaks the request. A failed parse plus a retry costs more than the wrapper.
- You have multiple fields, conditional logic, or a schema that has changed three times this quarter. Free-form parsing rots; schemas version cleanly.
- The output is large enough that the wrapper is a small fraction. A 500-token analysis with eight fields wears the JSON envelope without flinching.
Skip the tax when:
- The answer is a single label or a number. Use free-form, parse the string, validate with a regex.
- The model is already reliable on the prompt with zero-shot examples. Constrained decoding is overkill for binary outputs.
- You are running at a volume where a 5–10× output multiplier on tiny calls actually shows up on the invoice. Classification at a million calls per day is one of those volumes.
A middle path often wins: use a lightweight JSON contract instead of a full schema. Prompt the model for positive | negative | neutral as raw text, then validate against a Python Literal type after parsing. You pay the free-form rate and you get a typed value at the boundary.
from typing import Literal
Sentiment = Literal["positive", "negative", "neutral"]
def parse_sentiment(raw: str) -> Sentiment:
cleaned = raw.strip().lower()
if cleaned in ("positive", "negative", "neutral"):
return cleaned # type: ignore[return-value]
raise ValueError(f"unexpected label: {raw!r}")
The "we need structured output" reflex usually means "we need a typed value at the boundary." Those aren't the same thing. One is a billing decision; the other is a parser.
Run it on your own traffic
The right answer for your service is not in this post. It is in your usage data. Take the ten endpoints that hit an LLM most often and, for each one, compute three numbers: tokens-per-call in free-form, tokens-per-call in JSON mode, retry rate without schema enforcement.
If the JSON-mode column is more than 2× the free-form column and the retry rate is under 1%, you are paying for safety you do not need. Drop back to free-form, parse defensively, save the difference.
If the retry rate is 5% or higher, the schema is paying for itself. Keep it.
Measure first. The discount is in the data.
If this was useful
This is one of the patterns from my Prompt Engineering Pocket Guide — a small book about getting more out of LLMs without paying for the privilege. Chapters on schema design, output shaping, retry economics, and when to argue with the model versus when to give up and parse harder. If you ship anything that talks to an LLM API, the math in this post compounds across every endpoint you own, and the book is the long version.

Top comments (0)