DEV Community

Ravi Patel
Ravi Patel

Posted on • Originally published at ssimplifi.com

Structured outputs vs JSON mode vs function calling vs raw text: the cost tradeoff explained

The structured-outputs feature in modern LLM APIs is sold on reliability — "the model returns exactly the schema you ask for, no parsing failures, no malformed JSON." That's real, but it's the second-order benefit. The first-order benefit is token economics: structured outputs typically produce 30-50% less verbose responses than free-form generation on the same task, because the model isn't padding with explanatory prose around the answer. Plus the elimination of retry-on-parse-failure loops removes a class of cost overruns that look like model unreliability but are actually engineering overhead. This post walks through the four shapes — raw text, JSON mode, function calling, structured outputs (response_format: json_schema) — the per-shape cost characteristics, and when to use which.

The parent guide OpenAI cost optimization covers structured outputs as one of five high-ROI techniques; this article goes deeper on the tradeoff between the four shapes.

The four output shapes

Modern LLM APIs offer four ways to extract structured data from a model response, ranging from "just text" to "fully schema-enforced":

Shape 1 — Raw text generation. The model returns free-form text. Your code parses it (regex, manual JSON extraction, whatever). The default mode; works on every model.

Shape 2 — JSON mode (response_format: json_object). The model returns valid JSON. Schema is loose — you ask for JSON, the model returns some JSON shape, no guarantee on field names. Reliability is high; structure is unguarded.

Shape 3 — Function calling (tools + tool_choice). The model returns a function call with arguments matching a schema you supply. Originally designed for tool use (call this API, with these arguments) but commonly repurposed for "extract this data" workloads.

Shape 4 — Structured outputs (response_format: json_schema). The model returns JSON matching a schema you supply. The schema is enforced at decode time — the model literally cannot produce output that violates the schema. The newest of the four shapes (rolled out in late 2024); the strictest.

Each shape has different token economics, latency characteristics, and reliability profile. The choice depends on the workload, not on a universal "best" answer.

The token economics — the headline saving

The most consistent finding across structured-output workloads: the response is shorter than free-form generation of the same task.

Why? Free-form generation pads with explanatory prose. "The email address is support@acme.com and the date appears to be 2026-03-14." Structured outputs strip the prose: {"email": "support@acme.com", "date": "2026-03-14"}. The same information; ~50% fewer output tokens.

Worked example. Extracting an order summary from a customer message:

Raw text generation prompt: "Extract the order details from this message: [...]"

Raw text response (typical):

Looking at the message, the order details are as follows:
- Order ID: 4477
- Customer: Jane Smith
- Total: $147.50
- Items: 3
- Shipping address: 742 Evergreen Terrace, Springfield OR
The order was placed on March 14, 2026.
Enter fullscreen mode Exit fullscreen mode

~75 output tokens.

Structured outputs response (same task, json_schema):

{"order_id": "4477", "customer": "Jane Smith", "total_usd": 147.50, "items": 3, "shipping_address": "742 Evergreen Terrace, Springfield OR", "placed_date": "2026-03-14"}
Enter fullscreen mode Exit fullscreen mode

~45 output tokens.

40% reduction in output tokens. Output tokens cost 4-5x input tokens on most providers; that 40% reduction translates to ~30-35% total bill reduction on extraction-heavy workloads.

VERIFY (founder): replace the worked example with a real Prism customer extraction workload or aggregated production data. The illustrative numbers above are reasonable but worth grounding.

The saving compounds with volume. A workload extracting structured data from 100K user messages per day, with structured outputs instead of free-form, saves ~30 output tokens × 100K × $10/M = ~$30/day, ~$900/month on a single feature. Stacked across multiple extraction features, the impact gets large fast.

When each shape is the right call

The decision matrix:

Output requirement Recommended shape Reason
Free-form content, conversational responses Raw text Structured shapes add overhead with no benefit
Extraction, classification, simple structured data Structured outputs (json_schema) Strict schema + token efficiency wins
Data extraction where the consumer is permissive JSON mode (json_object) Lighter than full structured outputs; faster to implement
Tool use / function dispatch Function calling Native fit; the shape was built for this
Mixed conversational + structured output in the same response Function calling with controlled tool emission The model can decide when to emit structured output vs text
Performance-critical, low-token-budget workloads Structured outputs Tighter token control than other shapes
Complex nested objects with strict type guarantees Structured outputs Only shape that enforces schema; alternatives can drift

The default "use structured outputs everywhere" reflex is wrong for the same reason "always stream" is wrong: the right shape depends on the workload, and over-engineering creates cost.

Cost characteristics per shape

The per-shape cost profile, beyond just output-token volume:

Raw text generation: cheapest per call (no schema overhead) but most expensive in failure modes (retry-on-parse-failure loops, downstream-code defensive parsing). Production cost ends up higher than structured for any workload with downstream consumer that expects structure.

JSON mode (json_object): marginal token savings (~10-20%) vs raw text. Reliability gain (always valid JSON) eliminates one common failure mode (malformed JSON). Schema isn't enforced; the model can drift on field names. Implementation is one line: response_format={"type": "json_object"}.

Function calling: moderate token savings (~30-40% vs raw text). High reliability — the function-call format is well-trained on most models. Adds a layer of indirection in the response shape (you parse tool_calls[0].function.arguments instead of the message content). Originally designed for tool use; works for extraction but feels slightly off-purpose.

Structured outputs (response_format: json_schema): largest token savings (~40-50% vs raw text). Highest reliability — schema is enforced at decode time, the model literally cannot violate it. Implementation is more verbose (you have to define the JSON Schema with additionalProperties: false, etc.). Most modern provider support (OpenAI, Anthropic, Google, others) but not universal across providers.

The honest cost ranking from lowest-total-cost to highest-total-cost, assuming a workload that needs structured data:

  1. Structured outputs — most token-efficient + most reliable + lowest retry overhead
  2. Function calling — close second; small extra token overhead for the call format
  3. JSON mode — middle ground; saves vs raw text but doesn't enforce schema
  4. Raw text + manual parsing — most expensive in total because of retry overhead

The ordering reverses if you don't need structured data — raw text is cheapest when free-form output is the right shape.

The reliability dividend (where the second-order savings live)

Beyond direct token economics, structured outputs eliminate a class of cost overruns:

Retry-on-parse-failure loops. Raw text generation occasionally produces malformed output that the downstream parser rejects. The application retries. The retry succeeds. Parser still rejects. The loop is one of the most common "where did our LLM bill go" patterns in production. Structured outputs make this failure mode impossible — the response is either valid against the schema (and the parser accepts it) or the model fails the generation entirely (rare; surfaces as a clean error you can handle once).

Schema drift after model upgrades. When you switch from gpt-5-4 to gpt-5-5, free-form JSON outputs may shift in subtle ways (field names slightly different, types slightly different). Structured outputs guarantee the schema regardless of model — the contract holds even as the underlying model evolves.

Downstream-code defensive parsing. Without structured outputs, downstream code has to handle "what if the JSON is malformed, what if the field is missing, what if the type is wrong." That's real engineering time. Structured outputs remove most of the defensive parsing surface; downstream code can trust the structure.

The combined effect: structured outputs are usually cheaper than raw text not because of the per-call savings but because of what they eliminate. Engineering time spent on defensive parsing; retries chewing through credits; bugs from schema drift after model upgrades. Hard to quantify in dollar terms; visible in team-level engineering velocity.

When structured outputs cost more

Three scenarios where structured outputs are the wrong call:

1. Conversational responses. A chat UI that returns "Sure, the price is $147.50 because…" should not be structured. Stripping the prose strips the user-facing value. Use raw text.

2. Long-form content generation. Article generation, summarisation, narrative writing. The prose is the value; structured outputs constrain the model in ways that hurt quality. Use raw text.

3. Highly variable output shape. "Extract whatever's relevant from this email" with no fixed schema. Either pick a schema (and use structured) or accept free-form text (and use raw). Trying to force "variable structure" via partial schemas creates more problems than it solves.

The pattern: structured outputs are for workloads with a predetermined output shape. When the shape isn't predetermined, the schema definition is fighting the workload, not helping it.

The implementation overhead

Comparing the four shapes by implementation effort:

Raw text: 1 line of code (the API call). 5-50 lines of parsing logic per output structure you need to extract.

JSON mode: 1 line of code (add response_format={"type": "json_object"}). Still need parsing logic but simpler since input is guaranteed JSON.

Function calling: ~10-20 lines per function definition (the function schema in the tools parameter). One-time setup cost; reusable across many calls.

Structured outputs: ~10-30 lines per schema definition. Once per output shape you need; reusable.

For workloads that emit structured data repeatedly, the per-shape implementation overhead amortises quickly. The first structured-output schema takes an hour to define; the second one takes 10 minutes by copy-pasting and adjusting.

The discipline: define schemas as Python pydantic models or TypeScript types, generate the JSON Schema from those, share across multiple call sites. The infrastructure is one-time; the value is per-call.

A worked migration: raw text → structured outputs

Before:

def extract_order(email_body: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-5-4",
        messages=[{
            "role": "user",
            "content": f"Extract the order details from this email: {email_body}\n\nRespond with JSON.",
        }],
    )
    # Defensive parsing — what we used to do
    text = response.choices[0].message.content
    try:
        # Strip markdown code fences if present
        text = text.strip().lstrip("```

json").rstrip("

```").strip()
        data = json.loads(text)
    except json.JSONDecodeError:
        # Retry once with a more forceful prompt
        # ... retry logic ...
        raise
    # Defensive field checking
    required_fields = ["order_id", "customer", "total", "items"]
    if not all(f in data for f in required_fields):
        raise ExtractionError("Missing required fields")
    return data
Enter fullscreen mode Exit fullscreen mode

After:

from pydantic import BaseModel

class Order(BaseModel):
    order_id: str
    customer: str
    total_usd: float
    items: int
    shipping_address: str
    placed_date: str

def extract_order(email_body: str) -> Order:
    response = client.beta.chat.completions.parse(
        model="gpt-5-4",
        messages=[{
            "role": "user",
            "content": f"Extract the order details from this email: {email_body}",
        }],
        response_format=Order,  # OpenAI's Pydantic integration enforces the schema
    )
    return response.choices[0].message.parsed
Enter fullscreen mode Exit fullscreen mode

The "after" code is shorter, more reliable, and produces fewer output tokens per call. The defensive parsing is gone. Retries are gone. Schema-drift bugs after model upgrades are gone. This is a clean win on every dimension that matters.

VERIFY (founder): confirm the client.beta.chat.completions.parse API path is current — OpenAI SDK has evolved; the Pydantic-integration entry point may have moved out of beta. Worth a one-line check against the current SDK before publishing.

Provider compatibility

Structured outputs as response_format: json_schema is well-supported on:

  • OpenAI — full support, including the Pydantic-integration parse API
  • Anthropic — supports it via the tools parameter shape with tool_choice: {"type": "tool", "name": ...}; the API is slightly different but the capability is equivalent
  • Google Gemini — supports it via response_schema parameter
  • Mistral — supports it via response_format: json_schema
  • Most modern providers — increasingly standardised

Less-well-supported:

  • Self-hosted open-weights — some models support it (Llama 3+ via certain inference servers); others don't. Verify per-deployment.
  • Older API versions — pre-2024 APIs typically don't support structured outputs; either upgrade or use function calling as the workaround.

For multi-provider workloads through an AI gateway (Prism, Portkey, LiteLLM, OpenRouter), the gateway typically passes the response_format parameter through to the upstream provider. Verify that your specific provider supports the shape.

How Prism passes through structured outputs

Prism is transparent to structured outputs:

  • Pass-through preservation. The response_format parameter on incoming requests is forwarded to the upstream provider unchanged. Same for tools + tool_choice for function calling.
  • No gateway-side validation. Prism doesn't validate the JSON schema or check the response against it. That's the provider's job at decode time.
  • No caching interaction quirks. Structured-output requests cache normally (the schema is part of the cache fingerprint, so identical requests hit the cache; different schemas miss).
  • No mode interaction. Structured outputs work across eco/balanced/sport modes. The router picks the model based on task type + mode; the model then enforces the requested output schema.

The pattern: customer code uses structured outputs against any compatible model; Prism doesn't add or remove anything.

Decision framework

If you're evaluating whether to use structured outputs on a workload:

  1. Does the output have a predetermined shape? Yes → structured outputs candidate. No → raw text or JSON mode.
  2. Is downstream code consuming the output as data? Yes → structured outputs (the reliability is worth it). No → raw text.
  3. Is your workload extraction, classification, or function dispatch? All three benefit substantially from structured shapes.
  4. Are you running retry-on-parse-failure loops? That's the smell. Structured outputs eliminate the failure mode.
  5. Is the model you're using compatible? Verify provider + model support before committing.
  6. Define the schema once, reuse everywhere. Pydantic / TypeScript / JSON Schema — share the definition across all call sites for that output shape.

The economics consistently favour structured outputs on workloads that fit the pattern. The most common failure mode is over-applying the shape to workloads that don't fit — conversational responses, long-form generation, truly variable outputs. Be deliberate about which slice of your traffic benefits.

Where to go next

For the parent OpenAI cost optimization context: OpenAI cost optimization. For the broader cost-reduction playbook: LLM cost reduction and the ranked top-5.

For the caching layer that stacks with structured outputs: AI API caching and OpenAI prompt caching explained.

For modelling output-token savings on your workload: savings calculator.


FAQ

Are structured outputs slower than free-form generation?

Marginally. Schema enforcement happens at decode time and adds a small overhead per token. On most modern models the per-token latency difference is sub-5%; the response is also shorter, so total time-to-completion is often faster with structured outputs than with free-form generation of the same content.

Can I use structured outputs with prompt caching?

Yes. The two are independent — structured outputs constrain the response shape; prompt caching discounts the input-token cost on stable prefixes. Both engage simultaneously on workloads that satisfy both conditions. Combined savings are roughly multiplicative on the relevant cost components.

What happens if the schema is too complex?

Schemas with deeply nested objects, many oneOf alternatives, or recursive structures can hit provider-side limits. OpenAI documents specific schema-complexity restrictions (max nesting depth, max properties per object, etc.). For most production workloads the limits are far above what's needed; only edge cases (recursive AST representations, deeply nested taxonomies) bump into them.

Does function calling differ from structured outputs for extraction tasks?

Slightly. Function calling was designed for tool dispatch — "the model decides whether and which function to call, with arguments matching a schema." Structured outputs were designed for direct extraction — "always return this exact schema." For extraction workloads where the answer is always "yes, return the schema," structured outputs are the better fit; the function-calling indirection adds overhead. For tool-use workloads where the model genuinely picks between options, function calling is the native shape.

Can I use structured outputs with streaming?

Yes, in most providers. The response streams token-by-token (each chunk is a partial JSON fragment); your application either buffers until the closing brace or uses a streaming-JSON parser to consume partial output. The streaming-JSON consumer is more complex than the buffered approach; most production code waits for the full structured response.

Does structured outputs work in batches (Batch API)?

Yes. The response_format parameter is passed through in the batch JSONL submission shape just like any other request parameter. Batch + structured outputs + prompt caching all stack on workloads that support all three.

What about Anthropic's structured-output equivalent?

Anthropic supports schema-enforced output via the tools parameter with a specific tool definition. The API shape is different from OpenAI's response_format: json_schema but the capability is equivalent. Cross-provider portability requires translating the schema definition between provider conventions; most gateways and SDK wrappers handle this for you.

How much can I save by switching from raw text to structured outputs?

Output-token-wise, 30-50% reduction on extraction and classification workloads. Total-bill-wise, ~20-35% depending on the input/output ratio. The harder-to-quantify saving is the elimination of retry loops + defensive-parsing engineering time, which often exceeds the per-call savings in total impact.


Structured outputs are the right default on workloads with a predetermined output shape. The OpenAI cost optimization pillar covers structured outputs alongside the other 4 high-ROI OpenAI techniques; the LLM cost reduction playbook covers the cross-provider techniques.

Top comments (0)