DEV Community

Alex Cloudstar
Alex Cloudstar

Posted on • Originally published at alexcloudstar.com

Structured Outputs in 2026: Function Calling, JSON Mode, and the Schema Wars

The bug took three days to find. A user reported that our invoice extractor was occasionally swapping the buyer and seller fields. Not all the time. Not even most of the time. Maybe one in two hundred invoices, always on the ones that mattered.

I dug into the traces. The model was returning valid JSON. The schema validation passed. The downstream code was correct. The fields were just wrong, sometimes, in ways that no test caught and no eval flagged.

What I eventually figured out was that I had been using JSON mode, which guarantees valid JSON syntax but does not constrain the keys or the values. The model was free to return whatever object it wanted as long as it parsed. On hard invoices it would occasionally hallucinate a slightly different schema and our code, trusting the parse, would write garbage into the wrong column of the database.

Switching to a real structured output API with a schema constraint took fifteen minutes. The bug never came back.

This is the kind of mistake that quietly destroys data integrity in LLM pipelines, and the surface area for it has grown a lot in 2026 as every major provider has shipped its own version of structured outputs. Function calling, JSON mode, schema-constrained generation, tool use, response_format, output_schema — these terms overlap, conflict, and sometimes mean different things on different providers. If you do not understand which one you are actually using, you cannot reason about what it will and will not catch.

This is the field guide I have built up after shipping structured outputs across four products and getting bitten by every variant of this bug at least once.


The Three Things People Mean When They Say "Structured Outputs"

Before any of the provider-specific stuff, you have to separate three different ideas that everyone uses interchangeably.

JSON mode means the model is constrained to produce syntactically valid JSON. Nothing more. The keys and values can be whatever the model decides. There is no schema. If you ask for an invoice and the model gives you a recipe for soup, JSON mode will happily return valid JSON describing soup. This is the lowest tier and the one most likely to hurt you because it looks like it is doing something useful.

Function calling (sometimes called tool use) is the original way structured outputs got into production. You define a function with parameters and a schema, the model decides whether to call it, and if it does, the arguments come back as a structured object. The model is constrained to fill out the schema, but historically the constraint was a soft suggestion, not a hard guarantee. The model could still return malformed arguments and you had to handle that.

Schema-constrained generation (sometimes called structured outputs, response_format with schema, or constrained decoding) is the new world. You define a JSON schema, the provider runs constrained decoding under the hood so the model literally cannot emit tokens that would violate the schema, and you get back a parsed object that is guaranteed to match. No retries, no validation failures, no surprises.

These three modes are not the same and they fail in different ways. The crux of choosing the right one in 2026 is figuring out which guarantee you actually need.


Where Each Provider Lands in 2026

The landscape is finally settling, but it is still worth knowing what each provider gives you and what it does not.

OpenAI

OpenAI has the most mature schema-constrained API. The response_format parameter takes a JSON schema and the model is decoded under that constraint at the token level. If you set strict: true, you get a hard guarantee that the output matches the schema exactly. The schema can be nested, can include enums, can express required vs optional fields, and the constraint is enforced at generation time, not validated after.

OpenAI also still has the older function calling API and the original JSON mode (no schema). You should treat both of those as legacy unless you have a specific reason. Use response_format with strict schemas as your default.

The catch with OpenAI's strict mode is that it is more conservative than the looser modes. If your schema is too restrictive, the model can struggle to produce a useful answer because the constraint is preventing it from saying what it wants to say. The fix is usually to widen the schema with optional fields, not to remove the constraint.

Anthropic

Anthropic's tool use API has matured significantly through 2026. Tool definitions are JSON schemas, and the model is constrained to fill them. Through Claude Opus 4.7 the constraint enforcement has gotten strong enough that I treat it as equivalent to OpenAI's strict mode for practical purposes. Malformed tool calls are now rare enough that I no longer build retry logic around them.

Anthropic also added a more direct structured output mode that does not require dressing your call up as a tool. You provide an output schema and get a constrained response. This is the cleaner path when you do not actually need tool semantics, you just want a typed object back.

The thing to know about Anthropic is that the model is more willing to refuse or partially answer when the schema does not fit the request. If you ask for a structured field that the document does not contain, Claude is more likely to leave it null or empty than to confabulate. This is usually what you want, but it changes how you write prompts. You have to be explicit about what to do when the data is missing.

Google

Gemini's responseSchema parameter takes an OpenAPI-style schema and constrains the output. The constraint is enforced at decode time. The schema language is slightly different from JSON Schema, which is annoying, but the practical capability is on par with the others.

Gemini has the broadest support for very large outputs under structured constraints, which matters if you are extracting structured data from giant documents. If you need a 2 million token context window and a strict schema on the output, this is the one.

Open source models

For Llama, Qwen, Mistral, and the rest of the open weight ecosystem, structured outputs go through one of three libraries: outlines, lm-format-enforcer, or guidance. All three implement constrained decoding by intersecting the model's logit distribution with the legal next tokens given a schema or grammar. They work, they are reliable, and they are how most production self-hosted setups handle this.

The thing that surprised me when I moved a workload to a self-hosted model was that the open source constrained decoders are actually stricter than the hosted APIs. The model literally cannot produce an invalid output. If anything, the failure mode shifts from malformed JSON to the model getting stuck producing a less useful but technically valid answer.

If you are running local AI models with Ollama or any self-hosted inference, you almost certainly want one of these libraries in your stack. The native APIs do not do this for you.


Function Calling vs Structured Outputs vs Tool Use

Here is the part that confuses everyone, and where I have seen the most mistakes.

The historical naming is a mess. "Function calling" was OpenAI's original name. Anthropic called it "tool use." Google called it "function calling" too but did it differently. Eventually everyone converged on "tool use" because the model is not actually calling a function; it is producing a structured object that you then dispatch to a function.

Independently of that, "structured outputs" emerged as the term for "I want a structured response back, but I do not need tool semantics."

The distinction in 2026 is this:

Use tool use when the model needs to choose among multiple actions, decide whether to act at all, or take a sequence of actions in a loop. The semantics are about the model's agency. It is deciding what to do.

Use structured outputs when you have already decided what the model should produce and you just want the response in a typed shape. The semantics are about the response format. The model has no choice; it must produce the object.

In practice, most people use tool use for both because the API has been around longer and the documentation is more comprehensive. This works, but it leaks tool-use semantics into things that should be plain transformations. You end up with prompts that say "use the extract_invoice tool" when what you really mean is "give me an extracted invoice." The latter is less ambiguous and produces better results.

If your provider supports a direct structured output API, use it for transformations and reserve tool use for actual tool selection.


Designing Schemas That Do Not Fight the Model

This is the part nobody warns you about. The schema you write is itself a prompt. The model reads the schema, interprets the field names, and produces output guided as much by what the schema says as by what your text prompt says.

This means a badly designed schema can make the model worse, even when it is technically valid.

Use descriptive field names. A field called amount is ambiguous. A field called total_amount_with_tax tells the model exactly what to put there. The model is not magic; it reads what you wrote and tries to do what you said. Field names are part of the instruction.

Add field-level descriptions. Every major schema language supports a description per field. Use them. A description like "the date the invoice was issued, in YYYY-MM-DD format" is dramatically more reliable than just issue_date: string. The model treats the description as part of the prompt for that field specifically.

Use enums when possible. If a field has a fixed set of allowed values, encode that as an enum. The model is then physically incapable of producing anything else, and you do not have to write defensive parsing code. Status fields, category fields, type fields are all candidates.

Mark fields as nullable when they truly can be missing. If you make every field required, the model will fabricate values for fields it cannot find. If you allow a field to be null, the model will leave it null when the data is genuinely absent. This is the biggest source of hallucinated data in extraction pipelines, and the fix is to be honest in your schema about what is optional.

Avoid deeply nested structures. Constrained decoding works on every kind of schema, but the model performs better on flat structures. If your schema is six levels deep, consider whether some of those levels could be flat fields with composite keys. The model's accuracy on nested fields drops noticeably as depth increases.

Do not use schemas to enforce business logic. The schema is for shape, not policy. If a value must be between 1 and 100, do not encode that as a JSON Schema constraint and expect the model to produce a number in range. Validate it after. The model knows about JSON shape; it does not reliably reason about numeric constraints. The constraint will pass at decode time and you will get nonsense values that happen to be in range.


When the Schema Is Wrong: Failure Modes and Recovery

Even with strict schemas, structured outputs fail. The failure modes are different from unstructured outputs and you need to handle them.

Empty or null fields when the model cannot find the data. As I said above, this is the correct behavior. Your code needs to handle null. Treat any required-looking field as effectively optional in the model's eyes.

Confabulated values that match the schema. If you have a required field with no good default, the model will make something up. The fabrication will pass schema validation. The only defense is downstream verification — does the value actually exist in the source document, does it cross-reference with another field, does an LLM-as-judge agree it was extracted correctly. This is where the observability and eval workflow earns its keep.

Schema-mismatch in fallback paths. If your code has a fallback to a cheaper model that does not support strict mode, the fallback can return malformed data that breaks downstream parsers. Always validate after parsing, even if the API claims it cannot fail. Belt and suspenders.

Token waste on overly verbose schemas. Every property name and description is in the prompt every time. A schema with 80 fields and three sentences of description per field can easily run 4000 tokens. If you are calling that schema a million times a month, you are paying for those tokens a million times. Watch the token budget on your schema specifically. I covered the broader pattern in LLM cost optimization; schemas are a major hidden contributor.

Decoder slowdowns on complex schemas. Constrained decoding has a runtime cost that scales with schema complexity. A flat schema with ten fields decodes nearly as fast as unconstrained generation. A deeply nested schema with hundreds of optional branches can slow generation by 20% or more. If latency matters, profile this.


Prompting With Schemas

The prompt and the schema are two halves of the same instruction. Treat them that way.

The pattern that has worked best for me:

  1. The system prompt explains the task in plain language. What is the model doing, what does the input look like, what should it produce.
  2. The schema enforces the shape and provides field-level descriptions for anything ambiguous.
  3. The user prompt provides the input data.

Concretely, a system prompt like "Extract invoice fields from the provided document. If a field is not present in the document, return null for it; do not infer or estimate" plus a schema with descriptive field names is dramatically more reliable than either of those things alone.

Two specific anti-patterns to avoid:

Do not put the schema in the prompt as text. If your provider supports a real schema parameter, use it. Putting the schema in the prompt as JSON text means the model has to interpret it, and the constraint is not enforced at decode time. You get the worst of both worlds.

Do not duplicate field descriptions. If you have a field description in the schema, do not also describe it in the prompt. The model gets confused when the same instruction appears twice in slightly different words. Keep the field-level guidance in the schema and the task-level guidance in the prompt.


Streaming Structured Outputs

This is the new frontier in 2026. All three major providers now support streaming structured outputs, where the model generates the JSON token by token and you can read partial objects as they come in.

This matters because waiting for a 2000-token JSON response can take five seconds, and your UI cannot just freeze. Streaming lets you start rendering the first few fields while the rest are still being generated.

The catch is that partial JSON is not valid JSON. You cannot just JSON.parse the chunks as they arrive. You need a streaming JSON parser that can handle incomplete objects and emit field-level events as they are completed.

The libraries that do this well in 2026: partial-json for Node, pydantic-ai's streaming validators for Python, and the AI SDK's streamObject for full-stack TypeScript. All of them subscribe to the same pattern: parse what you have, emit a typed partial object, repeat as more data arrives.

Where I have seen this go wrong: developers stream the JSON, render fields as they arrive, but never wait for the final completion event. The user sees fields populate, then the model decides one of those fields was wrong and revises it in the final pass. Now you have a UI that flashes wrong data and corrects itself. Either lock in fields only when their complete event fires, or render to a buffered draft state and only commit on full completion.


Schema Versioning

The thing that nobody thinks about until they get burned: your schema is a contract. Once you ship it, every record you wrote conforms to that schema. If you change a field name, add a required field, or change a type, you have a migration problem. If you store the structured outputs in a database, you also have a database migration problem.

Treat your schemas like API versions. Bump a version number when you change them. Keep the old schema around so you can read old records. If you are storing extracted data, store the schema version with the data so you know how to interpret it later.

This sounds like overhead but it pays off the first time you change a schema and realize you have ten thousand records that do not parse against the new shape.

For pipelines that output to a database, the schema versioning question is partly answered by the database layer. If you are using Postgres with JSON columns, the schema is loose enough that minor changes are forgiving. If you are using a strongly typed ORM like the ones I covered in Drizzle vs Prisma, the schema versioning needs to flow through the type system as well as the data.


Evals for Structured Outputs

Evals for unstructured text output are squishy. You compare to a reference, you ask another model to judge, you accept some fuzziness. Evals for structured outputs are crisper, and you should take advantage of that.

For each test case, you can compute exact-match accuracy on every field. Did the extracted total match the ground truth total? Yes or no. Did the extracted date match the ground truth date? Yes or no. Aggregate this into a per-field accuracy metric and you can tell, at a glance, which fields your model is bad at.

This is much more actionable than "the output looked right." If you see that the total_amount field has 99% accuracy but the due_date field has 78%, you know exactly where to focus prompt or schema work.

The same eval framework that you use for general agents (which I covered in AI evals for solo developers) extends naturally to structured outputs, but the assertions get tighter. You no longer need an LLM-as-judge for most fields. You have ground truth and you have outputs. Compare them with code.


The Decision Framework

If I were starting a new project tomorrow that needed structured outputs, here is the decision tree I would use.

Need to extract typed data from documents or text? Use schema-constrained generation. OpenAI's strict response_format, Anthropic's structured output mode, or Gemini's responseSchema. Default to your existing provider; the differences in capability are smaller than the cost of switching.

Need the model to choose among multiple actions or use external tools? Use tool use. The semantics fit. Do not pretend it is just structured output.

Running a self-hosted model? Add outlines or lm-format-enforcer. Do not roll your own.

Outputting JSON to a low-stakes UI feature? JSON mode is fine. Skip the schema. The cost of a malformed response is a UI hiccup, not a data corruption.

Doing something where data integrity matters? Schema-constrained, plus downstream validation, plus per-field eval coverage. The schema catches shape errors. The validation catches policy errors. The evals catch hallucinated values that pass both.

The bug I described at the start of this post would have been impossible if I had used schema-constrained generation from day one. It cost me three days of debugging and a small amount of database cleanup that I am still slightly bitter about.

The good news is that in 2026 you do not have an excuse to skip this. Every major provider supports it. The libraries for self-hosted models are mature. The performance overhead is small. The error modes are well understood.

The only thing left is to actually use it. Stop using JSON mode for anything that matters. Stop trusting that the model will produce the right shape. Define the schema, enforce it at decode time, and validate the values that came through. The data integrity you save will be your own.

Top comments (0)