Structured output from LLMs: JSON mode, function calling, and grammar-constrained decoding
You deployed a chatbot that translates natural-language requests into API calls. A user says "book a table for four at 7pm tomorrow." Your prompt asks the LLM to emit a JSON like {"restaurant": string, "party_size": int, "time": string, "date": string}. One time it returns {"restaurant": "Olive Garden", "party_size": 4, "time": "19:00", "date": "2026-06-15"} -- valid JSON, everything works. The next request for "dim sum Saturday noon" produces {"restaurant": "Dim Sum House", "party_size": 2, "time": "12:00", "date": "Saturday"} followed by a free-text aside: -- also, what's the dress code?. Now your JSON parser throws, your downstream pipeline crashes, and your Slack channel lights up at 2 AM.
The problem is fundamental: LLMs generate tokens, not data structures. Any schema you ask for is a suggestion, not a constraint. Production systems that depend on structured output need a mechanism that enforces the schema at the token level, not just at the prompt level.
Why this matters for production LLM applications
Three scenarios where structured output is non-negotiable:
API wrappers and function calling. An LLM that calls tools on your behalf must produce arguments that match the tool's JSON Schema. A malformed argument means a runtime error from the tool, a retry, or silent failure. At scale, even a 2% malformation rate becomes a steady stream of incident alerts.
Data extraction and ETL pipelines. You point an LLM at 10,000 support tickets and ask it to extract
{customer_id, sentiment, category, urgency}. If 3% of the rows have extra fields, missing fields, or non-JSON prose, your data pipeline either drops them silently or someone writes a regex band-aid that breaks later.Multi-step agent loops. An agent that calls a search tool, reads the result, then calls another tool needs each step's output to be parseable. If step 2 produces free text instead of a function call, the loop stalls. Every retry costs tokens, latency, and money.
The three approaches to structured output
Developers today have three main ways to coerce an LLM into producing structured data. They differ in reliability, latency, and how deeply they integrate with the model.
| Method | Enforcement level | Latency overhead | Model support | Schema expressiveness |
|---|---|---|---|---|
| Prompt-only JSON mode | None (suggestion) | Zero | All models | Unlimited |
| API-level JSON mode / function calling | Soft (post-hoc validation + retry) | 0-200ms | OpenAI, Anthropic, Gemini, most providers | JSON Schema |
| Grammar-constrained decoding | Hard (token-level) | 10-50ms per token | Local models (llama.cpp, vLLM), Outlines, Guidance, lm-format-enforcer | Any CFG, JSON Schema, regex |
Prompt-only is what you write when you first prototype. API-level structured output is what most teams use in production today. Grammar-constrained decoding is the emerging standard for local and self-hosted models where you control the sampling loop.
Prompt-only JSON mode
The simplest approach: tell the model to output JSON and hope it complies.
You are a data extraction assistant.
Extract the requested fields and output ONLY valid JSON.
Do not include any explanation, markdown formatting, or extra text.
This works maybe 85-95% of the time with capable models, but the failure modes are maddening: trailing commas (not valid JSON but some parsers accept them), markdown code fences around the JSON, explanatory text before or after the JSON, missing closing braces, and string values that contain unescaped quotes.
The fatal flaw is that prompt-only mode does not interact with the token generation process at all. If the model is partway through a field value and its next most likely token is "fix" (the start of a free-text apology), it will generate that token. The prompt is just context -- it does not constrain the probability distribution.
API-level structured output (JSON mode and function calling)
OpenAI introduced JSON mode in mid-2024, and the rest of the industry followed. The API takes a response_format parameter with a JSON Schema. Behind the scenes, the provider uses a validator that resamples or masks tokens that would produce invalid JSON relative to the schema.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Extract: John Smith, 42, john@example.com"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string", "format": "email"}
},
"required": ["name", "age", "email"]
}
}
}
)
print(response.choices[0].message.content)
The output is guaranteed to be valid JSON matching the schema, or the API returns an error. The 'strict' flag enforces that no extra properties are emitted.
Function calling works similarly: you register tool definitions as JSON Schema objects, and the model returns a structured tool_calls array. The provider handles the token-level enforcement.
tools = [{
"type": "function",
"function": {
"name": "book_restaurant",
"description": "Book a table at a restaurant",
"parameters": {
"type": "object",
"properties": {
"restaurant": {"type": "string"},
"party_size": {"type": "integer"},
"time": {"type": "string"},
"date": {"type": "string"}
},
"required": ["restaurant", "party_size", "time", "date"]
}
}
}]
The model returns something like:
{
"name": "book_restaurant",
"arguments": "{\"restaurant\":\"Olive Garden\",\"party_size\":4,\"time\":\"19:00\",\"date\":\"2026-06-15\"}"
}
Anthropic Claude's tool use, Gemini's function calling, and Mistral's function calling all follow the same pattern. The schema is defined client-side, the provider validates at the token level, and the output is always parseable.
Grammar-constrained decoding
For local and self-hosted models, you can push enforcement into the sampling loop itself. Grammar-constrained decoding modifies the token probability distribution at each step, zeroing out any token that would produce an invalid next character relative to a grammar or schema.
# Using Outlines to constrain generation to a Pydantic model
from pydantic import BaseModel, constr
from outlines import models, generate
class Person(BaseModel):
name: constr(min_length=1, max_length=100)
age: int
email: str
model = models.transformers("Qwen/Qwen2.5-7B-Instruct")
generator = generate.json(model, Person)
result = generator("Extract: John Smith, 42, john@example.com")
print(result)
# Person(name='John Smith', age=42, email='john@example.com')
Outlines works by converting the JSON Schema or Pydantic model into a context-free grammar (CFG), then using that CFG to prune the token vocabulary at each generation step. Only tokens that represent valid continuations of the schema are kept.
The same idea works for arbitrary grammars, not just JSON:
# Grammar-constrained generation with llama.cpp
# GBNF (Grammar-Based Negative-dFidence) format
grammar = """
root ::= digit+ "." digit+
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
"""
Ollama supports GBNF grammars natively. vLLM has a --guided-decoding-backend flag (options: outlines, lm-format-enforcer, xgrammar). The key insight is that grammar-constrained decoding makes structured output a sampling-time property, not a post-processing step.
Here is how the token mask operates during generation:
flowchart TD
A[Start generation] --> B[Get next token logits<br/>from model forward pass]
B --> C[Apply grammar mask:<br/>zero-out tokens that would<br/>produce invalid structure]
C --> D[Sample from masked<br/>probability distribution]
D --> E{Is generation<br/>complete?}
E -->|No| B
E -->|Yes| F[Return valid structured<br/>output]
Every token is checked against the schema before it is sampled. If the schema expects a number at position 73 and the model proposes a comma, that token is masked out and the next-best valid token is sampled instead.
Comparison: which method should you use?
| Criterion | Prompt-only | API JSON/function call | Grammar-constrained |
|---|---|---|---|
| Reliability | 85-95% | ~99.9% | >99.9% |
| Latency impact | None | Negligible | ~10-50ms per token |
| Works with any model | Yes | No (provider-dependent) | Yes (local framework) |
| Schema validation | Post-hoc | Token-level | Token-level |
| Debugging difficulty | Easy (parse error) | Medium (API error) | Medium (grammar compile error) |
| Best use case | Prototyping, quick scripts | Production API calls | Self-hosted, sensitive data |
Common pitfalls
Nested schemas with strict mode. OpenAI's strict JSON mode rejects extra properties. If your schema has additionalProperties: true or relies on optional fields that the model sometimes fills with null, strict mode will return errors. Test with strict: false first, then tighten.
Grammar compilation time. Outlines and Guidance compile the schema into a state machine before generation starts. For complex schemas with deeply nested allOf / oneOf, this can take 2-10 seconds. Cache the compiled grammar if you reuse a schema.
Token masking vs resampling. Some implementations (early Guidance) used resampling: if the output was invalid, regenerate. This is slow and unpredictable. Prefer token-masking approaches (Outlines, xgrammar, llama.cpp GBNF) that never generate invalid tokens in the first place.
Model incompatibility with grammar backends. Not all Hugging Face model architectures work with Outlines' transformers backend. If you hit an error about unsupported model type, switch to the llamacpp backend or use vLLM's guided decoding instead.
When NOT to use it
Structured output is the wrong tool when:
You need open-ended creative text. A story-writing or brainstorming session should not be grammar-constrained. The constraints reduce the model's output quality and diversity for tasks where free text is the goal.
Your schema changes frequently. Grammar compilation and testing add overhead. If you are iterating on a schema multiple times per day, start with prompt-only JSON, then add enforcement once the schema stabilizes.
Your model is behind an API that does not support it. Not all providers offer JSON mode or function calling. For those that do not, you are limited to prompt-only or running a local validation + retry loop, which adds latency and cost.
Your use case tolerates occasional parse failures. If a human reviews every output or the downstream system has robust error handling, the complexity of grammar-constrained decoding may not be worth it.
Latency is the absolute top priority. Grammar masking adds a small per-token overhead. For sub-100ms response requirements at high throughput, prompt-only with a lenient parser may be the pragmatic choice. Measure before optimizing.
TL;DR
- Prompt-only JSON works ~85-95% of the time and is fine for prototyping, but will crash in production at scale.
- API-level JSON mode / function calling (OpenAI, Anthropic, Gemini, Mistral) provides token-level enforcement with negligible latency overhead. Use this for production or when your provider supports it.
- Grammar-constrained decoding (Outlines, Guidance, llama.cpp GBNF, vLLM guided decoding) enforces schema at the sampling step. Best for self-hosted models and sensitive-data scenarios.
- Token masking is better than resampling. Prefer frameworks that mask invalid tokens rather than regenerating on failure.
- Measure the overhead. Grammar compilation and per-token masking add latency. Test with your schema and model before committing.
Next post
The pipeline that evaluated the output of grammar-constrained decoding against a test corpus of 10,000 real user requests -- how we measured reliability, what broke, and what the latency budget actually looked like in production.
If you have a production story about structured output going wrong (or going right), the next post will compile reader experiences -- drop a comment with your war story.
Top comments (0)