I spent three days debugging why my GPT-4-powered app kept returning malformed JSON. It wasn't a prompt issue. I tried few-shot examples, system messages, even begged the model with 'PLEASE give me valid JSON'. And it still broke in production.
This is the story of how I finally got reliable structured output from LLMs — without playing whack-a-mole with edge cases.
The Problem: JSON Roulette
I was building a small internal tool that extracts meeting notes and turns them into structured data: action items, dates, assignees. The prompt was crystal clear:
Return a JSON array of objects with fields: action, due_date (ISO 8601), assignee.
I used response_format: { "type": "json_object" } in the OpenAI API call. In my local tests, everything worked fine. Then the first real user uploaded a transcript with a date like "next Thursday" and the model returned:
{"action": "Review Q3 budget", "due_date": "next Thursday", "assignee": "Alice"}
Not even a real date. And sometimes I got extra fields, or nested objects, or the entire array wrapped in an extra object. Pure chaos.
What I Tried That Didn't Work
1. Prompt engineering
I kept rewriting the system prompt. I added:
- "Only output valid JSON. No explanations."
- "Use en dashes for empty dates."
- A few-shot example with five perfectly formatted records.
Did it help? Marginally. But a single edge-case transcript could still trigger a different output format. Model behavior on non-English text was even worse.
2. Post-processing with regex
I wrote Python code to capture anything between the first [ and last ]. Then I tried json.loads() inside a try-except. When it failed, I logged the raw output and manually fixed it. This was a terrible idea — I was basically patching a broken pipe.
3. Custom output parsers
I built a parser that looked for key-value pairs even without proper JSON delimiters. It worked for 80% of cases, but the edge cases multiplied. And every parser change required a full regression test cycle.
What Actually Worked: Constrained Decoding with a Schema
The breakthrough came when I stopped trying to fix the output after generation and instead forced the model to generate valid JSON during generation. This technique is called constrained decoding (or structured generation).
Here's the core idea: instead of allowing the model to pick any token, we restrict the allowed next tokens based on a JSON schema. The model can only produce tokens that will result in valid JSON according to that schema.
I used a Python library called Outlines (open source, works with OpenAI and local models). The pattern looks like this:
import outlines
from outlines.generate import json as json_generator
from pydantic import BaseModel
from typing import List
class ActionItem(BaseModel):
action: str
due_date: str
assignee: str
class MeetingNotes(BaseModel):
items: List[ActionItem]
# The model object (can be a local model or an OpenAI-compatible endpoint)
model = outlines.models.openai("gpt-4o")
generator = json_generator(
model,
MeetingNotes,
# Additional prompt context
system_prompt="Extract action items from the meeting transcript. Output structured data."
)
result = generator("Meeting: Alice will review Q3 budget by next Thursday. Bob to update the dashboard by Friday.")
print(result.model_dump_json(indent=2))
The output is ALWAYS valid JSON matching MeetingNotes. No more parsing hell.
How It Works Internally
Constrained decoding works by compiling the JSON schema into a finite state machine. At each generation step, the library masks out tokens that would break the schema. For example, after generating a colon after a field name, the only allowed next token is a quote (for a string) or a digit (for a number), depending on the schema type. This is light-years ahead of post-generation validation.
Trade-offs and Limitations
- Speed: Constrained decoding adds a slight overhead per token because of the mask computation. In my testing, it added about 5–10% to generation time — negligible for most use cases.
- Model support: Not all providers expose logit biases that libraries like Outlines require. OpenAI's API supports logit biases for GPT-4/GPT-4o, but not for GPT-3.5-turbo. Anthropic's API doesn't expose logit biases at all — you'd need a different approach (like output validation with retries).
-
Schema complexity: Deeply nested schemas with
anyOfor recursive definitions can make the state machine huge. I recommend keeping your JSON schema flat for reliability. -
Non-completion endpoints: If you're using a chat completion API, you need to set the response format to
text(notjson_object) and pass the schema as part of the generation pipeline. Some libraries handle this automatically.
Alternatives I Considered
- Prompt-based schema embedding: Some people inline JSON Schema into the system prompt. This works okay but isn't bulletproof — the model can still hallucinate fields. I'd only use this for low-risk internal tools.
- Output validation + retry: If the JSON is invalid, resend the prompt with the error message. This works but increases latency and cost proportional to the error rate.
-
Using OpenAI's
json_objectmode: It forces the model to output a single JSON object, but not a specific schema. You still get random field names. Worthless for structured extraction.
Lessons Learned
- Prompt engineering has a ceiling. No amount of clever phrasing can force a stochastic model to be deterministic. If you need guaranteed structure, you need a structural constraint.
- Validate at generation time, not after. Post-hoc parsing is a losing battle. Each edge case you fix creates two new ones.
- Don't trust "structured output" API features. Most of them just set a flag that the model "prefers" JSON — but doesn't enforce it. You still need a library like Outlines or Guidance on top.
- Test with real-world inputs. My synthetic test data looked perfect; real transcripts from non-native speakers broke everything. Always include edge cases like missing dates, multiple people with the same name, etc.
What I'd Do Differently Next Time
I'd start with constrained decoding from day one. The library I used (Outlines) is one option; there's also Guidance by Microsoft and lm-format-enforcer. If I were using a local model, I'd use the .generate() method with logits_processor. For cloud APIs that don't expose logit bias, I'd batch requests with a retry mechanism that includes a schema-aware error message.
Also, I should have invested more time in schema design earlier. A schema with optional fields and null values is much more robust than one that expects every field to be present.
I still use prompt engineering for creative tasks, but for any application where the output feeds into a database or automation pipeline, constrained decoding is the only sane choice.
My final setup looks like this: Pydantic models for the schema, Outlines for generation, and a simple FastAPI endpoint. I've reduced JSON errors from 25% to less than 0.1%. And the remaining errors are almost always because of a network timeout, not the model.
What's your setup for getting structured output from LLMs? I'd love to hear how others handle this — especially if you're working with APIs that don't support logit bias.
Top comments (0)