Every team we talk to has a version of the same story. They built an LLM integration that works well in testing. Then, three weeks into production, something comes back slightly different — the model wraps the JSON in a code block, or uses "status": "Completed" instead of "status": "complete", or includes an extra key that breaks the downstream parser. The whole pipeline falls over.
This post is about how we handle that problem — specifically, how we use structured outputs to get reliable, typed data from LLMs in production Django applications, and where the approach still has limits.
The problem with parsing free-text LLM responses
When you ask an LLM to "return JSON", it usually does. Until it doesn't.
The failure modes are predictable once you've seen them enough times:
- The model wraps the output in a markdown code fence (
json ...) - Field names drift slightly (
customer_idvscustomerIdvscustomer id) - Optional fields are sometimes present, sometimes absent, with no consistency
- The model adds a conversational sentence before or after the JSON
- Numeric fields come back as strings in edge cases
None of this is surprising — the model is a text predictor, not a JSON serialiser. Treating its output as reliable structured data requires you to either enforce structure at generation time, or write defensive parsing code that handles every variant. The second path is a maintenance problem that compounds over time.
Structured outputs enforce schema at generation time
The cleaner approach is to constrain what the model can generate. OpenAI's structured outputs feature (available since late 2024) lets you pass a JSON schema to the API, and the model is guaranteed to return output that conforms to it. No code fences, no stray fields, no type mismatches.
We define our schemas with Pydantic and pass them directly to the API:
from pydantic import BaseModel
from openai import OpenAI
from typing import Literal
client = OpenAI()
class ExtractionResult(BaseModel):
company_name: str
industry: str
annual_revenue_usd: int | None
employee_count: int | None
confidence: Literal["high", "medium", "low"]
notes: str
def extract_company_info(raw_text: str) -> ExtractionResult:
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": (
"Extract structured company information from the provided text. "
"Use null for fields you cannot determine with reasonable confidence."
),
},
{"role": "user", "content": raw_text},
],
response_format=ExtractionResult,
)
return response.choices[0].message.parsed
The return value is a proper Pydantic model instance. You can access result.company_name directly, pass it to a Django serializer, store it in a JSONField — it is typed data, not a string you have to parse.
What this looks like in a real Django pipeline
We use this pattern in a document processing pipeline where we extract key fields from uploaded contracts and business documents before routing them for human review.
# models.py
from django.db import models
class Document(models.Model):
STATUS_CHOICES = [
("pending", "Pending"),
("processing", "Processing"),
("extracted", "Extracted"),
("failed", "Failed"),
("needs_review", "Needs Review"),
]
file = models.FileField(upload_to="documents/")
raw_text = models.TextField(blank=True)
extracted_data = models.JSONField(null=True, blank=True)
extraction_confidence = models.CharField(max_length=10, blank=True)
status = models.CharField(max_length=20, choices=STATUS_CHOICES, default="pending")
created_at = models.DateTimeField(auto_now_add=True)
# tasks.py (Celery)
from celery import shared_task
from openai import OpenAI
from pydantic import BaseModel, ValidationError
from typing import Literal
import logging
logger = logging.getLogger(__name__)
client = OpenAI()
class ContractExtraction(BaseModel):
counterparty_name: str
contract_value_usd: int | None
start_date: str | None # ISO 8601
end_date: str | None
auto_renewal: bool
governing_law: str | None
confidence: Literal["high", "medium", "low"]
@shared_task
def extract_document_fields(document_id: int):
from .models import Document
doc = Document.objects.get(id=document_id)
doc.status = "processing"
doc.save(update_fields=["status"])
try:
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": (
"Extract key fields from this contract. "
"Use null for fields not present or unclear. "
"Set confidence to 'low' if you are uncertain about any critical field."
),
},
{"role": "user", "content": doc.raw_text[:8000]}, # Stay within context
],
response_format=ContractExtraction,
)
result = response.choices[0].message.parsed
doc.extracted_data = result.model_dump()
doc.extraction_confidence = result.confidence
doc.status = "needs_review" if result.confidence == "low" else "extracted"
except Exception as e:
logger.error(f"Extraction failed for document {document_id}: {e}")
doc.status = "failed"
doc.save()
The key decision here: low-confidence extractions automatically route to human review. The confidence field is part of the schema — we instruct the model to self-report uncertainty, and we act on it. This is the same principle as our agent designs: the human review path is first-class, not a fallback.
Handling refusals
The one case structured outputs cannot prevent is a model refusal. If the model decides the input violates its content policy, response.choices[0].message.parsed will be None and response.choices[0].message.refusal will contain the refusal message.
This needs explicit handling:
message = response.choices[0].message
if message.refusal:
logger.warning(f"Model refused extraction for document {document_id}: {message.refusal}")
doc.status = "needs_review"
doc.save(update_fields=["status"])
return
result = message.parsed
In practice, refusals are rare for document extraction tasks. They are more common when you are doing classification or analysis on content that might be flagged — customer support tickets, forum posts, unmoderated user content. If your pipeline processes that kind of input, test refusal handling early.
Anthropic's equivalent: tool use
If you are using Anthropic's Claude models (which we also use for some tasks), the equivalent mechanism is tool use. You define a tool with a JSON schema, instruct the model to always call it, and get structured output through the tool call rather than the message content.
import anthropic
import json
client = anthropic.Anthropic()
extraction_tool = {
"name": "extract_contract_fields",
"description": "Extract structured fields from the contract text.",
"input_schema": {
"type": "object",
"properties": {
"counterparty_name": {"type": "string"},
"contract_value_usd": {"type": ["integer", "null"]},
"start_date": {"type": ["string", "null"]},
"end_date": {"type": ["string", "null"]},
"auto_renewal": {"type": "boolean"},
"confidence": {"type": "string", "enum": ["high", "medium", "low"]},
},
"required": ["counterparty_name", "auto_renewal", "confidence"],
},
}
def extract_with_claude(raw_text: str) -> dict:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
tools=[extraction_tool],
tool_choice={"type": "tool", "name": "extract_contract_fields"},
messages=[
{"role": "user", "content": f"Extract fields from this contract:\n\n{raw_text}"}
],
)
tool_use_block = next(b for b in response.content if b.type == "tool_use")
return tool_use_block.input # Already a dict, schema-validated
The tool_choice parameter forces the model to always call the specified tool rather than choosing to respond in prose. Without it, the model might sometimes call the tool and sometimes answer in text — not useful in a production pipeline.
What structured outputs do not solve
A few things worth being clear about:
They do not fix bad prompts. If your system prompt is vague about what a field should contain, you will get consistent structure but inconsistent semantics. confidence: "high" means whatever the model inferred it means, not whatever you intended. Schema design and prompt design go together.
They do not prevent hallucination. The model can still make up a contract value or misattribute a date. You are getting reliably shaped data — its accuracy still depends on the model's reasoning and the quality of the source text. For high-stakes fields, add a verification step that cross-checks extracted values against source text.
They add latency. Structured output generation with constrained decoding is slightly slower than unconstrained generation. For real-time user-facing features, measure this before committing to the pattern. For background processing pipelines, it generally does not matter.
The honest summary
Structured outputs are not exotic — they are just the right default when you need typed data from an LLM. Free-text parsing is a trap that costs you maintenance time and production incidents over the long run.
If you are building an LLM integration that outputs data to a database, an API, or another system: define a Pydantic schema, use response_format, handle refusals, and route low-confidence results to human review. That is the pattern. It is not complicated once you have seen it, but it makes a meaningful difference in how reliably the system runs.
Lycore builds production AI systems for businesses — document intelligence, agents, RAG pipelines, and custom LLM integrations on Django, React, Flutter, and .NET. Get in touch if you want to talk through your use case.
Top comments (0)