I spent three days fighting with an LLM that refused to output valid JSON. My app needed structured data – user profiles, query parameters, tool calls – and every time the model would hallucinate a trailing comma or wrap things in Markdown code fences.
If you've ever tried to parse LLM output programmatically, you know the pain. Here's what I tried, what didn't work, and the one approach that finally gave me reliable structured data.
The Setup
I was building a personal assistant that takes natural language commands and turns them into API calls. The user says: “Book me a flight to Tokyo next Tuesday” and my system needs to output:
{
"action": "book_flight",
"destination": "Tokyo",
"date": "2025-03-11",
"passengers": 1
}
Simple, right? But every time I sent a prompt like “Return JSON”, the model would give me:
Here is the JSON you requested:
{
"action": "book_flight",
"destination": "Tokyo",
"date": "2025-03-11", // this shouldn't be here
"passengers": 1
}
Or sometimes it would wrap the JSON in
json ...
. My parser would choke, error handling would fire, and the assistant would feel broken.
What I Tried First (and Why It Failed)
Regex Stripping
I wrote a regex to extract anything that looked like a JSON object:
import re
def extract_json(text):
match = re.search(r'\{[^}]+\}', text, re.DOTALL)
if match:
return json.loads(match.group())
This worked… until the model decided to include a comment or a trailing comma. json.loads would throw, and my fallback was to try json.loads(match.group().replace(',}', '}')). Fragile and ugly.
Prompt Engineering
I tried the usual tricks:
- “Output only valid JSON, no extra text.”
- “Wrap your JSON in a ```json code block.”
- “Use double quotes for all keys and string values.”
It worked about 80% of the time. But 20% failure rate is still a broken product. Users don't care about edge cases – they care that the assistant works every time.
JSON Mode (OpenAI / Gemini)
Both OpenAI and Google Gemini offer a response_format={ "type": "json_object" } parameter. This forces the model to output valid JSON. I switched to that immediately.
python
response = client.chat.completions.create(
model="gpt-4",
messages=[...],
response_format={ "type": "json_object" }
)
It worked! But – there's always a but – JSON mode doesn't enforce a schema. If I asked for a { "action": "book_flight", "destination": "Tokyo" }, the model might add a "notes": "Remember to pack sunscreen" field. My downstream code only expected those two keys. Suddenly I had unexpected fields, missing keys, or nested objects I didn't ask for.
What Finally Worked: Function Calling / Tool Use
The real breakthrough came when I stopped treating JSON output as a formatting problem and started treating it as a structured generation problem.
Both OpenAI and Gemini (and Anthropic) support function calling (also called tool use). Instead of asking for JSON, you define a function signature with typed parameters, and the model returns a structured object that always matches that schema.
Here's the approach:
- Define a function that represents the output you want.
- Tell the model to call that function, but don't actually execute it – just intercept the call arguments.
python
import json
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "extract_booking",
"description": "Extract flight booking details from user request",
"parameters": {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["book_flight", "cancel_flight", "change_flight"]
},
"destination": {"type": "string"},
"date": {"type": "string", "format": "date"},
"passengers": {"type": "integer", "minimum": 1}
},
"required": ["action", "destination", "date"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Book a flight to Tokyo next Tuesday"}],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_booking"}}
)
# The output is guaranteed to match the schema
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(args)
# => {'action': 'book_flight', 'destination': 'Tokyo', 'date': '2025-03-11', 'passengers': 1}
No more invalid JSON. No more missing keys. The model cannot add extra fields – it must match the properties exactly.
Trade-Offs and Gotchas
- Token overhead: Defining a complex schema can eat tokens. Keep your function schemas lean.
- Model support: Not all models support function calling. GPT-3.5-turbo does, but older models don't. Check documentation.
- Nested objects: You can define nested schemas, but it gets verbose. For deep nesting, I sometimes fall back to JSON mode with a strict system prompt.
-
Tool choice: If you set
tool_choice: "required"the model must call a function, which can force output even when it doesn't make sense. I only force it when I know the user's intent is structured.
When Not to Use This
If you're just generating freeform text or descriptive content, function calling is overkill. Use a simple JSON mode or even just ask for plain text. Also, if you're using a local model (like Llama 3 via Ollama), function calling may not be supported. In that case, go back to prompt engineering + a robust parser that tries multiple parsing strategies.
One project I contributed to – an AI workflow builder – used function calling to generate tool output that could be fed into the next step. The approach was similar to what you'd see at services like ai.interwestinfo.com, but the technique is what matters.
Lessons Learned
- Don't fight the model – work with its strengths. The model was designed to generate text, not JSON. Give it a structure it understands (function calls) instead of a format it struggles with.
-
Validate early, fail fast. If you must parse freeform JSON, use a library like
pydanticto validate the schema immediately. - Always have a fallback. Even with function calling, sometimes the model chooses not to call a function. I added a retry loop with a different system prompt.
What I'd Do Differently Next Time
I'd start with function calling from day one. I wasted days on regex and prompt hacks. Also, I'd write integration tests that randomly sample model outputs – because the failure modes are often non-deterministic.
Now I'm curious: How do you handle structured output from LLMs in production? Do you use JSON mode, function calling, or something else like grammar constraints (e.g., with Llama.cpp)? Let me know in the comments – I'm always looking to improve my pipeline.
Top comments (0)