API Gets One Line. Local Gets a Minefield.
OpenAI has response_format={"type": "json_object"}. Claude has equivalent. Set it, and output is guaranteed JSON. Parse errors don't happen.
Local LLMs don't have this. llama.cpp offers --grammar to constrain output to valid JSON syntax, but that only forces the format to be JSON. Whether the content makes sense is a completely different problem.
// API output: as intended
{"name": "Qwen2.5-14B", "speed_tps": 31.5, "vram_gb": 7.3}
// Local LLM (grammar enabled): valid JSON, broken content
{"name": "Qwen2.5-14B", "speed_tps": "fast", "vram_gb": "enough"}
// → Types are wrong. Numbers became strings.
This "format is correct but content is broken" problem gets worse with smaller models. On RTX 4060 8GB, the model size constraint (7B-14B) directly impacts JSON output reliability.
The 3 Failure Patterns
Pattern 1: Failed — JSON Itself Is Broken
// Typical: explanation text wraps around the JSON
Here is the JSON output:
{"name": "Qwen2.5-14B", ...}
I hope this helps!
// → json.loads() parse error
When it happens: Frequent with 7B models without grammar. Enabling --grammar eliminates this completely. 14B+ rarely has this issue even without grammar.
Pattern 2: Broken — Valid JSON, Wrong Content
// Expected: {"speed_tps": 31.5, "vram_gb": 7.3}
// Actual: {"speed_tps": "fast", "vram_gb": "7.3GB"}
// → Wrong types. Strings where numbers should be. Units leak into values.
When it happens: Frequent with 7B, sporadic with 14B. Including a JSON Schema in the prompt dramatically improves this. Grammar alone can't prevent it — the format is valid, just the values are wrong.
Pattern 3: Nested Structure Collapse
// Expected: {"items": [{"a": 1}, {"a": 2}]}
// Actual: {"items": [{"a": 1}, {"b": 2}]} // field name changed
// Or: {"items": [{"a": 1}, 2]} // type collapsed mid-array
When it happens: The nastiest pattern. When generating multiple objects inside an array, the first object is correct but subsequent ones drift — field names change, types collapse. This happens even with larger models. The best approach is to not ask the model to generate nested structures at all (see two-stage generation below).
Grammar: Necessary but Not Sufficient
llama.cpp's --grammar guarantees Pattern 1 (parse errors) goes to zero. But it can't prevent Pattern 2 or 3. Grammar constrains the token sequence format, not semantic correctness.
Grammar is a prerequisite, not a solution.
3 Fixes That Actually Work
Fix 1: Explicit Schema in Prompt
schema = {
"type": "object",
"properties": {
"model": {"type": "string"},
"speed_tps": {"type": "number"},
"vram_gb": {"type": "number"}
},
"required": ["model", "speed_tps", "vram_gb"]
}
prompt = f"""Output JSON following this schema:
{json.dumps(schema, indent=2)}
Input: {input_text}"""
Giving the model the exact structure upfront dramatically improves field name consistency and type correctness. This works because the schema is in the context when the model generates each token.
Fix 2: Grammar + Retry
def reliable_json(prompt: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
raw = call_llm(prompt)
try:
parsed = json.loads(raw)
if validate_schema(parsed):
return parsed
except (json.JSONDecodeError, ValidationError):
continue
raise RuntimeError(f"JSON generation failed ({max_retries} attempts)")
Allowing 3 retries massively improves effective success rate. The cost is up to 3x latency — a reasonable tradeoff for pipelines where reliability matters. Measure how many retries YOUR model needs on YOUR tasks.
Fix 3: Two-Stage Generation (for Nested Structures)
Don't ask the model to build nested JSON in one shot. Generate flat JSON twice and merge.
# Step 1: Extract metadata
meta = call_llm('Output model name only: {"model": "..."}')
# Step 2: Extract array separately
items = call_llm('Output each quantization as JSON array: [{"method": ..., "size_gb": ..., "speed_tps": ...}]')
# Step 3: Merge in code
result = {**json.loads(meta), "quantizations": json.loads(items)}
Two flat generations merged in code is dramatically more stable than one nested generation. For 7B models needing nested output, this is effectively the only practical option.
Model Size Decision Guide
[JSON Output System Design Guide]
High reliability (payments, medical): 32B + grammar + retry
→ Doesn't fit 8GB. Use an API.
Standard (RAG, analysis): 14B + grammar + schema + retry
→ Optimal for RTX 4060 8GB
Lightweight (log extraction, classification): 7B + grammar + two-stage
→ Practical if you stick to flat JSON
Nested structures required: 14B+ with two-stage generation
→ 7B can't do this reliably
Specific success rates depend on YOUR environment. Copy the test code above, run it with YOUR model and YOUR tasks. Those numbers are your real reliability. Don't trust anyone else's benchmarks — including this article.
References
- llama.cpp grammar documentation: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
- llama.cpp server API: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
Top comments (0)