Why Local LLM JSON Output Breaks — Failure Patterns and How to Fix Them in Code

#ai #llm #python #programming

API Gets One Line. Local Gets a Minefield.

OpenAI has response_format={"type": "json_object"}. Claude has equivalent. Set it, and output is guaranteed JSON. Parse errors don't happen.

Local LLMs don't have this. llama.cpp offers --grammar to constrain output to valid JSON syntax, but that only forces the format to be JSON. Whether the content makes sense is a completely different problem.

// API output: as intended
{"name": "Qwen2.5-14B", "speed_tps": 31.5, "vram_gb": 7.3}

// Local LLM (grammar enabled): valid JSON, broken content
{"name": "Qwen2.5-14B", "speed_tps": "fast", "vram_gb": "enough"}
// → Types are wrong. Numbers became strings.

This "format is correct but content is broken" problem gets worse with smaller models. On RTX 4060 8GB, the model size constraint (7B-14B) directly impacts JSON output reliability.

The 3 Failure Patterns

Pattern 1: Failed — JSON Itself Is Broken

// Typical: explanation text wraps around the JSON
Here is the JSON output:
{"name": "Qwen2.5-14B", ...}
I hope this helps!

// → json.loads() parse error

When it happens: Frequent with 7B models without grammar. Enabling --grammar eliminates this completely. 14B+ rarely has this issue even without grammar.

Pattern 2: Broken — Valid JSON, Wrong Content

// Expected: {"speed_tps": 31.5, "vram_gb": 7.3}
// Actual:   {"speed_tps": "fast", "vram_gb": "7.3GB"}
// → Wrong types. Strings where numbers should be. Units leak into values.

When it happens: Frequent with 7B, sporadic with 14B. Including a JSON Schema in the prompt dramatically improves this. Grammar alone can't prevent it — the format is valid, just the values are wrong.

Pattern 3: Nested Structure Collapse

// Expected: {"items": [{"a": 1}, {"a": 2}]}
// Actual:   {"items": [{"a": 1}, {"b": 2}]}  // field name changed
// Or:       {"items": [{"a": 1}, 2]}          // type collapsed mid-array

When it happens: The nastiest pattern. When generating multiple objects inside an array, the first object is correct but subsequent ones drift — field names change, types collapse. This happens even with larger models. The best approach is to not ask the model to generate nested structures at all (see two-stage generation below).

Grammar: Necessary but Not Sufficient

llama.cpp's --grammar guarantees Pattern 1 (parse errors) goes to zero. But it can't prevent Pattern 2 or 3. Grammar constrains the token sequence format, not semantic correctness.

Grammar is a prerequisite, not a solution.

3 Fixes That Actually Work

Fix 1: Explicit Schema in Prompt

schema = {
    "type": "object",
    "properties": {
        "model": {"type": "string"},
        "speed_tps": {"type": "number"},
        "vram_gb": {"type": "number"}
    },
    "required": ["model", "speed_tps", "vram_gb"]
}

prompt = f"""Output JSON following this schema:
{json.dumps(schema, indent=2)}

Input: {input_text}"""

Giving the model the exact structure upfront dramatically improves field name consistency and type correctness. This works because the schema is in the context when the model generates each token.

Fix 2: Grammar + Retry

def reliable_json(prompt: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        raw = call_llm(prompt)
        try:
            parsed = json.loads(raw)
            if validate_schema(parsed):
                return parsed
        except (json.JSONDecodeError, ValidationError):
            continue
    raise RuntimeError(f"JSON generation failed ({max_retries} attempts)")

Allowing 3 retries massively improves effective success rate. The cost is up to 3x latency — a reasonable tradeoff for pipelines where reliability matters. Measure how many retries YOUR model needs on YOUR tasks.

Fix 3: Two-Stage Generation (for Nested Structures)

Don't ask the model to build nested JSON in one shot. Generate flat JSON twice and merge.

# Step 1: Extract metadata
meta = call_llm('Output model name only: {"model": "..."}')

# Step 2: Extract array separately  
items = call_llm('Output each quantization as JSON array: [{"method": ..., "size_gb": ..., "speed_tps": ...}]')

# Step 3: Merge in code
result = {**json.loads(meta), "quantizations": json.loads(items)}

Two flat generations merged in code is dramatically more stable than one nested generation. For 7B models needing nested output, this is effectively the only practical option.

Model Size Decision Guide

[JSON Output System Design Guide]

High reliability (payments, medical):  32B + grammar + retry
                                       → Doesn't fit 8GB. Use an API.

Standard (RAG, analysis):              14B + grammar + schema + retry
                                       → Optimal for RTX 4060 8GB

Lightweight (log extraction, classification): 7B + grammar + two-stage
                                       → Practical if you stick to flat JSON

Nested structures required:            14B+ with two-stage generation
                                       → 7B can't do this reliably

Specific success rates depend on YOUR environment. Copy the test code above, run it with YOUR model and YOUR tasks. Those numbers are your real reliability. Don't trust anyone else's benchmarks — including this article.