JSON Mode Still Fails. Here's the Recovery Ladder

#ai #llm #python #prompt

Book: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

It's 3:14 a.m. Your on-call engineer is staring at a stack trace from a job that's been green for six weeks. The model returned {"name": "Alice", "age": 3 and json.loads exploded. No closing brace. No closing quote. finish_reason: "length". Somebody set max_tokens=256 in 2025 and the latest schema needed 312.

JSON mode was supposed to fix this. It didn't, and the OpenAI community thread on truncation past max_tokens has been open for two years. Even with strict structured outputs, the 2026 consensus is the same: json_object mode guarantees parseable JSON syntax. It says nothing about your schema. Schema drift, truncation, and leading prose still take down production extractors. A single try/except won't catch all three. You need a recovery ladder.

Why "JSON mode" still fails

Three failure modes account for almost everything you'll see in logs.

Truncation at max_tokens. The model is mid-object when the budget runs out. You get finish_reason: "length" and a string that ends in a comma, an open quote, or half a number. Worse, with response-format set to json_object, the API still returns the partial string. It does not backfill closing braces for you. Azure's SDK repo has an open issue reporting the same pattern — generation breaking past 1024 tokens on GPT-4-Turbo.

Schema drift. Your prompt says "return email," the model returns email_address. Or your enum has five values, the model returns a sixth ("UNKNOWN"). Strict-mode json_schema covers most of this, but not on legacy models, not on every provider, and not when middleware libraries silently strip fields between versions.

Leading prose. The model says "Sure, here's the JSON you asked for:


" and your parser chokes on "Sure". This is the easiest to fix and the easiest to forget when you're debugging the other two.

A working extractor handles all three with the same control flow. That's the ladder.

## The recovery ladder, rung 1: pydantic validation

Four rungs, in order. Stop climbing as soon as one rung succeeds.

1. **Validate** the raw output with a pydantic model. Done in 95% of calls.
2. **Repair** via a second LLM pass that sees the broken output and the error.
3. **Re-extract** with a stripped-down structured-extraction prompt.
4. **Escalate** to a human queue with the original input, output, and error chain.

The discipline is to set hard caps: one repair attempt, one re-extraction. No infinite retry loops. No exponential backoff that masks a prompt regression for two weeks.

Rung 1 itself is the cheap path. Define the shape, parse the string, validate. Treat every model output as untrusted input.



```python
from pydantic import BaseModel, Field, ValidationError
from typing import Literal
import json


class Person(BaseModel):
    name: str = Field(min_length=1, max_length=120)
    age: int = Field(ge=0, le=130)
    role: Literal["admin", "member", "guest"]


def parse_person(raw: str) -> Person:
    data = json.loads(raw)
    return Person.model_validate(data)

The Field constraints catch the schema-drift failures that json_object mode sails past. An age: -3 parses as JSON. It does not validate as a Person. A role: "owner" parses, validates against str, and dies on the Literal, which is exactly what you want.

When this raises json.JSONDecodeError or pydantic.ValidationError, you climb to rung 2. Capture both the raw string and the exception message; the repair prompt needs both.

Rung 2: one-shot repair pass

The repair prompt is short. It shows the model the broken output, the validation error, and the schema. It asks for corrected JSON. Nothing else.

from openai import OpenAI

client = OpenAI()

REPAIR_PROMPT = """The previous JSON output failed validation.

Schema (pydantic):
{schema}

Broken output:
{broken}

Validation error:
{error}

Return ONLY the corrected JSON object. No prose,
no markdown fences, no commentary."""

The call itself stays minimal. Low temperature, generous max_tokens, JSON mode on.

def repair(broken: str, error: str, schema: str) -> str:
    msg = REPAIR_PROMPT.format(
        schema=schema, broken=broken, error=error
    )
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{"role": "user", "content": msg}],
        response_format={"type": "json_object"},
        temperature=0,
        max_tokens=1024,
    )
    return r.choices[0].message.content

Two details that matter in production. First, set temperature=0 on the repair pass. You want the model to fix the error. Creativity is a regression here. Second, give the repair pass at least 4× the max_tokens of the original call. Truncation is the failure you're recovering from; do not recreate it.

This single rung resolves the majority of real failures: trailing commas, missing fields, leading "Sure, here's the JSON," and most truncation cases where the original budget was just slightly too tight.

Rung 3: structured-extraction fallback

If repair fails, the original prompt is probably the problem. Schema drift, an ambiguous instruction, a model update that shifted formatting. Rung 3 throws away the original prompt and runs a clean extraction over the source text.

EXTRACT_PROMPT = """Extract a Person object from the text below.

Required fields:
- name: full name (string, 1-120 chars)
- age: integer 0-130
- role: one of "admin", "member", "guest"

If a field is missing, set it to null.
Return only the JSON object.

Text:
{source}"""

Use a different prompt. A retry on the original gets you the same broken output if the regression lives in the prompt itself — a new instruction, a tool-use snippet, a system-message change. The fallback runs a deliberately boring extraction that's been frozen for months.

def extract(source: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[
            {"role": "user",
             "content": EXTRACT_PROMPT.format(source=source)},
        ],
        response_format={"type": "json_object"},
        temperature=0,
        max_tokens=1024,
    )
    return r.choices[0].message.content

Treat this fallback prompt as a fixed asset. Version it. Snapshot its eval scores quarterly. When it starts failing, you've found a model regression worth a vendor ticket.

Rung 4 and tying the ladder together

If rung 3 fails, stop. Push the case to a human queue with the original input, every intermediate output, and every error. Do not silently return None. Do not return a default Person(name="unknown", age=0, role="guest"). Both behaviors hide the failure rate from your dashboards and let a 2% bad-output rate compound into a downstream data-quality incident.

The escalation queue is also your training data for whatever extractor replaces the current one. Every escalated case is a labeled hard example.

The orchestrator is plain control flow. No frameworks needed.

import logging

log = logging.getLogger(__name__)


def run_extraction(source: str) -> Person | None:
    raw = call_primary(source)
    try:
        return parse_person(raw)
    except (json.JSONDecodeError, ValidationError) as e1:
        log.warning("rung1 failed: %s", e1)

Then the repair attempt. Same try/except shape, different inputs.

    try:
        repaired = repair(
            broken=raw,
            error=str(e1),
            schema=Person.model_json_schema(),
        )
        return parse_person(repaired)
    except (json.JSONDecodeError, ValidationError) as e2:
        log.warning("rung2 failed: %s", e2)

Then the structured-extraction fallback, then the escalation.

    try:
        extracted = extract(source)
        return parse_person(extracted)
    except (json.JSONDecodeError, ValidationError) as e3:
        log.error("rung3 failed: %s", e3)
        escalate(source=source, raw=raw, error=str(e3))
        return None

Each rung gets one attempt. The whole flow caps at three model calls. Latency budget is bounded; cost is bounded; the failure mode is observable.

The metric to put on a dashboard is rung-distribution: what percentage of calls succeed at rung 1, 2, 3, and 4. A healthy extractor sits at 95/4/0.8/0.2. When rung 2 traffic doubles overnight, your prompt has drifted or your provider just shipped a quiet model update. When rung 4 climbs above 1%, the schema or the upstream data has changed and humans need to look. You get a week of warning before the on-call page.

Where this breaks

Three places, in case you're wondering before you ship.

The repair prompt itself can hallucinate. If the broken output is {"name": "Alice" and the schema demands age and role, the repair pass will invent values to satisfy validation. Mitigate by inspecting model_dump() against the original input; for high-stakes extractions (legal, medical, financial), require that every output field appears verbatim in the source.

The structured-extraction fallback can succeed loudly while being wrong. Pydantic validation does not check that the fields are correct, only that they're well-typed. Pair the ladder with a sampled human review of rung-3 outputs.

And the recovery ladder is not a substitute for evals. If your primary prompt's rung-1 success rate drops from 98% to 91%, the ladder will absorb the failure and your latency p95 will climb, but your output quality will already be degrading. Watch the rung distribution like you watch error rates.

If this was useful

The ladder above is one of the patterns the Prompt Engineering Pocket Guide covers in the chapter on structured outputs — alongside the diff between json_object and json_schema modes, the prompts that hold up across model upgrades, and the failure-mode taxonomy you can hand a junior engineer on day one. If your extraction pipelines are quietly accumulating try/except scaffolding, it's the book for you.