LLM output validation: 5 patterns that actually work in production

#programming #ai #llm #python

LLMs are probabilistic text generators. In a notebook demo, that's fine. In production, it means your pipeline will occasionally receive a Python dict where you expected JSON, a 900-word paragraph where you asked for three bullet points, or a hallucinated field name that breaks your downstream schema. This post is not about theory — it's about five concrete patterns, each with working code, that handle these failures reliably.

The core problem

You're calling an LLM API expecting structured output. The model has been prompted carefully. But over thousands of calls, you'll see:

Malformed JSON (trailing commas, unquoted keys, markdown code fences wrapping the payload)
Responses that exceed or fall short of length constraints
Fields that exist in the schema but contain garbage ("confidence": "I am quite certain")
Duplicate entries in batch completions
The model evaluating its own output charitably when asked to self-check

Each pattern below addresses one failure mode.

import json
import re
import time
import hashlib
from openai import OpenAI

llm_client = OpenAI(
    api_key="your_api_key",
    base_url="https://api.your-llm-provider.com/v1",
)

def call_llm(messages: list[dict], model: str = "gpt-4o-mini",
             temperature: float = 0.3, max_tokens: int = 1000) -> str:
    response = llm_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content.strip()

Pattern 1: JSON schema validation with retry

Problem: The model returns valid JSON 98% of the time and something subtly broken the other 2%. Your parser crashes and you lose the request.

Bad solution: json.loads() with a bare except that returns None. You swallow errors silently and downstream code explodes later.

Good solution: Parse, validate against a schema, and retry with an error hint that tells the model exactly what went wrong.

import jsonschema

ARTICLE_SCHEMA = {
    "type": "object",
    "required": ["title", "summary", "tags", "difficulty"],
    "properties": {
        "title": {"type": "string", "minLength": 10, "maxLength": 120},
        "summary": {"type": "string", "minLength": 50},
        "tags": {"type": "array", "items": {"type": "string"}, "minItems": 1},
        "difficulty": {"type": "string", "enum": ["beginner", "intermediate", "advanced"]},
    },
    "additionalProperties": False,
}

def extract_json_from_response(text: str) -> str:
    """Strip markdown code fences if present."""
    match = re.search(r"```

(?:json)?\s*([\s\S]*?)

```", text)
    if match:
        return match.group(1).strip()
    # Try to find raw JSON object
    match = re.search(r"\{[\s\S]*\}", text)
    if match:
        return match.group(0)
    return text

def call_with_json_schema(prompt: str, schema: dict,
                           max_retries: int = 3) -> dict:
    messages = [
        {"role": "system", "content": (
            "You are a data extraction assistant. "
            "Always respond with valid JSON matching the requested schema. "
            "No prose, no markdown fences, just the JSON object."
        )},
        {"role": "user", "content": prompt},
    ]

    last_error = None
    for attempt in range(max_retries):
        raw = call_llm(messages)
        json_str = extract_json_from_response(raw)

        try:
            data = json.loads(json_str)
            jsonschema.validate(instance=data, schema=schema)
            return data
        except json.JSONDecodeError as e:
            last_error = f"JSON parse error: {e}. Raw output was: {json_str[:200]}"
        except jsonschema.ValidationError as e:
            last_error = f"Schema validation failed: {e.message}"

        # Append error feedback and retry
        messages.append({"role": "assistant", "content": raw})
        messages.append({"role": "user", "content": (
            f"That response had an error: {last_error}\n"
            "Please fix it and return only the corrected JSON."
        )})
        time.sleep(0.5 * (attempt + 1))  # back off slightly

    raise ValueError(f"Failed after {max_retries} attempts. Last error: {last_error}")

Pattern 2: Length constraint enforcement

Problem: You ask for a 2-sentence summary and get a paragraph. Or you ask for 500 words and get 80. Downstream rendering breaks.

Bad solution: Truncate with response[:500]. You cut mid-sentence and produce garbage.

Good solution: Measure, then retry with a correction hint that quantifies the delta.

def count_words(text: str) -> int:
    return len(text.split())

def call_with_length_constraint(prompt: str, min_words: int, max_words: int,
                                 max_retries: int = 3) -> str:
    messages = [
        {"role": "system", "content": (
            f"Write responses between {min_words} and {max_words} words. "
            "Count carefully before submitting."
        )},
        {"role": "user", "content": prompt},
    ]

    for attempt in range(max_retries):
        response = call_llm(messages, max_tokens=max_words * 2)
        word_count = count_words(response)

        if min_words <= word_count <= max_words:
            return response

        delta = word_count - max_words if word_count > max_words else min_words - word_count
        direction = "shorter" if word_count > max_words else "longer"
        hint = (
            f"Your response was {word_count} words. "
            f"It needs to be {abs(delta)} words {direction}. "
            f"Target: {min_words}–{max_words} words. Rewrite it."
        )
        messages.append({"role": "assistant", "content": response})
        messages.append({"role": "user", "content": hint})

    # Last resort: hard truncate/expand with note
    final = call_llm(messages, max_tokens=max_words * 2)
    words = final.split()
    if len(words) > max_words:
        return " ".join(words[:max_words])
    return final

Pattern 3: Regex-based field extraction as fallback

Problem: The model consistently wraps values in prose ("The severity is: HIGH") instead of returning a clean value. JSON parsing fails; you can't proceed.

Good solution: Regex extraction as a structured fallback — not a replacement for JSON, but a recovery layer when JSON fails.

FIELD_PATTERNS = {
    "severity": r"\b(LOW|MEDIUM|HIGH|CRITICAL)\b",
    "score": r"\b(\d+(?:\.\d+)?)\s*(?:/\s*10)?",
    "category": r"\b(spam|phishing|legitimate|malware|unknown)\b",
    "confidence": r"confidence[:\s]+(\d+(?:\.\d+)?)%?",
}

def extract_fields_with_regex(text: str,
                               fields: list[str]) -> dict:
    """
    Attempt to extract structured fields from prose output using regex.
    Returns None for fields that cannot be extracted.
    """
    result = {}
    text_upper = text.upper()

    for field in fields:
        pattern = FIELD_PATTERNS.get(field)
        if not pattern:
            result[field] = None
            continue

        match = re.search(pattern, text_upper if field == "severity" else text,
                          re.IGNORECASE)
        result[field] = match.group(1) if match else None

    return result

def classify_with_fallback(text_to_classify: str) -> dict:
    prompt = (
        f'Classify this text:\n\n"{text_to_classify}"\n\n'
        'Return JSON: {"category": "spam|phishing|legitimate", '
        '"severity": "LOW|MEDIUM|HIGH|CRITICAL", "confidence": 0-100}'
    )
    messages = [{"role": "user", "content": prompt}]
    raw = call_llm(messages, temperature=0.1)

    try:
        json_str = extract_json_from_response(raw)
        return json.loads(json_str)
    except (json.JSONDecodeError, ValueError):
        # Fallback: extract fields with regex
        extracted = extract_fields_with_regex(raw, ["category", "severity", "confidence"])
        extracted["_extraction_method"] = "regex_fallback"
        return extracted

Pattern 4: Confidence scoring via self-evaluation

Problem: The model answers confidently even when it's guessing. You need a signal to route low-confidence answers to human review.

Key insight: Ask the model to evaluate its own answer in a separate call. Self-evaluation in the same call is biased upward.

def get_answer_with_confidence(question: str, context: str) -> dict:
    # Step 1: Generate answer
    answer_messages = [
        {"role": "system", "content": "Answer based strictly on the provided context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
    ]
    answer = call_llm(answer_messages, temperature=0.2)

    # Step 2: Evaluate in a separate call
    eval_messages = [
        {"role": "system", "content": (
            "You are an impartial evaluator. Assess answer quality strictly. "
            "Return JSON: {\"confidence\": 0-100, \"issues\": [list of concerns], "
            "\"grounded\": true/false}"
        )},
        {"role": "user", "content": (
            f"Question: {question}\n\n"
            f"Context provided:\n{context}\n\n"
            f"Answer given:\n{answer}\n\n"
            "Evaluate: Is this answer fully supported by the context? "
            "Are there unsupported claims? Score 0-100."
        )},
    ]
    eval_raw = call_llm(eval_messages, temperature=0.0)

    try:
        eval_data = json.loads(extract_json_from_response(eval_raw))
    except (json.JSONDecodeError, ValueError):
        eval_data = {"confidence": 50, "issues": ["evaluation_parse_failed"], "grounded": None}

    return {
        "answer": answer,
        "confidence": eval_data.get("confidence", 50),
        "issues": eval_data.get("issues", []),
        "grounded": eval_data.get("grounded"),
        "needs_review": eval_data.get("confidence", 50) < 70,
    }

Pattern 5: Deduplication across batch outputs

Problem: You process 50 documents in batch and ask the model to extract key entities from each. You get overlapping, near-duplicate entries that pollute your downstream data.

Good solution: Hash-based exact dedup combined with a lightweight similarity check for near-duplicates.

from difflib import SequenceMatcher

def deduplicate_outputs(items: list[str],
                         similarity_threshold: float = 0.85) -> list[str]:
    """
    Remove exact duplicates (hash) and near-duplicates (sequence similarity).
    Keeps the first occurrence of each unique item.
    """
    seen_hashes: set[str] = set()
    unique_items: list[str] = []

    for item in items:
        normalized = item.strip().lower()
        item_hash = hashlib.md5(normalized.encode()).hexdigest()

        if item_hash in seen_hashes:
            continue  # exact duplicate

        # Check near-duplicate against existing unique items
        is_near_dup = any(
            SequenceMatcher(None, normalized, existing.strip().lower()).ratio()
            >= similarity_threshold
            for existing in unique_items
        )

        if not is_near_dup:
            unique_items.append(item)
            seen_hashes.add(item_hash)

    return unique_items

def batch_extract_entities(documents: list[str], entity_type: str) -> list[str]:
    all_entities = []

    for doc in documents:
        messages = [
            {"role": "system", "content": (
                f"Extract all {entity_type} from the text. "
                "Return a JSON array of strings. Nothing else."
            )},
            {"role": "user", "content": doc},
        ]
        raw = call_llm(messages, temperature=0.1)
        try:
            entities = json.loads(extract_json_from_response(raw))
            if isinstance(entities, list):
                all_entities.extend(entities)
        except (json.JSONDecodeError, ValueError):
            pass  # log and continue — one bad doc shouldn't stop the batch

    return deduplicate_outputs(all_entities)

Putting it all together

These patterns compose. A production pipeline for classifying user-submitted content might chain them:

def robust_classify(text: str) -> dict:
    try:
        result = call_with_json_schema(
            prompt=f'Classify this text: "{text}"',
            schema={
                "type": "object",
                "required": ["category", "severity", "confidence"],
                "properties": {
                    "category": {"type": "string", "enum": ["spam", "phishing", "legitimate", "toxic"]},
                    "severity": {"type": "string", "enum": ["LOW", "MEDIUM", "HIGH", "CRITICAL"]},
                    "confidence": {"type": "number", "minimum": 0, "maximum": 100},
                },
            },
            max_retries=3,
        )
    except ValueError:
        # Pattern 3 fallback
        result = classify_with_fallback(text)

    # Pattern 4: flag for human review if uncertain
    result["needs_review"] = result.get("confidence", 100) < 65
    return result

These five patterns cover the vast majority of production failures. Start with Pattern 1 (JSON schema + retry) and Pattern 3 (regex fallback) — they handle 80% of output issues. Add Pattern 4 (self-evaluation) when you have a human review queue and need to route intelligently. For content pipelines like the moderation system described in practical security guides, Patterns 1 and 5 together eliminate most of the noise from batch LLM processing.