Gabriel Anhaia

Posted on May 24

Structured Output Validation: Pydantic/Zod vs In-Prompt Schema vs JSON Mode

#ai #llm #python #typescript

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Three ways to make an LLM hand you structured data: native JSON mode, a typed example block inside the prompt, runtime validation with Pydantic or Zod. Most teams pick one. Then they spend three quarters chasing schema drift across model upgrades, fighting refusals from over-strict JSON modes, or shipping a parser that explodes the day a customer sends a payload with a stray trailing comma.

The right answer layers two of these. Never three. And the choice of which two depends entirely on how expensive a bad payload is in your domain.

The three validation surfaces

Each surface catches a different class of failure, and each fails differently.

Native JSON mode (Anthropic tool use, OpenAI Structured Outputs, Gemini response schema) is enforced by the provider. The model is constrained at the decoding step. Token by token, the sampler rejects branches that would violate the JSON schema you sent. This is the strongest guarantee on the wire. It also costs tokens, and on some schemas, it refuses to answer at all.

In-prompt schema is a typed example block in your system prompt. You're telling the model what shape to return. There's no enforcement, just instruction. Cheap, model-agnostic, and the most fragile surface. A model upgrade can quietly change what "follow this format" means.

Runtime validation is Pydantic in Python, Zod in TypeScript, or whatever validator your language ships. This runs after the model responds. It catches what the other two surfaces missed: extra fields, wrong enum values, malformed dates, numbers stringified, the works. It is the only surface that can produce an actionable error message back to either your code or the model.

The question is which two you stack, and why.

Native JSON mode — strong on the wire, fragile on the schema

Anthropic's tool use is the closest thing to enforced structured output for Claude. You declare a tool, the model fills its input. Here's a real example, the kind you'd actually ship for an invoice-extraction pipeline.

import anthropic

client = anthropic.Anthropic()

extract_invoice_tool = {
    "name": "extract_invoice",
    "description": "Extract structured fields from an invoice document.",
    "input_schema": {
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "issue_date": {
                "type": "string",
                "description": "ISO 8601 date (YYYY-MM-DD).",
            },
            "currency": {
                "type": "string",
                "enum": ["EUR", "USD", "GBP", "JPY"],
            },
            "total_cents": {
                "type": "integer",
                "minimum": 0,
                "description": "Total amount in cents. No floats.",
            },
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "sku": {"type": "string"},
                        "quantity": {"type": "integer", "minimum": 1},
                        "unit_price_cents": {"type": "integer"},
                    },
                    "required": ["sku", "quantity", "unit_price_cents"],
                },
            },
        },
        "required": [
            "invoice_number",
            "issue_date",
            "currency",
            "total_cents",
            "line_items",
        ],
    },
}

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    tools=[extract_invoice_tool],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": invoice_text}],
)

tool_block = next(b for b in response.content if b.type == "tool_use")
raw = tool_block.input  # already a dict, already parsed

That tool_choice={"type": "tool", "name": "extract_invoice"} line is the part most people forget. Without it the model decides whether to call the tool. With it, the model must call exactly that tool. This is your forcing function.

What you get back is a dict the SDK already parsed. The schema is enforced at the sampler. Strings are strings, integers are integers, the enum is one of four values. The currency field will not be "euros" or "eur" because the sampler couldn't produce those tokens.

What it does not catch: an issue_date of "2026-13-45". The schema says string. The model picked something string-shaped. The fact that calendar months stop at 12 isn't in the JSON schema vocabulary.

In-prompt schema — cheap, model-agnostic, drifts

You can describe the contract in the prompt instead. Works on any provider, no SDK feature dependency, almost no token cost.

SYSTEM = """You extract invoice data. Return ONLY a JSON object matching:

{
  "invoice_number": "string",
  "issue_date": "YYYY-MM-DD",
  "currency": "EUR|USD|GBP|JPY",
  "total_cents": 0,
  "line_items": [
    {"sku": "string", "quantity": 1, "unit_price_cents": 0}
  ]
}

No prose. No code fences. Cents as integers, never floats."""

This works well enough on a frontier model to pass demo day. Then you upgrade from one minor version to the next and discover the new model loves wrapping its JSON in

``json fences "for clarity." Or it starts emitting"currency": "EUR (Euro)"` because it wants to be helpful. Or it puts the cents in a string because the document had a thousands separator and it didn't know what to do.

In-prompt schema is a request, not a contract. The model decides how literally to follow it, and that decision shifts with every checkpoint Anthropic, OpenAI, and Google ship.

Runtime validation — the safety net that should always exist

Pydantic v2 is the right primitive here. It's fast, the error messages are excellent, and the model_validate API gives you exactly one place where bad data turns into a typed exception.


python
from datetime import date
from decimal import Decimal
from enum import Enum
from typing import Annotated

from pydantic import BaseModel, Field, ValidationError, field_validator


class Currency(str, Enum):
    EUR = "EUR"
    USD = "USD"
    GBP = "GBP"
    JPY = "JPY"


class LineItem(BaseModel):
    sku: Annotated[str, Field(min_length=1, max_length=64)]
    quantity: Annotated[int, Field(ge=1)]
    unit_price_cents: Annotated[int, Field(ge=0)]


class Invoice(BaseModel):
    invoice_number: Annotated[str, Field(pattern=r"^[A-Z0-9-]{3,32}$")]
    issue_date: date  # parses YYYY-MM-DD, rejects 2026-13-45
    currency: Currency
    total_cents: Annotated[int, Field(ge=0)]
    line_items: list[LineItem]

    @field_validator("line_items")
    @classmethod
    def at_least_one(cls, v: list[LineItem]) -> list[LineItem]:
        if not v:
            raise ValueError("invoice must have at least one line item")
        return v


def parse_invoice(raw: dict) -> Invoice:
    return Invoice.model_validate(raw)

Now 2026-13-45 blows up. "EUR (Euro)" blows up. An empty line items array blows up. An invoice number with whitespace blows up. The errors are structured, line-referenced, and machine-readable.

The TypeScript equivalent in Zod is the same idea with different syntax. Zod's strength is its inferred types. Your TS type is the schema, not a parallel declaration that drifts.


typescript
import { z } from "zod";

const currency = z.enum(["EUR", "USD", "GBP", "JPY"]);

const lineItem = z.object({
  sku: z.string().min(1).max(64),
  quantity: z.number().int().min(1),
  unit_price_cents: z.number().int().min(0),
});

const invoice = z.object({
  invoice_number: z.string().regex(/^[A-Z0-9-]{3,32}$/),
  issue_date: z.string().date(), // YYYY-MM-DD, rejects junk
  currency,
  total_cents: z.number().int().min(0),
  line_items: z.array(lineItem).min(1),
});

export type Invoice = z.infer<typeof invoice>;

export function parseInvoice(raw: unknown): Invoice {
  return invoice.parse(raw); // throws ZodError on failure
}

The point of runtime validation is not the happy path. It's the failure path. You get a structured error you can hand to a logger, a repair prompt, or a human reviewer.

The layered stack that ships

Here's the rule: pick two surfaces, not three. Three is a footgun.

Tool use + Pydantic/Zod: for high-stakes pipelines where a malformed payload causes a wrong invoice, a wrong dosage, a wrong account number. Tool use guarantees the wire format. Pydantic catches semantic invariants the schema can't express (calendar dates, regex patterns, cross-field constraints). The two cooperate cleanly.

In-prompt + Pydantic/Zod: for everything else. Costs less, works across providers, the runtime validator absorbs schema drift across model upgrades. This is what you want for the 80% of internal LLM calls that don't justify the JSON-mode tax.

Never all three. The failure modes start to overlap and the debugging gets miserable. JSON mode refuses a valid response, the in-prompt schema describes a shape the tool schema already declares, and now you have three sources of truth disagreeing about what optional means.

A practical pattern: define your Pydantic model once. Use it to drive both surfaces.


python
from typing import Any

def model_to_anthropic_tool(
    model: type[BaseModel], name: str, description: str
) -> dict[str, Any]:
    schema = model.model_json_schema()
    # Anthropic wants a slightly different shape than raw JSON Schema
    return {
        "name": name,
        "description": description,
        "input_schema": {
            "type": "object",
            "properties": schema.get("properties", {}),
            "required": schema.get("required", []),
        },
    }


invoice_tool = model_to_anthropic_tool(
    Invoice,
    name="extract_invoice",
    description="Extract structured fields from an invoice document.",
)

One source of truth. The Pydantic model. The tool schema is derived. The runtime validation is the model itself. A field rename ripples through everything.

Repair-prompt retry — what to do when validation fails

The single most useful retry pattern: hand the validation error back to the model. The model wrote the bad JSON. The model can fix it.


python
import json
from anthropic import Anthropic
from pydantic import ValidationError

client = Anthropic()

MAX_REPAIR_ATTEMPTS = 2


def extract_with_repair(invoice_text: str) -> Invoice:
    messages = [{"role": "user", "content": invoice_text}]

    for attempt in range(MAX_REPAIR_ATTEMPTS + 1):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2048,
            tools=[invoice_tool],
            tool_choice={"type": "tool", "name": "extract_invoice"},
            messages=messages,
        )

        tool_block = next(
            b for b in response.content if b.type == "tool_use"
        )

        try:
            return Invoice.model_validate(tool_block.input)
        except ValidationError as err:
            if attempt == MAX_REPAIR_ATTEMPTS:
                # out of budget; surface to the caller
                raise

            # feed the assistant's bad attempt + the error back
            messages.append(
                {"role": "assistant", "content": response.content}
            )
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "tool_result",
                            "tool_use_id": tool_block.id,
                            "is_error": True,
                            "content": (
                                "Your previous extraction failed validation. "
                                "Fix these specific errors and try again:\n\n"
                                f"{err.json(indent=2)}\n\n"
                                "Return ONLY the corrected extraction."
                            ),
                        }
                    ],
                }
            )

    raise RuntimeError("unreachable")

Two things make this work. First, err.json() gives Pydantic's structured error payload: field path, error type, the actual bad value. The model handles {"loc": ["issue_date"], "msg": "Input should be a valid date", "input": "2026-13-45"} much better than it handles a Python traceback. Second, the retry preserves the full conversation. The model sees its own bad attempt and the specific complaint, not a fresh prompt with no context.

Cap retries at 2. Past that, you're paying for a model that genuinely can't satisfy the schema on this input. The right move is to surface the error to a human reviewer, not to retry until the bill burns.

The gotcha — JSON mode refusals

JSON mode and tool use will sometimes refuse to answer. Strict schemas with deep nesting, ambiguous documents, content the model is unsure about: all can produce an empty tool call, a stop_reason of "refusal", or a tool_use block with a partial input the model gave up on.

The fix is a non-JSON fallback. If tool use fails twice with a repair prompt, drop to a plain text request with an in-prompt schema, run Pydantic over whatever JSON-ish thing comes back, and log the fallback as a separate metric. You want to see when the model is refusing. It's a signal about either your prompts or your input distribution.


python
def extract_robust(invoice_text: str) -> Invoice:
    try:
        return extract_with_repair(invoice_text)
    except (ValidationError, anthropic.APIError) as primary_err:
        # log: extraction_method=fallback, reason=<primary_err>
        return extract_via_plain_prompt(invoice_text)

The fallback path takes longer and is less reliable. That's fine. It runs on the 1% of inputs where the strict path can't cope, and it keeps the pipeline moving instead of bouncing the whole batch to a dead letter queue.

What to take away

Runtime validation is non-negotiable. Pydantic or Zod, pick the one your stack already speaks. Either one will save you the day your provider quietly changes how it handles optional fields.

Pair it with one of the other two surfaces. Never both. Tool use when the cost of a bad payload is high. In-prompt schema when you need provider flexibility or want to save tokens.

Build the repair prompt early. The first time a customer's weird invoice format breaks your extraction, you'll want the retry loop already wired. Cap the retries, log the failures, fall back to plain text for the long tail.

And derive your schemas from one source of truth. The day you have three places defining what a valid invoice looks like is the day schema drift becomes a permanent line item on your roadmap.

Which surface combo are you running in production right now, and what was the last validation failure that bit you?

If this was useful

The layered output validation pattern, the repair-prompt retry loop, and the dial between strictness and refusal rate are all worked through in detail in the Prompt Engineering Pocket Guide. The chapter on structured generation walks through tool use, in-prompt schemas, and validators with the kind of trade-off tables that save you a quarter of trial and error. It's the book I wish I'd had the first time I shipped a Pydantic model into an LLM pipeline.