Jack M

Posted on Jun 29

LLM Structured Output Validation: Stop JSON Breaks Before They Hit Production

#ai #llm #testing #tutorial

If your AI feature returns plain text, a bad answer is annoying. If it returns JSON that drives billing, tickets, database writes, automations, or customer-facing workflows, a bad answer can break the product.

That is the quiet failure mode many builders discover late. The demo works. The schema looks simple. The model follows instructions most of the time. Then one production request adds a sentence before the JSON, drops a required field, changes an enum, invents a key, or returns a valid object with unsafe values.

This guide shows how to build an LLM structured output validation layer that catches those failures before they touch production systems.

Why structured output breaks in real apps

Structured output is the bridge between language and software. You ask a model to return a shape like this:

{
  "intent": "refund_request",
  "confidence": 0.87,
  "customer_id": "cus_123",
  "next_action": "open_ticket"
}

Then your app treats that response like data.

The problem is that language models are not normal API servers. They predict text. Even when a provider offers JSON mode, function calling, tool calling, or schema-constrained decoding, your application still owns the safety boundary around the result.

Common production failures include:

extra prose before or after JSON
missing required fields
nullable fields where your app expects strings
enum drift, such as cancelled instead of canceled
IDs copied from examples instead of real input
unsafe tool arguments
schema versions mixed across deployments
valid JSON that violates business rules
silent model fallback that changes output behavior

The fastest way to make structured output reliable is to stop treating parsing as the only problem. Parsing answers one question: "Is this JSON?" Production validation asks a better question: "Can this object safely drive the next step?"

The output contract mindset

An output contract is a small agreement between the model and your app:

What shape must the answer have?
Which values are allowed?
Which fields are required for this workflow?
Which fields can be repaired?
Which failures must stop the workflow?
Which schema version produced the object?

This matters because the model is only one part of the system. Your contract also protects the queue worker, webhook handler, database transaction, notification job, and user interface.

A useful contract has three layers:

Layer	What it checks	Example
Syntax	Can we parse it?	Valid JSON object
Schema	Does it match the type shape?	`confidence` is a number between 0 and 1
Semantics	Is it safe and true enough for this workflow?	`customer_id` belongs to the tenant

Most broken AI workflows stop at layer one. Reliable ones enforce all three.

Start with the smallest schema that can do the job

Large schemas create more failure points. If the next step only needs an intent and a confidence score, do not ask for a full CRM record.

Bad schema:

{
  "customer": {
    "name": "string",
    "email": "string",
    "plan": "string",
    "sentiment": "string",
    "summary": "string",
    "recommendedAction": "string",
    "priority": "string",
    "tags": ["string"],
    "risk": "string"
  }
}

Better schema:

{
  "intent": "refund_request | bug_report | billing_question | unknown",
  "confidence": 0.0,
  "reason": "short string"
}

The second schema is easier to validate, easier to test, and easier to recover from.

A good rule: ask the model for decisions, labels, and short explanations. Fetch authoritative data from your own systems.

Do not ask the model to invent user IDs, invoice IDs, subscription states, or permission levels. Pass those in from trusted services after the model chooses the next step.

Use provider features, but do not outsource validation

Modern LLM APIs often support structured outputs through JSON mode, function calling, tool calling, or schema constraints. Use them. They reduce messy parsing problems.

But they are not the whole reliability layer.

Provider-side constraints help with syntax and part of the schema. Your application still needs to validate:

tenant ownership
authorization
field-level business rules
maximum amounts
date ranges
allowed workflow transitions
idempotency keys
schema version compatibility
whether the output is confident enough to automate

Think of provider structured output as a helpful first gate, not the final gate.

A practical TypeScript validation pattern

Here is a small TypeScript pattern using Zod. The same idea works with Pydantic, Valibot, JSON Schema, or your validation library of choice.

import { z } from "zod";

const TicketIntentSchema = z.object({
  schema_version: z.literal("ticket_intent.v1"),
  intent: z.enum([
    "refund_request",
    "bug_report",
    "billing_question",
    "feature_request",
    "unknown"
  ]),
  confidence: z.number().min(0).max(1),
  reason: z.string().min(1).max(240)
});

type TicketIntent = z.infer<typeof TicketIntentSchema>;

export function validateTicketIntent(raw: unknown): TicketIntent {
  const parsed = TicketIntentSchema.safeParse(raw);

  if (!parsed.success) {
    throw new Error(
      `LLM_OUTPUT_SCHEMA_ERROR: ${parsed.error.issues
        .map(issue => `${issue.path.join(".")}: ${issue.message}`)
        .join("; ")}`
    );
  }

  return parsed.data;
}

This gives you a clear failure mode. The workflow can stop, retry, repair, or route to human review instead of passing a malformed object downstream.

Add semantic checks after schema checks

A schema can tell you that amount_cents is a number. It cannot tell you whether the refund is allowed.

Add semantic validation near the workflow boundary:

type RefundDecision = {
  schema_version: "refund_decision.v1";
  action: "approve" | "deny" | "needs_review";
  confidence: number;
  invoice_id: string;
  amount_cents: number;
  reason: string;
};

async function validateRefundDecision(
  decision: RefundDecision,
  tenantId: string
) {
  const invoice = await db.invoice.findFirst({
    where: { id: decision.invoice_id, tenant_id: tenantId }
  });

  if (!invoice) {
    return { ok: false, code: "INVOICE_NOT_FOUND_FOR_TENANT" };
  }

  if (decision.amount_cents > invoice.amount_paid_cents) {
    return { ok: false, code: "REFUND_EXCEEDS_PAYMENT" };
  }

  if (decision.action === "approve" && decision.confidence < 0.9) {
    return { ok: false, code: "LOW_CONFIDENCE_APPROVAL" };
  }

  return { ok: true, invoice };
}

This is where many AI apps become safer. The model can suggest an action, but the system decides whether the action is allowed.

Build a repair loop with a strict budget

Not every invalid output should fail immediately. Some failures are cheap to repair:

response has prose wrapped around JSON
enum uses a close variant
optional field is missing
number is returned as a string
schema version is missing but the route is known

But repair loops can become expensive and unpredictable. Use a strict budget.

A safe repair policy might be:

try normal generation once
if parsing fails, attempt one extraction or repair call
if schema validation fails, attempt one repair call with exact errors
if semantic validation fails, do not repair automatically; route to review or fallback

Example repair prompt shape:

Return only valid JSON for schema ticket_intent.v1.
Do not add prose.
Fix only the validation errors listed below.

Validation errors:
- confidence: expected number between 0 and 1
- intent: expected one of refund_request, bug_report, billing_question, feature_request, unknown

Original response:
...

The phrase "fix only" matters. Without it, the model may reinterpret the whole task and change fields that were already valid.

Decide what can be automated

Structured output validation is not only about correctness. It is also about control.

Classify actions into risk tiers:

Tier	Example output	Automation rule
Low	classify topic, draft summary, route inbox	automate after schema validation
Medium	update CRM field, create ticket, send internal notification	require semantic checks and audit log
High	refund money, delete data, email customer, change permissions	require approval or a separate deterministic policy

A model returning valid JSON does not mean the workflow should run.

For high-risk actions, the structured output should create a proposal, not execute the action directly.

{
  "schema_version": "action_proposal.v1",
  "proposed_action": "refund_invoice",
  "invoice_id": "inv_789",
  "amount_cents": 4900,
  "requires_approval": true,
  "reason": "Customer reported duplicate charge."
}

That object can be shown to a human reviewer with the source evidence, tenant context, and audit trail.

Version every schema

Schema drift is a silent killer. You deploy a new prompt that returns priority, but an old worker expects urgency. Or your frontend is updated before your queue consumer. Or a fallback model follows an older example from the prompt.

Add a schema_version field to every structured output.

Good versions are boring:

{
  "schema_version": "ticket_intent.v1",
  "intent": "bug_report",
  "confidence": 0.82,
  "reason": "User says export fails with a 500 error."
}

Then handle versions explicitly:

switch (output.schema_version) {
  case "ticket_intent.v1":
    return handleTicketIntentV1(output);
  default:
    throw new Error("UNSUPPORTED_LLM_OUTPUT_SCHEMA_VERSION");
}

Do not rely on prompt naming alone. Prompts are not runtime contracts.

Log validation failures like product signals

Validation failures are not just errors. They tell you where the product is unclear.

Track:

model name and version
prompt version
schema version
validation error category
repair attempt count
final outcome
tenant or plan tier, if appropriate and privacy-safe
workflow step
latency and token cost

Useful metrics include:

parse failure rate
schema failure rate
semantic failure rate
repair success rate
invalid output cost
automation deflection rate
human review rate
downstream incident count

If one intent fails validation more than others, the prompt may be unclear. If one model produces more enum drift, route that task elsewhere. If semantic failures spike after a product change, your schema may no longer reflect the workflow.

Test with adversarial and boring cases

Most teams test happy paths. Production breaks on weird but normal inputs.

Create a small test set for every output contract:

empty user message
very long message
multilingual message
conflicting instructions
prompt injection attempt
old product terminology
copied JSON from docs
missing tenant data
unsupported request
ambiguous request
high-risk action request
example that looks like a real ID

For each case, assert one of three outcomes:

valid structured output
safe fallback
human review

Avoid tests that only check whether JSON parses. Test the workflow decision.

Example:

it("does not approve refund when invoice belongs to another tenant", async () => {
  const decision = {
    schema_version: "refund_decision.v1",
    action: "approve",
    confidence: 0.96,
    invoice_id: "inv_other_tenant",
    amount_cents: 2000,
    reason: "Duplicate charge"
  };

  const result = await validateRefundDecision(decision, "tenant_current");
  expect(result.ok).toBe(false);
  expect(result.code).toBe("INVOICE_NOT_FOUND_FOR_TENANT");
});

Keep prompts and schemas close together

A common mistake is storing prompts in one place and schemas in another. Over time they drift.

Keep these files together:

ai/
  ticket-intent/
    prompt.md
    schema.ts
    examples.jsonl
    evals.test.ts
    README.md

The README should explain:

what the contract does
what it must never do
allowed enum values
fallback behavior
owner
last major change

This makes AI workflows easier to review in pull requests. A reviewer can see when a prompt change affects the contract and whether tests changed with it.

Common mistakes to avoid

Mistake 1: Trusting JSON mode as a full safety system

JSON mode can reduce syntax failures. It does not validate tenant access, business rules, or workflow risk.

Mistake 2: Asking for too much in one object

One giant schema often hides multiple decisions. Split classification, extraction, and action proposal into separate contracts when possible.

Mistake 3: Automatically repairing semantic failures

If the model suggests refunding more than the invoice amount, do not ask it to "try again" until it approves. Stop the workflow.

Mistake 4: Ignoring low-confidence valid outputs

A perfectly valid object with confidence: 0.41 should not drive irreversible automation.

Mistake 5: Forgetting schema versions

Every contract should include a version. Your future migrations will be calmer.

A simple rollout plan

If your app already uses structured LLM output, start here:

List every workflow where model output becomes application data.
Mark the workflows that can write, send, charge, delete, or update records.
Add schema validation to the highest-risk workflow first.
Add semantic checks for tenant ownership and business rules.
Add one repair attempt for syntax or schema failures.
Add human review for high-risk semantic failures.
Log validation outcomes.
Build a small regression test set from real failures.
Add schema versions.
Review metrics weekly until failure rates stabilize.

You do not need a giant platform to start. One schema, one validator, and one clear stop condition can prevent the most painful incidents.

Content map for builders

This topic belongs in a broader production AI architecture cluster.

Pillar: production AI application architecture
Cluster: output reliability, workflow safety, schema validation, model routing
Search intent: practical implementation guide
Funnel stage: middle; the reader has built or is building AI features and needs reliability
Internal link targets: agent observability, claim verification, evaluation harness, approval gates, model failover
Next useful articles: schema migration for AI workflows, semantic validation for tool calls, regression testing structured AI outputs

FAQ

What is LLM structured output validation?

LLM structured output validation is the process of checking model responses against syntax, schema, and business rules before using them in software workflows. It makes sure the response is not only valid JSON, but safe for the next step.

Is JSON mode enough for production AI apps?

JSON mode helps, but it is not enough by itself. It can improve formatting, but your app still needs schema checks, authorization checks, semantic validation, logging, and fallback behavior.

What is the difference between parsing and validation?

Parsing checks whether a response can be read as JSON or another format. Validation checks whether the parsed object matches your expected fields, types, allowed values, and workflow rules.

Should invalid LLM output be repaired automatically?

Only low-risk syntax and schema failures should be repaired automatically, and only with a strict retry budget. Semantic failures, permission failures, and high-risk action failures should stop the workflow or require review.

Why should every AI output schema include a version?

A schema version prevents silent drift between prompts, models, workers, and frontends. It lets your app reject unsupported shapes and migrate contracts safely.

Which tools can validate structured LLM output?

Common options include Zod, Pydantic, JSON Schema, Valibot, TypeBox, and framework-specific parsers in LangChain or LlamaIndex. The best choice is usually the validation tool your codebase already understands.

DEV Community