DEV Community

Jack M
Jack M

Posted on

LLM Structured Output Validation: Stop JSON Breaks Before They Hit Production

If your AI feature returns plain text, a bad answer is annoying. If it returns JSON that drives billing, tickets, database writes, automations, or customer-facing workflows, a bad answer can break the product.

That is the quiet failure mode many builders discover late. The demo works. The schema looks simple. The model follows instructions most of the time. Then one production request adds a sentence before the JSON, drops a required field, changes an enum, invents a key, or returns a valid object with unsafe values.

This guide shows how to build an LLM structured output validation layer that catches those failures before they touch production systems.

Why structured output breaks in real apps

Structured output is the bridge between language and software. You ask a model to return a shape like this:

{
  "intent": "refund_request",
  "confidence": 0.87,
  "customer_id": "cus_123",
  "next_action": "open_ticket"
}
Enter fullscreen mode Exit fullscreen mode

Then your app treats that response like data.

The problem is that language models are not normal API servers. They predict text. Even when a provider offers JSON mode, function calling, tool calling, or schema-constrained decoding, your application still owns the safety boundary around the result.

Common production failures include:

  • extra prose before or after JSON
  • missing required fields
  • nullable fields where your app expects strings
  • enum drift, such as cancelled instead of canceled
  • IDs copied from examples instead of real input
  • unsafe tool arguments
  • schema versions mixed across deployments
  • valid JSON that violates business rules
  • silent model fallback that changes output behavior

The fastest way to make structured output reliable is to stop treating parsing as the only problem. Parsing answers one question: "Is this JSON?" Production validation asks a better question: "Can this object safely drive the next step?"

The output contract mindset

An output contract is a small agreement between the model and your app:

  1. What shape must the answer have?
  2. Which values are allowed?
  3. Which fields are required for this workflow?
  4. Which fields can be repaired?
  5. Which failures must stop the workflow?
  6. Which schema version produced the object?

This matters because the model is only one part of the system. Your contract also protects the queue worker, webhook handler, database transaction, notification job, and user interface.

A useful contract has three layers:

Layer What it checks Example
Syntax Can we parse it? Valid JSON object
Schema Does it match the type shape? confidence is a number between 0 and 1
Semantics Is it safe and true enough for this workflow? customer_id belongs to the tenant

Most broken AI workflows stop at layer one. Reliable ones enforce all three.

Start with the smallest schema that can do the job

Large schemas create more failure points. If the next step only needs an intent and a confidence score, do not ask for a full CRM record.

Bad schema:

{
  "customer": {
    "name": "string",
    "email": "string",
    "plan": "string",
    "sentiment": "string",
    "summary": "string",
    "recommendedAction": "string",
    "priority": "string",
    "tags": ["string"],
    "risk": "string"
  }
}
Enter fullscreen mode Exit fullscreen mode

Better schema:

{
  "intent": "refund_request | bug_report | billing_question | unknown",
  "confidence": 0.0,
  "reason": "short string"
}
Enter fullscreen mode Exit fullscreen mode

The second schema is easier to validate, easier to test, and easier to recover from.

A good rule: ask the model for decisions, labels, and short explanations. Fetch authoritative data from your own systems.

Do not ask the model to invent user IDs, invoice IDs, subscription states, or permission levels. Pass those in from trusted services after the model chooses the next step.

Use provider features, but do not outsource validation

Modern LLM APIs often support structured outputs through JSON mode, function calling, tool calling, or schema constraints. Use them. They reduce messy parsing problems.

But they are not the whole reliability layer.

Provider-side constraints help with syntax and part of the schema. Your application still needs to validate:

  • tenant ownership
  • authorization
  • field-level business rules
  • maximum amounts
  • date ranges
  • allowed workflow transitions
  • idempotency keys
  • schema version compatibility
  • whether the output is confident enough to automate

Think of provider structured output as a helpful first gate, not the final gate.

A practical TypeScript validation pattern

Here is a small TypeScript pattern using Zod. The same idea works with Pydantic, Valibot, JSON Schema, or your validation library of choice.

import { z } from "zod";

const TicketIntentSchema = z.object({
  schema_version: z.literal("ticket_intent.v1"),
  intent: z.enum([
    "refund_request",
    "bug_report",
    "billing_question",
    "feature_request",
    "unknown"
  ]),
  confidence: z.number().min(0).max(1),
  reason: z.string().min(1).max(240)
});

type TicketIntent = z.infer<typeof TicketIntentSchema>;

export function validateTicketIntent(raw: unknown): TicketIntent {
  const parsed = TicketIntentSchema.safeParse(raw);

  if (!parsed.success) {
    throw new Error(
      `LLM_OUTPUT_SCHEMA_ERROR: ${parsed.error.issues
        .map(issue => `${issue.path.join(".")}: ${issue.message}`)
        .join("; ")}`
    );
  }

  return parsed.data;
}
Enter fullscreen mode Exit fullscreen mode

This gives you a clear failure mode. The workflow can stop, retry, repair, or route to human review instead of passing a malformed object downstream.

Add semantic checks after schema checks

A schema can tell you that amount_cents is a number. It cannot tell you whether the refund is allowed.

Add semantic validation near the workflow boundary:

type RefundDecision = {
  schema_version: "refund_decision.v1";
  action: "approve" | "deny" | "needs_review";
  confidence: number;
  invoice_id: string;
  amount_cents: number;
  reason: string;
};

async function validateRefundDecision(
  decision: RefundDecision,
  tenantId: string
) {
  const invoice = await db.invoice.findFirst({
    where: { id: decision.invoice_id, tenant_id: tenantId }
  });

  if (!invoice) {
    return { ok: false, code: "INVOICE_NOT_FOUND_FOR_TENANT" };
  }

  if (decision.amount_cents > invoice.amount_paid_cents) {
    return { ok: false, code: "REFUND_EXCEEDS_PAYMENT" };
  }

  if (decision.action === "approve" && decision.confidence < 0.9) {
    return { ok: false, code: "LOW_CONFIDENCE_APPROVAL" };
  }

  return { ok: true, invoice };
}
Enter fullscreen mode Exit fullscreen mode

This is where many AI apps become safer. The model can suggest an action, but the system decides whether the action is allowed.

Build a repair loop with a strict budget

Not every invalid output should fail immediately. Some failures are cheap to repair:

  • response has prose wrapped around JSON
  • enum uses a close variant
  • optional field is missing
  • number is returned as a string
  • schema version is missing but the route is known

But repair loops can become expensive and unpredictable. Use a strict budget.

A safe repair policy might be:

  • try normal generation once
  • if parsing fails, attempt one extraction or repair call
  • if schema validation fails, attempt one repair call with exact errors
  • if semantic validation fails, do not repair automatically; route to review or fallback

Example repair prompt shape:

Return only valid JSON for schema ticket_intent.v1.
Do not add prose.
Fix only the validation errors listed below.

Validation errors:
- confidence: expected number between 0 and 1
- intent: expected one of refund_request, bug_report, billing_question, feature_request, unknown

Original response:
...
Enter fullscreen mode Exit fullscreen mode

The phrase "fix only" matters. Without it, the model may reinterpret the whole task and change fields that were already valid.

Decide what can be automated

Structured output validation is not only about correctness. It is also about control.

Classify actions into risk tiers:

Tier Example output Automation rule
Low classify topic, draft summary, route inbox automate after schema validation
Medium update CRM field, create ticket, send internal notification require semantic checks and audit log
High refund money, delete data, email customer, change permissions require approval or a separate deterministic policy

A model returning valid JSON does not mean the workflow should run.

For high-risk actions, the structured output should create a proposal, not execute the action directly.

{
  "schema_version": "action_proposal.v1",
  "proposed_action": "refund_invoice",
  "invoice_id": "inv_789",
  "amount_cents": 4900,
  "requires_approval": true,
  "reason": "Customer reported duplicate charge."
}
Enter fullscreen mode Exit fullscreen mode

That object can be shown to a human reviewer with the source evidence, tenant context, and audit trail.

Version every schema

Schema drift is a silent killer. You deploy a new prompt that returns priority, but an old worker expects urgency. Or your frontend is updated before your queue consumer. Or a fallback model follows an older example from the prompt.

Add a schema_version field to every structured output.

Good versions are boring:

{
  "schema_version": "ticket_intent.v1",
  "intent": "bug_report",
  "confidence": 0.82,
  "reason": "User says export fails with a 500 error."
}
Enter fullscreen mode Exit fullscreen mode

Then handle versions explicitly:

switch (output.schema_version) {
  case "ticket_intent.v1":
    return handleTicketIntentV1(output);
  default:
    throw new Error("UNSUPPORTED_LLM_OUTPUT_SCHEMA_VERSION");
}
Enter fullscreen mode Exit fullscreen mode

Do not rely on prompt naming alone. Prompts are not runtime contracts.

Log validation failures like product signals

Validation failures are not just errors. They tell you where the product is unclear.

Track:

  • model name and version
  • prompt version
  • schema version
  • validation error category
  • repair attempt count
  • final outcome
  • tenant or plan tier, if appropriate and privacy-safe
  • workflow step
  • latency and token cost

Useful metrics include:

  • parse failure rate
  • schema failure rate
  • semantic failure rate
  • repair success rate
  • invalid output cost
  • automation deflection rate
  • human review rate
  • downstream incident count

If one intent fails validation more than others, the prompt may be unclear. If one model produces more enum drift, route that task elsewhere. If semantic failures spike after a product change, your schema may no longer reflect the workflow.

Test with adversarial and boring cases

Most teams test happy paths. Production breaks on weird but normal inputs.

Create a small test set for every output contract:

  • empty user message
  • very long message
  • multilingual message
  • conflicting instructions
  • prompt injection attempt
  • old product terminology
  • copied JSON from docs
  • missing tenant data
  • unsupported request
  • ambiguous request
  • high-risk action request
  • example that looks like a real ID

For each case, assert one of three outcomes:

  1. valid structured output
  2. safe fallback
  3. human review

Avoid tests that only check whether JSON parses. Test the workflow decision.

Example:

it("does not approve refund when invoice belongs to another tenant", async () => {
  const decision = {
    schema_version: "refund_decision.v1",
    action: "approve",
    confidence: 0.96,
    invoice_id: "inv_other_tenant",
    amount_cents: 2000,
    reason: "Duplicate charge"
  };

  const result = await validateRefundDecision(decision, "tenant_current");
  expect(result.ok).toBe(false);
  expect(result.code).toBe("INVOICE_NOT_FOUND_FOR_TENANT");
});
Enter fullscreen mode Exit fullscreen mode

Keep prompts and schemas close together

A common mistake is storing prompts in one place and schemas in another. Over time they drift.

Keep these files together:

ai/
  ticket-intent/
    prompt.md
    schema.ts
    examples.jsonl
    evals.test.ts
    README.md
Enter fullscreen mode Exit fullscreen mode

The README should explain:

  • what the contract does
  • what it must never do
  • allowed enum values
  • fallback behavior
  • owner
  • last major change

This makes AI workflows easier to review in pull requests. A reviewer can see when a prompt change affects the contract and whether tests changed with it.

Common mistakes to avoid

Mistake 1: Trusting JSON mode as a full safety system

JSON mode can reduce syntax failures. It does not validate tenant access, business rules, or workflow risk.

Mistake 2: Asking for too much in one object

One giant schema often hides multiple decisions. Split classification, extraction, and action proposal into separate contracts when possible.

Mistake 3: Automatically repairing semantic failures

If the model suggests refunding more than the invoice amount, do not ask it to "try again" until it approves. Stop the workflow.

Mistake 4: Ignoring low-confidence valid outputs

A perfectly valid object with confidence: 0.41 should not drive irreversible automation.

Mistake 5: Forgetting schema versions

Every contract should include a version. Your future migrations will be calmer.

A simple rollout plan

If your app already uses structured LLM output, start here:

  1. List every workflow where model output becomes application data.
  2. Mark the workflows that can write, send, charge, delete, or update records.
  3. Add schema validation to the highest-risk workflow first.
  4. Add semantic checks for tenant ownership and business rules.
  5. Add one repair attempt for syntax or schema failures.
  6. Add human review for high-risk semantic failures.
  7. Log validation outcomes.
  8. Build a small regression test set from real failures.
  9. Add schema versions.
  10. Review metrics weekly until failure rates stabilize.

You do not need a giant platform to start. One schema, one validator, and one clear stop condition can prevent the most painful incidents.

Content map for builders

This topic belongs in a broader production AI architecture cluster.

  • Pillar: production AI application architecture
  • Cluster: output reliability, workflow safety, schema validation, model routing
  • Search intent: practical implementation guide
  • Funnel stage: middle; the reader has built or is building AI features and needs reliability
  • Internal link targets: agent observability, claim verification, evaluation harness, approval gates, model failover
  • Next useful articles: schema migration for AI workflows, semantic validation for tool calls, regression testing structured AI outputs

FAQ

What is LLM structured output validation?

LLM structured output validation is the process of checking model responses against syntax, schema, and business rules before using them in software workflows. It makes sure the response is not only valid JSON, but safe for the next step.

Is JSON mode enough for production AI apps?

JSON mode helps, but it is not enough by itself. It can improve formatting, but your app still needs schema checks, authorization checks, semantic validation, logging, and fallback behavior.

What is the difference between parsing and validation?

Parsing checks whether a response can be read as JSON or another format. Validation checks whether the parsed object matches your expected fields, types, allowed values, and workflow rules.

Should invalid LLM output be repaired automatically?

Only low-risk syntax and schema failures should be repaired automatically, and only with a strict retry budget. Semantic failures, permission failures, and high-risk action failures should stop the workflow or require review.

Why should every AI output schema include a version?

A schema version prevents silent drift between prompts, models, workers, and frontends. It lets your app reject unsupported shapes and migrate contracts safely.

Which tools can validate structured LLM output?

Common options include Zod, Pydantic, JSON Schema, Valibot, TypeBox, and framework-specific parsers in LangChain or LlamaIndex. The best choice is usually the validation tool your codebase already understands.

Top comments (0)