Hardik Mehta for Aviasole Technologies

Posted on Jul 1

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

#node #typescript #llm #ai

Stop Parsing LLM JSON. The Backend Already Knows Better.

If you've built anything with LLMs, you've probably written code like this:

const raw = await openai.chat.completions.create({ ... });

let parsed;

try {
  parsed = JSON.parse(raw.choices[0].message.content);
} catch {
  // retry?
  // strip markdown?
  // log?
  // cry?
}

Then you added retry logic.

Then the model started wrapping JSON in Markdown code fences.

Then it added a trailing comma.

Then you added another regex.

Eventually your structured output pipeline became more complicated than the feature you were trying to build.

The interesting part is:

This isn't actually an LLM problem.

It's a backend capability problem.

Not all structured output is equal

Most libraries expose a single "structured output" API regardless of which model you're using.

Under the hood, those providers work very differently.

OpenAI

OpenAI performs server-side schema enforcement.

If the model doesn't satisfy the schema, the API won't return malformed JSON.

Guarantee: Native.

Groq

Groq provides a similar native structured output mechanism.

Again, validation happens before the response reaches your application.

Guarantee: Native.

Ollama

This is where things get interesting.

Ollama supports GBNF grammar constraints.

Instead of validating JSON after generation, the grammar is applied during token sampling.

That means the model literally cannot emit a token that violates your schema.

There is no malformed JSON to parse.

There is no retry loop because of formatting mistakes.

There is no JSON.parse() risk.

For local models, this is probably the strongest structured output guarantee currently available.

Anthropic

Anthropic currently has no equivalent enforcement mechanism.

The common approach is:

Prompt carefully
Parse the response
Validate
Retry if necessary

It works well, but it's still best effort, not guaranteed.

Why this matters

Many libraries abstract all providers behind exactly the same API.

That sounds convenient.

But it also hides an important architectural difference.

Treating every provider the same means you're either:

trusting outputs more than you should, or
ignoring guarantees your backend already provides.

Knowing the guarantee level lets you design better systems.

For example:

Fewer retries
Simpler parsing
Better observability
More predictable production behavior
Cleaner error handling

A practical example

Here's a structured output example using Ollama with Zod.

import { generate, ollama } from "@aviasole/shapecraft";
import { z } from "zod";

const result = await generate(
  ollama({ model: "llama3.2" }),
  z.object({
    name: z.string(),
    score: z.number(),
  }),
  "Rate this essay."
);

console.log(result.data);

The returned object is fully typed, and with Ollama the schema is enforced through GBNF grammar during generation.

Different backends, different guarantees

Thinking about structured output in terms of guarantee levels makes provider behavior much easier to reason about.

Backend	Guarantee	Enforcement
OpenAI	Native	Server-side schema validation
Groq	Native	Server-side schema validation
Ollama	Constrained	Token-level GBNF grammar constraints
Anthropic	Best effort	Prompt + validation + retries

The API surface might look identical.

The reliability characteristics are not.

Building around guarantee levels

I've been experimenting with making these differences explicit rather than hiding them.

In the library I maintain, every generation returns the guarantee level alongside the parsed result.

const result = await generate(model, schema, prompt);

result.guaranteeLevel;
// "native"
// "constrained"
// "best-effort"

Instead of pretending every backend behaves the same, your application can make informed decisions based on the actual guarantees available.

Final thoughts

Structured output isn't just about getting valid JSON.

It's about understanding how that JSON became valid.

There's a big difference between:

"the model happened to follow the prompt,"
"the server enforced the schema," and
"the model physically could not generate an invalid token."

Those differences matter once you're building production systems.

Disclosure: I'm one of the maintainers of ShapeCraft, an open-source library that exposes structured output guarantee levels across OpenAI, Groq, Ollama, and Anthropic instead of treating every backend as identical.

I'd love to hear how you're handling structured output reliability in production.

Top comments (2)

James O'Connor • Jul 3

Good breakdown, and the OpenAI/Groq/Ollama split is real. One distinction I would add, because it is where this bites later: a grammar or schema guarantee gets you a valid object, not a correct one. GBNF constrained decoding means the model cannot emit a token off-schema, which is genuinely useful, but the value it fills is still whatever the model believed, and "valid and confidently wrong" survives every mechanism on this list. We had a required date field that always parsed, always matched the type, and was occasionally two years off because the source did not actually contain it. Native enforcement removes the JSON.parse class of bugs and is worth doing. It does not touch the schema-valid-but-semantically-wrong class, and that needs a separate content check, ideally one that can return "no evidence for this field" instead of a plausible guess. Worth being explicit that constrained decoding addresses the first problem and not the second, because it is easy to ship thinking you covered both.

Hardik Mehta Aviasole Technologies • Jul 3

You're right, and it's an important distinction we glossed over.

Constrained decoding eliminates the structural failure class - malformed JSON, wrong types, missing required fields. It says nothing about semantic correctness. The model still fills values from its priors when evidence is absent, and "valid and confidently wrong" is harder to catch than a parse error because it looks like success.

Your date example is exactly the failure mode: the field passes every check we expose, but the value is fabricated. No retry logic catches it because nothing throws.

The honest scope of what Shapecraft addresses: structural guarantees and parsing. Semantic validation - "was there actually evidence for this field?" - is a separate layer that needs either a confidence signal from the model, a retrieval grounding check, or a second-pass verifier. We haven't touched that and shouldn't imply otherwise.

Worth adding a note to the README that makes this explicit. Thanks for naming it precisely.