Stop Parsing LLM JSON. The Backend Already Knows Better.
If you've built anything with LLMs, you've probably written code like this:
const raw = await openai.chat.completions.create({ ... });
let parsed;
try {
parsed = JSON.parse(raw.choices[0].message.content);
} catch {
// retry?
// strip markdown?
// log?
// cry?
}
Then you added retry logic.
Then the model started wrapping JSON in Markdown code fences.
Then it added a trailing comma.
Then you added another regex.
Eventually your structured output pipeline became more complicated than the feature you were trying to build.
The interesting part is:
This isn't actually an LLM problem.
It's a backend capability problem.
Not all structured output is equal
Most libraries expose a single "structured output" API regardless of which model you're using.
Under the hood, those providers work very differently.
OpenAI
OpenAI performs server-side schema enforcement.
If the model doesn't satisfy the schema, the API won't return malformed JSON.
Guarantee: Native.
Groq
Groq provides a similar native structured output mechanism.
Again, validation happens before the response reaches your application.
Guarantee: Native.
Ollama
This is where things get interesting.
Ollama supports GBNF grammar constraints.
Instead of validating JSON after generation, the grammar is applied during token sampling.
That means the model literally cannot emit a token that violates your schema.
There is no malformed JSON to parse.
There is no retry loop because of formatting mistakes.
There is no JSON.parse() risk.
For local models, this is probably the strongest structured output guarantee currently available.
Anthropic
Anthropic currently has no equivalent enforcement mechanism.
The common approach is:
- Prompt carefully
- Parse the response
- Validate
- Retry if necessary
It works well, but it's still best effort, not guaranteed.
Why this matters
Many libraries abstract all providers behind exactly the same API.
That sounds convenient.
But it also hides an important architectural difference.
Treating every provider the same means you're either:
- trusting outputs more than you should, or
- ignoring guarantees your backend already provides.
Knowing the guarantee level lets you design better systems.
For example:
- Fewer retries
- Simpler parsing
- Better observability
- More predictable production behavior
- Cleaner error handling
A practical example
Here's a structured output example using Ollama with Zod.
import { generate, ollama } from "@aviasole/shapecraft";
import { z } from "zod";
const result = await generate(
ollama({ model: "llama3.2" }),
z.object({
name: z.string(),
score: z.number(),
}),
"Rate this essay."
);
console.log(result.data);
The returned object is fully typed, and with Ollama the schema is enforced through GBNF grammar during generation.
Different backends, different guarantees
Thinking about structured output in terms of guarantee levels makes provider behavior much easier to reason about.
| Backend | Guarantee | Enforcement |
|---|---|---|
| OpenAI | Native | Server-side schema validation |
| Groq | Native | Server-side schema validation |
| Ollama | Constrained | Token-level GBNF grammar constraints |
| Anthropic | Best effort | Prompt + validation + retries |
The API surface might look identical.
The reliability characteristics are not.
Building around guarantee levels
I've been experimenting with making these differences explicit rather than hiding them.
In the library I maintain, every generation returns the guarantee level alongside the parsed result.
const result = await generate(model, schema, prompt);
result.guaranteeLevel;
// "native"
// "constrained"
// "best-effort"
Instead of pretending every backend behaves the same, your application can make informed decisions based on the actual guarantees available.
Final thoughts
Structured output isn't just about getting valid JSON.
It's about understanding how that JSON became valid.
There's a big difference between:
- "the model happened to follow the prompt,"
- "the server enforced the schema," and
- "the model physically could not generate an invalid token."
Those differences matter once you're building production systems.
Disclosure: I'm one of the maintainers of ShapeCraft, an open-source library that exposes structured output guarantee levels across OpenAI, Groq, Ollama, and Anthropic instead of treating every backend as identical.
- GitHub: https://github.com/aviasoletechnologies/shapecraft
- npm: https://www.npmjs.com/package/@aviasole/shapecraft
I'd love to hear how you're handling structured output reliability in production.
Top comments (0)