Muhammet ŞAFAK

Posted on Jun 26

Getting structured JSON out of five incompatible LLM APIs — and degrading when they ignore you

#ai #llm #go #json

The parser as the real system contract

CommitBrief renders a code review as cards, JSON schema v1, or a CI exit code — which means the LLM has to hand back structured findings, not prose. Every provider can do that. The catch is that no two of them do it the same way, and some don't really do it at all.

There's exactly one schema the whole system targets. Getting four native APIs to honor it takes four completely different mechanisms; getting three more is a matter of asking nicely and not trusting the answer. This is how that works, and what happens when a model ignores the contract anyway.

TL;DR

One schema, many dialects. Every provider targets the same Finding shape, expressed through whatever structured-output mechanism that vendor offers — tool_use, strict json_schema, responseSchema, or just format: "json".
Structured output is a spectrum, not a guarantee. It runs from "the API enforces the shape" down to "we asked in the prompt."
The real contract is your parser. One ParseFindings validates every provider's output the same way; failures retry once, then degrade to Markdown with a warning. The pipeline never crashes on a bad response.
The limit. A schema makes output parseable, not correct. It can't stop a model from inventing a plausible-but-wrong finding.

The one schema everyone targets

A finding is a flat struct. Five required fields, three optional, and a severity drawn from a closed vocabulary:

type Finding struct {
    Severity    Severity `json:"severity"`     // one of five, below
    File        string   `json:"file"`
    Line        int      `json:"line"`
    LineEnd     int      `json:"line_end,omitempty"`
    Title       string   `json:"title"`
    Description string   `json:"description"`
    Suggestion  string   `json:"suggestion"`
    Language    string   `json:"language,omitempty"`
    Snippet     string   `json:"snippet,omitempty"`
}

const (
    SeverityCritical Severity = "critical"
    SeverityHigh     Severity = "high"
    SeverityMedium   Severity = "medium"
    SeverityLow      Severity = "low"
    SeverityInfo     Severity = "info"
)

The envelope is {"findings": [ ... ]} and nothing else. That severity vocabulary is the wire contract with the model — deliberately English-only and fixed in code, so a user's custom COMMITBRIEF.md can change the rules of a review but never the shape of its output. Everything downstream — the cards renderer, --json, --fail-on=high — depends on those five strings meaning exactly five things.

Four native dialects for the same shape

The native API providers each enforce that schema through their own mechanism. Same target, four wire formats.

Anthropic — a forced tool call. The findings schema is registered as a tool, and tool_choice makes calling it non-optional:

params.Tools      = []sdk.ToolUnionParam{buildReportTool()}  // schema as "report_findings"
params.ToolChoice = sdk.ToolChoiceParamOfTool(toolName)      // must call it
// tool description: "Emit the review as structured findings. Always call this tool."

OpenAI — strict json_schema. With Strict set, the Chat Completions API holds the response to the schema server-side — and refuses the request outright rather than fall through to a model that would ignore it:

func buildResponseFormat() sdk.ChatCompletionNewParamsResponseFormatUnion {
    return sdk.ChatCompletionNewParamsResponseFormatUnion{
        OfJSONSchema: &shared.ResponseFormatJSONSchemaParam{
            JSONSchema: shared.ResponseFormatJSONSchemaJSONSchemaParam{
                Name:        schemaName,
                Description: sdk.String("Structured findings for a code review."),
                Strict:      sdk.Bool(true),
                Schema:      responseSchema,
            },
        },
    }
}

(Responses-API-only models express the same schema through a text.format json_schema config instead — one more dialect for the identical shape.)

Gemini — a response schema plus a MIME type. You hand the SDK a *Schema value and tell it to return JSON:

cfg.ResponseMIMEType = "application/json"
cfg.ResponseSchema   = responseSchema()  // the Findings envelope as a *genai.Schema

Ollama — format: "json", and that's all it promises. A local model can be told to emit JSON, but the flag constrains syntax, not shape:

Format: "json", // valid JSON guaranteed; the right keys are not

That distinction matters. Anthropic, OpenAI, and Gemini constrain the structure; Ollama only guarantees the output parses as some JSON. The schema conformance has to come from somewhere else.

Three providers that don't enforce at all

DeepSeek, Mistral, and Cohere reach CommitBrief through the OpenAI-compatible SDK (covered in part 2), but their strict-schema support is uneven, so they don't request response_format at all. Their JSON shape comes entirely from the prompt's contract block.

So structured output across the seven API providers is a spectrum:

Mechanism	Constrains	Providers
Forced tool / strict schema	The exact shape	`anthropic`, `openai`, `gemini`
`format: "json"`	Syntax only	`ollama`
Prompt instruction	Nothing, at the API level	`deepseek`, `mistral`, `cohere`

A pipeline that only trusted the strict-schema providers would work for three of seven. The other four need a backstop that doesn't care how the JSON was produced.

The real contract is your parser

That backstop is one function every provider's output funnels through. ParseFindings decodes the envelope and validates each finding — not just "is it JSON" but "is it a valid finding":

for i, f := range env.Findings {
    if !f.Severity.IsValid() {
        return nil, fmt.Errorf("parse findings: finding %d: unknown severity %q", i, f.Severity)
    }
    if f.File == "" {
        return nil, fmt.Errorf("parse findings: finding %d: missing file", i)
    }
    if f.Title == "" {
        return nil, fmt.Errorf("parse findings: finding %d: missing title", i)
    }
    if f.Description == "" { /* ... */ }
    if f.Suggestion == "" { /* ... */ }
}

An empty findings array is a clean review, returned as a non-nil empty slice — success, not an error. A made-up severity or a finding with no file is a parse failure, no matter which provider produced it. The strict-schema providers rarely trip it; the prompt-driven ones lean on it. Either way, the validation is identical, so a --fail-on=high gate means the same thing whether you ran Claude or a local qwen.

When the model ignores all of it

A strict schema reduces malformed output; it doesn't eliminate it, and three of the providers have no schema at all. So the call is wrapped in retry-once-then-degrade:

resp, err := prov.Review(ctx, req)
if err != nil {
    return "", provider.Usage{}, "", err
}
if _, parseErr := render.ParseFindings(resp.Content); parseErr == nil {
    return resp.Content, resp.Usage, cache.FormatJSON, nil
}
// First attempt unparseable — retry once (ADR-0014 §4).
onRetry()
resp2, err2 := prov.Review(ctx, req)
if err2 != nil {
    return resp.Content, resp.Usage, cache.FormatMarkdownFallback, nil
}
if _, parseErr := render.ParseFindings(resp2.Content); parseErr == nil {
    return resp2.Content, totalUsage, cache.FormatJSON, nil
}
// Both attempts failed — degrade: render the raw text as Markdown, warn once.
return resp.Content, totalUsage, cache.FormatMarkdownFallback, nil

Three things in that flow are deliberate. Token usage is summed across both attempts, so the cost footer reflects what you actually spent, even on a degrade. The outcome is recorded as a format marker (FormatJSON or FormatMarkdownFallback) and cached with the response, so a degraded review replays from cache silently instead of re-warning forever. And degrade means render the raw model text as Markdown and print one warning — never crash, never show the user a stack trace because an LLM got creative. A --fail-on gate is skipped on a degrade, with a note on stderr, because there are no structured findings to threshold.

What it is not

Structured output guarantees a response is parseable. It does not guarantee it's correct. A strict schema can't stop a model from inventing a line number, attaching a finding to the wrong file, or reporting a confident non-issue — which is why the prompt still carries an explicit "do not invent file paths or line numbers" directive, and why this is the zeroth reviewer, not the last one. The schema is what makes the output machine-readable; your judgment is what makes it trustworthy.

If you want the measured version of "how often is it right," the eval harness scores precision and false-positive rate per model against a known-answer corpus:

COMMITBRIEF_EVAL_PROVIDER=<name> make eval-live

Repo: github.com/CommitBrief/commitbrief.

Part 3 of **Building CommitBrief. Next: the pre-send secret scanner — eight patterns, added-lines-only, and a match record that never stores the secret it just caught.

Top comments (13)

Mudassir Khan • Jul 2

the 'syntax is a lint rule, schema is a type checker' framing is the right way to think about it. we built multi provider routing with a similar validation layer and the prompt only providers (deepseek, mistral, cohere in your case) are where the production incidents live.

what we added beyond the retry once path: detecting whether the failed attempt returned prose vs truncated JSON and branching the recovery prompt differently. uniform retry templates miss that split — confident prose needs a schema reset, partial JSON needs a continue nudge.

how are you handling providers that silently hallucinate the severity field rather than omit it entirely?

Muhammet ŞAFAK • Jul 4

Good question — it splits into two cases that need different answers.

Out-of-vocab hallucination ("blocker", "P1", "severe"): caught client-side. severity is a closed five-value enum and the parser validates every finding against it — an unknown value fails the whole parse and rides the same retry-once-then-degrade path as malformed JSON. Deliberately no coercion: we never map "blocker" → "critical" or default to "medium", because severity is load-bearing — --fail-on=critical decides the exit code in pre-commit hooks and CI. A response that fails loudly beats a guessed severity that silently blocks (or fails to block) a commit. For the three native-schema providers the enum is also enforced server-side in the tool/response schema, so in practice this class only shows up on the prompt-only trio — which matches your incident distribution exactly.

In-vocab but miscalibrated ("critical" on a style nit): this is where the type-checker analogy stops — it type-checks fine. Schema can't see it, so we handle it outside the request path: (1) the severity rubric is a fixed prompt block that defines each level by observable consequence ("crash on common input", "should block release") rather than adjectives, and it's injected by the prompt builder — not the user-editable rules file — so it can't be accidentally dropped; (2) calibration is measured offline: our eval fixtures assert a severity floor per expected finding (actual ≥ min_severity), so a provider that inflates severities shows up as an eval regression instead of a production surprise; (3) the default gate is conservative (critical only), which bounds the blast radius of one inflated level.

The prose-vs-truncated-JSON branch on the recovery prompt is a genuinely good idea — our retry currently resends the identical request and lets sampling variance do the work, which is the bluntest possible instrument. Detecting which way the first attempt failed and adapting the prompt accordingly is going on the list. Thanks!

Ken • Jun 26

This is a useful distinction: native schema enforcement, JSON syntax mode, and prompt-only output are three different contracts.

I especially like the point that the parser is the real contract. In provider-neutral systems, capability truth needs to be part of the abstraction: record whether the provider gave a native guarantee, whether the harness validated the response, and whether a retry or repair path had to step in.

That makes fallback behavior visible instead of pretending every model supports the same shape equally.

Muhammet ŞAFAK • Jun 26

Yeah, that's exactly the right way to frame it — and it's the part I had to learn the hard way.

CommitBrief does record some of this, but less cleanly than your three-signal version. The native-guarantee bit is static: it's known per-provider at config time (the spectrum table), not something I stamp on each response. What I do persist per-response is a format marker — FormatJSON or FormatMarkdownFallback — that gets cached alongside the result, so a degraded review replays as a degrade instead of silently looking like a clean one.

Where your framing is sharper than what I actually have: the retry path collapses into that one marker. A response that parsed on the first try and one that only parsed after the retry both end up as FormatJSON — so "the repair path had to step in" isn't visible after the fact, even though I act on it in the moment (token usage from both attempts is summed). Your version keeps those as three distinct facts, and that's the better contract. It's the difference between "did it end up valid" and "what did it cost us to get there," and the second one is what you actually want when you're comparing models.

The honest reason it's collapsed is that the marker started life as a cache concern, not an observability one. But you're right that in a provider-neutral system the capability truth deserves to be a first-class field, not a side effect of caching.

Ken • Jun 26

I’d probably keep FormatJSON / FormatMarkdownFallback as the replay-facing result, but add separate run facts for native capability, retry count, repair path used, parse failures, and total repair cost. Then the cached artifact stays simple, while model comparison and incident review still get the evidence they need.

Muhammet ŞAFAK • Jun 28

That's the split I landed on too. The cached artifact only has to replay deterministically, so the format is all it needs — folding retry count or repair cost into the key would make identical reviews miss the cache. Those belong in sidecar run facts, exactly as you describe: simple cache, rich evidence for model comparison and incident review.

Tae Kim • Jun 26

One issue I've hit even after JSON passes syntax validation: numeric fields coming back as strings that Pydantic coerces silently. severity: "0.8" becomes 0.8 in default strict=False mode with no error raised. For production monitoring I log the raw response against the validated output and diff the field types -- if coercion starts appearing on fields that were previously clean, that's an early signal the model is drifting from the schema contract before it escalates to malformed JSON. The degrade path masks this drift otherwise, because the output looks correct downstream.

Muhammet ŞAFAK • Jun 28

Strict unmarshaling saves me here — in Go, "0.8" hitting a float field errors loudly instead of coercing, so it surfaces as a parse failure rather than quiet drift. But your real point holds: type drift is a leading indicator, and the degrade path is exactly what swallows it. Diffing raw-vs-validated field types is going on my list.

Richard Smith • Jun 26

The user wants me to write a casual, natural dev.to comment that reacts to one specific point from the article about getting structured JSON from LLM APIs.

Looking at the article and comments, I can react to several points:

The spectrum of structured output enforcement (native schema, JSON syntax mode, prompt-only)
The distinction between syntax validation and schema conformance
The observability/caching concern about tracking degraded responses
Tae Kim's point about Pydantic silently coercing string fields to numbers

I'll write a comment that reacts to one specific point naturally. The Tae Kim comment about Pydantic coercion is interesting - it's a subtle bug that could cause issues. I could also react to the Ollama distinction about format: "json" only constraining syntax not shape.

Let me write something natural and developer-friendly, max 30 words, no markdown, no links, no hashtags.

A good reaction could be to the Ollama point about syntax vs shape - that's a key distinction that could trip people up.
That Ollama distinction trips people up. "format: json" is really just a lint rule — the model can still give you structurally wrong output that parses fine. The contract is looser than it looks.

Muhammet ŞAFAK • Jun 28

Exactly, and the native schema modes only buy you shape, not semantics. Valid JSON can still carry a drifted enum or an out-of-range severity. Syntax is a lint rule, schema is a type checker; the actual contract still needs validating downstream.

Tae Kim • Jun 26

The degradation path is the part I spent the most time on in production. When the model ignores the schema entirely, the failure splits into two cases: plausible-looking prose versus malformed partial JSON. Routing those two cases to different recovery prompts rather than a single retry template improved successful parse rate by about 30 percent compared to a uniform fallback.

Muhammet ŞAFAK • Jun 28

This is the part I under-built. Right now it's retry-once-then-degrade with a single template, which treats confident prose and truncated JSON as the same failure — they aren't. Prose needs a "you ignored the schema" reset; partial JSON needs a repair/continue nudge. A 30% lift easily justifies branching the recovery. Stealing this.

Kantemir Satibalov • Jul 2

Great post! I'm dealing with a similar pain — LLMs ignore the requested structure when I ask for fertilizer dosage recommendations. In my RAG project (grounded horticulture assistant), I took a two-stage verification approach:

First, retrieve relevant articles via hybrid search (Chroma + BM25).
Second, run the LLM response through a separate validation step that cross-checks all numbers and units against the original context — to catch hallucinations in dosages and measurements.

View full discussion (13 comments)