CommitBrief renders a code review as cards, JSON schema v1, or a CI exit code — which means the LLM has to hand back structured findings, not prose. Every provider can do that. The catch is that no two of them do it the same way, and some don't really do it at all.
There's exactly one schema the whole system targets. Getting four native APIs to honor it takes four completely different mechanisms; getting three more is a matter of asking nicely and not trusting the answer. This is how that works, and what happens when a model ignores the contract anyway.
TL;DR
-
One schema, many dialects. Every provider targets the same
Findingshape, expressed through whatever structured-output mechanism that vendor offers —tool_use, strictjson_schema,responseSchema, or justformat: "json". - Structured output is a spectrum, not a guarantee. It runs from "the API enforces the shape" down to "we asked in the prompt."
-
The real contract is your parser. One
ParseFindingsvalidates every provider's output the same way; failures retry once, then degrade to Markdown with a warning. The pipeline never crashes on a bad response. - The limit. A schema makes output parseable, not correct. It can't stop a model from inventing a plausible-but-wrong finding.
The one schema everyone targets
A finding is a flat struct. Five required fields, three optional, and a severity drawn from a closed vocabulary:
type Finding struct {
Severity Severity `json:"severity"` // one of five, below
File string `json:"file"`
Line int `json:"line"`
LineEnd int `json:"line_end,omitempty"`
Title string `json:"title"`
Description string `json:"description"`
Suggestion string `json:"suggestion"`
Language string `json:"language,omitempty"`
Snippet string `json:"snippet,omitempty"`
}
const (
SeverityCritical Severity = "critical"
SeverityHigh Severity = "high"
SeverityMedium Severity = "medium"
SeverityLow Severity = "low"
SeverityInfo Severity = "info"
)
The envelope is {"findings": [ ... ]} and nothing else. That severity vocabulary is the wire contract with the model — deliberately English-only and fixed in code, so a user's custom COMMITBRIEF.md can change the rules of a review but never the shape of its output. Everything downstream — the cards renderer, --json, --fail-on=high — depends on those five strings meaning exactly five things.
Four native dialects for the same shape
The native API providers each enforce that schema through their own mechanism. Same target, four wire formats.
Anthropic — a forced tool call. The findings schema is registered as a tool, and tool_choice makes calling it non-optional:
params.Tools = []sdk.ToolUnionParam{buildReportTool()} // schema as "report_findings"
params.ToolChoice = sdk.ToolChoiceParamOfTool(toolName) // must call it
// tool description: "Emit the review as structured findings. Always call this tool."
OpenAI — strict json_schema. With Strict set, the Chat Completions API holds the response to the schema server-side — and refuses the request outright rather than fall through to a model that would ignore it:
func buildResponseFormat() sdk.ChatCompletionNewParamsResponseFormatUnion {
return sdk.ChatCompletionNewParamsResponseFormatUnion{
OfJSONSchema: &shared.ResponseFormatJSONSchemaParam{
JSONSchema: shared.ResponseFormatJSONSchemaJSONSchemaParam{
Name: schemaName,
Description: sdk.String("Structured findings for a code review."),
Strict: sdk.Bool(true),
Schema: responseSchema,
},
},
}
}
(Responses-API-only models express the same schema through a text.format json_schema config instead — one more dialect for the identical shape.)
Gemini — a response schema plus a MIME type. You hand the SDK a *Schema value and tell it to return JSON:
cfg.ResponseMIMEType = "application/json"
cfg.ResponseSchema = responseSchema() // the Findings envelope as a *genai.Schema
Ollama — format: "json", and that's all it promises. A local model can be told to emit JSON, but the flag constrains syntax, not shape:
Format: "json", // valid JSON guaranteed; the right keys are not
That distinction matters. Anthropic, OpenAI, and Gemini constrain the structure; Ollama only guarantees the output parses as some JSON. The schema conformance has to come from somewhere else.
Three providers that don't enforce at all
DeepSeek, Mistral, and Cohere reach CommitBrief through the OpenAI-compatible SDK (covered in part 2), but their strict-schema support is uneven, so they don't request response_format at all. Their JSON shape comes entirely from the prompt's contract block.
So structured output across the seven API providers is a spectrum:
| Mechanism | Constrains | Providers |
|---|---|---|
| Forced tool / strict schema | The exact shape |
anthropic, openai, gemini
|
format: "json" |
Syntax only | ollama |
| Prompt instruction | Nothing, at the API level |
deepseek, mistral, cohere
|
A pipeline that only trusted the strict-schema providers would work for three of seven. The other four need a backstop that doesn't care how the JSON was produced.
The real contract is your parser
That backstop is one function every provider's output funnels through. ParseFindings decodes the envelope and validates each finding — not just "is it JSON" but "is it a valid finding":
for i, f := range env.Findings {
if !f.Severity.IsValid() {
return nil, fmt.Errorf("parse findings: finding %d: unknown severity %q", i, f.Severity)
}
if f.File == "" {
return nil, fmt.Errorf("parse findings: finding %d: missing file", i)
}
if f.Title == "" {
return nil, fmt.Errorf("parse findings: finding %d: missing title", i)
}
if f.Description == "" { /* ... */ }
if f.Suggestion == "" { /* ... */ }
}
An empty findings array is a clean review, returned as a non-nil empty slice — success, not an error. A made-up severity or a finding with no file is a parse failure, no matter which provider produced it. The strict-schema providers rarely trip it; the prompt-driven ones lean on it. Either way, the validation is identical, so a --fail-on=high gate means the same thing whether you ran Claude or a local qwen.
When the model ignores all of it
A strict schema reduces malformed output; it doesn't eliminate it, and three of the providers have no schema at all. So the call is wrapped in retry-once-then-degrade:
resp, err := prov.Review(ctx, req)
if err != nil {
return "", provider.Usage{}, "", err
}
if _, parseErr := render.ParseFindings(resp.Content); parseErr == nil {
return resp.Content, resp.Usage, cache.FormatJSON, nil
}
// First attempt unparseable — retry once (ADR-0014 §4).
onRetry()
resp2, err2 := prov.Review(ctx, req)
if err2 != nil {
return resp.Content, resp.Usage, cache.FormatMarkdownFallback, nil
}
if _, parseErr := render.ParseFindings(resp2.Content); parseErr == nil {
return resp2.Content, totalUsage, cache.FormatJSON, nil
}
// Both attempts failed — degrade: render the raw text as Markdown, warn once.
return resp.Content, totalUsage, cache.FormatMarkdownFallback, nil
Three things in that flow are deliberate. Token usage is summed across both attempts, so the cost footer reflects what you actually spent, even on a degrade. The outcome is recorded as a format marker (FormatJSON or FormatMarkdownFallback) and cached with the response, so a degraded review replays from cache silently instead of re-warning forever. And degrade means render the raw model text as Markdown and print one warning — never crash, never show the user a stack trace because an LLM got creative. A --fail-on gate is skipped on a degrade, with a note on stderr, because there are no structured findings to threshold.
What it is not
Structured output guarantees a response is parseable. It does not guarantee it's correct. A strict schema can't stop a model from inventing a line number, attaching a finding to the wrong file, or reporting a confident non-issue — which is why the prompt still carries an explicit "do not invent file paths or line numbers" directive, and why this is the zeroth reviewer, not the last one. The schema is what makes the output machine-readable; your judgment is what makes it trustworthy.
If you want the measured version of "how often is it right," the eval harness scores precision and false-positive rate per model against a known-answer corpus:
COMMITBRIEF_EVAL_PROVIDER=<name> make eval-live
Repo: github.com/CommitBrief/commitbrief.
Part 3 of **Building CommitBrief. Next: the pre-send secret scanner — eight patterns, added-lines-only, and a match record that never stores the secret it just caught.
Top comments (1)
This is a useful distinction: native schema enforcement, JSON syntax mode, and prompt-only output are three different contracts.
I especially like the point that the parser is the real contract. In provider-neutral systems, capability truth needs to be part of the abstraction: record whether the provider gave a native guarantee, whether the harness validated the response, and whether a retry or repair path had to step in.
That makes fallback behavior visible instead of pretending every model supports the same shape equally.