David Moores

Posted on May 25 • Edited on Jul 24 • Originally published at carrick.tools

Benchmarking LLM Structured Outputs

#ai #llm #devops #productivity

Cross-posted from carrick.tools.

When you read the API documentation for OpenAI, Anthropic, or Google Gemini, the feature called "structured outputs" looks like a solved problem: pass a JSON schema, get back JSON that conforms to it.

In production, it is not a contract. It is a well-typed, best-effort suggestion.

At Carrick, the code-analysis scanner I work on, our post-LLM pipeline relies on a four-stage fallback parser. We attempt a direct parse, strip markdown fences, scan for array bounds inside surrounding garbage text, and finally apply regex cleanup. If all four fail, we drop the payload and proceed. If structured outputs worked as advertised, this would be a single serde_json::from_str(response).

To isolate why this defensive parsing is necessary, I built a benchmark testing 8 synthetic schemas against six models (the flagship and cheaper tiers from each provider). The schemas isolate one structural stressor each: a flat baseline, a 3-level nested object, a 7-level nested chain, a long enum, a oneOf tagged union, nullable + format fields, a 20-item array, and a closed object with additionalProperties: false. Every response is validated against the original schema using two independent validators (ajv and hyperjump). A response only counts as strict adherence when both agree.

Here is how the implementations actually behave.

At a glance

Of the 8 stressor schemas, here is how many each model handled with full strict adherence on every run, and how many tripped a specific failure mode:

Three patterns emerge. OpenAI rejects most schemas at submit time and then conforms perfectly on what is left. Anthropic accepts every schema but silently corrupts one specific structure. Gemini rejects a narrow set of features and conforms perfectly on the rest. Each pattern is the symmetric mirror of the others.

1. Anthropic accepts complex schemas, then silently returns the wrong shape

Anthropic's tool-use API is the most permissive of the three. It accepts almost any standard JSON schema as the input_schema for a tool, and on 7 of the 8 schemas in this bench, both Claude Sonnet 4.6 and Claude Opus 4.7 produce strict-conforming output 100% of the time. The failure mode is concentrated on one schema: a 7-level nested object chain (S3).

On S3 at n=20 runs per model:

Claude Sonnet 4.6: 20 of 20 runs silent-failed. Strict adherence 0% (95% CI: 0%–16.1%).
Claude Opus 4.7: 7 of 20 runs silent-failed. Strict adherence 65% (95% CI: 43.2%–82.3%).

The failure mode is unusual. Instead of returning a 7-level nested object, the model emits the entire nested structure as a single JSON-encoded string assigned to the root level1 field. Here is one of the Opus failures verbatim:

{"level1":"{\"name\":\"system\",\"child\":{\"name\":\"ingest_pipeline\",
\"child\":{\"name\":\"batch_24a17\",\"child\":{\"name\":\"parse_stage\",
\"child\":{\"name\":\"error_handling\",\"child\":{\"name\":\"dlq_promotion\",
\"leaf\":{\"value\":\"2 rows failed JSON parsing and were promoted to dlq
.ingest.parse-errors; weekly cleanup later inspected 412 items, removed
312, returned 100 for reprocessing\",\"kind\":\"outcome_summary\",
\"count\":2}}}}}}}}"}

The schema declares level1 as type: object. The model returned type: string containing a JSON serialisation of what the object should have been. ajv's diagnostic:

/level1 must be object {"type":"object"}

This is the most dangerous failure mode in the benchmark because:

The transport layer says success. The API returns HTTP 200 with no error field and no refusal signal.
The SDK does not validate. The Anthropic client passes tool_use.input back to your application without checking whether it conforms to the input_schema you sent.
The output parses cleanly. JSON.parse(response) succeeds, returning { level1: "{\"name\": ..." }. Only an explicit schema validator catches the type drift.

The mechanism is consistent across all 27 silent failures in the dataset (20 Sonnet plus 7 Opus): the model wraps the entire nested payload in a single string value. Run-to-run variance is in where the string boundary sits, not in whether the wrapping happens.

2. OpenAI enforces adherence by rejecting standard schemas

OpenAI's strict: true mode is the symmetric mirror of Anthropic. Where it accepts a schema, it produces strict-conforming output. Where the schema does not meet strict mode's narrow dialect, the request never reaches the model.

Of the 8 bench schemas, only 2 pass OpenAI's strict-mode rules (S1 baseline, which I deliberately shaped to be strict-compliant, and S8 closed object). The other 6 are rejected before the call is sent.

OpenAI strict mode requires:

Every object must explicitly declare additionalProperties: false.
Every property must be listed in the required array.
Type-arrays (e.g., type: ["string", "null"]) and oneOf unions are unsupported.

The bench performs the same schema validation OpenAI's API would perform, locally, before submission. A representative rejection (for the 7-level schema):

OpenAI strict mode violations:
  $: object missing additionalProperties: false;
  $.level1: object missing additionalProperties: false;
  $.level1.child: object missing additionalProperties: false

The rejection rate is identical between gpt-5.4-mini and gpt-5.5. The check runs server-side at the schema-submission layer before any model is invoked, so flagship intelligence does not change the outcome.

If you pull a schema from an OpenAPI spec or package.json, it will likely fail. Your options are to rewrite the schema to the strict dialect, or disable strict mode and inherit Anthropic's silent-failure problem.

3. Gemini is the rigid middle ground

Gemini's schema validator rejects modern JSON Schema features that OpenAI strict also bans (oneOf, type-arrays, $ref) but accepts the looser shapes OpenAI strict refuses. On the 6 of 8 bench schemas that clear Gemini's pre-flight, both Gemini Pro 3.1 and Gemini Flash 3.5 maintain 100% strict adherence at n=5 each (Wilson 95% CI for 5/5: 56.6%–100%; tight enough across 6 schemas to support the pattern).

The two rejected schemas are S5 (uses oneOf) and S6 (uses type: ["string", "null"] plus format: date-time). Gemini surfaces the rejection at submission time with a clear error naming the unsupported feature.

Notably, Gemini handled the same 7-level deeply nested schema that destroyed Anthropic at 100% strict adherence on every run. Where Gemini accepts a schema, it conforms.

The outcome matrix

The full pilot, condensed to one grid. S3 and S7 ran at n=20 for Anthropic; all other cells ran at n=5.

Defensive implementation patterns

The provider feature called "structured output" cannot be trusted as an application boundary. To handle the realities of the current APIs, your pipeline needs explicit guardrails. Here is the implementation priority:

Run an independent validation step. An HTTP 200 from the provider means nothing. Validate every single response payload against your schema using ajv, hyperjump, or a custom walker in your own codebase before passing the data to your application logic.
Redefine success criteria. Treat a standard parse error, a schema violation, and a refusal as equal failure modes. Trigger the same retry/fallback logic for all of them.
Flatten Anthropic schemas. Deep nesting triggers silent corruption in Claude, including at the flagship tier. Flatten structures into top-level arrays of sibling objects wherever possible. If a schema exceeds three or four levels of depth, consider refactoring it.
Compile schemas to the OpenAI dialect. If you are targeting OpenAI strict mode, author your schemas from the start with additionalProperties: false propagated to every sub-level and no optional fields.
Strip unions for Gemini. Avoid oneOf and ["string", "null"]. Use anyOf for unions and rely on a single nullable type constraint.

What this bench does and does not measure

Three caveats worth surfacing explicitly:

OpenAI rejection is bench-side, server-rule-mirrored. The 6 of 8 schemas reported as rejected by OpenAI are rejected by a pre-flight validator inside the bench that implements the documented strict-mode rules (additionalProperties: false, every property required, no type-arrays, no oneOf). I did not separately submit each schema to the OpenAI API and observe the server's 400 response, so the rejection rate reported here is the rate at which OpenAI's documented strict-mode rules disqualify normal JSON Schema, not the rate at which OpenAI's server returns an error. If OpenAI relaxed strict mode tomorrow, the bench would not notice.

Gemini schemas are normalised before submission. Gemini's structured-output API supports a narrower keyword set than OpenAPI / draft-2020-12 JSON Schema. The bench's convertSchemaToGemini function passes through the keywords Gemini's docs list as supported (type, enum, format, min/max, required, properties, items) and drops the rest before submission. The validator still checks Gemini's output against the original schema, so any constraint the converter drops is implicitly given a free pass on the Gemini side. For the current corpus this only affects S5 and S6 (already rejected at pre-flight), but it would matter for any future schema relying on const, pattern, or additionalProperties as a real constraint.

Sample sizes are uneven. The two cells the article quotes specifically (Anthropic Sonnet and Opus on S3 deep nesting) ran at n=20 each. The S7 long-array cells also ran at n=20 after an initial pilot revealed the Anthropic adapter was hard-capped at max_tokens: 4096, which was inflating the truncation rate; raising the cap to 8192 brought both Anthropic tiers to 100% strict adherence on S7. Everywhere else the bench ran at n=5 per cell, which is enough to see the dominant outcome but not enough to claim sharp rates.

Methodology, raw JSONL, schemas, and reproducible scripts are available at carrick-llm-structured-bench. The full re-run that backs the figures above cost roughly $8 in API credits and took about an hour of wall time.

Top comments (1)

Ken • Jun 15

Strong writeup. The provider-vs-harness distinction feels load-bearing here: transport success, parse success, schema conformance, and application admissibility are four different states. Once those are separate, the benchmark becomes more than a provider ranking; it becomes a regression harness for deciding when to retry, repair, fall back, or refuse the payload.