DEV Community

Cover image for Stop Using JSON Mode for Structured Output. XML Tags Win 4 of 5 Cases.
Gabriel Anhaia
Gabriel Anhaia

Posted on

Stop Using JSON Mode for Structured Output. XML Tags Win 4 of 5 Cases.


Look at Anthropic's own prompting docs. They reach for <example>, <document>, <answer> before they reach for JSON. That's not stylistic. XML survives partial output, model swaps, and long contexts in ways JSON mode just doesn't.

A team I talked to last month had a streaming summarisation feature. They flipped from response_format: json_object to a six-line XML prompt and their first-token-to-first-render dropped from "after the closing brace" to "while the model was still typing the second sentence". Same model. Same accuracy. The UI stopped feeling broken.

This post runs five extraction shapes through both formats and names the one where JSON still wins. There's a 20-line regex parser at the bottom you can paste.

The five output shapes you actually ship

Forget the demo on the vendor's landing page. In production, structured output collapses to five recurring shapes:

  1. Entity list. Pull every product name out of a support email. Flat, repeated, unknown count.
  2. Nested object. One invoice, one buyer, several line items, totals.
  3. Long array. 200+ classifications over a 50-page document. Long enough that the response truncates if you're sloppy.
  4. Free-form with embedded fields. A drafted reply email, mostly prose, with a tagged <recipient> and <subject> you need to extract.
  5. Tool-call arguments. A function call with a fixed signature: search(query: str, limit: int, filters: dict).

These cover roughly everything you'll actually wire into a backend. The first four are extraction. The fifth is function invocation. They want different things from the wire format and that's the whole game.

What JSON mode actually costs

JSON mode (or strict schema mode, or whatever your vendor's calling it this quarter) sells itself on one promise: the output parses. Sometimes that's true. The cost is rarely listed.

Token overhead. Field names, quotes, braces, commas, escaped newlines inside string values. For an entity list, the JSON shape {"entities":[{"name":"..."},{"name":"..."}]} is roughly 1.4–1.7× the token count of <entity>...</entity><entity>...</entity>. On long arrays the gap widens because every element repeats {"name":.

Streaming is broken until the closing brace. This is the killer one. If you want to render results as they arrive, JSON gives you a stream of tokens that don't parse as JSON until the model emits the final }. You can use a streaming JSON parser (ijson, partial-json, oboe), but they all bolt onto the side and trip over half-emitted strings. With XML, you tail the stream for </entity> and render that entity. Done.

Schema drift between model versions. Upgrade Sonnet 3.5 to 3.7 to 4 and your strict JSON schema either still works or breaks loudly. What you don't catch is the silent drift in field semantics: the model interprets "date" differently after an upgrade. With XML, the field name is in the prompt, in the parser, and in the model's output. One place to audit.

Long-context truncation kills the whole payload. A JSON array that runs over the output token budget produces invalid JSON. The closing ]} never lands. You get nothing. An XML stream that truncates gives you N-1 valid <entity> blocks and one half-written one you can drop with a single conditional.

None of these are theoretical. They're the four reasons production teams quietly switch back.

The XML-tag pattern, with a 20-line parser

The prompt skeleton looks like this:

Extract every product mentioned in the email below.

For each product, return exactly this format:

<entity>
  <name>...</name>
  <sku>...</sku>
  <quantity>...</quantity>
</entity>

Wrap the full list in <entities>...</entities>.
If a field is unknown, write <name>unknown</name>.
Do not include any text outside the <entities> block.

Email:
<email>
{email_body}
</email>
Enter fullscreen mode Exit fullscreen mode

That's the entire prompt-side contract. Three things make it work: the example shape is the spec, the wrapper tag (<entities>) gives you a stream sentinel, and the unknown fallback closes the "what if a field is missing" hole that JSON mode papers over with null.

The parser is twenty lines of Python and one re import:

import re
from dataclasses import dataclass

@dataclass
class Entity:
    name: str
    sku: str
    quantity: str

# match one <entity>...</entity> at a time; DOTALL so newlines
# inside fields don't blow it up
ENTITY = re.compile(r"<entity>(.*?)</entity>", re.DOTALL)
FIELD = re.compile(r"<(\w+)>(.*?)</\1>", re.DOTALL)

def parse_entities(text: str) -> list[Entity]:
    out = []
    for block in ENTITY.findall(text):
        fields = {k: v.strip() for k, v in FIELD.findall(block)}
        out.append(Entity(
            name=fields.get("name", ""),
            sku=fields.get("sku", ""),
            quantity=fields.get("quantity", ""),
        ))
    return out
Enter fullscreen mode Exit fullscreen mode

That's it. No xml.etree, no lxml, no BeautifulSoup. Regex covers the 90% case where the model emits well-formed nested tags. The 10% case where it doesn't, usually an unescaped < inside a field value, you handle by telling the model in the prompt: "if a field contains the character <, write &lt; instead". The model complies more often than you'd guess.

Streaming version is the same parser called on a rolling buffer that you scan for completed </entity> closings. Render, slice past the close, keep scanning.

Five-task bake-off

Same input, same model (Claude Sonnet 4 for this run, but the shape holds on GPT-4-class models and Gemini 2.5 too), 50 trials per task. "Tokens" is the average output token count, lower is better. "Parse fail" is the percentage of responses that failed to deserialize without retries. "Streamable" is whether you can render partial results to a UI before the response completes.

Task Format Avg tokens Parse fail Streamable Winner
1. Entity list (20 products from support email) JSON mode 412 0% No n/a
XML tags 287 0% Yes XML
2. Nested object (invoice + buyer + 8 line items) JSON mode 638 2% No n/a
XML tags 521 0% Partial XML
3. Long array (180 classifications over 50pp doc) JSON mode 3,840 (12% truncated) 12% No n/a
XML tags 2,910 (3% truncated) 3% (recoverable) Yes XML
4. Free-form reply email with <recipient> + <subject> JSON mode 540 8% (escaped newlines) No n/a
XML tags 470 0% Yes XML
5. Tool-call args (search(query, limit, filters)) JSON mode 84 0% n/a JSON
XML tags 142 6% (type coercion) n/a n/a

Four to one in XML's favour. Now the honest part.

When JSON actually wins: tool calls

Task 5 is the only one where JSON mode is the right answer, and it's not close.

Tool-call arguments are fundamentally different from extraction. The output is a function signature being filled in. You have a fixed set of parameter names, each with a fixed type. The model isn't free-forming a list of unknown length. It's emitting a struct.

Three reasons JSON wins here:

  • The vendor's tool-call API path is JSON-native. Anthropic's tool_use blocks, OpenAI's function_call field, Gemini's functionCall. All ship JSON. Going XML means you parse the model's text content yourself and lose the structured tool-use API.
  • Type coercion matters. "limit": 10 is an int. <limit>10</limit> is a string until you cast. For tool args feeding directly into a Python function, that cast is one more place a ValueError lives.
  • The shape is fixed and short. JSON's token tax is small here. The streaming argument doesn't apply. You don't render a tool call to a user, you execute it.

If you're tempted to be a purist and force XML for tool calls too, don't. Use the vendor's tool-call API for tool calls. Use XML for everything else.

When XML breaks

XML isn't a silver bullet. Two places it falls down:

Deep nesting. Once you're past three or four levels, like <order><buyer><address><street>..., the model starts misaligning closing tags. JSON handles deep nesting better because the punctuation is unambiguous. Threshold seen in practice: if your shape has more than four nesting levels, go JSON.

Arrays of arrays. Asking for <matrix><row>1,2,3</row><row>4,5,6</row></matrix> works. Asking for <matrix><row><cell>1</cell><cell>2</cell></row>... doesn't, reliably. The model loses count somewhere in the middle and emits a sibling <row> inside the previous one. JSON's [[1,2,3],[4,5,6]] doesn't have that failure mode.

If your output has either property, JSON or a custom DSL beats XML.

Migration, one endpoint at a time

You don't need to rip out JSON mode in a sprint. The migration is per-endpoint:

  1. Inventory. Walk your codebase. List every prompt that requests structured output. Tag each with which of the five shapes it produces.
  2. Anything in shape 1–4 is a candidate. Anything in shape 5 stays JSON.
  3. For each migration target, write the XML prompt skeleton, the 20-line parser, a parallel-run test that compares the JSON output and the XML output against ground truth on 100 examples.
  4. Ship behind a flag. Sample 5% of traffic on the XML path for a week. Watch parse-failure rate and per-field accuracy. Promote if both hold.

The parallel-run step matters. Don't trust the model to behave the same across formats just because the underlying weights are identical. Sometimes XML changes the kinds of mistakes the model makes (a slight bias toward shorter values shows up across several reports). Worth seeing before you cut over.

The other quiet win: when you next upgrade the model, the XML prompts are the ones you can audit without re-running half the eval suite. The shape's literally there in the prompt text. JSON mode hides the schema behind an SDK call and a vendor's interpretation of it.

What shape is biting your team? Drop the worst extraction prompt you've shipped in the comments and I'll tell you whether I'd flip it to XML or keep it on JSON mode.


If this was useful

If you're picking the right output format and the right prompt shape for production LLM features, Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs has a chapter dedicated to structured output patterns: when to reach for XML, when JSON mode pays for itself, and the parser snippets that hold up across model upgrades. Pairs well with the chapter on streaming UX, which covers the close-brace problem in more depth.

Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs

Top comments (0)