The Conventional Wisdom
Every prompt engineering guide says the same thing: wrap your prompt sections in XML tags. <instructions>, <schema>, <input>. Anthropic recommends it. OpenAI recommends it (with markdown headers). The internet treats it as best practice.
But best practices need boundaries. When does this actually matter? And what happens when you apply structural overhead to prompts that don't need it?
I ran the experiment. The answer is instructive.
Setup
Model: Claude Sonnet 4.5
Task: Extract 7 structured fields from restaurant descriptions
Test cases: 12 inputs across 4 difficulty tiers (unambiguous, ambiguous, missing data, conflicting signals)
Conditions: Flat prose vs XML-delimited — same semantic content, different structure
Prompt length: ~150 tokens (flat), ~200 tokens (XML)
Total calls: 24
📓 Full reproducible notebook on Kaggle
The two conditions are semantically identical. Same instructions, same schema definition, same inputs. The only variable is whether structural delimiters exist.
Results
| Metric | Flat | XML | Δ |
|---|---|---|---|
| Overall accuracy | 97.6% | 96.4% | -1.2 pp |
| Hallucination rate | 0% | 0% | 0 |
| Input token overhead | — | — | +31% |
XML was marginally worse. Not statistically significant at N=12, but certainly not better.
The only field with a notable gap: accepts_reservations (-8.3 pp for XML), where the XML condition inferred a reservation policy the flat condition correctly left as null. One wrong answer on 12 cases = 8.3% swing. Small N makes individual errors loud.
Both conditions produced zero hallucinations. Neither fabricated values when ground truth was null.
The Interpretation
This is not a surprising result if you think about what XML tags actually do.
Structural delimiters solve a disambiguation problem. They signal to the model: "this block is instructions, that block is data, this other block is context." The value emerges when the model might otherwise confuse one for another.
On a 150-token prompt with a clear instruction followed by a clear input, there's nothing to confuse. The model parses flat prose correctly because the prompt is short enough to be unambiguous on its own. Adding XML to a prompt that's already clear is the same anti-pattern as adding abstraction layers to simple code — it impresses no one and costs tokens.
The Revised Mental Model
The benefit of XML scales with prompt complexity, not prompt quality. Specifically:
XML helps when:
- The prompt exceeds ~500 tokens with 3+ logical sections (instructions, schema, examples, context, input). Without delimiters, the model may lose track of where one section ends and another begins.
- Input data resembles instructions. If your user-provided text contains phrases like "ignore previous instructions" or reads like a prompt itself, XML creates an explicit boundary the model can respect.
- Context accumulates over turns. In agentic loops where conversation history grows to thousands of tokens, structural markers prevent the model from treating old context as current instructions.
XML doesn't help when:
- The prompt is under ~300 tokens with a single clear task. The model handles unstructured prose at this scale without confusion.
- Instructions and data are obviously distinct. "Extract fields from this text: [text]" is unambiguous regardless of delimiters.
The threshold isn't a magic number — it's a function of how many distinct roles the content in your prompt serves and how easily a model could conflate them.
The Hidden Benefit
There's one value of XML my experiment can't measure: it forces prompt authors to decompose their thinking.
Deciding "what goes in <instructions> vs <schema> vs <examples>?" is a design exercise. It surfaces unclear requirements. It separates concerns. It produces better prompts — not because the model needs the structure, but because the human needed it to think clearly.
This is real value. But it's an authoring benefit, not a runtime benefit. For short prompts where the decomposition is trivial, the authoring benefit is also trivial.
What This Means for Production Systems
If you're building a system that makes 10K extraction calls per day with short, templated prompts:
- Flat prose saves 31% on input tokens. At Sonnet 4.5 pricing ($3/MTok input), that's ~$1.41/day or ~$515/year of pure waste if you XML-wrap prompts that don't need it.
- The cost is small in absolute terms. The principle is what matters: don't add structure for structure's sake.
If your prompts are long, complex, multi-section, or handle untrusted input — use XML. You're solving a real problem.
If your prompts are short, clear, and templated — skip it. You're adding overhead for nothing.
The rule: benchmark on your own data at your own prompt length before adopting any "best practice" wholesale.
Limitations
- N=12. Directional signal, not statistical proof.
- Single domain (restaurants), single model (Sonnet 4.5), single run per condition.
- Only tests the regime where XML shouldn't help.
What's Next
The natural follow-up: testing prompts at 1,000+ tokens with complex multi-section structures, embedded documents, and adversarial inputs — the regime where XML should shine. That experiment will tell us how much benefit XML provides when the conditions warrant it.
All opinions are my own and do not represent my employer.
Top comments (0)