DEV Community: Manish Ramavat

XML Tags Don't Help Short Prompts — Here's When They Actually Matter (2026)

Manish Ramavat — Sun, 10 May 2026 20:19:04 +0000

The Conventional Wisdom

Every prompt engineering guide says the same thing: wrap your prompt sections in XML tags. <instructions>, <schema>, <input>. Anthropic recommends it. OpenAI recommends it (with markdown headers). The internet treats it as best practice.

But best practices need boundaries. When does this actually matter? And what happens when you apply structural overhead to prompts that don't need it?

I ran the experiment. The answer is instructive.

Setup

Model: Claude Sonnet 4.5

Task: Extract 7 structured fields from restaurant descriptions

Test cases: 12 inputs across 4 difficulty tiers (unambiguous, ambiguous, missing data, conflicting signals)

Conditions: Flat prose vs XML-delimited — same semantic content, different structure

Prompt length: ~150 tokens (flat), ~200 tokens (XML)

Total calls: 24

📓 Full reproducible notebook on Kaggle

The two conditions are semantically identical. Same instructions, same schema definition, same inputs. The only variable is whether structural delimiters exist.

Results

Metric	Flat	XML	Δ
Overall accuracy	97.6%	96.4%	-1.2 pp
Hallucination rate	0%	0%	0
Input token overhead	—	—	+31%

XML was marginally worse. Not statistically significant at N=12, but certainly not better.

The only field with a notable gap: accepts_reservations (-8.3 pp for XML), where the XML condition inferred a reservation policy the flat condition correctly left as null. One wrong answer on 12 cases = 8.3% swing. Small N makes individual errors loud.

Both conditions produced zero hallucinations. Neither fabricated values when ground truth was null.

The Interpretation

This is not a surprising result if you think about what XML tags actually do.

Structural delimiters solve a disambiguation problem. They signal to the model: "this block is instructions, that block is data, this other block is context." The value emerges when the model might otherwise confuse one for another.

On a 150-token prompt with a clear instruction followed by a clear input, there's nothing to confuse. The model parses flat prose correctly because the prompt is short enough to be unambiguous on its own. Adding XML to a prompt that's already clear is the same anti-pattern as adding abstraction layers to simple code — it impresses no one and costs tokens.

The Revised Mental Model

The benefit of XML scales with prompt complexity, not prompt quality. Specifically:

XML helps when:

The prompt exceeds ~500 tokens with 3+ logical sections (instructions, schema, examples, context, input). Without delimiters, the model may lose track of where one section ends and another begins.
Input data resembles instructions. If your user-provided text contains phrases like "ignore previous instructions" or reads like a prompt itself, XML creates an explicit boundary the model can respect.
Context accumulates over turns. In agentic loops where conversation history grows to thousands of tokens, structural markers prevent the model from treating old context as current instructions.

XML doesn't help when:

The prompt is under ~300 tokens with a single clear task. The model handles unstructured prose at this scale without confusion.
Instructions and data are obviously distinct. "Extract fields from this text: [text]" is unambiguous regardless of delimiters.

The threshold isn't a magic number — it's a function of how many distinct roles the content in your prompt serves and how easily a model could conflate them.

The Hidden Benefit

There's one value of XML my experiment can't measure: it forces prompt authors to decompose their thinking.

Deciding "what goes in <instructions> vs <schema> vs <examples>?" is a design exercise. It surfaces unclear requirements. It separates concerns. It produces better prompts — not because the model needs the structure, but because the human needed it to think clearly.

This is real value. But it's an authoring benefit, not a runtime benefit. For short prompts where the decomposition is trivial, the authoring benefit is also trivial.

What This Means for Production Systems

If you're building a system that makes 10K extraction calls per day with short, templated prompts:

Flat prose saves 31% on input tokens. At Sonnet 4.5 pricing ($3/MTok input), that's ~$1.41/day or ~$515/year of pure waste if you XML-wrap prompts that don't need it.
The cost is small in absolute terms. The principle is what matters: don't add structure for structure's sake.

If your prompts are long, complex, multi-section, or handle untrusted input — use XML. You're solving a real problem.

If your prompts are short, clear, and templated — skip it. You're adding overhead for nothing.

The rule: benchmark on your own data at your own prompt length before adopting any "best practice" wholesale.

Limitations

N=12. Directional signal, not statistical proof.
Single domain (restaurants), single model (Sonnet 4.5), single run per condition.
Only tests the regime where XML shouldn't help.

What's Next

The natural follow-up: testing prompts at 1,000+ tokens with complex multi-section structures, embedded documents, and adversarial inputs — the regime where XML should shine. That experiment will tell us how much benefit XML provides when the conditions warrant it.

All opinions are my own and do not represent my employer.

How to Save 90% on Claude API Input Costs With Prompt Caching (2026)

Manish Ramavat — Sun, 10 May 2026 16:03:12 +0000

The Problem

If you're calling the Claude API with a large system prompt, every request reprocesses the same tokens from scratch. Production AI systems — agents, RAG pipelines, customer-facing assistants — routinely carry 10K–30K token system prompts (tool definitions, reference docs, few-shot examples). At $3/MTok across hundreds of thousands of daily requests, redundant prefix processing can easily run $500–$3,000+/day. That's pure waste for context the model has already seen.

Anthropic's prompt caching solves this. You mark a stable prefix as cacheable, pay a small one-time write surcharge (1.25×), and every subsequent request reads that prefix at 10% of the standard price.

I ran a controlled experiment to measure the real-world savings. Here are the numbers.

How Prompt Caching Works

The mechanism is straightforward:

You attach cache_control: {"type": "ephemeral"} to a content block in your request
The API caches everything up to and including that block (the "prefix")
On the next request with a byte-for-byte identical prefix, the model reads from cache instead of reprocessing

Pricing (Claude Sonnet 4.5):

Operation	Price / MTok	Relative to Base
Standard input	$3.00	1×
Cache write	$3.75	1.25×
Cache read	$0.30	0.1×
Output	$15.00	—

Constraints:

Minimum prefix: 1,024 tokens (model-dependent)
TTL: 5 minutes, refreshed on each hit
Max 4 explicit breakpoints per request
system must be passed as an array of content blocks (not a plain string)

Experiment Design

Three API calls. Same system prompt (~2,158 tokens). Same user question. The only variable is whether caching is enabled:

Call	Configuration	Expected Behavior
1	No `cache_control`	Baseline — all tokens at standard rate
2	Explicit `cache_control` on system block	Cache WRITE (1.25× on prefix)
3	Same as Call 2	Cache READ (0.1× on prefix)

Implementation

Baseline (no caching):

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=SYSTEM_PROMPT,  # plain string — caching not possible
    messages=[{"role": "user", "content": question}]
)

With explicit cache breakpoint:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": question}]
)

One structural change: system becomes a list of content blocks. That's the only code difference.

Results

Call	`input_tokens`	`cache_creation`	`cache_read`	Total Input
1 (baseline)	2,180	0	0	2,180
2 (write)	22	2,158	0	2,180
3 (read)	22	0	2,158	2,180

The API usage fields tell the full story:

input_tokens = non-cached tail (the user message — 22 tokens)
cache_creation_input_tokens = prefix written to cache
cache_read_input_tokens = prefix served from cache

Call 3 reads 2,158 tokens from cache at $0.30/MTok instead of $3.00/MTok.

Cost Analysis

Call	Actual Input Cost	Baseline Cost	Delta
1 (baseline)	$0.006540	$0.006540	—
2 (write)	$0.008159	$0.006540	+24.7% (write surcharge)
3 (read)	$0.000713	$0.006540	−89.1%

The write costs 25% more than baseline. The read costs 89% less. Break-even: 2 requests.

Production Projection

At 10,000 requests/day with a 5-minute TTL, cache writes occur 288 times/day (once per TTL window). The remaining 9,712 requests pay cache-read pricing:

Metric	With Caching	Without	Savings
Daily	$54.28	$110.40	$56.12
Monthly (30d)	$1,628	$3,312	$1,684
Savings			50.8%

This is with a ~2,158 token system prompt. For agent-style workloads with 10K–30K token system prompts (tool definitions, reference docs, few-shot examples), the write surcharge becomes negligible relative to the prefix size, and total savings approach 85–89%.

A Pitfall Worth Documenting

My first implementation used top-level automatic caching:

# ❌ Fails silently with varying user messages
response = client.messages.create(
    model="claude-sonnet-4-5",
    cache_control={"type": "ephemeral"},  # breakpoint at last block
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": question}]  # varies per request
)

Every call triggered a cache write — never a read. The API returned cache_creation_input_tokens > 0 on every request.

Root cause: Top-level cache_control places the breakpoint at the last cacheable block, which includes the user message. Different messages produce different prefixes, so the cache key never matches.

Fix: Use explicit cache_control on the system prompt block. The cached prefix then covers only the stable system prompt, and varying user messages sit after the breakpoint.

This is not documented prominently in Anthropic's guides, but it's the critical distinction between "caching that works" and "caching that silently charges you 25% more on every call."

When Prompt Caching Makes Sense

Scenario	Expected Input Savings
Static system prompt (>1K tokens) across requests	~89%
Multi-turn conversations (growing message history)	70–85%
RAG with stable reference documents	80–90%
Agent loops with large tool catalogues	60–80%

Implementation Checklist

Verify system prompt exceeds the model's minimum (1,024 tokens for Sonnet 4.5)
Restructure system from a plain string to a list of content blocks
Add "cache_control": {"type": "ephemeral"} on the last stable block
Place static content before dynamic content in the prompt
Confirm cache reads by checking cache_read_input_tokens > 0 in responses
Ensure request frequency stays within the 5-minute TTL window

Full Experiment

Reproducible notebook with all code: Kaggle →

References

All opinions are my own and do not represent my employer.

Article #2 in the LLM Engineering Experiments series. Previous: How to Choose the Right Prompt Engineering Pattern

How to Choose the Right Prompt Engineering Pattern (And Why Simpler Is Usually Better)

Manish Ramavat — Sun, 10 May 2026 15:04:58 +0000

I spent the past weekend running a head-to-head experiment: five popular prompt engineering patterns, one real model (Claude Sonnet 4.5), fifty real movie reviews. The goal was simple — find out which technique actually delivers the best results.

The result? The simplest approach won. And the most sophisticated one — Chain-of-Thought — didn't just underperform. It actively made things worse.

Here's what happened.

The 5 Patterns I Tested

1. Zero-Shot

Direct instruction, no examples.

Classify this movie review as 'positive' or 'negative'. Reply with only one word.

Review: "The food was terrible but the service was great."

2. Few-Shot (k=3)

Three input-output examples before the query.

Classify movie reviews as 'positive' or 'negative'. Reply with only one word.

Review: "a masterpiece of modern cinema" → positive
Review: "boring and pointless" → negative
Review: "absolutely loved every minute" → positive

Review: "The food was terrible but the service was great." →

3. Chain-of-Thought (CoT)

Ask the model to reason step by step.

Classify this movie review as 'positive' or 'negative'.
Think step by step about the sentiment words and overall tone,
then give your final answer on the last line as just one word.

4. Role Prompting

Assign the model a persona.

You are an expert sentiment analyst with 20 years of experience
in film criticism and NLP.
Classify this movie review as 'positive' or 'negative'. Reply with only one word.

5. Structured Output

Force the model to respond in JSON.

Analyze this movie review and respond ONLY with valid JSON:
{"sentiment": "positive" or "negative"}

The Experiment

I ran all 5 patterns against 50 real movie reviews from the SST-2 dataset using Claude Sonnet 4.5. Each review was classified as positive or negative, and I measured:

Accuracy — did it get the right answer?
Latency — how long did it take?
Token cost — how many tokens were consumed?

Results

Pattern	Accuracy	Avg Latency	Avg Tokens	Relative Cost
Zero-Shot	98.0%	1.58s	50	1.0x
Few-Shot (k=3)	98.0%	1.78s	86	1.7x
Role Prompting	98.0%	1.83s	72	1.4x
Structured Output	98.0%	2.06s	65	1.3x
Chain-of-Thought	64.0%	5.23s	228	4.6x

The Surprising Finding

Four out of five patterns hit 98% accuracy. The model is simply good enough at binary sentiment that Zero-Shot, Few-Shot, Role Prompting, and Structured Output all achieve nearly the same result.

But Chain-of-Thought collapsed to 64% — barely better than guessing.

Here's a real example. For the review "an utterly compelling 'torture' story" (label: positive), Zero-Shot immediately returned "positive." But Chain-of-Thought went down a rabbit hole:

"The word 'torture' has negative connotations... however 'compelling' is positive... the quotes around 'torture' suggest it may be used figuratively... but the overall sentiment is ambiguous..."

And got it wrong.

Why? Because asking the model to "think step by step" about something it already knows how to do introduces confusion. The reasoning process picks up on ambiguity that doesn't exist when the model just answers directly. It overthinks.

And it costs 4.6x more tokens for that worse result.

What This Means for You

1. Don't reach for complex patterns by default

If your model is capable enough for the task, Zero-Shot might be all you need. In my test, the simplest approach was the cheapest, fastest, AND tied for most accurate. There was literally no reason to use anything fancier.

2. Chain-of-Thought can actively hurt on "simple" tasks

CoT is designed for multi-step reasoning (math, logic, planning). When you apply it to tasks the model already handles well in one shot, you're adding noise, not signal. In my test, it cut accuracy by 34 percentage points while costing nearly 5x more. That's the worst possible trade-off.

3. Fancier patterns cost more without accuracy gains

Few-Shot used 1.7x the tokens. Role Prompting used 1.4x. Structured Output used 1.3x. All hit the same 98% as Zero-Shot. If you're running thousands of classifications per day in production, that cost difference adds up — for literally zero accuracy benefit.

4. Match your pattern to your problem

Stop asking "which pattern is best?" and start asking "how hard is this task for this model?" If the answer is "not very" — and for a frontier model on binary classification, it usually isn't — just use Zero-Shot and move on.

A Practical Decision Guide

Based on these results and broader published research:

Situation	Recommended Pattern
Model is capable enough for the task	Zero-Shot
Model needs calibration on output format	Few-Shot
Multi-step math or logic problems	Chain-of-Thought
Need specific tone or perspective	Role Prompting
Need machine-parseable output	Structured Output
High-stakes decisions needing max accuracy	Self-Consistency (multiple CoT runs + voting)

The Bottom Line

Start with the simplest pattern that works. Add complexity only when the data proves you need it.

Zero-Shot was the fastest (1.58s), cheapest (50 tokens), and tied for most accurate (98%). Every other pattern either matched it at higher cost, or actively hurt performance.

The biggest mistake I see is reaching for Chain-of-Thought or complex prompting strategies when a direct instruction would have been faster, cheaper, and more reliable.

Full experiment code available on Kaggle. Model: Claude Sonnet 4.5. Dataset: 50 SST-2 sentiment samples (28 positive, 22 negative).

References

Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS.
Wang, X., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR.
Brown, T., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS.
Anthropic. (2024). "Prompt Engineering for Claude." Anthropic Documentation.

All opinions are my own and do not represent my employer.