Webmaster Ramos

Posted on Apr 14 • Edited on May 19 • Originally published at webmaster-ramos.com

Stop Using JSON in Claude Prompts. I Tested 4 Formats — One Won by 30%.

#ai #llm #claude #promptengineering

My own benchmark across three Claude tiers (Haiku, Sonnet, Opus): 120 data files, 8 real-world scenarios, 5 formats. Tokens, cost, and accuracy – numbers, not opinions.

You Are Overpaying for Prompts

Every time you send data to the Claude API, the format of that data determines how many tokens you spend. The same 200-product catalog in JSON costs 15,879 tokens. In Markdown, it costs 7,814. In TOON, 6,088. That is a 62% difference.

A 120-task list? JSON consumes 8,500 tokens. TOON uses 2,267. Savings: 73%.

The problem is that every existing benchmark focuses on GPT, Gemini, and Llama. There has not been a public benchmark for Claude. I decided to fix that.

I ran 450 API calls on Claude Haiku 4.5, tested Sonnet 4.6 and Opus 4.6, and counted tokens across 120 files using Anthropic’s production tokenizer. Eight real-world scenarios, five formats. In this article – the results, the conclusions, and specific recommendations.

Five Formats at a Glance

JSON (JavaScript Object Notation)

Year created: 2001; ECMA-404 standard (2013)
Author: Douglas Crockford
Primary use case: APIs, data exchange between systems, configuration files
Key characteristic: strict typing, nesting via {} and [], mandatory quotes

JSON is the lingua franca of programmatic interfaces. Every API speaks JSON, and every language can parse it. But that universality comes at a price in an LLM context: quotes, braces, and commas all consume tokens. They carry syntactic weight, but not semantic meaning.

{"products": [{"id": 1, "name": "Mouse", "price": 29.99, "in_stock": true}]}

YAML (YAML Ain't Markup Language)

Year created: 2001; YAML 1.2 standard (2009)
Authors: Clark Evans, Ingy döt Net, Oren Ben-Kiki
Primary use case: configuration files (Docker Compose, Kubernetes, GitHub Actions)
Key characteristic: indentation-based structure, minimal punctuation

YAML is the de facto standard of the DevOps world. It reads like pseudocode and usually does not require quotes. The trade-off is that repeating keys for every array item eats up much of the punctuation savings.

products:
  - id: 1
    name: Mouse
    price: 29.99
    in_stock: true

Markdown

Year created: 2004
Author: John Gruber (with Aaron Swartz)
Primary use case: documentation, READMEs, blogs, wikis
Key characteristic: human-first syntax – headings #, tables |, lists -

Markdown is the most “native” format for LLMs. Models have been trained on billions of READMEs and wiki pages. GitHub, Notion, Obsidian – all rely on Markdown. It is a communication format, not a data format.

## Products

| ID | Name  | Price | In Stock |
|----|-------|-------|----------|
| 1  | Mouse | 29.99 | Yes      |

Plain Text

Primary use case: human communication – emails, notes, instructions
Key characteristic: no syntax, no markup, maximum flexibility

Plain text with no markup. It minimizes token overhead, but it provides no explicit structure for programmatic data extraction.

Products: Mouse (ID 1, $29.99, in stock)

TOON (Token-Oriented Object Notation)

Year created: 2025 (v1.0 – November 2025, MIT license)
Author: open-source community (GitHub)
Primary use case: token optimization in LLM prompts, replacing JSON in AI workflows
Key characteristic: a YAML + CSV hybrid (indentation for objects, row-style encoding for arrays)

The newest format in this comparison. TOON was created for one purpose: minimize tokens while preserving lossless JSON round-tripping. For arrays of homogeneous objects, field names are declared once and values are written as CSV-style rows. On GPT-5 Nano, it showed 99.4% accuracy with 46% token savings. Before this benchmark, it had not been tested on Claude.

products[1]{id,name,price,in_stock}:
1,Mouse,29.99,true

Methodology

What I Tested

Eight scenarios, each in three sizes (S / M / L), each in five formats. Total: 120 data files.

#	Scenario	Data type	S	M	L
1	System prompt / instructions	Rules, sections	10 rules	30 rules	60 rules
2	Product catalog	Tabular data	20 products	100 products	200 products
3	Roadmap / tasks	Statuses, dependencies	15 tasks	50 tasks	120 tasks
4	Business rules	Conditional logic	8 rules	25 rules	50 rules
5	Few-shot classification	Input-output examples	5 examples	15 examples	40 examples
6	Organizational hierarchy	3 levels of nesting	12 people	60 people	150 people
7	API documentation	Endpoints, parameters	5 endpoints	15 endpoints	30 endpoints
8	Output format	Requesting data in a given format	10 countries	50 countries	100 countries

Few-shot (scenario 5) is a prompting technique in which several “input → output” examples are included directly in the prompt so the model can infer the task from a pattern. For example: "Great product!" → positive, "Terrible quality" → negative, then the question "Love it!" → ?. Zero examples is zero-shot, one example is one-shot, several examples is few-shot. The format of those examples directly affects cost: 40 pairs in JSON take 2,131 tokens; in TOON, 996.

For scenarios 2, 3, 6, and 7, I prepared questions with precomputed correct answers (ground truth). For scenarios 1, 4, and 5, scoring was manual and rubric-based. For scenario 8, I measured output tokens and format compliance.

Models and Pricing

Model	Tier	Input ($/1M)	Output ($/1M)
Claude Haiku 4.5	Fast	$0.80	$4
Claude Sonnet 4.6	Mid	$3	$15
Claude Opus 4.6	Premium	$15	$75

Accuracy was measured across all three tiers. Sizes S and M were tested for accuracy. L-size was used only for token counts.

Clean-Test Principle

All requests were sent directly via the anthropic Python SDK: plain client.messages.create() with temperature=0. No MCP servers, IDE plugins, or agent frameworks.

Token counting was done with client.messages.count_tokens() – Anthropic’s production tokenizer, i.e. the same numbers used for billing. The tokenizer is the same across all Claude tiers – so the token-count data applies to all Claude models.

Benchmark code: github.com/webmaster-ramos/yaml-vs-md-benchmark

Input-Token Efficiency

These numbers apply to all Claude tiers – Haiku, Sonnet, and Opus all use the same tokenizer. The only cost difference comes from the price per token.

Summary Table: Average Input Tokens Across All Scenarios

Format	Average tokens	vs JSON
JSON	3,252	baseline
YAML	2,208	-32%
Markdown	1,514	-53%
Plain Text	1,391	-57%
TOON	1,226	-62%

TOON saves 62% of input tokens on average versus JSON. Markdown saves 53%. YAML, despite its minimal punctuation, saves only 32% – because of repeated keys and indentation overhead.

Breakdown by Scenario (% Savings vs JSON, L-size)

Scenario	YAML	MD	TXT	TOON
Instructions	-22%	-29%	-24%	-24%
Products	-29%	-51%	-53%	-62%
Tasks	-35%	-63%	-69%	-73%
Business Rules	-28%	-52%	-48%	-63%
Few-shot	-31%	-45%	-37%	-53%
Hierarchy	-37%	-61%	-67%	-68%
API Docs	-35%	-45%	-59%	-53%

YAML Savings vs JSON (%, L-size)

MD Savings vs JSON (%, L-size)

TXT Savings vs JSON (%, L-size)

TOON Savings vs JSON (%, L-size)

Detailed Charts by Scenario

Input tokens by scenario: Instructions

)

Input tokens by scenario: Products

Input tokens by scenario: Tasks

Input tokens by scenario: Rules

Input tokens by scenario: Few-shot

Input tokens by scenario: Hierarchy

Input tokens by scenario: API Docs

Key Observations

TOON is the clear leader for tabular data. Product catalogs, task lists, few-shot examples – anything that looks like an array of homogeneous objects. Savings: 62–73% versus JSON.
Markdown is the best all-purpose format. A stable 50–65% reduction across all data types. It is the only format that performs consistently well across tables, instructions, and hierarchies.
YAML is underwhelming. Many people expect YAML to be much more compact than JSON. In practice, the savings are only 14–41%. The reason is repeated keys for every array element.
Plain Text wins on API docs. For technical specifications, plain text is more efficient than TOON (59% vs 53%). Without extra syntax, descriptive text compresses better.
Scale barely affects the percentage savings. The difference between S and L is under 2 percentage points. Format drives efficiency more than data volume does.

Haiku 4.5: When Format Matters

Haiku is the most format-sensitive tier. In 35% of questions, it produced different answers depending on the input format. Accuracy spread reached as high as 36 percentage points between the best and worst format within the same scenario.

Accuracy by Scenario

Accuracy Haiku: Products (product catalog)

Accuracy Haiku: Tasks (tasks / roadmap)

Accuracy Haiku: Hierarchy (organizational hierarchy)

Accuracy Haiku: API Docs (documentation)

Scenario	JSON	YAML	MD	TXT	TOON	Best
Products	63.4%	61.4%	69.2%	70.2%	66.2%	TXT
Tasks	71.0%	65.7%	66.7%	56.7%	65.3%	JSON
Hierarchy	85.7%	92.9%	85.7%	78.2%	85.7%	YAML
API Docs	85.7%	85.7%	57.1%	78.6%	85.7%	JSON/YAML/TOON

Hierarchy shows the sharpest gap: YAML (92.9%) vs Markdown (57.1%) – a 36-point difference. Tree-like structures are clearly easier for Haiku to parse in an indentation-based format.

API Docs: Markdown performs unexpectedly poorly – 57.1% vs 85.7% for JSON. For technical specifications with parameters and types, explicit structure matters more than compactness.

Accuracy by Size (Haiku)

Size	Accuracy
S (small data)	80.3%
M (medium data)	67.2%

Scale matters more than format. Accuracy drops by 13 points when moving from S to M – more than the average difference between formats (5.7 points). The implication is straightforward: reduce data volume first, then optimize format.

Cost: Haiku

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0026	$260
YAML	2,208	$0.0018	$177
MD	1,514	$0.0012	$121
TXT	1,391	$0.0011	$111
TOON	1,226	$0.0010	$98
JSON -> TOON	-	-62%	$162/month

Output Format: Haiku

Output tokens: S-size (10 countries) – Haiku, Sonnet, Opus

Output tokens: M-size (50 countries) – Haiku, Sonnet, Opus

Requested format	S (10 countries)	M (50 countries)	Savings vs JSON
JSON	465	1,985	baseline
YAML	296	1,352	-32..36%
Markdown	165	1,125	-43..65%
Plain Text	294	1,381	-30..37%
TOON	342	1,369	-26..31%

Markdown is the cheapest output format on Haiku. 165 vs 465 tokens on S-size – a 65% reduction. At $4 per 1M output tokens, that matters.

Important: TOON loses on output. Haiku does not know the TOON format and, instead of producing compact CSV-like rows, tends to emit verbose plain text that only vaguely resembles TOON. A few-shot example improves TOON output quality, but it still trails Markdown in efficiency.

Output-Format Choice: Technical Requirements

Output cost is not the only thing that matters. Often, Claude’s response must be processed programmatically – parsed, inserted into a database, or passed to another service. The best output format depends on who or what is going to read it.

Usage scenario	Recommendation	Why
User-facing answer in UI	Markdown	Renders natively, lowest token cost
Backend parsing	JSON	Reliable, universal, guaranteed structure
Config / YAML pipeline	YAML	Human-readable + machine-parsable
Rows for CSV / spreadsheet	TXT	Minimal overhead, structure via delimiters
Compact output for TOON SDK	TOON	Only if using Opus, or with a few-shot example

Rule of thumb: if a human reads the output, use Markdown. If code reads it, use JSON or YAML. Do not optimize output cost at the expense of parsing reliability in production.

Recommendations for Haiku

Data type	Best input	Accuracy	Best output
System prompts	MD	stable	MD
Catalogs, lists	TXT	70.2%	MD
Tasks / roadmap	JSON	71.0%	MD or JSON
Hierarchies	YAML	92.9%	YAML
API documentation	JSON or YAML	85.7%	JSON
Few-shot examples	TOON	65.3% (-0.5% vs JSON)	MD

On Haiku, format matters – especially for hierarchies and API documentation. Use TOON on input where token savings are worth a small accuracy trade-off, but do not use TOON on output without a few-shot example.

Sonnet 4.6: Format Affects Cost, Not Quality

Sonnet 4.6 produced identical answers across all five formats. In 100% of questions, the result was the same regardless of how the data was represented. For Sonnet, format optimization is pure cost reduction with no quality trade-off.

Accuracy: Format-Invariant

Accuracy by model and format

Format	Sonnet 4.6
JSON	89.4%
YAML	89.4%
Markdown	89.4%
Plain Text	89.4%
TOON	89.4%

The answers are completely identical across all formats. Switching from JSON to TOON saves 62% of input tokens while preserving the same output.

Cost: Sonnet

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0098	$975
YAML	2,208	$0.0066	$663
MD	1,514	$0.0045	$454
TXT	1,391	$0.0042	$417
TOON	1,226	$0.0037	$368
JSON -> TOON	-	-62%	$607/month

At 100K requests per month, switching from JSON to TOON saves $607/month. On Sonnet, output costs $15 per 1M tokens, so output optimization also matters.

Output Format: Sonnet

Output tokens for Sonnet (estimated as characters ÷ 3.5 chars/token):

Format	S (10 countries)	M (50 countries)
JSON	~210	~1,120
YAML	~195	~1,023
Markdown	~143	~746
Plain Text	~103	~549
TOON	~86	~414

Comparison of output tokens across all three models (S-size):

M-size (50 countries):

On Sonnet, TOON output requires a few-shot example. Without extra context, Sonnet interprets “TOON format” literally – as an abbreviation connected to cartoons – and returns an irrelevant answer. With a format example in the prompt, it generates correct TOON.

Technical requirements for output on Sonnet are the same as on Haiku: if a downstream system parses the response programmatically, use JSON or YAML. If a human is going to read it, use Markdown.

Recommendations for Sonnet

On Sonnet, format choice is a pure cost optimization. The logic is simple:

Input data: use TOON (for tables) or MD (for instructions / hierarchies)
Human-readable output: Markdown (-65% vs JSON)
Machine-parsed output: JSON (most reliable) or YAML (more compact, still parseable)
TOON output: add a few-shot example to the prompt; otherwise the answer may be incorrect

Optimal prompt design: MD for instructions + TOON for data + a request for MD/JSON output.

Opus 4.6: Maximum Capability, Also Format-Invariant

Opus 4.6 is the strongest model and the most expensive one. Like Sonnet, it is completely insensitive to input format. But Opus has one unique advantage: it knows TOON “out of the box.”

Accuracy: Format-Invariant

Format	Opus 4.6
JSON	93.5%
YAML	93.5%
Markdown	93.5%
Plain Text	93.5%
TOON	93.5%

The answers are 100% identical across all formats. Changing format affects only cost.

Cost: Opus

Format	Avg tokens	Cost / request	100K requests / month
JSON	3,252	$0.0488	$4,878
YAML	2,208	$0.0331	$3,312
MD	1,514	$0.0227	$2,271
TXT	1,391	$0.0209	$2,087
TOON	1,226	$0.0184	$1,839
JSON -> TOON	-	-62%	$3,039/month

On Opus, switching from JSON to TOON saves over $3,000/month at 100K requests. Output costs $75 per 1M tokens – so format optimization has the largest financial impact here.

Output Format: Opus

Output tokens for Opus (estimated as characters ÷ 3.5 chars/token):

Format	S (10 countries)	M (50 countries)
JSON	~254	~1,271
YAML	~286	~1,414
Markdown	~177	~814
Plain Text	~194	~986
TOON	~106	~543

Comparison of output tokens across all three models (S-size):

M-size (50 countries):

Opus generates TOON without hints. That is the key difference from Sonnet and Haiku. Opus knows the format and produces valid TOON output on the first try.

Can Claude generate valid TOON output?

Model	Without example in prompt	With few-shot example
Opus 4.6	Valid TOON	Valid TOON
Sonnet 4.6	Cartoon / irrelevant	Valid TOON
Haiku 4.5	Verbose plain text	Closer to TOON, but still inaccurate

In practical terms, this means: if you need TOON output and want it to work reliably without prompt scaffolding, use Opus.

Technical Requirements for Output: When Parsing Matters More Than Cost

On Opus, output costs $75 per 1M tokens – so output-format savings are highly relevant. But the requirements of the downstream system still take priority:

Scenarios where output must be parsed programmatically:

The response goes into a database or structured store – use JSON
Another LLM or service consumes the response through an API – use JSON or YAML
The response is part of a pipeline (the next step processes the data) – use JSON
The response is rendered in the UI as text or a document – use Markdown (lowest token cost)
You need compact machine-readable output and already have a TOON SDK – use TOON (only Opus works reliably without prompt help)

The key point: output on Opus costs $75 per 1M – five times more than input. A 65% output reduction (Markdown vs JSON) can matter even more than input savings. But do not trade away parse reliability just to cut cost.

Recommendations for Opus

Input: TOON for tabular data (-62%), MD for instructions (-53%)
Human-readable output: Markdown (-65% output tokens)
Machine-parsed output: JSON – reliable and universal
TOON output: works without few-shot – Opus’s unique advantage
Do not use JSON on input: it is the most expensive format with no accuracy benefit

Summary Results

Accuracy Across All Models and Formats

Format	Haiku 4.5	Sonnet 4.6	Opus 4.6
JSON	75.3%	89.4%	93.5%
YAML	75.1%	89.4%	93.5%
Markdown	69.6%	89.4%	93.5%
Plain Text	70.6%	89.4%	93.5%
TOON	74.8%	89.4%	93.5%

For Sonnet and Opus, format does not affect accuracy. For Haiku, it matters materially – especially for hierarchies and documentation.

Decision Matrix: Input Format

Data type	Haiku	Sonnet / Opus
System prompts / instructions	MD (-29%)	TOON or MD
Catalogs, lists	TXT (70.2%)	TOON (-62%)
Tasks / roadmap	JSON (71.0%)	TOON (-73%)
Business rules	JSON (stable)	TOON (-63%)
Few-shot examples	TOON (≈JSON)	TOON (-53%)
Hierarchies	YAML (92.9%)	TOON or MD
API documentation	JSON/YAML (85.7%)	TXT (-59%)

Decision Matrix: Output Format

Output consumer	Recommendation	Haiku	Sonnet	Opus
UI / end user	Markdown	native	native	native
API / JSON parser	JSON	reliable	reliable	reliable
YAML pipeline	YAML	reliable	reliable	reliable
TOON SDK	TOON	with few-shot*	with few-shot*	native
CSV / spreadsheet	TXT	with template	with template	with template

*Requires a few-shot example in the prompt

Benchmark Limitations

Accuracy was measured only on S+M sizes. L-size includes token counts only. Accuracy may degrade more sharply on larger data.
The data is synthetic. Catalogs and tasks were script-generated. Real-world data may be messier (missing fields, Unicode, long descriptions).
Automatic scoring covers 4 of 8 cases. Cases 1, 4, and 5 require rubric-based evaluation. The accuracy numbers here cover cases 2, 3, 6, and 7.
Sonnet / Opus were tested via subscription (subagents). Output-token counts are estimated, not directly measured. Haiku was tested via API.
No A/B test on live traffic. This is a laboratory benchmark. The impact on a production product must be validated separately.

The code and data are open – reproduce it, extend it, challenge it.

What Surprised Me

Opus and Sonnet are completely insensitive to format. I expected a 3–5% gap. I got 0%. For the higher tiers, format is pure cost optimization.
YAML is not as efficient as many assume. The expectation is usually “YAML is more compact than JSON.” In practice, the savings are only 32%. Repeated keys wipe out much of the benefit from removing braces.
TOON works on Claude without special training. Claude may not have seen much TOON in training data, yet all three tiers parse it correctly – essentially on par with JSON.
Opus knows TOON; Sonnet does not. Opus generates valid TOON output without hints. Sonnet interpreted “TOON format” as “cartoon” and produced an irrelevant answer. With a few-shot example, both work correctly.
Markdown is the best output format. The gap in output tokens between JSON and Markdown is 65%. At $75 per 1M on Opus, that is significant. It is also the only format every tier generates natively without extra prompting.
On Haiku, scale matters more than format. Accuracy drops from 80.3% (S) to 67.2% (M) – a 13-point drop. The average difference between formats is 5.7 points. On Sonnet and Opus, scale is much less of an issue.

FAQ

Q: Do these results apply to other models (GPT, Gemini)?

The trends are similar, but the numbers differ. Every model has its own tokenizer. On GPT-5 Nano, YAML shows 62% accuracy on nested data (ImprovingAgents); on Claude Haiku, it reaches 93%. Use these results for Claude, and other benchmarks for other models.

Q: How were tokens counted?

Using client.messages.count_tokens() – the standard Anthropic SDK method and production tokenizer. These are the same numbers used for billing. The tokenizer is the same across all tiers.

Q: Why not test XML?

XML is rarely used in modern LLM workflows. Existing benchmarks (ShShell) suggest that XML is significantly more expensive than Markdown in token terms, with comparable or worse accuracy.

Q: Is TOON a serious format or just hype?

TOON v1.0 was released in November 2025 under MIT, and there are SDKs in 6+ languages. For tabular data, the savings are real – 62% on Claude with JSON-level accuracy. Opus generates TOON output without prompting. Other tiers require a few-shot example.

Q: Does the input format affect the output format?

Partially. If you provide data in YAML, Claude is more likely to structure its answer with indentation. But an explicit instruction such as “Return as a Markdown table” overrides that tendency.

Q: Is it worth converting all prompts away from JSON?

At 100K requests/month on Sonnet, moving from JSON to TOON saves $607/month. On Opus, it saves $3,039/month. For hobby projects with 1K requests, the difference is around $6. Run the math on your own usage.

Q: Can you combine formats in one prompt?

Yes – and that is usually the recommended approach. Markdown for instructions + TOON for data + a request for output in the format you need. Claude handles multi-format prompts well.

Q: Where is the benchmark source code?

github.com/webmaster-ramos/yaml-vs-md-benchmark. All 120 data files, 51 questions, ground truth, runner, and scorer are open for reproduction.

Conclusion

Data format in a prompt is not a cosmetic choice. On the Claude API, the gap between JSON and TOON is 62% on input tokens. Markdown saves 65% on output tokens. At 100K requests/month on Opus, that means $3,039 saved on input and even more on output.

But the main finding is not about tokens. Claude Sonnet 4.6 and Opus 4.6 are completely insensitive to format. They produced 100% identical answers on JSON, YAML, Markdown, Plain Text, and TOON. For the higher tiers, format optimization is pure savings with no quality trade-off.

Only Haiku 4.5 is meaningfully format-sensitive – and only there does the choice of format affect accuracy (by up to 36 percentage points). On Haiku, format should be matched to data type: YAML for hierarchies, JSON for tasks with dependencies.

Beyond cost, there are technical requirements: if the output must be parsed programmatically, JSON is more reliable than Markdown. If a human reads the answer, Markdown is cheaper. Opus is the only tier that generates TOON natively; Sonnet and Haiku require a few-shot example.

TL;DR by tier:

	Haiku 4.5	Sonnet 4.6	Opus 4.6
Does format affect accuracy?	Yes, by up to 36 points	No	No
Best input (data)	YAML/JSON/TXT by data type	TOON	TOON
Best input (instructions)	MD	MD	MD
Best output (human-readable)	MD	MD	MD
Best output (parsing)	JSON	JSON	JSON
TOON output without prompt help	No	No	Yes
JSON -> TOON savings	$162 / 100K	$607 / 100K	$3,039 / 100K

Benchmark run in April 2026 on Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5.
120 data files, 8 scenarios, 3 sizes, 5 formats, 3 models.
All code and data: github.com/webmaster-ramos/yaml-vs-md-benchmark