TOON (Token-Oriented Object Notation), a new data format is making brave claims. It says it can cut down on tokens by 30–60% compared to JSON, while also helping LLMs understand the data more accurately.[1]
However, when you start looking at the test results (the benchmarks), you see a difference between the official scores and the results from independent testing.
So, this article is going to lay out both sets of data, side-by-side, and look into what might explain why they don’t match up.
Understanding the Benchmark Methodology
Before we dive into the results, it’s important to know what these tests actually measure.
They check how well an LLM can understand and pull information from data. Specifically, we’re seeing how good it is at answering questions about data when it’s presented in different formats. [1]
So, think of it as a reading test, not a test of whether the model can write in that format.
Official Repository Benchmarks
Headline Results
The official TOON repository presents compelling performance metrics based on 201 data retrieval questions across 4 models:[1]
Key Finding: TOON achieves 68.7% accuracy versus JSON’s 65.7% while using 39.5% fewer tokens — a seemingly definitive win for TOON on both efficiency and accuracy metrics.[1]
Dataset Composition
The official tests used five different types of datasets to see how the models performed:[1]
- Tabular: 100 employee records, which were all simple and had the same fields.
- Nested: 50 e-commerce orders, which were complex and had data tucked inside other data (like a customer object with a list of items).
- Analytics: 60 days of time-series data (like dates and numbers).
- GitHub: Real-world data from 100 popular GitHub repositories.
- Event Logs: 75 logs, about half of which were simple and the other half had more complex error details.
Across all these datasets, the models were asked a total of 201 questions to retrieve information.
Model-Specific Performance
A separate benchmark study using the TOON implementation tested individual model performance, showing varying results across different LLMs:[1]
Here is your list, corrected using the official values you provided. I’ve included the full official results for each model for completeness and updated the rankings.
GPT-5-nano
- TOON: 88.6% accuracy (178/201) — Best overall performer
- JSON compact: 88.1% accuracy (177/201)
- CSV: 88.0% accuracy (88/100)
- YAML: 84.6% accuracy (170/201)
- XML: 81.6% accuracy (164/201)
- JSON: 80.1% accuracy (161/201)
Claude Haiku 4.5
- YAML: 52.2% accuracy (105/201) — Best performer for this model
- TOON: 50.7% accuracy (102/201)
- JSON: 50.2% accuracy (101/201)
- JSON compact: 49.8% accuracy (100/201)
- XML: 49.3% accuracy (99/201)
- CSV: 39.0% accuracy (39/100)
Gemini 2.5 Flash
- XML: 86.1% accuracy (173/201) — Best performer for this model
- TOON: 84.1% accuracy (169/201) — Ranked 2nd
- CSV: 82.0% accuracy (82/100)
- JSON compact: 81.1% accuracy (163/201)
- YAML: 81.1% accuracy (163/201)
- JSON: 81.1% accuracy (163/201)
Grok-4-fast-non-reasoning
- TOON: 51.2% accuracy (103/201) — Tied for 1st
- JSON: 51.2% accuracy (103/201) — Tied for 1st
- XML: 50.2% accuracy (101/201)
- JSON compact: 49.8% accuracy (100/201)
- YAML: 48.8% accuracy (98/201)
- CSV: 40.0% accuracy (40/100)
Note: These results are from a separate 201-question benchmark and may use different datasets than the main official repository’s 201-question benchmark.[1]
Independent Third-Party Benchmarks
Test 1: Tabular Data with GPT-4.1-nano
An independent evaluation from improvingagents.com presents dramatically different findings. Testing was conducted using GPT-4.1-nano across 12 formats with statistical confidence intervals:[2]
Key Finding: In this evaluation, TOON ranks 9th out of 12 formats with 47.5% accuracy, performing worse than JSON (52.3%) and significantly behind Markdown-KV (60.7%).[2]
The Accuracy-Token Efficiency Trade-off
This independent test shows that while TOON is great at saving tokens (using only 21,518 compared to 66,396 for JSON), this efficiency might come at a price.[2]
It seems that being too compact might make the data harder for the LLM to understand.
Here’s the interesting part: the format that got the best accuracy score (Markdown-KV at 60.7%) was actually a lot “heavier,” using more than double the tokens that TOON did.
This suggests that using more words and a clearer structure (being more “verbose”) might actually give the LLM the context it needs to understand the data better.
Test 2: Nested Data with GPT-5-nano
The same independent researchers conducted a second evaluation using GPT-5-nano to test TOON’s performance with nested data structures:[3]
Key Finding: In this nested data test, TOON ranked last with 43.1% accuracy, performing worse than JSON (50.3%), Markdown (54.3%), and significantly behind YAML (62.1%).[3]
This second test reveals a critical limitation: TOON’s performance degrades substantially with nested data structures, despite still being relatively token-efficient.[3]
Analyzing the Differences
The conflicting results present a puzzle: how can TOON score 68.7% in official benchmarks but only 47.5% in independent testing? Several factors warrant consideration:
1. Dataset Characteristics
The official tests and the independent tests were not testing the same things.
The Official Tests:
The creators of TOON set up their benchmarks using six different types of data. This included “tabular” (flat) employee records, which they openly state is the perfect, ideal use case for TOON.[1]
They basically created different test “tracks” (like flat data vs. mixed-structure data) that were heavily weighted toward scenarios where TOON was designed to win.
The Independent Tests:
The independent researchers focused only on tabular (flat) data and used one specific model, GPT-4.1-nano (which is a small, fast model).
Their write-up suggests that TOON “could be a format to consider if you’re trying to reduce token usage, especially if the limitations of the CSV format make it hard to represent aspects of your data.”[2]
This implies their tests might have used data or asked questions in a way that didn’t play to TOON’s biggest strengths, giving a different perspective on its performance.
2. Question Complexity
The distribution of question types matters significantly:
- Simple field retrieval favors compact formats like TOON
- Complex aggregation may benefit from more explicit structure
- Nested queries could suffer in TOON’s flattened representation
The independent study doesn’t specify question distribution details, making direct comparison difficult.
3. Model Selection
This represents a critical methodological difference:
Official benchmarks: Tested across 4 different LLM models, providing an aggregate view of TOON’s performance.[1]
Independent benchmarks: Conducted two separate tests — one with GPT-4.1-nano on tabular data and another with GPT-5-nano on nested data.[2][3]
The model-specific results from separate testing show dramatic variance:
- GPT-5-nano achieved 96.1% accuracy with TOON[1]
- Claude Haiku 4.5 scored only 48.7% with TOON[1]
- GPT-4.1-nano in the independent study: 47.5% with TOON[2]
Interestingly, the independent tests showed that even GPT-5-nano achieved only 43.1% accuracy with TOON on nested data[3], despite the official benchmarks showing GPT-5-nano reaching 96.1% with TOON.[1] This dramatic variance suggests the data structure type (tabular vs. nested) may be more significant than model selection alone.
4. Token Counting Methodology
Official benchmarks used o200k_base encoding (GPT-4o/GPT-5 tokenizer) via gpt-tokenizer.[1]
Independent benchmarks don’t specify their tokenization method.[2]
Different tokenizers can produce varying token counts for identical text, particularly with formatting characters and whitespace. This could affect both absolute counts and relative efficiency rankings.
Model-Specific Considerations
The official benchmarks reveal that TOON’s performance varies dramatically by model:[1]
- Excellent: GPT-5-nano (88.6% on official benchmarks)
- Good: Gemini 2.5 Flash (86.4%)
- Moderate: Claude Haiku 4.5 (48.7%), Grok-4-fast (49.4%)
However, the independent nested data test showed GPT-5-nano achieving only 43.1% accuracy with TOON.[3] This suggests the data structure type matters more than model selection alone.
This variance is critical for practical applications. TOON’s performance depends heavily on both the model you’re using AND the type of data you’re working with. Even top-performing models like GPT-5-nano may struggle with TOON when dealing with nested structures.
Conclusion: An Evidence-Based Perspective
The available benchmarks paint an incomplete picture. TOON demonstrably reduces token usage by 30–60% across multiple studies — this finding appears robust.[1][2] However, accuracy results span from strong performance (68.7%) to below-average (47.5%), depending on the evaluation methodology.
This variance isn’t necessarily damning. It reflects the reality that no data format is optimal for all scenarios. TOON appears well-suited for:
- Uniform tabular data
- Simple field retrieval queries
- GPT-5 and similar models
- Applications where token efficiency is paramount
It may under-perform with:
- Complex nested structures (where it ranked last in independent testing)
- Deep reasoning queries
- Claude Haiku and similar models
- Applications where accuracy cannot be compromised
So, what’s the takeaway?
Here’s my recommendation: Treat TOON like a specialized tool, not a one-size-fits-all replacement for JSON.
If you’re thinking about using it for a real project, you have to test it yourself — with your data and your models — before you commit. The token savings look great, but you need to prove they don’t hurt the accuracy of the answers you need.
As TOON gets older and more independent researchers test it, we’ll all get a better idea of where it really shines and where older formats are still better.
Until then, let your own results guide you, not the hype.
Saving tokens is a big deal, and optimizations are important. But good optimization means you have to be clear-eyed about the pros and cons, not just accept the headline numbers.
TOON is an interesting experiment. It just needs a lot more attention, ongoing testing as the technology and our use cases for it continue to evolve.
References
[1]: Johann Schopplich, “TOON — Token-Oriented Object Notation”, GitHub Repository, https://github.com/johannschopplich/toon
[2]: “Is TOON Good for Table Data?”, Improving Agents, https://www.improvingagents.com/blog/is-toon-good-for-table-data
[3]: “TOON Benchmarks”, Improving Agents, https://www.improvingagents.com/blog/toon-benchmarks



Top comments (0)