İsmail Kağan Acar

Posted on Nov 7

TOON Benchmarks: A Critical Analysis of Different Results

#llm #rag #ai #json

TOON (Token-Oriented Object Notation), a new data format is making brave claims. It says it can cut down on tokens by 30–60% compared to JSON, while also helping LLMs understand the data more accurately.[1]

However, when you start looking at the test results (the benchmarks), you see a difference between the official scores and the results from independent testing.

So, this article is going to lay out both sets of data, side-by-side, and look into what might explain why they don’t match up.

Understanding the Benchmark Methodology

Before we dive into the results, it’s important to know what these tests actually measure.

They check how well an LLM can understand and pull information from data. Specifically, we’re seeing how good it is at answering questions about data when it’s presented in different formats. [1]

So, think of it as a reading test, not a test of whether the model can write in that format.

Official Repository Benchmarks

Headline Results

The official TOON repository presents compelling performance metrics based on 201 data retrieval questions across 4 models:[1]

Key Finding: TOON achieves 68.7% accuracy versus JSON’s 65.7% while using 39.5% fewer tokens — a seemingly definitive win for TOON on both efficiency and accuracy metrics.[1]

Dataset Composition

The official tests used five different types of datasets to see how the models performed:[1]

Tabular: 100 employee records, which were all simple and had the same fields.
Nested: 50 e-commerce orders, which were complex and had data tucked inside other data (like a customer object with a list of items).
Analytics: 60 days of time-series data (like dates and numbers).
GitHub: Real-world data from 100 popular GitHub repositories.
Event Logs: 75 logs, about half of which were simple and the other half had more complex error details.

Across all these datasets, the models were asked a total of 201 questions to retrieve information.

Model-Specific Performance

A separate benchmark study using the TOON implementation tested individual model performance, showing varying results across different LLMs:[1]

Here is your list, corrected using the official values you provided. I’ve included the full official results for each model for completeness and updated the rankings.

GPT-5-nano

TOON: 88.6% accuracy (178/201) — Best overall performer
JSON compact: 88.1% accuracy (177/201)
CSV: 88.0% accuracy (88/100)
YAML: 84.6% accuracy (170/201)
XML: 81.6% accuracy (164/201)
JSON: 80.1% accuracy (161/201)

Claude Haiku 4.5

YAML: 52.2% accuracy (105/201) — Best performer for this model
TOON: 50.7% accuracy (102/201)
JSON: 50.2% accuracy (101/201)
JSON compact: 49.8% accuracy (100/201)
XML: 49.3% accuracy (99/201)
CSV: 39.0% accuracy (39/100)

Gemini 2.5 Flash

XML: 86.1% accuracy (173/201) — Best performer for this model
TOON: 84.1% accuracy (169/201) — Ranked 2nd
CSV: 82.0% accuracy (82/100)
JSON compact: 81.1% accuracy (163/201)
YAML: 81.1% accuracy (163/201)
JSON: 81.1% accuracy (163/201)

Grok-4-fast-non-reasoning

TOON: 51.2% accuracy (103/201) — Tied for 1st
JSON: 51.2% accuracy (103/201) — Tied for 1st
XML: 50.2% accuracy (101/201)
JSON compact: 49.8% accuracy (100/201)
YAML: 48.8% accuracy (98/201)
CSV: 40.0% accuracy (40/100)

Note: These results are from a separate 201-question benchmark and may use different datasets than the main official repository’s 201-question benchmark.[1]

Independent Third-Party Benchmarks

Test 1: Tabular Data with GPT-4.1-nano

An independent evaluation from improvingagents.com presents dramatically different findings. Testing was conducted using GPT-4.1-nano across 12 formats with statistical confidence intervals:[2]

Key Finding: In this evaluation, TOON ranks 9th out of 12 formats with 47.5% accuracy, performing worse than JSON (52.3%) and significantly behind Markdown-KV (60.7%).[2]

The Accuracy-Token Efficiency Trade-off

This independent test shows that while TOON is great at saving tokens (using only 21,518 compared to 66,396 for JSON), this efficiency might come at a price.[2]

It seems that being too compact might make the data harder for the LLM to understand.

Here’s the interesting part: the format that got the best accuracy score (Markdown-KV at 60.7%) was actually a lot “heavier,” using more than double the tokens that TOON did.

This suggests that using more words and a clearer structure (being more “verbose”) might actually give the LLM the context it needs to understand the data better.

Test 2: Nested Data with GPT-5-nano

The same independent researchers conducted a second evaluation using GPT-5-nano to test TOON’s performance with nested data structures:[3]

Key Finding: In this nested data test, TOON ranked last with 43.1% accuracy, performing worse than JSON (50.3%), Markdown (54.3%), and significantly behind YAML (62.1%).[3]

This second test reveals a critical limitation: TOON’s performance degrades substantially with nested data structures, despite still being relatively token-efficient.[3]

Analyzing the Differences

The conflicting results present a puzzle: how can TOON score 68.7% in official benchmarks but only 47.5% in independent testing? Several factors warrant consideration:

1. Dataset Characteristics

The official tests and the independent tests were not testing the same things.

The Official Tests:

The creators of TOON set up their benchmarks using six different types of data. This included “tabular” (flat) employee records, which they openly state is the perfect, ideal use case for TOON.[1]

They basically created different test “tracks” (like flat data vs. mixed-structure data) that were heavily weighted toward scenarios where TOON was designed to win.

The Independent Tests:

The independent researchers focused only on tabular (flat) data and used one specific model, GPT-4.1-nano (which is a small, fast model).

Their write-up suggests that TOON “could be a format to consider if you’re trying to reduce token usage, especially if the limitations of the CSV format make it hard to represent aspects of your data.”[2]

This implies their tests might have used data or asked questions in a way that didn’t play to TOON’s biggest strengths, giving a different perspective on its performance.

2. Question Complexity

The distribution of question types matters significantly:

Simple field retrieval favors compact formats like TOON
Complex aggregation may benefit from more explicit structure
Nested queries could suffer in TOON’s flattened representation

The independent study doesn’t specify question distribution details, making direct comparison difficult.

3. Model Selection

This represents a critical methodological difference:

Official benchmarks: Tested across 4 different LLM models, providing an aggregate view of TOON’s performance.[1]

Independent benchmarks: Conducted two separate tests — one with GPT-4.1-nano on tabular data and another with GPT-5-nano on nested data.[2][3]

The model-specific results from separate testing show dramatic variance:

GPT-5-nano achieved 96.1% accuracy with TOON[1]
Claude Haiku 4.5 scored only 48.7% with TOON[1]
GPT-4.1-nano in the independent study: 47.5% with TOON[2]

Interestingly, the independent tests showed that even GPT-5-nano achieved only 43.1% accuracy with TOON on nested data[3], despite the official benchmarks showing GPT-5-nano reaching 96.1% with TOON.[1] This dramatic variance suggests the data structure type (tabular vs. nested) may be more significant than model selection alone.

4. Token Counting Methodology

Official benchmarks used o200k_base encoding (GPT-4o/GPT-5 tokenizer) via gpt-tokenizer.[1]

Independent benchmarks don’t specify their tokenization method.[2]

Different tokenizers can produce varying token counts for identical text, particularly with formatting characters and whitespace. This could affect both absolute counts and relative efficiency rankings.

Model-Specific Considerations

The official benchmarks reveal that TOON’s performance varies dramatically by model:[1]

Excellent: GPT-5-nano (88.6% on official benchmarks)
Good: Gemini 2.5 Flash (86.4%)
Moderate: Claude Haiku 4.5 (48.7%), Grok-4-fast (49.4%)

However, the independent nested data test showed GPT-5-nano achieving only 43.1% accuracy with TOON.[3] This suggests the data structure type matters more than model selection alone.

This variance is critical for practical applications. TOON’s performance depends heavily on both the model you’re using AND the type of data you’re working with. Even top-performing models like GPT-5-nano may struggle with TOON when dealing with nested structures.

Conclusion: An Evidence-Based Perspective

The available benchmarks paint an incomplete picture. TOON demonstrably reduces token usage by 30–60% across multiple studies — this finding appears robust.[1][2] However, accuracy results span from strong performance (68.7%) to below-average (47.5%), depending on the evaluation methodology.

This variance isn’t necessarily damning. It reflects the reality that no data format is optimal for all scenarios. TOON appears well-suited for:

Uniform tabular data
Simple field retrieval queries
GPT-5 and similar models
Applications where token efficiency is paramount

It may under-perform with:

Complex nested structures (where it ranked last in independent testing)
Deep reasoning queries
Claude Haiku and similar models
Applications where accuracy cannot be compromised

So, what’s the takeaway?

Here’s my recommendation: Treat TOON like a specialized tool, not a one-size-fits-all replacement for JSON.

If you’re thinking about using it for a real project, you have to test it yourself — with your data and your models — before you commit. The token savings look great, but you need to prove they don’t hurt the accuracy of the answers you need.

As TOON gets older and more independent researchers test it, we’ll all get a better idea of where it really shines and where older formats are still better.

Until then, let your own results guide you, not the hype.

Saving tokens is a big deal, and optimizations are important. But good optimization means you have to be clear-eyed about the pros and cons, not just accept the headline numbers.

TOON is an interesting experiment. It just needs a lot more attention, ongoing testing as the technology and our use cases for it continue to evolve.

References

[1]: Johann Schopplich, “TOON — Token-Oriented Object Notation”, GitHub Repository, https://github.com/johannschopplich/toon

[2]: “Is TOON Good for Table Data?”, Improving Agents, https://www.improvingagents.com/blog/is-toon-good-for-table-data

[3]: “TOON Benchmarks”, Improving Agents, https://www.improvingagents.com/blog/toon-benchmarks

DEV Community