DEV Community

Cover image for TOON vs JSON for LLM Prompts: Can We Reduce Token Usage Without Losing Response Quality?
Romina Elena Mendez Escobar
Romina Elena Mendez Escobar

Posted on

TOON vs JSON for LLM Prompts: Can We Reduce Token Usage Without Losing Response Quality?

Introduction

Over the past months, I came across several articles claiming that TOON can significantly reduce token usage in LLM prompts compared to traditional JSON.

That raised a few questions for me:

  • Does TOON still provide benefits with real-world API responses?
  • How much does it actually reduce tokens?
  • And more importantly: does changing the format affect how an LLM interprets the data or the quality of the response?

Answering these questions isn’t simple, and the results can vary depending on the dataset, the structure of the data, and even the LLM itself. It’s also not a simple matter of counting token, different formats may influence how the model understands and processes the information.

In this article, I aim to run a practical benchmark to explore whether TOON could be useful in production pipelines, in what contexts it performs best, and whether it works well across different types of JSON.

This article walks through the experiment, the results, and the conclusions.


What Is TOON (and How Is It Different from JSON)?

TOON (Terse Object-Oriented Notation) is a data serialization format designed specifically for LLM prompts. The goal is simple: reduce syntactic overhead while remaining readable for both humans and machines.


The Experiment

This experiment evaluates whether alternative data serialization formats can reduce token usage in LLM prompts without degrading response quality.

The experiment follows four main stages:

  1. Dataset Fetching: Data is retrieved from public APIs and prepared for downstream processing.
  2. Token Benchmarking: Each dataset is encoded in JSON and TOON, and token counts are computed using a tokenizer to measure size differences across formats.
  3. LLM Interaction: The serialized data is sent to an LLM via Amazon Bedrock to generate responses and embeddings under deterministic settings.
  4. Semantic Evaluation: Outputs generated from JSON and TOON prompts are compared using semantic (cosine similarity) and lexical (ROUGE, BLEU) metrics to assess equivalence.

The goal is not to optimize prompt content, but to isolate the impact of serialization format on token efficiency and response consistency.


Datasets

In this experiment, I wanted to test TOON with realistic, publicly available data, rather than small, manually created datasets. Using real API responses allows us to see how token savings and LLM behavior hold up in practical scenarios.
I selected two public APIs with very different characteristics:

  1. GitHub Events API: Returns a stream of recent public events on GitHub, such as pushes, pull requests, issues, and comments.

    • 🔗 URL: https://api.github.com/events
    • 🧩 Data structure: Deeply nested, heterogeneous objects with multiple levels of dictionaries and arrays.
    • 💡 Why this matters: Represents the kind of complex operational API data you might send to an LLM in real projects.
  2. Wikipedia Page Views API:Returns the** top-viewed articles on English Wikipedia** for a given day.

    • 🔗 URL: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2024/01/01
    • 🧩 Data structure: Flat, repetitive lists of articles, each with numeric metrics (title, views, category).
    • 💡 Why this matters: Ideal for testing TOON’s efficiency with flat, repetitive data, where token savings are expected to be highest. Using these two APIs allows us to evaluate TOON in both complex nested and flat list scenarios, giving a more comprehensive view of its performance in real-world LLM prompts.

Fetching the Data

To extract data from these APIs, we created the following utility class:

class DatasetFetcher:
    """Fetch datasets from different sources"""

    @staticmethod
    def fetch_github_events(limit: int = 30) -> List[Dict]:
        """Fetch recent GitHub events"""
        url = "https://api.github.com/events"
        response = requests.get(url)
        response.raise_for_status()
        return response.json()[:limit]

    @staticmethod
    def fetch_wikipedia_pages(limit: int = 30) -> List[Dict]:
        """Fetch popular Wikipedia pages"""
        headers = {
            "User-Agent": "TOON-Benchmark/1.0 (Research)"
        }
        url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2024/01/01"
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        data = response.json()
        articles = data["items"][0]["articles"][:limit]
        return articles
Enter fullscreen mode Exit fullscreen mode

This class allows you to quickly fetch sample datasets for testing token efficiency with TOON and JSON formats.


Part 1: Token Reduction

🏗️ Methodology

To measure token usage, I used tiktoken, the same tokenizer employed by many OpenAI-compatible models. This allows us to estimate how many tokens are consumed by the prompt payload itself, independent of the model’s output.
For TOON generation, I used the toon-format library, which converts Python objects into TOON while preserving structure and ordering.
The following classes implement token counting and incremental benchmarking using these libraries:

class TokenCounter:
    """Count tokens using tiktoken"""

    def __init__(self, model: str = "gpt-4"):
        self.encoder = tiktoken.encoding_for_model(model)

    def count(self, text: str) -> int:
        """Count tokens in a text string"""
        return len(self.encoder.encode(text))
Enter fullscreen mode Exit fullscreen mode

This class allows you to quickly count tokens in any string, whether it’s JSON, TOON, or plain text.

class TokenBenchmark:
    """Benchmark token reduction: JSON vs TOON"""

    def __init__(self, config: BenchmarkConfig):
        self.config = config
        self.counter = TokenCounter(config.model)

    def incremental_benchmark(self, data_list: List[Dict], dataset_name: str) -> pd.DataFrame:
        """
        Perform incremental benchmark comparing JSON vs TOON

        Args:
            data_list: List of objects to analyze
            dataset_name: Name of dataset for identification

        Returns:
            DataFrame with benchmark results
        """
        results = []
        accum = []

        for idx, item in enumerate(data_list, start=1):
            accum.append(item)

            # Encode in both formats
            json_prompt = json.dumps(accum, ensure_ascii=False)
            toon_prompt = toon_encode(accum)

            # Count tokens
            json_tokens = self.counter.count(json_prompt)
            toon_tokens = self.counter.count(toon_prompt)

            # Calculate reduction
            saved = json_tokens - toon_tokens
            reduction_pct = (saved / json_tokens) * 100 if json_tokens else 0

            results.append({
                "num_items": idx,
                "JSON_tokens": json_tokens,
                "TOON_tokens": toon_tokens,
                "tokens_saved": saved,
                "reduction_pct": round(reduction_pct, 2),
                "dataset": dataset_name
            })

        return pd.DataFrame(results)
Enter fullscreen mode Exit fullscreen mode

These classes allow us to incrementally benchmark token usage, providing a detailed view of how much TOON reduces tokens compared to JSON as items accumulate in a prompt.


🧪 Results: Token Reduction Metrics

In the code available in the repository, you can see the classes used to compute these results.

What stands out, however, is that token reduction is not uniform across datasets.

dataset mean std min max
github_events 2.77 0.26 2.60 4.02
wikipedia_pages 42.61 6.66 13.64 46.70
  • GitHub Events (complex, nested data) - Average token reduction: ~3%
  • Wikipedia Pages (flat, repetitive data) - Average token reduction: ~43%

💡 Why the difference?

  • For GitHub Events, the reduction is only ~3%, which means that using TOON instead of JSON does not significantly reduce token usage. The reason is that deep nesting and heterogeneous keys limit how much syntactic overhead can be removed.
  • For Wikipedia Pages, the reduction is ~43% because flat, repetitive lists benefit greatly from removing braces, commas, and repeated field names.

Part 2: Does Response Quality Stay the Same?

The second experiment focuses on response quality, the goal is to verify whether using the same prompt, but providing the data encoded in JSON versus TOON, produces equivalent outputs from the LLM.
For this experiment, I used the Wikipedia dataset, since it showed the highest token reduction (~45%). This makes it an ideal candidate to evaluate whether aggressive token savings have any negative impact on output quality.
To compare the responses, I generated outputs using both formats and evaluated them using several text similarity metrics.

🧪 Results: Evaluation Metrics

To assess output quality, I used the following metrics, each capturing a different aspect of similarity.


LLM and Embeddings Setup (AWS Bedrock)

All responses and embeddings were generated using AWS Bedrock, Amazon’s fully managed service for accessing foundation models.
The following models were used:

  • ⚡ Amazon Nova Lite (amazon.nova-lite-v1:0): A lightweight, cost-efficient LLM optimized for fast inference. In this experiment, it was used for prompt completion and response generation.
  • ⚡ Amazon Titan Embeddings (amazon.titan-embed-text-v2:0): A text embedding model that converts text into high-dimensional vectors. It was used to generate vector representations of the responses for semantic similarity comparison.

Bedrock Client Implementation

The following class encapsulates interaction with AWS Bedrock for both prompt generation and embedding extraction.

invoke_prompt

This method sends a prompt to the LLM and returns the generated response.
It accepts the following parameters:

  • 💬 prompt: The base instruction or question provided to the model.
  • 📄 dataset: The data to analyze, encoded either in JSON or TOON, which is appended to the prompt.
  • 🌡️ temperature: Controls the randomness of the model’s output.
🌡️ Why temperature = 0?

The temperature parameter with this value is due to:

  • It reduces randomness in model outputs
  • It makes responses deterministic across multiple runs
  • It ensures that any differences in the outputs are due to the input format (JSON vs TOON), not sampling variability

Without fixing the temperature, it would be impossible to reliably attribute differences in response quality to the serialization format alone.

get_embeddings

This method generates vector embeddings for a given text using the embedding model.
The resulting vectors are later used to compute cosine similarity, allowing us to measure semantic equivalence between responses generated from JSON and TOON inputs.

Overall, these parameters allow us to control model behavior and isolate the impact of input serialization, with temperature being the most important variable for this experiment.

class AWSBedrockClient:
    """Client to interact with AWS Bedrock"""

    def __init__(self, region: str, model_prompt: str, model_embedding: str,
                 aws_access_key_id: str = None, aws_secret_access_key: str = None):
        self.region = region
        self.model_prompt = model_prompt
        self.model_embedding = model_embedding
        self.client = boto3.client(
            service_name='bedrock-runtime',
            region_name=region,
            aws_access_key_id=aws_access_key_id,
            aws_secret_access_key=aws_secret_access_key
        )

    def invoke_prompt(self, prompt: str, dataset: str = "", temperature: float = 0.0) -> str:
        """
        Invoke model with prompt

        Args:
            prompt: Base prompt
            dataset: Data to analyze (JSON or TOON encoded)
            temperature: 0 = deterministic, higher = more random
        """
        prompt_final = f"{prompt} {dataset}".strip()

        payload = {
            "messages": [
                {
                    "role": "user",
                    "content": [{"text": prompt_final}]
                }
            ],
            "inferenceConfig": {
                "max_new_tokens": 5000,
                "temperature": temperature,
                "top_p": 0.9
            }
        }

        try:
            response = self.client.invoke_model(
                modelId=self.model_prompt,
                body=json.dumps(payload)
            )
            response_body = json.loads(response['body'].read())
            return response_body['output']['message']['content'][0]['text']
        except Exception as e:
            raise Exception(f"Error invoking prompt model: {e}")

    def get_embeddings(self, text: str) -> List[float]:
        """Generate embeddings for a text"""
        payload = {"inputText": text}

        try:
            response = self.client.invoke_model(
                modelId=self.model_embedding,
                body=json.dumps(payload)
            )
            response_body = json.loads(response['body'].read())
            return response_body['embedding']
        except Exception as e:
            raise Exception(f"Error generating embeddings: {e}")

Enter fullscreen mode Exit fullscreen mode

Experimental Setup

The experiment is based on the following principles:

  • Same prompt structure, changing only the data serialization format (JSON vs TOON)
  • 25 independent runs per format to capture variability and compute robust statistics
  • Temperature = 0 to minimize randomness and ensure deterministic model behavior

This setup allows us to isolate the impact of the serialization format on the model’s output.


Prompt Design

The following is the prompt we will use for testing. The same prompt will be used in all executions, and we will only modify the data attached to the prompt for testing with toon and json format.

By concatenating the dataset directly to the prompt, we ensure that the instruction remains identical, and any differences in the response are attributable solely to the input format.


Evaluation Procedure

To assess response equivalence between JSON and TOON, the experiment relies on the SemanticEvaluator class, which encapsulates response generation and similarity evaluation.
At the core of the evaluation is the comparison of two responses per run, generated using the same prompt but different data encodings (JSON vs TOON), with temperature fixed at 0 to ensure deterministic behavior.
The evaluation is structured as follows:

  • cosine_similarity computes semantic similarity between the two responses using embedding vectors generated by Amazon Titan. This metric captures meaning-level equivalence and is insensitive to surface-level wording changes.
  • evaluate_single_run performs a full comparison for one run. It invokes the LLM twice (JSON and TOON), generates embeddings, and computes cosine similarity along with lexical overlap metrics (ROUGE-1, ROUGE-2, ROUGE-L) and BLEU. The output is a consolidated set of similarity scores for that run.
  • evaluate_multiple_runs repeats the single-run evaluation 25 times using the same prompt and dataset. Results from all runs are aggregated into a DataFrame, enabling statistical analysis such as mean values, variance, and stability across runs.

This design allows us to determine whether TOON’s token savings preserve response quality, both semantically and lexically, across multiple deterministic evaluations.


Results

After running 25 deterministic evaluations (temperature = 0), the analysis focused exclusively on response equivalence, measuring whether JSON and TOON produce comparable outputs when token savings are significant.

Semantic Equivalence (Cosine Similarity ≈ 0.991)
The most important signal comes from cosine similarity, computed using embeddings generated by Amazon Titan.

An average score of 0.991 indicates that, for the LLM, responses generated from TOON-encoded data are semantically equivalent to those generated from JSON.

Despite the removal of structural syntax such as braces, quotes, and repeated field names, the model preserved its ability to reason over the data and extract the same insights.
Across all runs, the meaning of the responses remained consistent.


Lexical Variability vs. Data Accuracy

Lexical similarity metrics such as ROUGE-1 and BLEU report lower absolute values:

  • ROUGE-1 F1 = 0.747
  • ROUGE-L F1 = 0.608
  • BLEU = 0.563

These scores indicate a moderate degree of lexical and structural variation between responses generated from JSON and TOON inputs. In particular, ROUGE-1 suggests partial overlap at the word level, while the lower ROUGE-L score highlights differences in sentence structure and ordering, consistent with paraphrasing and reformulation rather than content loss. Similarly, BLEU, which is sensitive to exact n-gram matches and word order, penalizes these variations even when responses remain correct and informative.

Importantly, these lexical differences do not correspond to a degradation in response quality. When inspecting the actual content of the responses, including rankings, averages, and detected trends, the results were numerically and logically consistent across formats.


🗂️ Code repository
If you want to analyze my code and see all these experiments performed, you can consult them from my repository, where all the code is available.
If you find this tutorial useful, do not forget to leave a star ⭐️ on the repository and follow me to receive notifications about new articles. Your support helps keep creating valuable technical content for the community 🚀

GitHub logo RominaElenaMendezEscobar / experiment-toon-vs-json

This repository contains a practical benchmark comparing JSON and TOON (Terse Object Oriented Notation) as data serialization formats for LLM prompts.

Buy Me A Coffee


TOON vs JSON for LLM Prompts: Can We Reduce Token Usage Without Losing Response Quality?

A practical benchmark comparing TOON and JSON formats for LLM prompts

|Tags: llm, ai, optimization, python|

img-preview


Introduction

Over the past months, I came across several articles claiming that TOON can significantly reduce token usage in LLM prompts compared to traditional JSON. Most of these examples, however, relied on small or artificial datasets. That raised a few questions for me:

  • Does TOON still provide benefits with real-world API responses?
  • How much does it actually reduce tokens?
  • And more importantly: does changing the format affect how an LLM interprets the data or the quality of the response?

In this article, I aim to run a practical benchmark to explore whether TOON could be useful in production pipelines, in what contexts it performs best, and whether it works well across…


Conclusions

This experiment shows that TOON can significantly reduce token usage while preserving response quality, as long as it is applied to the right type of data. For flat, repetitive structures, TOON acts as an effective form of prompt compression: the LLM retains semantic understanding, and any differences in wording are superficial rather than affecting meaning or correctness.

⚠️ Key limitations:

  • Only a single LLM was tested (Amazon Nova Lite)
  • Only specific datasets were used (GitHub Events and Wikipedia Page Views)
  • Evaluation was conducted in English only
  • Prompts were simple analytical tasks, not complex reasoning scenarios

As in any systems project, solutions should be carefully evaluated to determine whether they are truly optimal for a given use case. Outcomes often depend on many variables, so testing and validation in the specific context are essential before making decisions or implementing at scale.

Top comments (0)