Assessing TOON Token Savings in an MCP Server

#ai #toon

I have been wiring TOON support with toon-token-diff into this MCP server to understand whether converting JSON payloads to TOON meaningfully reduces prompt costs. The short answer: TOON is elegant, but in my test harness it delivered microscopic savings for real-world workloads.

Environment

Project mode: toon-token-diff in libraryMode via npm install toon-token-diff
Models monitored: openai (tiktoken GPT-5 profile) and claude
Integration strategy: lightweight instrumentation that appends token stats into a JSONL ledger for later analysis

import { estimateAndLog } from "toon-token-diff/libraryMode";

// inside my MCP tool handler
estimateAndLog(JSON.stringify(result), {
  models: ["openai", "claude"],
  file: "./token-logs.jsonl",
  format: "json",
  label: "mcp_tool_call",
});

This snippet runs after the MCP tool produces a JSON response. It serializes the payload, estimates TOON vs JSON tokens, and emits a structured record to token-logs.jsonl. The rest of the MCP server stays untouched—no need to change transport or business logic.

Observations

Timestamp (UTC)	openai JSON	openai TOON	openai Δ (%)	claude JSON	claude TOON	claude Δ (%)
2025-11-19T14:16:54.296Z	127	126	0.79	130	129	0.77
2025-11-19T14:17:15.720Z	53,703	53,702	0.0019	54,977	54,976	0.0018
2025-11-19T14:17:34.988Z	14	12	14.29	14	12	14.29
2025-11-19T14:17:39.246Z	53,703	53,702	0.0019	54,977	54,976	0.0018
2025-11-19T14:17:48.333Z	29	29	0.00	28	28	0.00
2025-11-19T14:18:13.725Z	91,729	91,728	0.0011	98,607	98,606	0.0010
2025-11-19T14:21:19.174Z	127	126	0.79	130	129	0.77
2025-11-19T14:21:23.370Z	91,729	91,728	0.0011	98,607	98,606	0.0010
2025-11-19T14:21:30.314Z	53,703	53,702	0.0019	54,977	54,976	0.0018

Nine consecutive tool runs told the same story: production payloads barely moved. Only the intentionally tiny sample showed double-digit savings, which is irrelevant for backlog-scale prompts.

Why the Reduction Rate Is Flat

Content dominates token volume – The payload body itself accounts for nearly every token, so TOON’s structural tweaks barely register in the total.

Practical Guidance

Keep TOON handy as a normalization format, but don't promise cost savings without benchmarking your actual payloads.
Instrument with the libraryMode snippet above before ship time; it gives you historical evidence of whether TOON helps.
If savings are negligible, redirect effort toward higher-impact tactics: pruning unused fields, batching small tool calls, or applying semantic compression upstream.

Next Experiments

Compare with alternative tokenizers (Gemini, Llama) to see whether non-GPT vocabularies respond differently.
Add diff tooling that highlights specific fields TOON shrinks, so we can manually prune them if needed.
Explore policy-driven trimming (e.g., dropping debug blobs) prior to TOON conversion.

TOON remains a clever serialization trick, but as my MCP experiment showed, it is not an automatic token economy lever. Measure, log, and decide based on real numbers.

Resources

Top comments (1)

Apple Dev • Nov 20

Really appreciate the hard numbers here—nice reality check on TOON as a “token saver” vs just a clean format. Has anyone seen materially better deltas with different schemas, models, or extreme payload shapes? Any tips on semantic compression that worked well?