Seenivasa Ramadurai

Posted on Dec 22

Two Efficient Technologies to Reduce AI Token Costs: TOON and Microsoft's LLMLingua-2

#microsoft #ai #llm #performance

TOON Data Serialization and Microsoft's LLMLingua-2 Prompt Compressor

Building AI applications has never been more accessible. OpenAI's GPT-4, Anthropic's Claude, and Google's Gemini have opened up possibilities that seemed like science fiction just a few years ago. Enterprise teams are creating intelligent agents, RAG systems, and GenAI applications that solve complex business challenges at scale.

But there's a catch that hits you hard once you move from prototype to production: token costs.

Your AI application's monthly bill just doubled. Again.

Every API call to these large language models costs money and those costs are calculated per token. A token is roughly a word or part of a word, and you're charged for every single one in both your input (the data and instructions you send) and output (the AI's response).

Here's what most developers don't realize until it's too late: the way you format your data and write your prompts is costing you 40-60% more than it should. Traditional data formats like JSON are incredibly wasteful. Every bracket, every repeated field name, every verbose instruction it all adds up. Fast.

You're building something innovative or maybe an AI agent processing customer data, a RAG system that needs to understand long documents, or a GenAI app that actually delivers value. But every API call to OpenAI's GPT-4, Anthropic's Claude, or Google's Gemini is draining your budget faster than you expected.

Here's the reality: you're paying for every single token. And most of those tokens? They're waste redundant data formatting, unnecessary words, bloated prompts that could be half the size.

Two technologies are changing the game. TOON is a data serialization format that makes structured data 30-60% more compact for LLMs. Microsoft's LLMLingua-2 is a prompt compressor that intelligently removes 50-80% of your prompt while keeping the meaning intact.

Different problems. Different solutions. Same result: dramatically lower AI costs.

What is TOON? (Data Serialization Built for LLMs)

TOON stands for Token-Oriented Object Notation it's a data serialization format designed specifically for Large Language Models.

Think of serialization as translating your data into a format that can be transmitted and understood. JSON has been the standard for years. But JSON was built for web APIs in the 2000s, not for AI models that charge by the token in 2025.

Here's the problem: JSON is incredibly repetitive. When you send a list of 100 customers to an AI model, JSON repeats every field name ("id", "name", "email") 100 times. That's thousands of wasted tokens you're paying for.

TOON combines YAML's indentation-based structure for nested objects with a CSV style tabular layout for uniform arrays. Instead of repeating field names, TOON declares them once and then lists just the values.

See the Difference

Here's employee data in both formats:

JSON (the traditional way):

{
  "team": [
    {"id": 1, "name": "Tej B", "role": "engineer"},
    {"id": 2, "name": "Praveen V", "role": "designer"},
    {"id": 3, "name": "Partha G", "role": "manager"}
  ]
}

TOON (the efficient way):

team[3]{id,name,role}:
1,Tej B,engineer
2,Praveen V,designer
3,Partha G,manager

Same data. Fraction of the tokens.

The Real Numbers

TOON achieved 73.9% accuracy while using 39.6% fewer tokens than standard JSON, which scored 69.7% accuracy. Not only does it use fewer tokens—AI models actually understand TOON better than JSON.

Let's talk real money. Say you're sending product catalog data to GPT-4:

100 products with 8 fields each in JSON: ~12,000 tokens
Same data in TOON: ~6,000 tokens
Cost savings: 50% reduction per API call

Do this thousands of times daily across your GenAI application? You're saving hundreds or thousands of dollars monthly.

When to Use TOON

TOON excels with uniform arrays of objects—data where multiple items share the same structure:

Customer records, product catalogs, transaction logs
Database query results sent to AI agents
Analytics dashboards, sales reports, inventory data
Any tabular or semi-tabular data your AI needs to process

For deeply nested or non-uniform structures, JSON may be more efficient. TOON isn't a universal replacement—it's a specialized tool for the right job.

How to Use TOON (Python)

Installation:

pip install toon-py

Basic Usage:

from toon_py import encode, decode

# Your application data
products = [
    {"id": 101, "name": "Laptop", "price": 1299, "stock": 45},
    {"id": 102, "name": "Mouse", "price": 29, "stock": 230},
    {"id": 103, "name": "Keyboard", "price": 89, "stock": 156}
]

# Convert to TOON before sending to LLM
toon_data = encode(products)
print(toon_data)
# Output:
# [3]{id,name,price,stock}:
# 101,Laptop,1299,45
# 102,Mouse,29,230
# 103,Keyboard,89,156

# Use in your AI prompt
prompt = f"Analyze this inventory:\n{toon_data}\n\nWhich products need restocking?"

# Send to OpenAI, Claude, or Gemini...
# Save 40-60% on tokens!

Command Line:

# Convert JSON to TOON
toon input.json -o output.toon

# Convert TOON back to JSON
toon data.toon -o output.json

What is LLMLingua-2? (The Prompt Compressor)

LLMLingua-2 is Microsoft's prompt compression technology that reduces your prompts by 50-80% without losing meaning.

While TOON handles structured data serialization, LLMLingua-2 tackles a different problem: your prompts are too long. Instructions for AI agents, context from documents, examples for few-shot learning—it all adds up to massive token counts.

LLMLingua-2 formulates prompt compression as a token classification problem using a Transformer encoder to capture essential information from full bidirectional context. It's trained through data distillation from GPT-4, so it knows exactly what information large language models need and what they can ignore.

Think of it as having an expert editor who understands how AI models think. LLMLingua-2 reads your prompt and removes filler words, redundant phrases, and unnecessary context keeping everything essential.

The Performance Numbers

LLMLingua-2 achieves up to 20x compression with minimal performance loss and is 3x-6x faster than the original LLMLingua. More impressively, it accelerates end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.

What does this mean practically? A 1,000-token prompt compressed to 200 tokens isn't just cheaper—it's faster. Your users get responses quicker. You pay less. Everyone wins.

Perfect for RAG Systems

If you're building Retrieval-Augmented Generation systems, LLMLingua-2 is a game-changer. RAG applications often pull 10-20 document chunks to answer a single question. That's massive context to send to your LLM.

LLMLingua mitigates the "lost in the middle" issue in LLMs, enhancing long-context information processing. By compressing retrieved context, you maintain all the important information while dramatically reducing tokens.

LLMLingua has been integrated into LangChain and LlamaIndex, two widely-used RAG frameworks.

How to Use LLMLingua-2 (Python)

Installation:

pip install llmlingua

Basic Compression:

from llmlingua import PromptCompressor

# Initialize LLMLingua-2
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

# Your lengthy prompt
context = """
The quarterly financial report shows strong growth in Q4 2024.
Revenue increased by 28% compared to Q3, primarily driven by
enterprise sales. Operating costs decreased by 12% due to
improved efficiency measures. Customer retention improved to 96%,
while new customer acquisition grew by 34%. The product team
shipped five major features that significantly increased user
engagement metrics across all segments...
"""

question = "What were the main growth drivers in Q4?"
prompt = f"{context}\n\nQuestion: {question}"

# Compress the prompt
compressed = compressor.compress_prompt(
    prompt,
    rate=0.5,  # Target 50% compression
    force_tokens=['\n', '?']  # Preserve important formatting
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt: {compressed['compressed_prompt']}")

With LangChain RAG:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMLinguaCompressor
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Your vector store setup
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Add LLMLingua-2 compression
compressor = LLMLinguaCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank"
)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

# Your RAG system now uses compressed context automatically
compressed_docs = compression_retriever.get_relevant_documents(
    "What are the key findings from the research?"
)

For Agentic AI:

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

# Long agent instructions
agent_instructions = """
You are a financial analysis agent with access to market data,
company financials, and industry reports. Your task is to identify
investment opportunities by analyzing revenue trends, profit margins,
market positioning, competitive advantages, and growth potential.
Consider both quantitative metrics and qualitative factors...
"""

# Compress for the agent
compressed = compressor.compress_prompt(
    agent_instructions,
    rate=0.4  # 60% compression
)

# Use compressed instructions with your AI agent
agent_prompt = f"{compressed['compressed_prompt']}\n\nTask: Analyze Tesla's Q4 performance"

When to Use What: The Smart Strategy

TOON and LLMLingua-2 aren't competitors—they're complementary tools. Here's when to use each:

Use TOON For:

Structured data with repeated fields: Customer lists, product catalogs, database results
Tabular or semi-tabular data: Sales reports, analytics, inventory systems
AI agents processing data: Anything where you're sending arrays of objects with the same structure
API responses: Convert JSON from your backend before sending to LLMs

Use LLMLingua-2 For:

Long text prompts: Instructions, explanations, guidelines for AI agents
RAG systems: Compress retrieved document context before sending to LLMs
Natural language: Meeting transcripts, reports, articles that need compression
Multi-step reasoning: Complex chain-of-thought prompts that are inherently lengthy

Use BOTH For:

Sophisticated GenAI applications combining structured data and lengthy instructions
High-volume systems making thousands of AI API calls daily
Cost-sensitive applications where token efficiency directly impacts profitability

Combining Both: Maximum Efficiency

from toon_py import encode
from llmlingua import PromptCompressor

# Initialize compressor
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True
)

# Structured data → TOON
sales_data = [
    {"month": "Oct", "revenue": 450000, "customers": 1245, "churn": 23},
    {"month": "Nov", "revenue": 485000, "customers": 1312, "churn": 19},
    {"month": "Dec", "revenue": 520000, "customers": 1398, "churn": 15}
]
toon_data = encode(sales_data)

# Long instructions → LLMLingua-2
instructions = """
Analyze the quarterly sales performance considering seasonal trends,
customer acquisition costs, competitive landscape changes, and
market conditions. Compare with historical data from the past
three years. Identify key growth drivers and potential risks.
Provide actionable recommendations for the sales team based on
data driven insights and market analysis...
"""
compressed_instructions = compressor.compress_prompt(instructions, rate=0.5)

# Combine both optimizations
final_prompt = f"""
{compressed_instructions['compressed_prompt']}

Q4 Sales Data:
{toon_data}

Question: What's the trend and what should we do next quarter?
"""

# Maximum token efficiency achieved!
# Send to OpenAI, Claude, or Gemini with 50-70% cost savings

The Bottom Line: Economics Matter

Building with AI isn't just about model capabilities it's about sustainable economics. The best AI application in the world fails if token costs scale faster than revenue.

TOON and LLMLingua-2 give you breathing room. They let you:

Ship features faster without constantly optimizing for token costs
Scale sustainably as your user base grows
Compete effectively even against companies with bigger budgets
Build richer experiences because you're not cutting features to save tokens

Both technologies are production-ready, open-source, and actively maintained:

TOON:

Python: pip install toon-py
Multiple language implementations available
5 minutes to integrate into existing applications

LLMLingua-2:

Python: pip install llmlingua
Integrated with LangChain and LlamaIndex
Microsoft Research backed with ongoing development

Start Saving Today

You don't need to rewrite your entire application. Start small:

Identify your most expensive API calls (log tokens per endpoint)
Test TOON on structured data endpoints or LLMLingua-2 on text-heavy prompts
Measure actual savings (tokens before vs after)
Roll out gradually across your application

The AI revolution is expensive. Smart developers are finding ways to make it affordable. TOON and LLMLingua-2 are two of the most effective tools available today.

Start cutting your API bills today.

Thanks
Sreeni Ramadorai

DEV Community