Nilofer 🚀

Posted on May 6

Prompt Compression Benchmarker: Cut LLM Input Costs by 35–63% With Measurable Quality Tracking

#promptengineering #machinelearning #python #opensource

Most LLM cost comes from input tokens, the long documents, codebases, or conversation histories you send as context. There are several prompt compression algorithms available, but nobody tells you which one actually works best for your specific workload, or how much quality you are trading for the savings.

Prompt Compression Benchmarker (PCB) answers both questions. It benchmarks every major prompt compression algorithm against your actual data, shows you exactly how much quality each one drops, projects the real dollar savings at your call volume, and then gives you a one-line wrapper to deploy the winner as a drop-in replacement around your Anthropic or OpenAI client.

What It Does

PCB answers two questions:

Which compression algorithm preserves the most quality at a given token budget?
Benchmark mode runs all compressors against your data and scores each one with task-specific quality metrics and an optional LLM-as-judge.

How much money does that save at your actual call volume?
Cost projection mode takes your daily token volume and model pricing and gives you monthly and annual savings per compressor.

Then it gives you a one-line wrapper to deploy the answer.

Installation

# From source
git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install .

# From PyPI (once published)
pip install prompt-compression-benchmarker

Requires Python 3.9+. No GPU required. Core dependencies: tiktoken, scikit-learn, rouge-score, rank-bm25, typer, rich.

# Verify
pcb --help

# Optional extras
pip install "prompt-compression-benchmarker[anthropic]"   # SDK wrapper for Anthropic
pip install "prompt-compression-benchmarker[openai]"      # SDK wrapper for OpenAI
pip install "prompt-compression-benchmarker[mcp]"         # MCP server for Claude Code
pip install "prompt-compression-benchmarker[all]"         # Everything

Quick Start

1. Run the benchmark
The simplest run uses bundled sample data - no setup needed:

# All compressors × all task types, bundled sample data — no setup needed
pcb run

# Target a specific task with cost projection
pcb run --task rag --max-samples 20 --daily-tokens 2000000 --cost-model claude-sonnet-4-6

# Add LLM-as-judge for deeper quality scoring (requires OpenRouter API key)
export OPENROUTER_API_KEY=sk-or-...
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10

Here is what a real benchmark run looks like - RAG task, 3M tokens/day, claude-sonnet-4-6 pricing:

pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6
                          RAG
 Compressor          Token Reduc %  Proxy Score  Proxy Drop %   ms
 no_compression           0.0%        0.2983         0.0%      0.3
 tfidf ★                 40.1%        0.2519        +16.5%     12.1
 selective_context        56.9%        0.1874        +34.4%      8.3
 llmlingua                53.6%        0.2182        +28.1%      9.7
 llmlingua2               45.0%        0.2204        +27.3%     11.2

 Monthly Cost Projection  claude-sonnet-4-6 · $3/1M · 3M tokens/day
 tfidf             38.3% reduction   $103/mo saved   $1,240/yr
 selective_context 57.5% reduction   $155/mo saved   $1,863/yr
 llmlingua2        43.6% reduction   $118/mo saved   $1,413/yr

The ★ marks the Pareto-optimal compressor - best token savings given a quality drop below 20%.

2. Compress a file directly

# Compress from a file or stdin, output to stdout
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats

# Pipe it into any script
cat rag_context.txt | pcb compress | python send_to_claude.py

# Save compressed output
pcb compress context.txt -o compressed.txt --stats

3. Deploy the winner
Once you know which compressor wins on your data, deploying it is one line:

from pcb.middleware import CompressingAnthropic

# Drop-in replacement for anthropic.Anthropic()
client = CompressingAnthropic(compressor="llmlingua2", rate=0.45)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

print(client.stats)  # CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

Everything else in your codebase stays the same.

Understanding the Results

Benchmark table columns

Quality drop color coding

cyan   = negative drop (compression improved the metric — noise removal)
green  = < 5% drop    (effectively lossless)
yellow = 5–15% drop   (acceptable for most use cases)
red    = ≥ 15% drop   (significant information loss)

Why use the LLM judge?

The proxy score (F1, ROUGE, BM25) is fast and free but mechanical. The LLM judge calls a real model to evaluate whether the compressed context still supports the correct answer, it reveals things proxy metrics miss.

Here is a real example showing why this matters - RAG task, 5 samples, LLM judge = claude-sonnet-4-6:

Compressor           Proxy Drop %   LLM Score   LLM Drop %
no_compression           0.0%         0.94         0.0%
tfidf                  +23.7%         0.40        -57.4%    ← proxy hid the severity
llmlingua2             +29.9%         0.70        -25.5%    ← much better than proxy suggested
selective_context      +37.6%         0.14        -85.1%    ← dangerous despite high compression

Rule of thumb: use proxy scores to compare many configs quickly, then LLM-judge the top 2–3 before deploying.

Choosing a Compressor

RAG: llmlingua2 at rate 0.40 - preserves named entities and key facts better than sentence-dropping

Summarization: llmlingua at rate 0.45 - sentence-level pruning maintains structural coverage

Code contexts: llmlingua2 at rate 0.35 - keeps imports, identifiers, type names; removes boilerplate

General chat: tfidf at rate 0.40 - safe default, fast, reliable

Target compression rate

--rate is the fraction of tokens to remove. 0.45 means keep 55% of tokens.

Cost Savings - The Real Numbers

Compression saves money on input tokens only. Output tokens are unchanged.
At 3M input tokens per day:

Compression is most valuable on premium models. On DeepSeek or GPT-4.1-mini, the savings are too small to justify the complexity, use it only if you're hitting context window limits.

# Check your own workload
pcb run --max-samples 10 --daily-tokens 5000000 --cost-model claude-opus-4-7

Deploy: Python SDK Wrappers

Anthropic

from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    verbose=True,
)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

# Cumulative stats
print(client.stats)
# CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

# Estimate monthly savings
print(client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000))
# 588.0

OpenAI (Chat Completions + Codex Responses API)

from pcb.middleware import CompressingOpenAI

client = CompressingOpenAI(compressor="tfidf", rate=0.40)

# Chat Completions API — unchanged
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": long_context}]
)

# Responses API (Codex / o-series)
response = client.responses.create(
    model="codex-mini-latest",
    input=long_codebase_context,
    reasoning={"effort": "high"}
)

What gets compressed

By default, only "user" role messages over 100 tokens are compressed. System prompts and assistant history are passed through unchanged.

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    compress_roles=("user", "system"),  # also compress system prompt
)

Claude Code Integration (MCP)

PCB ships an MCP server that adds four compression tools directly into Claude Code conversations.

Setup

# Add to the current project
claude mcp add pcb -s project -- python -m pcb.mcp_server

# Or add to all your projects
claude mcp add pcb -s user -- python -m pcb.mcp_server

Or drop .mcp.json into any project root:

{
  "mcpServers": {
    "pcb": {
      "type": "stdio",
      "command": "python",
      "args": ["-m", "pcb.mcp_server"]
    }
  }
}

Available tools
Once connected, you can ask Claude:

"Compress this RAG context before sending it to the model"
"Estimate how much I'd save compressing my prompts on claude-opus-4-7 at 2000 calls/day"
"What compressor should I use for my coding assistant at 90% quality floor?"

OpenAI Codex (Agents SDK)

from agents import Agent, Runner
from agents.mcp import MCPServerStdio
import asyncio

async def main():
    async with MCPServerStdio(
        name="pcb",
        params={"command": "python", "args": ["-m", "pcb.mcp_server"]},
    ) as pcb_server:
        agent = Agent(
            name="CostAwareAssistant",
            model="codex-mini-latest",
            mcp_servers=[pcb_server],
        )
        result = await Runner.run(
            agent,
            "Compress this codebase context and estimate savings: " + codebase_context
        )
        print(result.final_output)

asyncio.run(main())

Bring Your Own Data

Data is JSONL - one JSON object per line. Check the schema for each task type:

pcb show-schema rag
pcb show-schema summarization
pcb show-schema coding

RAG schema

{
  "id": "my_001",
  "context": "<passage 300–1500 tokens>",
  "question": "<specific question requiring the full context>",
  "answer": "<short, precise answer string>"
}

Summarization schema

{
  "id": "my_001",
  "article": "<article or document 300–800 tokens>",
  "summary": "<2–3 sentence reference summary>"
}

Coding schema

{
  "id": "my_001",
  "context": "<imports, helpers, type definitions — 400–800 tokens>",
  "docstring": "<description of the function to implement>",
  "solution": "<correct Python implementation>"
}

Running on your data

pcb run --data-dir ./my_data --task rag --max-samples 50

# Compare specific compressors
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2

# Export results
pcb run --data-dir ./my_data --output results.json
pcb run --data-dir ./my_data --output results.csv
pcb run --data-dir ./my_data --output results.html

Workflow: Benchmark to Production

Here is the full path from benchmarking to deploying a compressor in production.

Step 1 : Benchmark on your actual data

pcb run --data-dir ./my_data --max-samples 50 --task rag \
        --daily-tokens 2000000 --cost-model claude-opus-4-7 \
        --output benchmark.json

Step 2 : LLM-judge the top candidates

pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2 \
        --llm-judge --judge-model claude-sonnet-4-6 --max-samples 30

Step 3 : Deploy the winner

from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(compressor="llmlingua2", rate=0.40)
# Everything else in your codebase stays the same

Step 4 : Monitor in production

if client.stats.calls % 1000 == 0:
    logger.info(
        "pcb savings: calls=%d saved=%d tokens (%.1f%%) est_monthly=$%.0f",
        client.stats.calls,
        client.stats.tokens_saved,
        client.stats.reduction_pct,
        client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000),
    )

CLI Reference

PCB run - benchmark

Options:
  -c, --compressor TEXT       Compressor to include (repeat for multiple). Default: all five.
  -t, --task TEXT             Task type: rag, summarization, coding (repeat for multiple).
  -n, --max-samples INT       Max samples per task.
  -r, --rate FLOAT            Target compression rate 0.0–1.0. Default: 0.5
  -d, --data-dir PATH         Directory with *_samples.jsonl files.
  -o, --output PATH           Save report as .json, .csv, or .html.
  -j, --llm-judge             Enable LLM-as-judge scoring via OpenRouter.
  -m, --judge-model TEXT      Model for LLM judge. Default: claude-sonnet-4-6.
      --openrouter-key TEXT   OpenRouter API key (or set OPENROUTER_API_KEY).
      --daily-tokens INT      Daily token volume for cost projection.
      --cost-model TEXT       Model name for cost lookup (e.g. claude-opus-4-7).
      --token-price FLOAT     Manual price override in $/1M tokens.

PCB compress - compress text

Arguments:
  [INPUT_FILE]                File to compress. Reads stdin if omitted.

Options:
  -c, --compressor TEXT       Algorithm. Default: tfidf.
  -r, --rate FLOAT            Fraction to remove. Default: 0.45.
  -o, --output PATH           Write to file instead of stdout.
  -s, --stats                 Print token stats to stderr.

Other commands

pcb list-compressors          # Show all algorithms
pcb list-models               # Show 75+ supported LLM judge models
pcb show-schema rag           # Show JSONL schema for a task type

Output Formats

JSON - full detail per sample

pcb run --output results.json

CSV - one row per compressor × task

pcb run --output results.csv

Columns: compressor, task, avg_token_reduction_pct, avg_quality_score, avg_quality_drop_pct, avg_llm_score, avg_llm_drop_pct, avg_latency_ms, num_samples

HTML - shareable visual report

pcb run --output results.html
# Open in any browser — Chart.js scatter plots, dark theme, Pareto highlights

When NOT to Use Compression

Short prompts (< 200 tokens): PCB skips these automatically overhead exceeds savings.
Cheap models (< $0.50/1M): DeepSeek, Gemini Flash, GPT-4.1-mini - savings too small.
High-precision tasks: Legal review, medical diagnosis - verify your quality floor with --llm-judge first.
Output-bottlenecked workloads: Compression only affects input tokens.

Project Structure

src/pcb/
├── cli.py                      # Typer CLI — all commands
├── config.py                   # Pydantic config and model pricing table
├── runner.py                   # Benchmark orchestration + BenchmarkReport
├── mcp_server.py               # FastMCP server for Claude Code / Codex
├── compressors/
│   ├── tfidf.py                # TF-IDF sentence scoring
│   ├── selective_context.py    # Greedy token-budget selection
│   ├── llmlingua.py            # Sentence-level coarse pruning
│   └── no_compression.py       # Passthrough baseline
├── tasks/
│   ├── rag.py                  # F1/EM/context-recall evaluator
│   ├── summarization.py        # ROUGE-L evaluator
│   └── coding.py               # BM25 + identifier preservation
├── evaluators/
│   └── llm_judge.py            # OpenRouter LLM-as-judge (75+ models)
├── reporters/
│   ├── terminal.py             # Rich terminal tables
│   ├── json_reporter.py        # JSON output
│   ├── csv_reporter.py         # CSV output
│   └── html_reporter.py        # Chart.js HTML report
├── middleware/
│   ├── anthropic_client.py     # CompressingAnthropic drop-in wrapper
│   └── openai_client.py        # CompressingOpenAI drop-in wrapper
└── data/
    ├── rag_samples.jsonl        # 20 real-world factual passages (400–450 tokens)
    ├── summarization_samples.jsonl  # 10 real news-style articles
    └── coding_samples.jsonl     # 10 real Python code contexts (370–800 tokens)

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

I described the problem at a high level: a tool that benchmarks multiple prompt compression algorithms against real workloads, scores quality loss empirically, projects actual dollar savings at a given token volume, and makes it trivially easy to deploy the winning algorithm into an existing Anthropic or OpenAI codebase.

NEO built the entire thing autonomously, the Typer CLI with all commands and flags, all five compressor implementations, the F1/ROUGE-L/BM25 task evaluators, the OpenRouter LLM-as-judge with support for 75+ models, the cost projection engine with the model pricing table, the CompressingAnthropic and CompressingOpenAI drop-in wrappers, the FastMCP server with four tools, the JSON/CSV/HTML reporters, and the three bundled sample datasets - 20 RAG passages, 10 summarization articles, and 10 coding contexts.

How You Can Use and Extend This With NEO

Use it before committing to a compression strategy.
Before wiring any compressor into your production stack, run pcb on a sample of your actual prompts. The benchmark tells you which algorithm preserves the most quality at your target compression rate specific to your data, not a generic recommendation.

Use it to justify the cost of compression infrastructure.
The cost projection output gives you monthly and annual savings at your actual token volume and model pricing. This is the number you need to make a case for adding compression to your pipeline, not a rough estimate but a measured projection against your workload.

Use the MCP tools inside Claude Code sessions.
With the MCP server connected, you can ask Claude to compress a context, estimate savings, or recommend a compressor without leaving your coding environment. This makes compression a natural part of the agent workflow rather than a separate offline step.

Extend it with additional compressors.
The four compressors share a common interface in src/pcb/compressors/. A new algorithm - semantic chunking, abstractive summarization, or a custom retrieval-based approach, slots in as a new file in that directory and appears automatically in pcb run, pcb compress, and the MCP recommend tool without touching any other part of the codebase.

Final Notes

Most teams discover they are over-spending on input tokens only after the bill arrives. pcb gives you the benchmark data to make an informed decision before committing - which algorithm, at what rate, for which task type and the deployment tooling to act on it immediately.

The code is at https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

DEV Community