Most LLM cost comes from input tokens, the long documents, codebases, or conversation histories you send as context. There are several prompt compression algorithms available, but nobody tells you which one actually works best for your specific workload, or how much quality you are trading for the savings.
Prompt Compression Benchmarker (PCB) answers both questions. It benchmarks every major prompt compression algorithm against your actual data, shows you exactly how much quality each one drops, projects the real dollar savings at your call volume, and then gives you a one-line wrapper to deploy the winner as a drop-in replacement around your Anthropic or OpenAI client.
What It Does
PCB answers two questions:
Which compression algorithm preserves the most quality at a given token budget?
Benchmark mode runs all compressors against your data and scores each one with task-specific quality metrics and an optional LLM-as-judge.
How much money does that save at your actual call volume?
Cost projection mode takes your daily token volume and model pricing and gives you monthly and annual savings per compressor.
Then it gives you a one-line wrapper to deploy the answer.
Installation
# From source
git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install .
# From PyPI (once published)
pip install prompt-compression-benchmarker
Requires Python 3.9+. No GPU required. Core dependencies: tiktoken, scikit-learn, rouge-score, rank-bm25, typer, rich.
# Verify
pcb --help
# Optional extras
pip install "prompt-compression-benchmarker[anthropic]" # SDK wrapper for Anthropic
pip install "prompt-compression-benchmarker[openai]" # SDK wrapper for OpenAI
pip install "prompt-compression-benchmarker[mcp]" # MCP server for Claude Code
pip install "prompt-compression-benchmarker[all]" # Everything
Quick Start
1. Run the benchmark
The simplest run uses bundled sample data - no setup needed:
# All compressors ร all task types, bundled sample data โ no setup needed
pcb run
# Target a specific task with cost projection
pcb run --task rag --max-samples 20 --daily-tokens 2000000 --cost-model claude-sonnet-4-6
# Add LLM-as-judge for deeper quality scoring (requires OpenRouter API key)
export OPENROUTER_API_KEY=sk-or-...
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10
Here is what a real benchmark run looks like - RAG task, 3M tokens/day, claude-sonnet-4-6 pricing:
pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6
RAG
Compressor Token Reduc % Proxy Score Proxy Drop % ms
no_compression 0.0% 0.2983 0.0% 0.3
tfidf โ
40.1% 0.2519 +16.5% 12.1
selective_context 56.9% 0.1874 +34.4% 8.3
llmlingua 53.6% 0.2182 +28.1% 9.7
llmlingua2 45.0% 0.2204 +27.3% 11.2
Monthly Cost Projection claude-sonnet-4-6 ยท $3/1M ยท 3M tokens/day
tfidf 38.3% reduction $103/mo saved $1,240/yr
selective_context 57.5% reduction $155/mo saved $1,863/yr
llmlingua2 43.6% reduction $118/mo saved $1,413/yr
The โ marks the Pareto-optimal compressor - best token savings given a quality drop below 20%.
2. Compress a file directly
# Compress from a file or stdin, output to stdout
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats
# Pipe it into any script
cat rag_context.txt | pcb compress | python send_to_claude.py
# Save compressed output
pcb compress context.txt -o compressed.txt --stats
3. Deploy the winner
Once you know which compressor wins on your data, deploying it is one line:
from pcb.middleware import CompressingAnthropic
# Drop-in replacement for anthropic.Anthropic()
client = CompressingAnthropic(compressor="llmlingua2", rate=0.45)
response = client.messages.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": very_long_document}],
max_tokens=1024,
)
print(client.stats) # CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)
Everything else in your codebase stays the same.
Understanding the Results
Benchmark table columns
Quality drop color coding
cyan = negative drop (compression improved the metric โ noise removal)
green = < 5% drop (effectively lossless)
yellow = 5โ15% drop (acceptable for most use cases)
red = โฅ 15% drop (significant information loss)
Why use the LLM judge?
The proxy score (F1, ROUGE, BM25) is fast and free but mechanical. The LLM judge calls a real model to evaluate whether the compressed context still supports the correct answer, it reveals things proxy metrics miss.
Here is a real example showing why this matters - RAG task, 5 samples, LLM judge = claude-sonnet-4-6:
Compressor Proxy Drop % LLM Score LLM Drop %
no_compression 0.0% 0.94 0.0%
tfidf +23.7% 0.40 -57.4% โ proxy hid the severity
llmlingua2 +29.9% 0.70 -25.5% โ much better than proxy suggested
selective_context +37.6% 0.14 -85.1% โ dangerous despite high compression
Rule of thumb: use proxy scores to compare many configs quickly, then LLM-judge the top 2โ3 before deploying.
Choosing a Compressor
RAG: llmlingua2 at rate 0.40 - preserves named entities and key facts better than sentence-dropping
Summarization: llmlingua at rate 0.45 - sentence-level pruning maintains structural coverage
Code contexts: llmlingua2 at rate 0.35 - keeps imports, identifiers, type names; removes boilerplate
General chat: tfidf at rate 0.40 - safe default, fast, reliable
Target compression rate
--rate is the fraction of tokens to remove. 0.45 means keep 55% of tokens.
Cost Savings - The Real Numbers
Compression saves money on input tokens only. Output tokens are unchanged.
At 3M input tokens per day:
Compression is most valuable on premium models. On DeepSeek or GPT-4.1-mini, the savings are too small to justify the complexity, use it only if you're hitting context window limits.
# Check your own workload
pcb run --max-samples 10 --daily-tokens 5000000 --cost-model claude-opus-4-7
Deploy: Python SDK Wrappers
Anthropic
from pcb.middleware import CompressingAnthropic
client = CompressingAnthropic(
compressor="llmlingua2",
rate=0.45,
verbose=True,
)
response = client.messages.create(
model="claude-opus-4-7",
messages=[{"role": "user", "content": very_long_document}],
max_tokens=1024,
)
# Cumulative stats
print(client.stats)
# CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)
# Estimate monthly savings
print(client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000))
# 588.0
OpenAI (Chat Completions + Codex Responses API)
from pcb.middleware import CompressingOpenAI
client = CompressingOpenAI(compressor="tfidf", rate=0.40)
# Chat Completions API โ unchanged
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": long_context}]
)
# Responses API (Codex / o-series)
response = client.responses.create(
model="codex-mini-latest",
input=long_codebase_context,
reasoning={"effort": "high"}
)
What gets compressed
By default, only "user" role messages over 100 tokens are compressed. System prompts and assistant history are passed through unchanged.
client = CompressingAnthropic(
compressor="llmlingua2",
rate=0.45,
compress_roles=("user", "system"), # also compress system prompt
)
Claude Code Integration (MCP)
PCB ships an MCP server that adds four compression tools directly into Claude Code conversations.
Setup
# Add to the current project
claude mcp add pcb -s project -- python -m pcb.mcp_server
# Or add to all your projects
claude mcp add pcb -s user -- python -m pcb.mcp_server
Or drop .mcp.json into any project root:
{
"mcpServers": {
"pcb": {
"type": "stdio",
"command": "python",
"args": ["-m", "pcb.mcp_server"]
}
}
}
Available tools
Once connected, you can ask Claude:
- "Compress this RAG context before sending it to the model"
- "Estimate how much I'd save compressing my prompts on claude-opus-4-7 at 2000 calls/day"
- "What compressor should I use for my coding assistant at 90% quality floor?"
OpenAI Codex (Agents SDK)
from agents import Agent, Runner
from agents.mcp import MCPServerStdio
import asyncio
async def main():
async with MCPServerStdio(
name="pcb",
params={"command": "python", "args": ["-m", "pcb.mcp_server"]},
) as pcb_server:
agent = Agent(
name="CostAwareAssistant",
model="codex-mini-latest",
mcp_servers=[pcb_server],
)
result = await Runner.run(
agent,
"Compress this codebase context and estimate savings: " + codebase_context
)
print(result.final_output)
asyncio.run(main())
Bring Your Own Data
Data is JSONL - one JSON object per line. Check the schema for each task type:
pcb show-schema rag
pcb show-schema summarization
pcb show-schema coding
RAG schema
{
"id": "my_001",
"context": "<passage 300โ1500 tokens>",
"question": "<specific question requiring the full context>",
"answer": "<short, precise answer string>"
}
Summarization schema
{
"id": "my_001",
"article": "<article or document 300โ800 tokens>",
"summary": "<2โ3 sentence reference summary>"
}
Coding schema
{
"id": "my_001",
"context": "<imports, helpers, type definitions โ 400โ800 tokens>",
"docstring": "<description of the function to implement>",
"solution": "<correct Python implementation>"
}
Running on your data
pcb run --data-dir ./my_data --task rag --max-samples 50
# Compare specific compressors
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2
# Export results
pcb run --data-dir ./my_data --output results.json
pcb run --data-dir ./my_data --output results.csv
pcb run --data-dir ./my_data --output results.html
Workflow: Benchmark to Production
Here is the full path from benchmarking to deploying a compressor in production.
Step 1 : Benchmark on your actual data
pcb run --data-dir ./my_data --max-samples 50 --task rag \
--daily-tokens 2000000 --cost-model claude-opus-4-7 \
--output benchmark.json
Step 2 : LLM-judge the top candidates
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2 \
--llm-judge --judge-model claude-sonnet-4-6 --max-samples 30
Step 3 : Deploy the winner
from pcb.middleware import CompressingAnthropic
client = CompressingAnthropic(compressor="llmlingua2", rate=0.40)
# Everything else in your codebase stays the same
Step 4 : Monitor in production
if client.stats.calls % 1000 == 0:
logger.info(
"pcb savings: calls=%d saved=%d tokens (%.1f%%) est_monthly=$%.0f",
client.stats.calls,
client.stats.tokens_saved,
client.stats.reduction_pct,
client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000),
)
CLI Reference
PCB run - benchmark
Options:
-c, --compressor TEXT Compressor to include (repeat for multiple). Default: all five.
-t, --task TEXT Task type: rag, summarization, coding (repeat for multiple).
-n, --max-samples INT Max samples per task.
-r, --rate FLOAT Target compression rate 0.0โ1.0. Default: 0.5
-d, --data-dir PATH Directory with *_samples.jsonl files.
-o, --output PATH Save report as .json, .csv, or .html.
-j, --llm-judge Enable LLM-as-judge scoring via OpenRouter.
-m, --judge-model TEXT Model for LLM judge. Default: claude-sonnet-4-6.
--openrouter-key TEXT OpenRouter API key (or set OPENROUTER_API_KEY).
--daily-tokens INT Daily token volume for cost projection.
--cost-model TEXT Model name for cost lookup (e.g. claude-opus-4-7).
--token-price FLOAT Manual price override in $/1M tokens.
PCB compress - compress text
Arguments:
[INPUT_FILE] File to compress. Reads stdin if omitted.
Options:
-c, --compressor TEXT Algorithm. Default: tfidf.
-r, --rate FLOAT Fraction to remove. Default: 0.45.
-o, --output PATH Write to file instead of stdout.
-s, --stats Print token stats to stderr.
Other commands
pcb list-compressors # Show all algorithms
pcb list-models # Show 75+ supported LLM judge models
pcb show-schema rag # Show JSONL schema for a task type
Output Formats
JSON - full detail per sample
pcb run --output results.json
CSV - one row per compressor ร task
pcb run --output results.csv
Columns: compressor, task, avg_token_reduction_pct, avg_quality_score, avg_quality_drop_pct, avg_llm_score, avg_llm_drop_pct, avg_latency_ms, num_samples
HTML - shareable visual report
pcb run --output results.html
# Open in any browser โ Chart.js scatter plots, dark theme, Pareto highlights
When NOT to Use Compression
Short prompts (< 200 tokens): PCB skips these automatically overhead exceeds savings.
Cheap models (< $0.50/1M): DeepSeek, Gemini Flash, GPT-4.1-mini - savings too small.
High-precision tasks: Legal review, medical diagnosis - verify your quality floor with --llm-judge first.
Output-bottlenecked workloads: Compression only affects input tokens.
Project Structure
src/pcb/
โโโ cli.py # Typer CLI โ all commands
โโโ config.py # Pydantic config and model pricing table
โโโ runner.py # Benchmark orchestration + BenchmarkReport
โโโ mcp_server.py # FastMCP server for Claude Code / Codex
โโโ compressors/
โ โโโ tfidf.py # TF-IDF sentence scoring
โ โโโ selective_context.py # Greedy token-budget selection
โ โโโ llmlingua.py # Sentence-level coarse pruning
โ โโโ no_compression.py # Passthrough baseline
โโโ tasks/
โ โโโ rag.py # F1/EM/context-recall evaluator
โ โโโ summarization.py # ROUGE-L evaluator
โ โโโ coding.py # BM25 + identifier preservation
โโโ evaluators/
โ โโโ llm_judge.py # OpenRouter LLM-as-judge (75+ models)
โโโ reporters/
โ โโโ terminal.py # Rich terminal tables
โ โโโ json_reporter.py # JSON output
โ โโโ csv_reporter.py # CSV output
โ โโโ html_reporter.py # Chart.js HTML report
โโโ middleware/
โ โโโ anthropic_client.py # CompressingAnthropic drop-in wrapper
โ โโโ openai_client.py # CompressingOpenAI drop-in wrapper
โโโ data/
โโโ rag_samples.jsonl # 20 real-world factual passages (400โ450 tokens)
โโโ summarization_samples.jsonl # 10 real news-style articles
โโโ coding_samples.jsonl # 10 real Python code contexts (370โ800 tokens)
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
I described the problem at a high level: a tool that benchmarks multiple prompt compression algorithms against real workloads, scores quality loss empirically, projects actual dollar savings at a given token volume, and makes it trivially easy to deploy the winning algorithm into an existing Anthropic or OpenAI codebase.
NEO built the entire thing autonomously, the Typer CLI with all commands and flags, all five compressor implementations, the F1/ROUGE-L/BM25 task evaluators, the OpenRouter LLM-as-judge with support for 75+ models, the cost projection engine with the model pricing table, the CompressingAnthropic and CompressingOpenAI drop-in wrappers, the FastMCP server with four tools, the JSON/CSV/HTML reporters, and the three bundled sample datasets - 20 RAG passages, 10 summarization articles, and 10 coding contexts.
How You Can Use and Extend This With NEO
Use it before committing to a compression strategy.
Before wiring any compressor into your production stack, run pcb on a sample of your actual prompts. The benchmark tells you which algorithm preserves the most quality at your target compression rate specific to your data, not a generic recommendation.
Use it to justify the cost of compression infrastructure.
The cost projection output gives you monthly and annual savings at your actual token volume and model pricing. This is the number you need to make a case for adding compression to your pipeline, not a rough estimate but a measured projection against your workload.
Use the MCP tools inside Claude Code sessions.
With the MCP server connected, you can ask Claude to compress a context, estimate savings, or recommend a compressor without leaving your coding environment. This makes compression a natural part of the agent workflow rather than a separate offline step.
Extend it with additional compressors.
The four compressors share a common interface in src/pcb/compressors/. A new algorithm - semantic chunking, abstractive summarization, or a custom retrieval-based approach, slots in as a new file in that directory and appears automatically in pcb run, pcb compress, and the MCP recommend tool without touching any other part of the codebase.
Final Notes
Most teams discover they are over-spending on input tokens only after the bill arrives. pcb gives you the benchmark data to make an informed decision before committing - which algorithm, at what rate, for which task type and the deployment tooling to act on it immediately.
The code is at https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code





Top comments (0)