Title: "Your RAG Pipeline Wastes 64% of Tokens on Documents You Already Sent — Here's the Fix"
Tags: #LLM #OpenAI #RAG #TokenOptimization #VSCode #DevTools
We tested 9,300 real documents across 4 categories: RAG chunks, pull requests, emails, and support tickets.
The results were painful:
- RAG documents: 64% redundancy (your retriever keeps fetching the same chunks)
- Pull requests: 64% redundancy (similar diffs, repeated file contexts)
- Emails: 62% redundancy (reply chains, signatures, boilerplate)
- Support tickets: 26% redundancy (templates, repeated issue descriptions)
On average, 44% of tokens you send to LLM APIs are content you've already sent before.
You're paying for the same information twice. Sometimes three times. Sometimes ten.
Why existing solutions don't fix this
Prompt caching (OpenAI, Anthropic) sounds like the answer. But in production agentic workflows — LangChain chains, CrewAI agents, AutoGen pipelines — the cache hit rate drops below 20%. Why? Because every request carries different tool outputs, different retrieved documents, and different conversation state. The prefix changes every time. Cache miss. Full price.
Context compression (LLMLingua, Selective Context) takes a different approach: it removes "unimportant" tokens from your prompts using a trained model. The problem? It modifies your prompts. If you've spent weeks tuning your RAG template, compression will change your carefully crafted words. And the quality impact is unpredictable — sometimes it removes tokens that matter.
Custom dedup scripts work for one pipeline. Then you add another. And another. Each needs its own logic. Each breaks when document formats change. A senior developer spending 2 hours on dedup costs more than a year of TokenSaver.
How TokenSaver works (the engineering)
TokenSaver uses content fingerprinting — not prompt-level caching, not compression.
- Every document/chunk that passes through your LLM pipeline gets a content fingerprint (fast hash, 0.6ms)
- Before sending to the API, TokenSaver checks: "Have I seen this exact content before in this session?"
- If yes: filters it out. You don't pay for it. The LLM doesn't see redundant context.
- If no: passes it through. The LLM sees everything unique.
Key engineering decisions:
- Content-level, not prompt-level: Unlike caching, we fingerprint the content inside the prompt, not the prompt structure. Different prompts with the same RAG chunk? Caught.
- 100% recall guarantee: We only filter exact duplicates. If even one character differs, it passes through. Zero information loss.
- 0.6ms decision time: Hash comparison, not model inference. Negligible latency.
- Provider-agnostic: Works with OpenAI, Anthropic, Google, Mistral, local models — anything that accepts text.
Benchmarks (real data, not synthetic)
| Document Type | Documents Tested | Avg Redundancy | Tokens Saved |
|---|---|---|---|
| RAG chunks | 3,200 | 64% | ~2M tokens |
| Pull requests | 2,800 | 64% | ~1.8M tokens |
| Emails | 2,100 | 62% | ~1.3M tokens |
| Support tickets | 1,200 | 26% | ~0.3M tokens |
| Total | 9,300 | 44% avg | ~5.4M tokens |
All tests run on real production documents, not generated benchmarks.
Setup (30 seconds)
- Install TokenSaver from VS Code Marketplace
- Press
Ctrl+Shift+Tto activate - That's it. No configuration. No API keys. No prompt changes.
TokenSaver sits between your code and the LLM API. It filters before sending. Your existing code, prompts, and workflows stay exactly the same.
Comparison table
| Feature | Prompt Caching | Compression | Manual Scripts | TokenSaver |
|---|---|---|---|---|
| Setup time | 0 (built-in) | 1-4 hours | 2-8 hours | 30 sec |
| Hit rate / reduction | <20% (agents) | 30-70% | Varies | 44% avg |
| Modifies prompts | No | Yes | No | No |
| Tuning required | Yes (prefix) | Yes (threshold) | Yes (per pipeline) | None |
| Provider-agnostic | No | Yes | Yes | Yes |
| Information loss risk | None | Moderate | Low | None |
| Latency added | 0ms | 50-500ms | Varies | 0.6ms |
| Recall guarantee | N/A | No | No | 100% |
Try it free for 14 days
TokenSaver Solo: €9/month after trial. No credit card required.
TokenSaver Team (5+ seats): €29/month — shared fingerprint database across your team.
No API keys. No configuration files. No prompt restructuring. No training.
[See your savings before paying — Start free trial]
Built by Lukasz Trzeciak EurekaIntelligent.dev (on the way) We optimize AI costs so you can focus on building.
Top comments (0)