DEV Community

Łukasz Trzeciak
Łukasz Trzeciak

Posted on

"Your RAG Pipeline Wastes 64% of Tokens on Documents You Already Sent — Here's the Fix"

Title: "Your RAG Pipeline Wastes 64% of Tokens on Documents You Already Sent — Here's the Fix"

Tags: #LLM #OpenAI #RAG #TokenOptimization #VSCode #DevTools


We tested 9,300 real documents across 4 categories: RAG chunks, pull requests, emails, and support tickets.

The results were painful:

  • RAG documents: 64% redundancy (your retriever keeps fetching the same chunks)
  • Pull requests: 64% redundancy (similar diffs, repeated file contexts)
  • Emails: 62% redundancy (reply chains, signatures, boilerplate)
  • Support tickets: 26% redundancy (templates, repeated issue descriptions)

On average, 44% of tokens you send to LLM APIs are content you've already sent before.

You're paying for the same information twice. Sometimes three times. Sometimes ten.

Why existing solutions don't fix this

Prompt caching (OpenAI, Anthropic) sounds like the answer. But in production agentic workflows — LangChain chains, CrewAI agents, AutoGen pipelines — the cache hit rate drops below 20%. Why? Because every request carries different tool outputs, different retrieved documents, and different conversation state. The prefix changes every time. Cache miss. Full price.

Context compression (LLMLingua, Selective Context) takes a different approach: it removes "unimportant" tokens from your prompts using a trained model. The problem? It modifies your prompts. If you've spent weeks tuning your RAG template, compression will change your carefully crafted words. And the quality impact is unpredictable — sometimes it removes tokens that matter.

Custom dedup scripts work for one pipeline. Then you add another. And another. Each needs its own logic. Each breaks when document formats change. A senior developer spending 2 hours on dedup costs more than a year of TokenSaver.

How TokenSaver works (the engineering)

TokenSaver uses content fingerprinting — not prompt-level caching, not compression.

  1. Every document/chunk that passes through your LLM pipeline gets a content fingerprint (fast hash, 0.6ms)
  2. Before sending to the API, TokenSaver checks: "Have I seen this exact content before in this session?"
  3. If yes: filters it out. You don't pay for it. The LLM doesn't see redundant context.
  4. If no: passes it through. The LLM sees everything unique.

Key engineering decisions:

  • Content-level, not prompt-level: Unlike caching, we fingerprint the content inside the prompt, not the prompt structure. Different prompts with the same RAG chunk? Caught.
  • 100% recall guarantee: We only filter exact duplicates. If even one character differs, it passes through. Zero information loss.
  • 0.6ms decision time: Hash comparison, not model inference. Negligible latency.
  • Provider-agnostic: Works with OpenAI, Anthropic, Google, Mistral, local models — anything that accepts text.

Benchmarks (real data, not synthetic)

Document Type Documents Tested Avg Redundancy Tokens Saved
RAG chunks 3,200 64% ~2M tokens
Pull requests 2,800 64% ~1.8M tokens
Emails 2,100 62% ~1.3M tokens
Support tickets 1,200 26% ~0.3M tokens
Total 9,300 44% avg ~5.4M tokens

All tests run on real production documents, not generated benchmarks.

Setup (30 seconds)

  1. Install TokenSaver from VS Code Marketplace
  2. Press Ctrl+Shift+T to activate
  3. That's it. No configuration. No API keys. No prompt changes.

TokenSaver sits between your code and the LLM API. It filters before sending. Your existing code, prompts, and workflows stay exactly the same.

Comparison table

Feature Prompt Caching Compression Manual Scripts TokenSaver
Setup time 0 (built-in) 1-4 hours 2-8 hours 30 sec
Hit rate / reduction <20% (agents) 30-70% Varies 44% avg
Modifies prompts No Yes No No
Tuning required Yes (prefix) Yes (threshold) Yes (per pipeline) None
Provider-agnostic No Yes Yes Yes
Information loss risk None Moderate Low None
Latency added 0ms 50-500ms Varies 0.6ms
Recall guarantee N/A No No 100%

Try it free for 14 days

TokenSaver Solo: €9/month after trial. No credit card required.
TokenSaver Team (5+ seats): €29/month — shared fingerprint database across your team.

No API keys. No configuration files. No prompt restructuring. No training.

[See your savings before paying — Start free trial]

Built by Lukasz Trzeciak EurekaIntelligent.dev (on the way) We optimize AI costs so you can focus on building.

Top comments (0)