<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Łukasz Trzeciak</title>
    <description>The latest articles on DEV Community by Łukasz Trzeciak (@ukasz_trzeciak_fb0f46515).</description>
    <link>https://dev.to/ukasz_trzeciak_fb0f46515</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3846556%2F519ac04e-ecac-4d07-8878-0b16d0ac6ef6.jpg</url>
      <title>DEV Community: Łukasz Trzeciak</title>
      <link>https://dev.to/ukasz_trzeciak_fb0f46515</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ukasz_trzeciak_fb0f46515"/>
    <language>en</language>
    <item>
      <title>"Your RAG Pipeline Wastes 64% of Tokens on Documents You Already Sent — Here's the Fix"</title>
      <dc:creator>Łukasz Trzeciak</dc:creator>
      <pubDate>Fri, 27 Mar 2026 17:47:23 +0000</pubDate>
      <link>https://dev.to/ukasz_trzeciak_fb0f46515/your-rag-pipeline-wastes-64-of-tokens-on-documents-you-already-sent-heres-the-fix-hil</link>
      <guid>https://dev.to/ukasz_trzeciak_fb0f46515/your-rag-pipeline-wastes-64-of-tokens-on-documents-you-already-sent-heres-the-fix-hil</guid>
      <description>&lt;p&gt;&lt;strong&gt;Title&lt;/strong&gt;: "Your RAG Pipeline Wastes 64% of Tokens on Documents You Already Sent — Here's the Fix"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #LLM #OpenAI #RAG #TokenOptimization #VSCode #DevTools&lt;/p&gt;




&lt;p&gt;We tested 9,300 real documents across 4 categories: RAG chunks, pull requests, emails, and support tickets.&lt;/p&gt;

&lt;p&gt;The results were painful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG documents&lt;/strong&gt;: 64% redundancy (your retriever keeps fetching the same chunks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pull requests&lt;/strong&gt;: 64% redundancy (similar diffs, repeated file contexts)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emails&lt;/strong&gt;: 62% redundancy (reply chains, signatures, boilerplate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support tickets&lt;/strong&gt;: 26% redundancy (templates, repeated issue descriptions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On average, &lt;strong&gt;44% of tokens you send to LLM APIs are content you've already sent before&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You're paying for the same information twice. Sometimes three times. Sometimes ten.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why existing solutions don't fix this
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt; (OpenAI, Anthropic) sounds like the answer. But in production agentic workflows — LangChain chains, CrewAI agents, AutoGen pipelines — the cache hit rate drops below 20%. Why? Because every request carries different tool outputs, different retrieved documents, and different conversation state. The prefix changes every time. Cache miss. Full price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context compression&lt;/strong&gt; (LLMLingua, Selective Context) takes a different approach: it removes "unimportant" tokens from your prompts using a trained model. The problem? It &lt;em&gt;modifies your prompts&lt;/em&gt;. If you've spent weeks tuning your RAG template, compression will change your carefully crafted words. And the quality impact is unpredictable — sometimes it removes tokens that matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom dedup scripts&lt;/strong&gt; work for one pipeline. Then you add another. And another. Each needs its own logic. Each breaks when document formats change. A senior developer spending 2 hours on dedup costs more than a year of TokenSaver.&lt;/p&gt;

&lt;h3&gt;
  
  
  How TokenSaver works (the engineering)
&lt;/h3&gt;

&lt;p&gt;TokenSaver uses &lt;strong&gt;content fingerprinting&lt;/strong&gt; — not prompt-level caching, not compression.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Every document/chunk&lt;/strong&gt; that passes through your LLM pipeline gets a content fingerprint (fast hash, 0.6ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before sending to the API&lt;/strong&gt;, TokenSaver checks: "Have I seen this exact content before in this session?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If yes&lt;/strong&gt;: filters it out. You don't pay for it. The LLM doesn't see redundant context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If no&lt;/strong&gt;: passes it through. The LLM sees everything unique.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Key engineering decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Content-level, not prompt-level&lt;/strong&gt;: Unlike caching, we fingerprint the &lt;em&gt;content inside&lt;/em&gt; the prompt, not the prompt structure. Different prompts with the same RAG chunk? Caught.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100% recall guarantee&lt;/strong&gt;: We only filter exact duplicates. If even one character differs, it passes through. Zero information loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.6ms decision time&lt;/strong&gt;: Hash comparison, not model inference. Negligible latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider-agnostic&lt;/strong&gt;: Works with OpenAI, Anthropic, Google, Mistral, local models — anything that accepts text.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benchmarks (real data, not synthetic)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document Type&lt;/th&gt;
&lt;th&gt;Documents Tested&lt;/th&gt;
&lt;th&gt;Avg Redundancy&lt;/th&gt;
&lt;th&gt;Tokens Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAG chunks&lt;/td&gt;
&lt;td&gt;3,200&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;~2M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pull requests&lt;/td&gt;
&lt;td&gt;2,800&lt;/td&gt;
&lt;td&gt;64%&lt;/td&gt;
&lt;td&gt;~1.8M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Emails&lt;/td&gt;
&lt;td&gt;2,100&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;td&gt;~1.3M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support tickets&lt;/td&gt;
&lt;td&gt;1,200&lt;/td&gt;
&lt;td&gt;26%&lt;/td&gt;
&lt;td&gt;~0.3M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9,300&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;44% avg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~5.4M tokens&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All tests run on real production documents, not generated benchmarks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup (30 seconds)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Install TokenSaver from VS Code Marketplace&lt;/li&gt;
&lt;li&gt;Press &lt;code&gt;Ctrl+Shift+T&lt;/code&gt; to activate&lt;/li&gt;
&lt;li&gt;That's it. No configuration. No API keys. No prompt changes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;TokenSaver sits between your code and the LLM API. It filters before sending. Your existing code, prompts, and workflows stay exactly the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Prompt Caching&lt;/th&gt;
&lt;th&gt;Compression&lt;/th&gt;
&lt;th&gt;Manual Scripts&lt;/th&gt;
&lt;th&gt;TokenSaver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;0 (built-in)&lt;/td&gt;
&lt;td&gt;1-4 hours&lt;/td&gt;
&lt;td&gt;2-8 hours&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30 sec&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hit rate / reduction&lt;/td&gt;
&lt;td&gt;&amp;lt;20% (agents)&lt;/td&gt;
&lt;td&gt;30-70%&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;44% avg&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modifies prompts&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tuning required&lt;/td&gt;
&lt;td&gt;Yes (prefix)&lt;/td&gt;
&lt;td&gt;Yes (threshold)&lt;/td&gt;
&lt;td&gt;Yes (per pipeline)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;None&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider-agnostic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Information loss risk&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Moderate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;None&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency added&lt;/td&gt;
&lt;td&gt;0ms&lt;/td&gt;
&lt;td&gt;50-500ms&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recall guarantee&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Try it free for 14 day
&lt;/h3&gt;

&lt;p&gt;[See your savings before paying — Start free trial] - fullydeveloped - under construction - coming soon&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by Lukasz Trzeciak EurekaIntelligent.dev (on the way) We optimize AI costs so you can focus on building.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>programming</category>
      <category>development</category>
    </item>
  </channel>
</rss>
