<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rishav E. Kejriwal</title>
    <description>The latest articles on DEV Community by Rishav E. Kejriwal (@78_bola11605).</description>
    <link>https://dev.to/78_bola11605</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830494%2Fbd2c154a-d123-49ed-b3bf-68e4da2a0433.jpg</url>
      <title>DEV Community: Rishav E. Kejriwal</title>
      <link>https://dev.to/78_bola11605</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/78_bola11605"/>
    <language>en</language>
    <item>
      <title>I Cut My LLM API Bill in Half with a Single Python Library</title>
      <dc:creator>Rishav E. Kejriwal</dc:creator>
      <pubDate>Wed, 18 Mar 2026 08:39:23 +0000</pubDate>
      <link>https://dev.to/78_bola11605/i-cut-my-llm-api-bill-in-half-with-a-single-python-library-57lo</link>
      <guid>https://dev.to/78_bola11605/i-cut-my-llm-api-bill-in-half-with-a-single-python-library-57lo</guid>
      <description>&lt;p&gt;Last month I was debugging why our agent pipeline was burning through $400/day in OpenAI tokens. Turns out 60% of what we were feeding GPT-4 was redundant — repeated JSON schemas, duplicate log blocks, unchanged diff context, verbose imports.&lt;/p&gt;

&lt;p&gt;I tried prompt trimming by hand. Tedious. I tried LLMLingua. Better, but it needs a GPU and the fidelity wasn't great at high compression.&lt;/p&gt;

&lt;p&gt;Then I found &lt;a href="https://github.com/open-compress/claw-compactor" rel="noopener noreferrer"&gt;claw-compactor&lt;/a&gt; and honestly I'm a bit mad I didn't find it sooner.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Actually Does
&lt;/h2&gt;

&lt;p&gt;It's a 14-stage compression pipeline that sits between your data and the LLM. No neural network, no inference cost — pure deterministic transforms. You feed it code, JSON, logs, diffs, whatever, and it spits out a compressed version that preserves meaning but costs way fewer tokens.&lt;/p&gt;

&lt;p&gt;The compression rates are kind of nuts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JSON payloads&lt;/strong&gt;: 82% reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build logs&lt;/strong&gt;: 76% reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python source&lt;/strong&gt;: 25% reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git diffs&lt;/strong&gt;: 40%+ reduction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weighted average across real workloads: &lt;strong&gt;~54% fewer tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Actually Switched From LLMLingua
&lt;/h2&gt;

&lt;p&gt;I was using LLMLingua-2 before. It works, but:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It needs a model to run (GPU or slow CPU inference)&lt;/li&gt;
&lt;li&gt;At 0.3 compression rate, ROUGE-L fidelity was 0.346 — basically mangling the content&lt;/li&gt;
&lt;li&gt;Can't reverse the compression&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;claw-compactor at the same 0.3 rate? ROUGE-L of &lt;strong&gt;0.653&lt;/strong&gt;. Almost twice the fidelity. And zero inference cost because it's all deterministic.&lt;/p&gt;

&lt;p&gt;Plus it has this &lt;code&gt;RewindStore&lt;/code&gt; feature where you can actually get the original content back from a compressed marker. Try doing that with a neural compressor.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I'm Using It
&lt;/h2&gt;

&lt;p&gt;We have an agent that processes GitHub issues — fetches the issue, relevant code, CI logs, and prior conversations, then asks the LLM to triage.&lt;/p&gt;

&lt;p&gt;Before compression, a typical context was ~12K tokens. After piping everything through &lt;code&gt;FusionEngine&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scripts.lib.fusion.engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FusionEngine&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FusionEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 12K tokens → 5.5K tokens, zero information loss on the stuff that matters
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 14 stages each handle a different content type. The cool part is it auto-detects what's code, what's JSON, what's a log — you don't need to tell it.&lt;/p&gt;

&lt;p&gt;Some stages that impressed me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SemanticDedup&lt;/strong&gt; — SimHash fingerprinting to find near-duplicate blocks across your entire conversation. Killed about 20% of our tokens right there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ionizer&lt;/strong&gt; — Sees 100 JSON objects with the same schema? Samples a representative subset and summarizes the rest. Brutal efficiency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LogCrunch&lt;/strong&gt; — "This line repeated 847 times" instead of sending 847 lines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Neurosyntax&lt;/strong&gt; — Actual AST-aware code compression. Knows the difference between meaningful code and boilerplate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;For our pipeline specifically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;~$400/day&lt;/td&gt;
&lt;td&gt;~$185/day&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$6,450/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12K avg tokens/call&lt;/td&gt;
&lt;td&gt;5.5K avg tokens/call&lt;/td&gt;
&lt;td&gt;54% reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Zero added latency&lt;/td&gt;
&lt;td&gt;No GPU needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The library itself is zero-dependency on Python 3.9+. You can optionally add &lt;code&gt;tiktoken&lt;/code&gt; for exact token counts and &lt;code&gt;tree-sitter-language-pack&lt;/code&gt; for AST-level code analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It's Not
&lt;/h2&gt;

&lt;p&gt;This isn't magic. It won't compress a well-written 500-word prompt that's already tight. It shines when you're feeding the LLM structured data, code, logs, or conversations — the kind of bloated context that agent systems generate.&lt;/p&gt;

&lt;p&gt;If you're running a chatbot with short user messages, you probably don't need this. If you're building an AI agent that processes real-world data, you probably do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/open-compress/claw-compactor.git
&lt;span class="nb"&gt;cd &lt;/span&gt;claw-compactor
python3 scripts/mem_compress.py /your/workspace benchmark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The benchmark command does a dry run — shows you exactly how much each stage would compress without changing anything. Start there.&lt;/p&gt;

&lt;p&gt;1,676 tests passing, MIT licensed, zero dependencies. Not sure what else you'd want.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/open-compress/claw-compactor" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Curious if anyone else is running token compression in production. What's your setup?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
