<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arjun Shah</title>
    <description>The latest articles on DEV Community by Arjun Shah (@arjunkshah).</description>
    <link>https://dev.to/arjunkshah</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4004457%2Fef60576e-f654-4841-bdc1-a8f82ef6fd73.jpg</url>
      <title>DEV Community: Arjun Shah</title>
      <link>https://dev.to/arjunkshah</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arjunkshah"/>
    <language>en</language>
    <item>
      <title>SuperCompress is now on PyPI! pip install supercompress in 1 line</title>
      <dc:creator>Arjun Shah</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:55:27 +0000</pubDate>
      <link>https://dev.to/arjunkshah/supercompress-is-now-on-pypi-pip-install-supercompress-in-1-line-20ja</link>
      <guid>https://dev.to/arjunkshah/supercompress-is-now-on-pypi-pip-install-supercompress-in-1-line-20ja</guid>
      <description>&lt;p&gt;I just published &lt;strong&gt;SuperCompress&lt;/strong&gt; to PyPI! 🎉&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install supercompress&lt;/code&gt; — that's all it takes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is it?
&lt;/h2&gt;

&lt;p&gt;A tiny ~5K parameter CPU policy that scores every line of context for relevance before sending to the LLM. It keeps only what matters for the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;65% fewer tokens&lt;/strong&gt; → same answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100% oracle recall&lt;/strong&gt; → never drops the answer line&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~60ms CPU latency&lt;/strong&gt; → no GPU needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source&lt;/strong&gt; → MIT with non-commercial clause&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;supercompress

from supercompress import compress
result &lt;span class="o"&gt;=&lt;/span&gt; compress&lt;span class="o"&gt;(&lt;/span&gt;context, question&lt;span class="o"&gt;)&lt;/span&gt;
print&lt;span class="o"&gt;(&lt;/span&gt;f&lt;span class="s2"&gt;"Saved {result['kv_savings_pct']}% tokens"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Live Demo
&lt;/h2&gt;

&lt;p&gt;Try the interactive comparison tool: &lt;a href="https://supercompress.vercel.app/compare" rel="noopener noreferrer"&gt;https://supercompress.vercel.app/compare&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or read the technical deep-dive: &lt;a href="https://dev.to/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80"&gt;https://dev.to/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/arjunkshah/supercompress" rel="noopener noreferrer"&gt;https://github.com/arjunkshah/supercompress&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/supercompress/" rel="noopener noreferrer"&gt;https://pypi.org/project/supercompress/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>I Built a Prompt Compressor That Saves 65% on LLM Costs — Here's the Story</title>
      <dc:creator>Arjun Shah</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:45:49 +0000</pubDate>
      <link>https://dev.to/arjunkshah/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-heres-the-story-2bdp</link>
      <guid>https://dev.to/arjunkshah/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-heres-the-story-2bdp</guid>
      <description>&lt;p&gt;I've been working on a side project called &lt;strong&gt;SuperCompress&lt;/strong&gt; — an intelligent prompt compression system for LLMs. The idea is simple: most tokens you send to an LLM never need to be processed. They're padding, boilerplate, irrelevant context. But they still burn GPU cycles.&lt;/p&gt;

&lt;p&gt;I wanted to fix that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Working with LLM agents, I noticed something: every agent loop was sending massive context through the GPU. 10K tokens. 50K tokens. Sometimes more. Most of it was irrelevant to the specific task.&lt;/p&gt;

&lt;p&gt;Truncation (keeping head + tail) was the standard approach, but it regularly dropped critical information from the middle of the context.&lt;/p&gt;

&lt;p&gt;I thought: what if we could score each line of context for relevance BEFORE sending it to the GPU? A tiny CPU model that decides what matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Build
&lt;/h2&gt;

&lt;p&gt;The technical challenge was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Train a lightweight policy (~5K params) that runs on CPU in under 60ms&lt;/li&gt;
&lt;li&gt;Score each line of context relative to the user's question&lt;/li&gt;
&lt;li&gt;Evict low-relevance lines while keeping answer-critical ones&lt;/li&gt;
&lt;li&gt;Ensure the compressed output preserves correct answers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After a lot of iteration, the results surprised even me:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Policy&lt;/th&gt;
&lt;th&gt;KV Saved&lt;/th&gt;
&lt;th&gt;Oracle Recall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Truncation&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H2O&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;98%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SuperCompress&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;100% oracle recall at the same token savings. The policy never dropped a line the answer depended on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Environmental Angle
&lt;/h2&gt;

&lt;p&gt;Here's what hit me hardest: at 50M agent turns per day (a conservative estimate for the industry), we're wasting 100B tokens daily. That's 24K GPU hours, 1,526 tons of CO₂, 6.5M liters of cooling water. Every day.&lt;/p&gt;

&lt;p&gt;Per 1 million compressions, SuperCompress saves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;800M tokens avoided&lt;/li&gt;
&lt;li&gt;29 kWh energy&lt;/li&gt;
&lt;li&gt;12 kg CO₂&lt;/li&gt;
&lt;li&gt;52 L cooling water&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's tiny per call. It's enormous at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current Status
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ Working policy with 100% oracle recall&lt;/li&gt;
&lt;li&gt;✅ Benchmarks and tests (65 passing)&lt;/li&gt;
&lt;li&gt;✅ Hosted API with free tier&lt;/li&gt;
&lt;li&gt;✅ Browser demo (compresses in-browser)&lt;/li&gt;
&lt;li&gt;✅ Python client library&lt;/li&gt;
&lt;li&gt;✅ Integration guides (OpenAI, LangChain, LlamaIndex)&lt;/li&gt;
&lt;li&gt;✅ Open source (MIT)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently looking for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First real users and feedback&lt;/li&gt;
&lt;li&gt;Integration partners&lt;/li&gt;
&lt;li&gt;Contributors to the open-source codebase&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://supercompress.vercel.app" rel="noopener noreferrer"&gt;https://supercompress.vercel.app&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/arjunkshah/supercompress" rel="noopener noreferrer"&gt;https://github.com/arjunkshah/supercompress&lt;/a&gt;&lt;br&gt;
Docs: &lt;a href="https://arjunkshah-supercompress-55.mintlify.app" rel="noopener noreferrer"&gt;https://arjunkshah-supercompress-55.mintlify.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ask:&lt;/strong&gt; If you're building with LLMs, try compressing your next prompt. See if the answers stay the same. I'd love to hear what you think.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Now available on PyPI!&lt;/strong&gt; &lt;code&gt;pip install supercompress&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/arjunkshah/supercompress" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://pypi.org/project/supercompress/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt; | &lt;a href="https://supercompress.vercel.app" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>SuperCompress: Cut LLM Costs by 65% Without Losing Answers</title>
      <dc:creator>Arjun Shah</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:23:33 +0000</pubDate>
      <link>https://dev.to/arjunkshah/supercompress-cut-llm-costs-by-65-without-losing-answers-2c8n</link>
      <guid>https://dev.to/arjunkshah/supercompress-cut-llm-costs-by-65-without-losing-answers-2c8n</guid>
      <description>&lt;h2&gt;
  
  
  Tweet 1
&lt;/h2&gt;

&lt;p&gt;Every LLM call burns GPU cycles on tokens that never needed to run.&lt;/p&gt;

&lt;p&gt;Padding. Boilerplate. Irrelevant context.&lt;/p&gt;

&lt;p&gt;I built SuperCompress — a tiny CPU policy that cuts 65% of tokens before inference.&lt;/p&gt;

&lt;p&gt;Open source. MIT. Free tier.&lt;/p&gt;

&lt;p&gt;supercompress.vercel.app&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 2
&lt;/h2&gt;

&lt;p&gt;The problem is worse than most people realize.&lt;/p&gt;

&lt;p&gt;At ~50M agent turns/day:&lt;/p&gt;

&lt;p&gt;→ 100B tokens wasted daily&lt;/p&gt;

&lt;p&gt;→ 24K GPU hours&lt;/p&gt;

&lt;p&gt;→ 1,526 tons CO₂&lt;/p&gt;

&lt;p&gt;→ 6.5M L cooling water&lt;/p&gt;

&lt;p&gt;We're burning through resources on tokens that don't matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 3
&lt;/h2&gt;

&lt;p&gt;How it works:&lt;/p&gt;

&lt;p&gt;1️⃣ Context + question → CPU policy (5K params)&lt;/p&gt;

&lt;p&gt;2️⃣ Every line scored for relevance to the question&lt;/p&gt;

&lt;p&gt;3️⃣ Low-scoring lines evicted&lt;/p&gt;

&lt;p&gt;4️⃣ Only essential tokens reach the GPU&lt;/p&gt;

&lt;p&gt;CPU first. GPU for what matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 4
&lt;/h2&gt;

&lt;p&gt;The numbers at 35% budget:&lt;/p&gt;

&lt;p&gt;• 65% KV cache saved&lt;/p&gt;

&lt;p&gt;• 100% oracle recall (vs 25% for truncation)&lt;/p&gt;

&lt;p&gt;• ~60ms CPU latency&lt;/p&gt;

&lt;p&gt;Same answers. ⅓ the compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 5
&lt;/h2&gt;

&lt;p&gt;Per 1 million compressions:&lt;/p&gt;

&lt;p&gt;→ 800M tokens avoided&lt;/p&gt;

&lt;p&gt;→ 29 kWh saved&lt;/p&gt;

&lt;p&gt;→ 12 kg CO₂ avoided&lt;/p&gt;

&lt;p&gt;→ 52 L cooling water saved&lt;/p&gt;

&lt;p&gt;Scale that across the industry and it's enormous.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 6
&lt;/h2&gt;

&lt;p&gt;SuperCompress is:&lt;/p&gt;

&lt;p&gt;✅ Open source (MIT)&lt;/p&gt;

&lt;p&gt;✅ Free API tier&lt;/p&gt;

&lt;p&gt;✅ Python library&lt;/p&gt;

&lt;p&gt;✅ Browser demo (no install)&lt;/p&gt;

&lt;p&gt;✅ Integration guides for OpenAI/LangChain&lt;/p&gt;

&lt;p&gt;Try it: supercompress.vercel.app&lt;/p&gt;

&lt;p&gt;GitHub: github.com/arjunkshah/supercompress&lt;/p&gt;

&lt;h2&gt;
  
  
  Tweet 7
&lt;/h2&gt;

&lt;p&gt;Built this because I believe we can't scale AI by burning through what we have left.&lt;/p&gt;

&lt;p&gt;Smarter compute means more AI for everyone — without the environmental cost.&lt;/p&gt;

&lt;p&gt;Would love feedback from the community 🙏&lt;/p&gt;

&lt;h1&gt;
  
  
  LLM #AI #OpenSource #MachineLearning
&lt;/h1&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt; &lt;a href="https://github.com/arjunkshah/supercompress" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://supercompress.vercel.app" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt; | &lt;a href="https://supercompress.vercel.app/compare" rel="noopener noreferrer"&gt;Interactive Tool&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>How I Built a Prompt Compressor That Saves 65% on LLM Costs</title>
      <dc:creator>Arjun Shah</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:15:11 +0000</pubDate>
      <link>https://dev.to/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80</link>
      <guid>https://dev.to/arjunkshah/how-i-built-a-prompt-compressor-that-saves-65-on-llm-costs-3m80</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Prompt Compressor That Saves 65% on LLM Costs
&lt;/h1&gt;

&lt;p&gt;Every time you call an LLM, tokens that never needed to be processed burn GPU cycles, waste money, and strain the grid. The problem gets worse with every agent loop, every long-context RAG query, every multi-turn conversation.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;SuperCompress&lt;/strong&gt; — a tiny ~5K parameter CPU policy that scores every line of context for relevance before inference, keeping only what the model needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The results?&lt;/strong&gt; 65% fewer tokens, 100% oracle recall, ~60ms latency. Open source. MIT licensed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: LLMs Are Wasteful
&lt;/h2&gt;

&lt;p&gt;Modern LLMs process every token you give them. On long contexts (think agent logs, RAG results, codebases), most of those tokens are padding — irrelevant boilerplate that consumes KV cache space without contributing to the answer.&lt;/p&gt;

&lt;p&gt;The standard approaches don't work well:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Tokens Saved&lt;/th&gt;
&lt;th&gt;Answer Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Truncation (keep head/tail)&lt;/td&gt;
&lt;td&gt;~65%&lt;/td&gt;
&lt;td&gt;~25% recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FIFO eviction&lt;/td&gt;
&lt;td&gt;~65%&lt;/td&gt;
&lt;td&gt;~25% recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H2O&lt;/td&gt;
&lt;td&gt;~65%&lt;/td&gt;
&lt;td&gt;~98% recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SuperCompress&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~65%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100% recall&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At the same KV savings, SuperCompress preserves answer quality dramatically better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: CPU-First Eviction
&lt;/h2&gt;

&lt;p&gt;The key insight: &lt;strong&gt;you don't need a GPU to decide what a GPU should process.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐     ┌──────────────┐     ┌──────────┐
│  Context In  │ ──→ │  CPU Policy  │ ──→ │  GPU LLM │
│ (1,247 tok)  │     │  (5K params) │     │ (437 tok) │
└─────────────┘     └──────────────┘     └──────────┘
                          │
                          ↓
                    Score each line
                    Drop low-relevance
                    Keep answer-critical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The policy is a lightweight neural network (~5,000 parameters) that runs entirely on CPU. It takes each line of context + the user's question, and scores how relevant that line is to answering the question. Lines below a threshold get evicted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training Approach
&lt;/h2&gt;

&lt;p&gt;The policy was trained on a dataset of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long-form text passages (books, documentation, code)&lt;/li&gt;
&lt;li&gt;Paired with realistic user questions&lt;/li&gt;
&lt;li&gt;Ground-truth relevance labels from oracle LLM judgments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The training objective balances:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token savings&lt;/strong&gt; — maximize KV reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall&lt;/strong&gt; — preserve lines needed for correct answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — keep inference under 100ms on CPU&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;At a fixed 35% budget (keep 35% of tokens):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Policy&lt;/span&gt;          &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Oracle&lt;/span&gt; &lt;span class="k"&gt;Recall&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Entity&lt;/span&gt; &lt;span class="k"&gt;Recall&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="k"&gt;Latency&lt;/span&gt;
&lt;span class="err"&gt;────────────────┼───────────────┼───────────────┼────────&lt;/span&gt;
&lt;span class="k"&gt;FIFO&lt;/span&gt;&lt;span class="err"&gt;/&lt;/span&gt;&lt;span class="k"&gt;Truncation&lt;/span&gt; &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;25&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;73&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;57&lt;/span&gt;&lt;span class="k"&gt;ms&lt;/span&gt;
&lt;span class="k"&gt;Summarization&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;61&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;65&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;63&lt;/span&gt;&lt;span class="k"&gt;ms&lt;/span&gt;
&lt;span class="k"&gt;H&lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt;&lt;span class="k"&gt;O&lt;/span&gt;             &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;98&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;73&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;56&lt;/span&gt;&lt;span class="k"&gt;ms&lt;/span&gt;
&lt;span class="k"&gt;SuperCompress&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt;        &lt;span class="mf"&gt;100&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;  &lt;span class="err"&gt;|&lt;/span&gt;         &lt;span class="mf"&gt;73&lt;/span&gt;&lt;span class="err"&gt;%&lt;/span&gt;   &lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="err"&gt;~&lt;/span&gt;&lt;span class="mf"&gt;60&lt;/span&gt;&lt;span class="k"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;100% oracle recall means the policy never dropped a line that the answer depended on. At the same compute savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Environmental Impact
&lt;/h2&gt;

&lt;p&gt;Per 1 million compressions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;800M tokens avoided&lt;/strong&gt; — that's real GPU time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;29 kWh saved&lt;/strong&gt; — enough to power a home for a day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12 kg CO₂ avoided&lt;/strong&gt; — tiny but it adds up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;52 L water saved&lt;/strong&gt; — datacenter cooling is thirsty&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Python (in-process)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;git&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;github&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;arjunkshah&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;supercompress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;git&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;supercompress&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compress_context&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your long context text here...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What does this code do?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;budget_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compressed_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kv_savings_pct&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% KV saved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hosted API (no local ML deps)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://supercompress.vercel.app/api/v1/compress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"X-API-Key: sc_live_YOUR_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"context":"...","query":"Summarize this"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Browser demo (no setup needed)
&lt;/h3&gt;

&lt;p&gt;Just visit &lt;a href="https://supercompress.vercel.app" rel="noopener noreferrer"&gt;supercompress.vercel.app&lt;/a&gt; and try the live demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Adaptive compression ratios (not fixed budget)&lt;/li&gt;
&lt;li&gt;Integration with LangChain/LlamaIndex as a built-in compressor&lt;/li&gt;
&lt;li&gt;Quantized policy for even lower latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code is open source under MIT. Contributions welcome!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/arjunkshah/supercompress" rel="noopener noreferrer"&gt;https://github.com/arjunkshah/supercompress&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Live demo:&lt;/strong&gt; &lt;a href="https://supercompress.vercel.app" rel="noopener noreferrer"&gt;https://supercompress.vercel.app&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://arjunkshah-supercompress-55.mintlify.app" rel="noopener noreferrer"&gt;https://arjunkshah-supercompress-55.mintlify.app&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
