<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shouvik Palit</title>
    <description>The latest articles on DEV Community by Shouvik Palit (@shouvik12).</description>
    <link>https://dev.to/shouvik12</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890793%2Fc3274ea7-2b55-4b67-9d70-2f3c08b63374.png</url>
      <title>DEV Community: Shouvik Palit</title>
      <link>https://dev.to/shouvik12</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shouvik12"/>
    <language>en</language>
    <item>
      <title>How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 03 May 2026 01:23:28 +0000</pubDate>
      <link>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</link>
      <guid>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
If you've ever been mid-conversation with Claude or GPT, hit a quota limit, and switched to a local Ollama model,you know the pain. The local model has zero context. It's like walking into a meeting 45 minutes late and nobody catches you up.&lt;br&gt;
I got frustrated enough to build something about it. That something is Trooper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Trooper&lt;/strong&gt;&lt;br&gt;
Trooper is a lightweight Go proxy (~850 lines, two files) that sits between your application and your LLM providers. When a cloud provider returns a quota error (429, 402, 529), Trooper automatically falls back to a local Ollama instance without dropping the conversation context.&lt;br&gt;
Single binary. Zero dependencies. Easy to audit since it sits in front of your API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real problem: context loss on fallback&lt;/strong&gt;&lt;br&gt;
Most fallback proxies solve the routing problem but ignore the context problem. They either pass the raw message history as-is (which blows up the local model's context window) or they truncate the oldest turns (which kills continuity).&lt;br&gt;
Neither works well in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution: three-layer context compaction&lt;/strong&gt;&lt;br&gt;
Trooper uses a structured compaction strategy before handing off to Ollama:&lt;br&gt;
&lt;strong&gt;Anchor&lt;/strong&gt; : The first two turns of the conversation are always preserved. These establish the original intent and set the tone.&lt;br&gt;
&lt;strong&gt;SITREP&lt;/strong&gt; : The middle turns get compressed into a structured summary called a SITREP. It extracts intent, entities, open loops, recent actions, and resolved items. The local model gets situational awareness, not raw history.&lt;br&gt;
&lt;strong&gt;Tail&lt;/strong&gt; : The most recent turns are preserved within a configurable token budget.&lt;/p&gt;

&lt;p&gt;A real SITREP looks like this in the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📦  Context compaction triggered — 538 tokens exceeds 500 budget
📦  Context compaction complete
    Total turns    : 7
    Anchor turns   : 2 (~43 tokens)
    Middle turns   : 2 → SITREP (~71 tokens)
    Recent turns   : 3 (~323 tokens)
    Tokens used    : 437 / 500
    SITREP         : intent="trooper" stage=unclear confidence=0.60 open=1 actions=0 resolved=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local model knows what you were working on, what's broken, what's been resolved, and what the last few exchanges were. That's enough to keep the conversation coherent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Go&lt;/strong&gt;&lt;br&gt;
Single binary distribution was the main reason. No runtime, no dependencies, drop it anywhere and it runs. The codebase being ~850 lines also means anyone can read the whole thing in an afternoon — important for something that proxies API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider support&lt;/strong&gt;&lt;br&gt;
Trooper currently supports Claude, Gemini, and OpenAI as cloud providers with automatic fallback to Ollama. The provider chain is configurable via environment variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;&lt;br&gt;
V3.0 is focused on foundation hardening — concurrency fixes and improved error handling. V3.1 will improve the SITREP extraction quality on longer conversations, which is where intent detection starts to degrade today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;&lt;br&gt;
github.com/shouvik12/trooper&lt;br&gt;
Would love feedback on the context compaction approach — especially from anyone running larger local models. What's your cold-start latency on fallback?&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>llm</category>
      <category>ai</category>
      <category>go</category>
    </item>
  </channel>
</rss>
