<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shouvik Palit</title>
    <description>The latest articles on DEV Community by Shouvik Palit (@shouvik12).</description>
    <link>https://dev.to/shouvik12</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890793%2Fc3274ea7-2b55-4b67-9d70-2f3c08b63374.png</url>
      <title>DEV Community: Shouvik Palit</title>
      <link>https://dev.to/shouvik12</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shouvik12"/>
    <language>en</language>
    <item>
      <title>Stop explaining yourself to Claude</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Fri, 12 Jun 2026 04:26:55 +0000</pubDate>
      <link>https://dev.to/shouvik12/-stop-explaining-react-to-claude-47f</link>
      <guid>https://dev.to/shouvik12/-stop-explaining-react-to-claude-47f</guid>
      <description>&lt;p&gt;You're wasting tokens. Not a little -a lot.&lt;/p&gt;

&lt;p&gt;Here's a prompt I see constantly:&lt;/p&gt;

&lt;p&gt;"I have a React app and I'm using the useState hook. My component re-renders every time the parent renders even though the props haven't changed. Why is this happening?"&lt;/p&gt;

&lt;p&gt;Claude doesn't need any of that setup. It already knows React. It already knows what useState is. The only thing it needed was:&lt;/p&gt;

&lt;p&gt;"Component re-renders on parent render. Props unchanged. Why."&lt;/p&gt;

&lt;p&gt;Same answer. 64% fewer tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  The delta principle
&lt;/h2&gt;

&lt;p&gt;Most prompts are written for humans. We explain context, name the framework, describe how things work before asking the question. That's how we communicate with each other.&lt;/p&gt;

&lt;p&gt;But Claude already knows the context. The only thing it needs is the &lt;strong&gt;delta&lt;/strong&gt; — the new information, the specific problem, the unknown.&lt;/p&gt;

&lt;p&gt;Everything else is noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you can safely strip
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude already knows these — stop re-explaining them:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Framework names used as context ("I have a React app", "I'm using Python")&lt;/li&gt;
&lt;li&gt;Concept explanations ("hooks are a React feature that...", "a closure is when...")&lt;/li&gt;
&lt;li&gt;Stack introductions ("my app uses Node, Express, and MongoDB")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Social noise that adds zero signal:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pleasantries: "hey", "hope you can help", "thanks in advance"&lt;/li&gt;
&lt;li&gt;Permission requests: "could you please", "I was wondering if"&lt;/li&gt;
&lt;li&gt;Hedging: "I think", "I'm not sure but", "maybe", "possibly"&lt;/li&gt;
&lt;li&gt;Filler: "basically", "essentially", "just", "simply", "actually"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you should never strip:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The actual error, bug, or problem&lt;/li&gt;
&lt;li&gt;Numbers, thresholds, measurements&lt;/li&gt;
&lt;li&gt;Variable names, function names, file names&lt;/li&gt;
&lt;li&gt;Code blocks and URLs&lt;/li&gt;
&lt;li&gt;Anything Claude could NOT know without being told&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Debugging:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (41 tokens):
"I'm working on a Node.js Express API and I'm getting a 401 unauthorized 
error when I try to call the endpoint. I'm passing the JWT token in the 
Authorization header."

After (12 tokens):
"401 on endpoint. JWT in Authorization header."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code review:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (29 tokens):
"Could you please review this Python function and tell me if there are 
any issues or improvements I could make?"

After (6 tokens):
"Review. Issues + improvements."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (19 tokens):
"I was wondering if you could explain how database connection pooling 
works in simple terms?"

After (5 tokens):
"Explain connection pooling. Simple."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The compound effect
&lt;/h2&gt;

&lt;p&gt;Single prompt savings look small. But across a real session, it compounds.&lt;/p&gt;

&lt;p&gt;Here's a simulated 20-turn dev session — the kind where you're debugging something across multiple back-and-forths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Verbose (tokens)&lt;/th&gt;
&lt;th&gt;Delta (tokens)&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;757&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;226&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;531&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;531 tokens saved in a single session. 70% reduction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On Claude's API at Sonnet pricing, that's a small number in dollars. But if you're building on top of the API and running hundreds of sessions a day, it adds up fast. And even on claude.ai, fewer input tokens means less context noise — Claude processes cleaner signal and responds more precisely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three intensity levels
&lt;/h2&gt;

&lt;p&gt;Not every prompt needs ultra-compression. I use three modes depending on the situation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;lite&lt;/strong&gt; — strip pleasantries only, keep context (~20% reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: onboarding a new topic, first message in a session&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;full&lt;/strong&gt; — strip everything Claude knows, keep only the delta (~60% reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: mid-session debugging, iterating on code, quick questions&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;ultra&lt;/strong&gt; — compress to bare minimum signal (~70%+ reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: you know exactly what you want and don't care about polish&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The skill file
&lt;/h2&gt;

&lt;p&gt;I turned this into a Claude skill — a markdown file that instructs Claude to apply delta compression automatically, with activation/deactivation commands and intensity switching.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/shouvik12/delta" rel="noopener noreferrer"&gt;github.com/shouvik12/delta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README has the full rule set, intensity examples, and instructions for adding it to your Claude setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  One thing worth thinking about
&lt;/h2&gt;

&lt;p&gt;This is a small optimization. But the principle behind it is bigger:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We've been writing prompts for humans.&lt;/strong&gt; We explain, we hedge, we contextualize — because that's how we earn understanding from other people. With LLMs, that overhead is waste. The model doesn't need to be convinced you know what you're talking about. It doesn't need the social scaffolding.&lt;/p&gt;

&lt;p&gt;Just send the delta.&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
    </item>
    <item>
      <title>Escalate the Model, Not the Conversation</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Wed, 10 Jun 2026 19:03:23 +0000</pubDate>
      <link>https://dev.to/shouvik12/escalate-the-model-not-the-conversation-4n4k</link>
      <guid>https://dev.to/shouvik12/escalate-the-model-not-the-conversation-4n4k</guid>
      <description>&lt;p&gt;Trooper started as a fallback proxy for agents. Claude hits a quota, falls back to Ollama, session continues. No crashes, no lost context.&lt;/p&gt;

&lt;p&gt;The interesting problem that came up wasn't model routing. It was context preservation.&lt;/p&gt;

&lt;p&gt;When you're debugging something hard, you build up context over many turns. The problem statement, what you've tried, what failed. When you switch from a local model to Claude, all of that context has to go with it. And when you come back to local, the local model needs to know what Claude said.&lt;/p&gt;

&lt;p&gt;That's what 4.0 solves.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;A local model handles requests by default. Fast, free, private.&lt;/p&gt;

&lt;p&gt;When it gets stuck, one click escalates to Claude — the full conversation history is injected automatically. Claude answers. Then control returns to the local model, which continues the conversation knowing exactly what Claude said.&lt;/p&gt;

&lt;p&gt;No copy-pasting. No restarting the conversation. No lost context.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0buzqpsmk689hxdyj5zj.png" alt=" " width="800" height="693"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The escalation moment
&lt;/h2&gt;

&lt;p&gt;You're debugging a slow Postgres query. Llama gives you a decent answer — check your EXPLAIN output, look for function calls on indexed columns. Good start.&lt;/p&gt;

&lt;p&gt;Not enough. You hit Escalate.&lt;/p&gt;

&lt;p&gt;Claude receives the full session. It knows you're debugging a slow query. It knows what Llama already told you. It picks up exactly where the conversation left off.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyfvbf68v53d9j87xgev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyfvbf68v53d9j87xgev.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flho0rxtc4xldue9dr0z4.png" alt=" " width="800" height="1054"&gt;
&lt;/h2&gt;

&lt;p&gt;You click Back to local.&lt;/p&gt;

&lt;p&gt;Now ask Llama to summarize what Claude said.&lt;/p&gt;

&lt;p&gt;It does. Correctly. Because the session store was updated with Claude's response. Llama reads the full history including what Claude said and continues from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4s9hjglp4k9slzem6vg.png" alt=" " width="800" height="996"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go proxy that sits between your client and any LLM provider. The chat UI is a static HTML file served by the same process.&lt;/p&gt;

&lt;p&gt;When you escalate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The UI fetches the full session history from &lt;code&gt;/session/:id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sends it to Claude via &lt;code&gt;/v1/messages&lt;/code&gt; with &lt;code&gt;X-Force-Cloud: true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Claude's response gets written back to the session store via &lt;code&gt;/session/:id/append&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Next local turn, Llama reads the full history including Claude's response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The SITREP panel on the right extracts intent, confidence, entities and open loops from the conversation using a rule-based classifier — no LLM call needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The proxy still works
&lt;/h2&gt;

&lt;p&gt;The proxy layer is unchanged. Agents, SDK clients, curl — everything routes through &lt;code&gt;/v1/messages&lt;/code&gt; the same way it always did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Agent flow — unchanged&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000

&lt;span class="c"&gt;# Chat UI — new in 4.0&lt;/span&gt;
open http://localhost:3000/chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...  &lt;span class="c"&gt;# optional&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;llama3.1:8b
go run &lt;span class="nb"&gt;.&lt;/span&gt;
open http://localhost:3000/chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works without a Claude key-escalation falls back to Ollama. Add the key when you want real cloud escalation.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>go</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Cut Agent Token Usage by 89% Without Touching the Agent</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Fri, 05 Jun 2026 13:10:22 +0000</pubDate>
      <link>https://dev.to/shouvik12/how-i-cut-agent-token-usage-by-89-without-touching-the-agent-3g7o</link>
      <guid>https://dev.to/shouvik12/how-i-cut-agent-token-usage-by-89-without-touching-the-agent-3g7o</guid>
      <description>&lt;p&gt;Every time your agent calls an LLM, it sends the full conversation history.&lt;/p&gt;

&lt;p&gt;Turn 20 includes turns 1–19. Turn 50 includes turns 1–49. Nobody notices because it's happening inside the agent, silently, on every single request.&lt;/p&gt;

&lt;p&gt;I noticed it while building &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;Trooper&lt;/a&gt; - a Go proxy that sits between agents and LLMs. I was watching token counts climb across a long debugging session and realised the agent was replaying the same context over and over. Most of it was noise.&lt;/p&gt;

&lt;p&gt;The model didn't need a transcript. It needed state.&lt;/p&gt;




&lt;h2&gt;
  
  
  What state actually means
&lt;/h2&gt;

&lt;p&gt;After a few turns, most of what matters in a session falls into four categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decisions made&lt;/strong&gt; — what was chosen and why&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraints locked&lt;/strong&gt; — what cannot change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open loops&lt;/strong&gt; — what still needs to be resolved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ruled out&lt;/strong&gt; — what was tried and rejected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. Everything else — the back and forth, the verbose LLM responses explaining things, the repeated context — is replay. The model doesn't need it again.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SITREP
&lt;/h2&gt;

&lt;p&gt;I added structured session memory to Trooper. After enough turns, Trooper's local Llama model generates a SITREP — a situation report — from the user messages in the session.&lt;/p&gt;

&lt;p&gt;It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INTENT: Build a RAG pipeline with ChromaDB and nomic-embed-text

DECISIONS: Use cosine similarity over MMR — focused queries not broad;
           Chunk size 256, overlap 30 — locked;
           Pure vector search — ChromaDB no hybrid support;
           Top k set to 5

CONSTRAINTS: Node 18 locked — platform team constraint, no exceptions;
             Re-ranking ruled out — latency jumped 200ms to 800ms

OPEN: Poor recall on technical queries — nomic-embed-text struggles with domain jargon;
      Evaluating bge-small as alternative
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From that point forward, every request to the LLM sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anchor (first 2 turns verbatim)
+ SITREP (structured state)
+ Tail (last N turns verbatim)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of the full history.&lt;/p&gt;




&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;From a real 15-turn session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Full history:    10,820 tokens per request
With Trooper:     1,157 tokens per request
Reduction:             89%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visible live on the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6dlkrn4g7kbev71864i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6dlkrn4g7kbev71864i.png" alt=" " width="799" height="621"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Does the LLM still answer correctly?
&lt;/h2&gt;

&lt;p&gt;This was the question that mattered. Token savings are worthless if the model loses coherence.&lt;/p&gt;

&lt;p&gt;To test it: I took the auto-generated SITREP, opened a completely fresh chat with no history, and asked questions about decisions made in the original session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is the chunk size?&lt;/li&gt;
&lt;li&gt;Why did we rule out hybrid search?&lt;/li&gt;
&lt;li&gt;What retrieval method did we choose and why?&lt;/li&gt;
&lt;li&gt;What is still open?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; All four answered correctly. The model worked entirely from the SITREP. No history. No context bleed.&lt;/p&gt;

&lt;p&gt;That's the claim: structured state is sufficient for the model to continue reasoning correctly — and it costs 89% less to send.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go proxy — one binary, no SDK, no instrumentation. You point your existing agent at it by changing one URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://api.anthropic.com

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing else changes. Trooper intercepts every request, maintains session state, and when the SITREP is ready, rewrites the messages array before forwarding to the LLM.&lt;/p&gt;

&lt;p&gt;The SITREP is built by a local Llama 3.1 8b model running via Ollama — fast, private, no cloud cost. The extraction happens asynchronously in the background. The main request path is not blocked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// GetTripleAnchor assembles what gets sent to the LLM&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SessionStore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GetTripleAnchor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Anchor&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SITREP&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[STATE_SITREP: %s]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SITREP&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tail&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dashboard shows the compression ratio live:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HISTORY COMPRESSED    89%
TOKENS SAVED          459
CONFIDENCE            100%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why this is different from conversation summarisation
&lt;/h2&gt;

&lt;p&gt;Most summarisation tools compress what was said. The SITREP extracts what matters for the next action.&lt;/p&gt;

&lt;p&gt;Copilot's context compaction summarises the full conversation — useful for humans in long chats. The SITREP is structured specifically for agents: decisions, constraints, open loops, ruled-out paths. Not a narrative summary. A state snapshot.&lt;/p&gt;

&lt;p&gt;The result is that subsequent turns stay coherent on intent without replaying noise. More relevant for agents running repeated structured workflows than for general chat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The limitation
&lt;/h2&gt;

&lt;p&gt;The SITREP works best for structured agentic workflows — debugging sessions, research pipelines, multi-step build tasks. For open-ended creative work where tangential context might become important later, you'd want a larger tail window or higher fidelity compression.&lt;/p&gt;

&lt;p&gt;The tail window is configurable. You can keep more raw context for less structured sessions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What else Trooper does
&lt;/h2&gt;

&lt;p&gt;The compression is the latest addition. Trooper also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Falls back to local Ollama when cloud quota hits — context preserved across the switch&lt;/li&gt;
&lt;li&gt;Routes simple turns to Ollama automatically — cloud never contacted&lt;/li&gt;
&lt;li&gt;Privacy routing — sensitive requests stay local via &lt;code&gt;x_force_local&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Live dashboard — intent, open loops, completed steps, transcript&lt;/li&gt;
&lt;li&gt;Subagent recovery — &lt;code&gt;/recovery/{session_id}&lt;/code&gt; tells you exactly where to resume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All from one URL change.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bigger question
&lt;/h2&gt;

&lt;p&gt;We tend to treat conversation history as memory. But a transcript is a log. Memory is state.&lt;/p&gt;

&lt;p&gt;Humans don't replay every prior conversation before making a decision. They carry forward conclusions, constraints, unresolved questions, and relevant context — a structured snapshot, not a full transcript.&lt;/p&gt;

&lt;p&gt;Long-running agents may need to do the same. Not because of token costs — though that helps — but because state is a better abstraction for agent memory than history.&lt;/p&gt;

&lt;p&gt;The SITREP is an experiment in that direction.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;github.com/shouvik12/trooper&lt;/a&gt; — Go, MIT, zero dependencies beyond Ollama.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>I Added a Live Dashboard to My LLM Proxy. Zero Instrumentation. Just a URL Change.</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Thu, 28 May 2026 15:24:53 +0000</pubDate>
      <link>https://dev.to/shouvik12/-i-added-a-live-dashboard-to-my-llm-proxy-zero-instrumentation-just-a-url-change-12k4</link>
      <guid>https://dev.to/shouvik12/-i-added-a-live-dashboard-to-my-llm-proxy-zero-instrumentation-just-a-url-change-12k4</guid>
      <description>&lt;p&gt;I built Trooper as a fallback proxy. Claude hits quota → falls back to Ollama. Useful but passive. It sat in the background, invisible, doing its job silently.&lt;/p&gt;

&lt;p&gt;Today it became something different.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Problem
&lt;/h2&gt;

&lt;p&gt;When you're building with LLMs, quota hits are inevitable. Claude's free tier is generous until it isn't. A mid-session 429 kills your context, your workflow, your train of thought.&lt;/p&gt;

&lt;p&gt;Trooper solved that. Point your app at &lt;code&gt;http://localhost:3000&lt;/code&gt; instead of the Claude API. When Claude fails, Trooper catches it, preserves the full session context via a 3-layer compaction system (Anchor + SITREP + Tail), and continues on local Ollama. Your app never knows anything happened.&lt;/p&gt;

&lt;p&gt;That's still there. But it was passive. Useful when things broke. Invisible when they didn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Passive
&lt;/h2&gt;

&lt;p&gt;Passive infrastructure has an adoption problem. Developers install it, forget about it, and only notice it when something breaks. That's not a product — that's a safety net.&lt;/p&gt;

&lt;p&gt;The question I kept asking: what does Trooper do that has daily value, not just failure value?&lt;/p&gt;

&lt;p&gt;The answer was sitting in the code the whole time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Already There
&lt;/h2&gt;

&lt;p&gt;Trooper captures every message in every session. It runs a classifier on each one — extracting intent, entities, open loops, completed steps, recent actions. All rule-based, zero LLM calls, zero latency.&lt;/p&gt;

&lt;p&gt;This is what powers the fallback context preservation. When Claude fails and Ollama picks up, Ollama doesn't start blind — it receives a SITREP (Situation Report) that tells it what the session was about, what was completed, what's still pending.&lt;/p&gt;

&lt;p&gt;That data exists for every session. It was just never visible to the developer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;I added a live dashboard at &lt;code&gt;localhost:3000/dashboard&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Point any agent at Trooper — just change your base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;span class="c"&gt;# or&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open the dashboard. Keep it on a second monitor while your agent runs.&lt;/p&gt;

&lt;p&gt;From a single message, it already shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intent&lt;/strong&gt; — what your agent is trying to do, extracted automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Loops&lt;/strong&gt; — what it's stuck on, highlighted in red&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completed Steps&lt;/strong&gt; — what it finished, tracked as it happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entities&lt;/strong&gt; — the key things being referenced&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session Transcript&lt;/strong&gt; — every message, colour coded by role&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Auto-refreshes every 5 seconds. No page reload needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Looks Like In Practice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cprikgdz65ls8uy4l5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cprikgdz65ls8uy4l5k.png" alt=" " width="800" height="685"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I ran a 3-turn agent session simulating a database debugging workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 1:&lt;/strong&gt; "I am building a Go API server. The database connection is failing with connection refused errors on port 5432."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 2:&lt;/strong&gt; "Checked the config. Postgres is running on port 5433 not 5432. Fixing the connection string now."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 3:&lt;/strong&gt; "Fixed the port. Database connection is working. API server is running successfully."&lt;/p&gt;

&lt;p&gt;After Turn 1, the dashboard immediately showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intent: "building a go api server. the database connection is failing with connection refused errors on port" (100% confidence)&lt;/li&gt;
&lt;li&gt;Entities: Postgres, network&lt;/li&gt;
&lt;li&gt;Open Loops: "fail with connection refused error on port"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After Turn 3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completed Steps: "successfully fixed the port"&lt;/li&gt;
&lt;li&gt;Open loops cleared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zero instrumentation. No SDK. No code changes to the agent. Just a URL change.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Every observability tool requires you to instrument your code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; — wrap your agent in LangChain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; — add their SDK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentOps&lt;/strong&gt; — add &lt;code&gt;@observe&lt;/code&gt; decorators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trooper&lt;/strong&gt; requires nothing. Your agent already communicates over HTTP to an LLM. Trooper sits in that path and observes everything automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; was the closest — proxy-based, zero instrumentation. But it went into maintenance mode in March 2026 and was cloud-only. Your data went to their servers.&lt;/p&gt;

&lt;p&gt;Trooper is open source, local-first, and free forever. Your data never leaves your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Sessions Endpoint
&lt;/h2&gt;

&lt;p&gt;Not sure which session to look at? Hit &lt;code&gt;/sessions&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/sessions
&lt;span class="c"&gt;# {"sessions":["agent-debug-123","agent-debug-456"],"count":2}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click any session in the dashboard home page at &lt;code&gt;localhost:3000/dashboard&lt;/code&gt; to open it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Endpoint
&lt;/h2&gt;

&lt;p&gt;Still there. When an agent fails mid-task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/recovery/&lt;span class="o"&gt;{&lt;/span&gt;session_id&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns exactly what completed and where to resume. The dashboard makes this visual — completed steps are tracked in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pivot
&lt;/h2&gt;

&lt;p&gt;Trooper started as: &lt;em&gt;"Claude failed. Trooper caught it."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Trooper is now: &lt;em&gt;"Your agent communicates over HTTP to an LLM. Trooper can observe it."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The fallback is a feature. Observability is the product.&lt;/p&gt;

&lt;p&gt;Your agent was always talking. Now you can hear it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:3000/dashboard&lt;/code&gt; in your browser.&lt;/p&gt;

&lt;p&gt;Point your agent at Trooper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Zero dependencies. Pure Go. Runs in under 60 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: llm, agents, observability, ollama, go, opensource&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Added a /recovery Endpoint to My LLM Proxy So Agents Never Lose Progress Mid-Task</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 24 May 2026 17:28:55 +0000</pubDate>
      <link>https://dev.to/shouvik12/i-added-a-recovery-endpoint-to-my-llm-proxy-so-agents-never-lose-progress-mid-task-524b</link>
      <guid>https://dev.to/shouvik12/i-added-a-recovery-endpoint-to-my-llm-proxy-so-agents-never-lose-progress-mid-task-524b</guid>
      <description>&lt;p&gt;Most LLM proxies handle failures the same way -retry the request, fall back to another provider, or crash. None of them ask the more important question: &lt;strong&gt;what did the agent already complete before it failed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the gap I built Trooper to fill.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;If you're running multi-agent workflows, you've probably hit this scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A subagent starts a long task-reviewing PRs, processing documents, running analysis&lt;/li&gt;
&lt;li&gt;It completes steps 1, 2, and 3&lt;/li&gt;
&lt;li&gt;On step 4 it hits a quota error, rate limit, or provider failure&lt;/li&gt;
&lt;li&gt;Your orchestration layer has no idea what completed&lt;/li&gt;
&lt;li&gt;You restart from scratch and repeat all the work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most proxies handle the failure. Nobody handles the recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Trooper Does Differently
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go-based LLM proxy that sits between your agents and your LLM providers. It already handled fallback routing -if Claude hits quota, it falls back to local Ollama automatically.&lt;/p&gt;

&lt;p&gt;But the new &lt;code&gt;/recovery/{session_id}&lt;/code&gt; endpoint goes further. It tracks every step your agent completes in real time and tells your orchestration layer exactly where to resume.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Endpoint
&lt;/h2&gt;

&lt;p&gt;When your agent sends requests through Trooper, it captures every assistant response and extracts completed steps as they happen. When something fails, you call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET http://localhost:3000/recovery/{session_id}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And Trooper returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"subagent-demo-1779630533"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"completed_steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #3"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resume_from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"recovery_hint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Resume from step 4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your parent agent now knows exactly what the subagent finished and where to restart it. No repeated work. No lost progress.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;An agent reviewing 8 pull requests hits quota on PR #4. Trooper intercepts, returns the recovery payload, and the agent resumes from PR #4 using local Ollama.&lt;/p&gt;

&lt;p&gt;[&lt;a href="https://www.youtube.com/watch?v=NN2uwQZDCck" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=NN2uwQZDCck&lt;/a&gt;]&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Trooper uses a two-tier memory system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor&lt;/strong&gt; — the first two turns of a session, always preserved verbatim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tail&lt;/strong&gt; — the most recent turns, stored in a rolling window.&lt;/p&gt;

&lt;p&gt;When you call &lt;code&gt;/recovery&lt;/code&gt;, Trooper scans all stored assistant messages for completion signals — words like "completed", "finished", "done", "merged", "deployed". It extracts one completed step per message, deduplicates by task identifier, and returns the ordered list.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;resume_from&lt;/code&gt; field is simply &lt;code&gt;len(completed_steps) + 1&lt;/code&gt; — telling your orchestration layer which step to restart the subagent on.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start Trooper&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Point your agent at Trooper instead of Claude&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Instead of https://api.anthropic.com/v1/messages&lt;/span&gt;
POST http://localhost:3000/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Pass a session ID with each request&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:3000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"X-Session-ID: my-agent-session-1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"claude-haiku-4-5","max_tokens":100,"messages":[...]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Call recovery when something fails&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/recovery/my-agent-session-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The recovery endpoint is the foundation for proper subagentic orchestration. Upcoming work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parent agent integration&lt;/strong&gt; — the recovery payload feeds directly back into the orchestration layer to automatically restart subagents from the right step&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured step tracking&lt;/strong&gt; — support for agents that emit structured JSON progress instead of natural language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session replay&lt;/strong&gt; — rewind any session to any point and branch from there&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;As agent workflows get longer and more complex, failure recovery becomes a first-class concern. Trooper's approach — track everything as it happens, make recovery queryable — is a different philosophy from retry-and-hope.&lt;/p&gt;

&lt;p&gt;Local-first by default. Cloud when you choose. And now recoverable when things go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How are you handling agent failures in your workflows today? Drop a comment genuinely curious what patterns people are using.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Tested Privacy-Aware Routing with 4 AI Agents: What Actually Stayed Local</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Wed, 13 May 2026 02:40:52 +0000</pubDate>
      <link>https://dev.to/shouvik12/i-tested-privacy-aware-routing-with-4-ai-agents-what-actually-stayed-local-39oa</link>
      <guid>https://dev.to/shouvik12/i-tested-privacy-aware-routing-with-4-ai-agents-what-actually-stayed-local-39oa</guid>
      <description>&lt;p&gt;Following up on my &lt;a href="https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5"&gt;earlier Trooper experiments&lt;/a&gt;, I wanted to see if per-request privacy routing actually works in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The test:&lt;/strong&gt; 4 agents running simultaneously. Some handling public knowledge (OAuth security, Redis vs Memcached). Others handling sensitive data (API keys, customer PII).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; Credentials and PII stay on my machine. Everything else can use Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Each agent gets a &lt;code&gt;x_force_local&lt;/code&gt; flag:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 1 - security-analyst (☁️ Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "What are the top 3 OAuth2 vulnerabilities?"  
Routing: Public knowledge, let Claude handle it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 2 - credential-formatter (🔒 Qwen local)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Task:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Format as JSON: api_key=sk-prod-x7f9k2m, vault_url=https://vault.acme.io:8200"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="err"&gt;Routing:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Contains&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;credentials&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;stay&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;machine&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 3 - architecture-advisor (☁️ Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Redis or Memcached for session storage?"  
Routing: General best practices, use cloud
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 4 - compliance-reporter (🔒 Qwen local)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`Task: "Summarize: 47 tickets today. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"  
Routing: Contains customer names — privacy violation if sent to cloud`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvsb7hdks8f7yurzyfn5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvsb7hdks8f7yurzyfn5.png" alt=" " width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every agent completed successfully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud agents:&lt;/strong&gt; 3.8s and 2.4s (Claude handled complex reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local agents:&lt;/strong&gt; 2.4s and 1.2s (Qwen formatted data locally)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The critical part:&lt;/strong&gt; API keys, vault URLs, and customer names never left my machine. Zero network calls to Anthropic for those two agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened Under the Hood
&lt;/h2&gt;

&lt;p&gt;When Agent 2 (credential-formatter) ran with &lt;code&gt;x_force_local: true&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request intercepted by Trooper proxy&lt;/li&gt;
&lt;li&gt;Privacy flag detected&lt;/li&gt;
&lt;li&gt;Routed to local Ollama instead of Claude API&lt;/li&gt;
&lt;li&gt;Session context maintained via 3-layer system (Anchor/SITREP/Tail)&lt;/li&gt;
&lt;li&gt;JSON response returned — credentials never hit the network&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The vault URL and API key stayed on my hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;Using the OpenAI SDK (works with any OpenAI-compatible client):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-anthropic-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Trooper proxy
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Regular request → Claude
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OAuth2 vulnerabilities?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Session-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security-analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Privacy request → Qwen local
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Format: api_key=sk-prod...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Session-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;credential-formatter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x_force_local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# This keeps it local
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire API. One boolean flag controls routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most LLM proxies route between cloud providers. LiteLLM falls back from Claude to OpenAI. That's useful for uptime, but both destinations are someone else's servers.&lt;/p&gt;

&lt;p&gt;Trooper's &lt;code&gt;x_force_local&lt;/code&gt; routes to &lt;strong&gt;your machine&lt;/strong&gt;. Different failure mode, different privacy guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you need it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code refactoring with internal URLs&lt;/li&gt;
&lt;li&gt;Proprietary algorithms (not secret, just yours)&lt;/li&gt;
&lt;li&gt;Customer data that shouldn't leave your network&lt;/li&gt;
&lt;li&gt;Cost control (force expensive operations local)&lt;/li&gt;
&lt;li&gt;Offline work (flights, train rides, API outages)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When you don't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public API questions&lt;/li&gt;
&lt;li&gt;General best practices&lt;/li&gt;
&lt;li&gt;Complex reasoning that needs Claude's horsepower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point isn't "local always" or "cloud always." It's per-request control based on what you're asking.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Context Preservation Works
&lt;/h2&gt;

&lt;p&gt;The hardest part of routing isn't switching models — it's maintaining conversation state.&lt;/p&gt;

&lt;p&gt;Trooper uses a 3-layer compaction system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Anchor (~10%):** First 2 turns verbatim, never dropped  
**SITREP (~20%):** Rule-based summary of middle turns  
**Tail (~70%):** Last N turns verbatim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total budget: 6144 tokens (configurable)&lt;/p&gt;

&lt;p&gt;When Agent 4 (compliance-reporter) ran locally, Qwen received the anchor, a compressed SITREP of what Claude said earlier, and the immediate context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work Great
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Local models aren't Claude.&lt;/strong&gt; Qwen 2.5 is fast and solid for structured tasks (JSON formatting, parsing, summarization). But if you need deep reasoning, route to Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context compression is lossy.&lt;/strong&gt; Trooper compresses middle turns into summaries. For precision-critical workflows, keep sessions short or increase the context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need Ollama running.&lt;/strong&gt; This isn't plug-and-play:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5:3b
ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;code&gt;qwen2.5:3b&lt;/code&gt; (2GB, fast) for most tasks. Switch to &lt;code&gt;7b&lt;/code&gt; (5GB) when I need better output quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compared to My Previous Post
&lt;/h2&gt;

&lt;p&gt;Last time I showed what happens when Claude quota runs out: Trooper automatically falls back to Ollama with context preserved. That's &lt;strong&gt;reactive&lt;/strong&gt; — something breaks, the system recovers.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;proactive&lt;/strong&gt;: you tell it "keep this request local" before sending. Different problem, same underlying context system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Pull local model&lt;/span&gt;
ollama pull qwen2.5:3b

&lt;span class="c"&gt;# 2. Clone and run Trooper&lt;/span&gt;
git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
go run main.go providers.go classifier.go
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trooper starts on &lt;code&gt;localhost:3000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Point any OpenAI-compatible client at it and add &lt;code&gt;x_force_local: true&lt;/code&gt; when you want privacy routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback welcome — especially on edge cases or use cases I haven't considered.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is v3.1. The x_force_local feature shipped last week. Still iterating on auto-routing classification.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>privacy</category>
    </item>
    <item>
      <title>How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 03 May 2026 01:23:28 +0000</pubDate>
      <link>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</link>
      <guid>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
If you've ever been mid-conversation with Claude or GPT, hit a quota limit, and switched to a local Ollama model,you know the pain. The local model has zero context. It's like walking into a meeting 45 minutes late and nobody catches you up.&lt;br&gt;
I got frustrated enough to build something about it. That something is Trooper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Trooper&lt;/strong&gt;&lt;br&gt;
Trooper is a lightweight Go proxy (~850 lines, two files) that sits between your application and your LLM providers. When a cloud provider returns a quota error (429, 402, 529), Trooper automatically falls back to a local Ollama instance without dropping the conversation context.&lt;br&gt;
Single binary. Zero dependencies. Easy to audit since it sits in front of your API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real problem: context loss on fallback&lt;/strong&gt;&lt;br&gt;
Most fallback proxies solve the routing problem but ignore the context problem. They either pass the raw message history as-is (which blows up the local model's context window) or they truncate the oldest turns (which kills continuity).&lt;br&gt;
Neither works well in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution: three-layer context compaction&lt;/strong&gt;&lt;br&gt;
Trooper uses a structured compaction strategy before handing off to Ollama:&lt;br&gt;
&lt;strong&gt;Anchor&lt;/strong&gt; : The first two turns of the conversation are always preserved. These establish the original intent and set the tone.&lt;br&gt;
&lt;strong&gt;SITREP&lt;/strong&gt; : The middle turns get compressed into a structured summary called a SITREP. It extracts intent, entities, open loops, recent actions, and resolved items. The local model gets situational awareness, not raw history.&lt;br&gt;
&lt;strong&gt;Tail&lt;/strong&gt; : The most recent turns are preserved within a configurable token budget.&lt;/p&gt;

&lt;p&gt;A real SITREP looks like this in the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📦  Context compaction triggered — 538 tokens exceeds 500 budget
📦  Context compaction complete
    Total turns    : 7
    Anchor turns   : 2 (~43 tokens)
    Middle turns   : 2 → SITREP (~71 tokens)
    Recent turns   : 3 (~323 tokens)
    Tokens used    : 437 / 500
    SITREP         : intent="trooper" stage=unclear confidence=0.60 open=1 actions=0 resolved=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local model knows what you were working on, what's broken, what's been resolved, and what the last few exchanges were. That's enough to keep the conversation coherent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Go&lt;/strong&gt;&lt;br&gt;
Single binary distribution was the main reason. No runtime, no dependencies, drop it anywhere and it runs. The codebase being ~850 lines also means anyone can read the whole thing in an afternoon — important for something that proxies API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider support&lt;/strong&gt;&lt;br&gt;
Trooper currently supports Claude, Gemini, and OpenAI as cloud providers with automatic fallback to Ollama. The provider chain is configurable via environment variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;&lt;br&gt;
V3.0 is focused on foundation hardening — concurrency fixes and improved error handling. V3.1 will improve the SITREP extraction quality on longer conversations, which is where intent detection starts to degrade today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;&lt;br&gt;
github.com/shouvik12/trooper&lt;br&gt;
Would love feedback on the context compaction approach — especially from anyone running larger local models. What's your cold-start latency on fallback?&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>llm</category>
      <category>ai</category>
      <category>go</category>
    </item>
  </channel>
</rss>
