<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shouvik Palit</title>
    <description>The latest articles on DEV Community by Shouvik Palit (@shouvik12).</description>
    <link>https://dev.to/shouvik12</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890793%2Fc3274ea7-2b55-4b67-9d70-2f3c08b63374.png</url>
      <title>DEV Community: Shouvik Palit</title>
      <link>https://dev.to/shouvik12</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shouvik12"/>
    <language>en</language>
    <item>
      <title>What happens when you give an AI a budget</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 28 Jun 2026 18:22:22 +0000</pubDate>
      <link>https://dev.to/shouvik12/what-happens-when-you-give-an-ai-a-budget-1leg</link>
      <guid>https://dev.to/shouvik12/what-happens-when-you-give-an-ai-a-budget-1leg</guid>
      <description>&lt;p&gt;I gave Claude a budget. It changed what it built.&lt;/p&gt;

&lt;p&gt;Over the past few weeks I've been exploring whether LLMs can work with execution budgets and whether that constraint changes how they behave.&lt;/p&gt;

&lt;p&gt;The short answer: it does.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Researchers studying frontier models found they are consistently over-optimistic about budget. Instead of stopping and alerting the user early, they keep spending tokens on work that's unlikely to succeed. Even after fine-tuning specifically targeting budget awareness, calibration caps at 47%.&lt;/p&gt;

&lt;p&gt;You cannot train your way out of this. External enforcement is required.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What I Tried&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I started giving Claude implementation tasks with a fixed token budget.&lt;/p&gt;

&lt;p&gt;The behavior changed.&lt;/p&gt;

&lt;p&gt;Instead of building everything it could think of, the model focused on completing the requested specification before asking for more budget.&lt;/p&gt;

&lt;p&gt;Unconstrained Claude on a REST API task added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT authentication (not requested)&lt;/li&gt;
&lt;li&gt;Health check endpoint (not requested)&lt;/li&gt;
&lt;li&gt;Full test suite (not requested)&lt;/li&gt;
&lt;li&gt;nodemon dev dependency (not requested)&lt;/li&gt;
&lt;li&gt;URL validation middleware (not requested)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Token Sensei at budget 400 delivered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The API&lt;/li&gt;
&lt;li&gt;Nothing else&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;The Numbers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three tasks. Two conditions each. Same model (claude-sonnet-4-5).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Unconstrained&lt;/th&gt;
&lt;th&gt;Token Sensei&lt;/th&gt;
&lt;th&gt;Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Notes REST API&lt;/td&gt;
&lt;td&gt;3,584 tokens&lt;/td&gt;
&lt;td&gt;2,578 tokens&lt;/td&gt;
&lt;td&gt;28%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bookmark REST API&lt;/td&gt;
&lt;td&gt;4,000+ (truncated)&lt;/td&gt;
&lt;td&gt;1,604 tokens&lt;/td&gt;
&lt;td&gt;60%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python CLI&lt;/td&gt;
&lt;td&gt;2,466 tokens&lt;/td&gt;
&lt;td&gt;993 tokens&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Bookmark Manager task is the most striking. The unconstrained model was still generating when it hit the response ceiling. Token Sensei completed the same task in 1,604 tokens and stopped.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;How Token Sensei Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set a task and budget&lt;/li&gt;
&lt;li&gt;The runtime streams output and tracks exact token usage&lt;/li&gt;
&lt;li&gt;When the budget is exhausted, execution pauses&lt;/li&gt;
&lt;li&gt;A checkpoint shows what was completed and what remains&lt;/li&gt;
&lt;li&gt;You decide — approve a loan to continue, or ship what exists&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight: the budget does not just control how much the model produces. It changes what the model optimizes for. Under constraint, it prioritizes completing the requested work. Reduced token usage is an outcome of that shift, not the objective.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The Protocol&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Token Sensei is both a runtime and a protocol. The LOAN_REQUEST format tells the model how to surface checkpoints:&lt;/p&gt;

&lt;p&gt;LOAN_REQUEST&lt;br&gt;
Completed: ✓ [working deliverables produced]&lt;br&gt;
Remaining: • [specific remaining items]&lt;br&gt;
Requested Budget: [minimum needed]&lt;br&gt;
This budget will complete: ✓ [specific deliverables]&lt;/p&gt;

&lt;p&gt;The runtime enforces budget externally using exact provider-reported token counts — not estimates.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try It&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/shouvik12/token-sensei" rel="noopener noreferrer"&gt;https://github.com/shouvik12/token-sensei&lt;/a&gt;&lt;br&gt;
cd token-sensei&lt;br&gt;
npm install&lt;br&gt;
node server.js&lt;/p&gt;

&lt;p&gt;Open localhost:3000. Set a task, set a budget, run.&lt;/p&gt;

&lt;p&gt;MIT licensed. Would love feedback from the dev.to community — especially whether the checkpoint UX feels right and whether the protocol is something you'd use outside the runtime.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F20qjljmop80mnggt35w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F20qjljmop80mnggt35w6.png" alt=" " width="800" height="659"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Every AI task should have a budget.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Ship Happens: How I Made a 3B Local Model Trustworthy for Kubernetes (By Never Trusting It)</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 21 Jun 2026 13:36:46 +0000</pubDate>
      <link>https://dev.to/shouvik12/-ship-happens-how-i-made-a-3b-local-model-trustworthy-for-kubernetes-by-never-trusting-it-23o8</link>
      <guid>https://dev.to/shouvik12/-ship-happens-how-i-made-a-3b-local-model-trustworthy-for-kubernetes-by-never-trusting-it-23o8</guid>
      <description>&lt;p&gt;Small local models are surprisingly capable. They're also wrong in&lt;br&gt;
small, easy-to-miss ways: a deprecated API version here, a&lt;br&gt;
placeholder value that looks real until you actually deploy it&lt;br&gt;
there. The usual fix is better prompting. I tried something&lt;br&gt;
different: stop trying to make the model right, and just make sure&lt;br&gt;
nothing wrong ever ships.&lt;/p&gt;
&lt;h2&gt;
  
  
  The idea
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb4c1lph917olo0wzf3eo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb4c1lph917olo0wzf3eo.png" alt=" " width="799" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ship Happens&lt;/strong&gt; lets you describe a Kubernetes deployment in plain&lt;br&gt;
English. A local model (&lt;code&gt;qwen2.5:3b&lt;/code&gt;, via Ollama) drafts the&lt;br&gt;
manifest. Before anything counts as "done," that manifest gets&lt;br&gt;
tested against a real local cluster (&lt;code&gt;kind&lt;/code&gt;) using a server-side&lt;br&gt;
dry-run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;server &lt;span class="nt"&gt;-f&lt;/span&gt; manifest.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the cluster rejects it, the &lt;em&gt;actual error message&lt;/em&gt;, not a&lt;br&gt;
vague "try again", gets fed straight back into the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;RegenerateFromError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;imageName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clusterError&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;correction&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"A previous attempt was tested against a real Kubernetes "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="s"&gt;"cluster and REJECTED with this exact error:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="s"&gt;"Fix the YAML to resolve this specific error."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;clusterError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then it gets checked again. The model never gets to just "try&lt;br&gt;
again" blindly. Every retry is grounded in what the cluster&lt;br&gt;
actually said was wrong.&lt;/p&gt;
&lt;h2&gt;
  
  
  What this caught, that I didn't expect
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Missing &lt;code&gt;---&lt;/code&gt; silently merges resources
&lt;/h3&gt;

&lt;p&gt;Ask for a Deployment + Service and the model sometimes forgets the&lt;br&gt;
YAML document separator between them. The result isn't a parse&lt;br&gt;
error. It's &lt;em&gt;worse&lt;/em&gt;. Without &lt;code&gt;---&lt;/code&gt;, you get one YAML document with&lt;br&gt;
duplicate keys, and most parsers (including the one &lt;code&gt;kubectl&lt;/code&gt; uses)&lt;br&gt;
just keep the &lt;em&gt;last&lt;/em&gt; value for each duplicate key. The Deployment's&lt;br&gt;
&lt;code&gt;spec&lt;/code&gt; silently gets overwritten by the Service's &lt;code&gt;spec&lt;/code&gt;. Half your&lt;br&gt;
config vanishes with zero warning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Models often show a wrong example before the right one, or just&lt;/span&gt;
&lt;span class="c"&gt;// forget the separator. Either way, this normalizes it before&lt;/span&gt;
&lt;span class="c"&gt;// validation ever runs:&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;fixMissingDocumentSeparators&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yamlContent&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// detects a second "apiVersion:" line without a preceding ---&lt;/span&gt;
    &lt;span class="c"&gt;// and inserts one automatically&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Placeholder images that pass every check
&lt;/h3&gt;

&lt;p&gt;The model would write &lt;code&gt;image: your-app-image:v1&lt;/code&gt;, syntactically&lt;br&gt;
perfect YAML. Dry-run validates schema, not whether the string&lt;br&gt;
happens to be a real, pullable image. This passes validation&lt;br&gt;
cleanly and only fails 20 minutes later, at actual pod scheduling,&lt;br&gt;
as &lt;code&gt;ImagePullBackOff&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix wasn't a smarter prompt (tried that, model just produced a&lt;br&gt;
&lt;em&gt;different&lt;/em&gt; placeholder). It was deterministic substitution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;knownImages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"nginx"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"nginx:latest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"redis"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"redis:7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"postgres"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"postgres:16"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;fixKnownPlaceholderImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yamlContent&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// finds a known technology keyword inside the placeholder text&lt;/span&gt;
    &lt;span class="c"&gt;// itself ("your-nginx-image" contains "nginx") and substitutes&lt;/span&gt;
    &lt;span class="c"&gt;// the real image, deterministically. No second model call,&lt;/span&gt;
    &lt;span class="c"&gt;// no risk of generating a different placeholder&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. The repair that "fixed" the version by deleting it
&lt;/h3&gt;

&lt;p&gt;This one was my favorite. A manifest used the deprecated&lt;br&gt;
&lt;code&gt;policy/v1beta1&lt;/code&gt; for a &lt;code&gt;PodDisruptionBudget&lt;/code&gt;. Dry-run rejected it&lt;br&gt;
correctly. The repair loop kicked in, read the error, and&lt;br&gt;
"corrected" it to... &lt;code&gt;v1&lt;/code&gt;. Not &lt;code&gt;policy/v1&lt;/code&gt;. Just &lt;code&gt;v1&lt;/code&gt;, dropping the&lt;br&gt;
API group entirely.&lt;/p&gt;

&lt;p&gt;Technically a different string. Still completely wrong, since&lt;br&gt;
&lt;code&gt;PodDisruptionBudget&lt;/code&gt; doesn't exist under the bare &lt;code&gt;v1&lt;/code&gt; group at&lt;br&gt;
all. The model recognized something needed to change, defaulted to&lt;br&gt;
the most common generic &lt;code&gt;apiVersion&lt;/code&gt; it knew, and got it wrong in a&lt;br&gt;
new way.&lt;/p&gt;

&lt;p&gt;The fix was explicit, not clever:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;`IMPORTANT: If the error is about a deprecated or invalid
apiVersion, the API GROUP prefix (the part before the slash)
usually stays the same. Only the version number changes. For
example, "policy/v1beta1" should become "policy/v1", NOT just "v1".
Dropping the group prefix entirely produces a different, equally
invalid value.`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-validated after that, and it passed correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  "But couldn't jsonschema do this?"
&lt;/h2&gt;

&lt;p&gt;Came up when I posted this, and it's a fair question. Mostly: no,&lt;br&gt;
not for the bugs above.&lt;/p&gt;

&lt;p&gt;A static schema is pinned to whatever Kubernetes version it was&lt;br&gt;
bundled for. The &lt;code&gt;policy/v1beta1&lt;/code&gt; bug is a real, structurally valid&lt;br&gt;
shape. It's just been removed from newer clusters. A schema&lt;br&gt;
package might still call that "valid" depending on how current it&lt;br&gt;
is. Dry-run against the real cluster has no such lag: it's checking&lt;br&gt;
against exactly what &lt;em&gt;this&lt;/em&gt; cluster, right now, actually accepts.&lt;/p&gt;

&lt;p&gt;It also covers CRDs for free. If your cluster has cert-manager,&lt;br&gt;
Prometheus operators, anything custom: there's no generic schema&lt;br&gt;
for that, but dry-run validates against whatever's actually&lt;br&gt;
installed with zero extra configuration. And it goes through the&lt;br&gt;
real admission chain (RBAC, any webhooks), which schema validation&lt;br&gt;
never sees, since it's only checking shape in isolation.&lt;/p&gt;

&lt;p&gt;jsonschema is faster and works offline, so it's a reasonable&lt;br&gt;
first-pass filter. It's just not sufficient on its own for what I&lt;br&gt;
was trying to catch.&lt;/p&gt;
&lt;h2&gt;
  
  
  What it doesn't catch
&lt;/h2&gt;

&lt;p&gt;Worth being honest about the ceiling here. Dry-run proves &lt;em&gt;syntax&lt;/em&gt;,&lt;br&gt;
not &lt;em&gt;architecture&lt;/em&gt;. It will not catch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;ReadWriteOnce&lt;/code&gt; PersistentVolumeClaim shared across 3 replicas
(perfectly valid YAML, breaks at pod scheduling)&lt;/li&gt;
&lt;li&gt;A model picking a real-but-wrong image for an ambiguous request
(asked for "prometheus," once got handed Alertmanager's image,
also real, also wrong)&lt;/li&gt;
&lt;li&gt;Container-internal startup requirements, like a Bitnami image
refusing to start without &lt;code&gt;ALLOW_EMPTY_PASSWORD=yes&lt;/code&gt; set. That's
application logic baked into the image, completely invisible to
the Kubernetes API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one &lt;em&gt;is&lt;/em&gt; now caught, just not by dry-run. It's caught by a&lt;br&gt;
Dashboard that polls live pod health, and on a crash, pulls the&lt;br&gt;
actual logs and asks the local model to summarize the root cause&lt;br&gt;
in plain English instead of a raw log dump:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔴 redis-high-availability
0/3 replicas ready
Missing REDIS_PASSWORD leads to startup failure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That line is genuinely the model's own summary of a much longer,&lt;br&gt;
uglier log output. Different use of the same local model, not&lt;br&gt;
generating infrastructure, just explaining a failure after the&lt;br&gt;
fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's still unverified
&lt;/h2&gt;

&lt;p&gt;I'd rather say this directly than let someone find out the hard&lt;br&gt;
way: there's a third mode, "My Code," that's supposed to build your&lt;br&gt;
own source into an image and deploy it directly into the cluster.&lt;br&gt;
It's implemented. I have not run it end-to-end yet. The README says&lt;br&gt;
exactly that, in those words, instead of pretending otherwise.&lt;/p&gt;

&lt;p&gt;"Public Image" mode (describe it, AI drafts it) and "My YAML" mode&lt;br&gt;
(paste your own, skip generation entirely) have both been tested&lt;br&gt;
repeatedly and are confirmed working. The build-from-source path&lt;br&gt;
hasn't earned that yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual point
&lt;/h2&gt;

&lt;p&gt;This isn't really about Kubernetes, and it isn't really about this&lt;br&gt;
one model. The pattern is just: don't try to make an unreliable&lt;br&gt;
generator more reliable through better instructions alone. Give it&lt;br&gt;
a real, external thing to be checked against, feed the &lt;em&gt;real&lt;/em&gt;&lt;br&gt;
failure back in when it's wrong, and let that loop run until&lt;br&gt;
something true comes out the other side.&lt;/p&gt;

&lt;p&gt;A 3B model doesn't need to be smart for this to work. It just needs&lt;br&gt;
something honest to check it against.&lt;/p&gt;

&lt;p&gt;Code: &lt;a href="https://github.com/shouvik12/ship-happens" rel="noopener noreferrer"&gt;github.com/shouvik12/ship-happens&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>go</category>
    </item>
    <item>
      <title>I tracked every GitHub traffic spike for my open source LLM proxy for 7 weeks. Then I did the exact same thing again, and it worked again.</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Mon, 15 Jun 2026 03:42:08 +0000</pubDate>
      <link>https://dev.to/shouvik12/i-tracked-every-github-traffic-spike-for-my-open-source-llm-proxy-for-7-weeks-then-i-did-the-exact-39ln</link>
      <guid>https://dev.to/shouvik12/i-tracked-every-github-traffic-spike-for-my-open-source-llm-proxy-for-7-weeks-then-i-did-the-exact-39ln</guid>
      <description>&lt;p&gt;When I shipped &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;Trooper&lt;/a&gt;, a privacy-aware LLM proxy written in Go, I didn't have a marketing plan. I had GitHub traffic analytics and a habit of checking them obsessively.&lt;/p&gt;

&lt;p&gt;Seven weeks later, I have something more useful than a viral moment: a ranked table of every traffic spike, what caused each one, and proof that the exact same playbook that worked at launch still works when you have something new to say.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Trooper?
&lt;/h2&gt;

&lt;p&gt;Trooper sits between your app and your LLM provider. When your cloud quota runs out, it automatically falls back to a local Ollama instance with zero code changes on your end. It also tracks session context, so your agents don't go blind between calls.&lt;/p&gt;

&lt;p&gt;It's not a chatbot wrapper. It's plumbing. Which makes the distribution story more interesting, because plumbing doesn't go viral the way demos do.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Data
&lt;/h2&gt;

&lt;p&gt;GitHub gives you 14-day rolling windows for clones and views. I screenshotted them obsessively and tracked every spike. Here's the full ranked table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Clones&lt;/th&gt;
&lt;th&gt;Unique Cloners&lt;/th&gt;
&lt;th&gt;Views&lt;/th&gt;
&lt;th&gt;Unique Visitors&lt;/th&gt;
&lt;th&gt;Driver&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🥇 1&lt;/td&gt;
&lt;td&gt;May 13&lt;/td&gt;
&lt;td&gt;375&lt;/td&gt;
&lt;td&gt;173&lt;/td&gt;
&lt;td&gt;1,113&lt;/td&gt;
&lt;td&gt;~140&lt;/td&gt;
&lt;td&gt;Reddit wave peak&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥈 2&lt;/td&gt;
&lt;td&gt;May 10-12&lt;/td&gt;
&lt;td&gt;312&lt;/td&gt;
&lt;td&gt;137&lt;/td&gt;
&lt;td&gt;974&lt;/td&gt;
&lt;td&gt;133&lt;/td&gt;
&lt;td&gt;Reddit launch spike&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🥉 3&lt;/td&gt;
&lt;td&gt;Jun 10&lt;/td&gt;
&lt;td&gt;289&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;td&gt;749&lt;/td&gt;
&lt;td&gt;101&lt;/td&gt;
&lt;td&gt;"Escalate the model" r/ollama post&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Jun 11&lt;/td&gt;
&lt;td&gt;268&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;840&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;td&gt;Decaying from Jun 10 spike&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Jun 12&lt;/td&gt;
&lt;td&gt;240&lt;/td&gt;
&lt;td&gt;99&lt;/td&gt;
&lt;td&gt;739&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;Decaying from Jun 10 spike&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Jun 9&lt;/td&gt;
&lt;td&gt;175&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;802&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Organic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Apr 25&lt;/td&gt;
&lt;td&gt;174&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;664&lt;/td&gt;
&lt;td&gt;113&lt;/td&gt;
&lt;td&gt;Early Reddit posts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Jun 7&lt;/td&gt;
&lt;td&gt;171&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;td&gt;876&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;td&gt;Organic recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Jun 6&lt;/td&gt;
&lt;td&gt;163&lt;/td&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;td&gt;755&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;Organic recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;May 29-30&lt;/td&gt;
&lt;td&gt;122&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;610&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;LinkedIn post&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;May 25&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;495&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;Claude Code integration chat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Reddit is the only thing that moved the needle, and community fit matters more than size
&lt;/h3&gt;

&lt;p&gt;The #1 and #2 peaks were both Reddit-driven. On May 10-11, I posted across r/ollama, r/LocalLLM, r/ClaudeCode, and r/Gemini simultaneously. Total views across those posts: ~7,000.&lt;/p&gt;

&lt;p&gt;r/ollama alone drove nearly 4,000 of those views. Not r/LocalLLM. Not r/ClaudeCode. &lt;strong&gt;r/ollama&lt;/strong&gt;, the smallest of the four communities.&lt;/p&gt;

&lt;p&gt;The reason: Trooper solves an Ollama-specific problem. Quota exhaustion hitting your local Ollama fallback is something that community lives with daily. Posting to a larger but less relevant community got less traction, even with identical content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Precision beats reach. Find the subreddit where your exact problem is a lived experience, not just a relatable concept.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. The post that doesn't feel like a post performs best
&lt;/h3&gt;

&lt;p&gt;The r/ollama post that drove the May peak wasn't "I built a thing, please star it." It was structured around the problem first:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I kept hitting Claude quota limits mid-session and losing context. So I built a proxy that falls back to Ollama automatically."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nobody wants to read a launch announcement. Everyone wants to read about a problem they recognize.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Lead with the pain, not the product.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Organic discovery is real, and it compounds quietly
&lt;/h3&gt;

&lt;p&gt;Ranks #3 and #4 (the highest views since the May peak) happened with &lt;strong&gt;zero posts&lt;/strong&gt; in the preceding two weeks. Pure organic discovery.&lt;/p&gt;

&lt;p&gt;The referring traffic breakdown tells the story: &lt;code&gt;github.com&lt;/code&gt; is the top referrer, meaning developers are finding Trooper by browsing related repos. Google is sending traffic too. Someone is searching for "LLM proxy ollama fallback" and landing on Trooper.&lt;/p&gt;

&lt;p&gt;This doesn't show up immediately. It built slowly over six weeks. But it's now driving more daily traffic than the LinkedIn post did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; SEO and GitHub organic discovery are slow, but they compound. Write a good README. Use keywords people actually search for.&lt;/p&gt;




&lt;h3&gt;
  
  
  3.5. The playbook worked a second time, on purpose
&lt;/h3&gt;

&lt;p&gt;Three weeks after writing the first draft of this article, rank #3 happened: 289 clones and 124 unique cloners in roughly 24 hours. Second only to the original launch peak, and bigger than the entire Reddit launch week for views-per-day.&lt;/p&gt;

&lt;p&gt;This time it wasn't a mystery. I had shipped a new feature: smart escalation, where Trooper bumps a request up to a bigger model mid-conversation when the local model can't handle it, instead of dropping context and starting over. I wrote it up using the exact same framing from lesson #2, problem first, no "I built a feature" language, and posted it to r/ollama again.&lt;/p&gt;

&lt;p&gt;The title was "&lt;a href="https://www.reddit.com/r/ollama/comments/1u2bcvz/escalate_the_model_not_the_conversation/" rel="noopener noreferrer"&gt;Escalate the model, not the conversation&lt;/a&gt;". Same subreddit as the original launch. Same problem-first structure. The referring sites confirmed it: &lt;code&gt;reddit.com&lt;/code&gt; and &lt;code&gt;com.reddit.frontpage&lt;/code&gt; combined for over 60 views and &lt;strong&gt;16 unique visitors arriving via Reddit's front page&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; the launch playbook isn't a one-time unlock. It's a template. Same community, same problem-first framing, new thing to say, comparable result. The hard part was never "how do I get r/ollama's attention once", it's having something genuinely new and useful worth bringing back to them. If you ship something that actually matters to the community you launched in, the same channel that worked once will work again.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. LinkedIn drove less than you'd expect
&lt;/h3&gt;

&lt;p&gt;The May 29-30 LinkedIn post landed at rank #5. It did move the needle, 34 more clones than the pre-post baseline, but the effect faded within 48 hours and the unique cloner ratio was lower than Reddit.&lt;/p&gt;

&lt;p&gt;My read: LinkedIn audiences share content but don't clone repos at the same rate as Reddit or HN. They're evaluating Trooper as a concept, not as a tool they're about to run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; LinkedIn is good for reach and credibility. It is not a clone driver.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. The view-to-clone ratio tells you who's actually landing
&lt;/h3&gt;

&lt;p&gt;During the Reddit peak (May 13): ~140 unique visitors, 173 unique cloners. &lt;strong&gt;Ratio above 1.0&lt;/strong&gt;, meaning people were cloning on multiple machines, or recommending it to colleagues.&lt;/p&gt;

&lt;p&gt;During the organic recovery (Jun 7): 111 unique visitors, 106 unique cloners. &lt;strong&gt;Near 1.0&lt;/strong&gt;, meaning developers landing on the page are converting almost immediately.&lt;/p&gt;

&lt;p&gt;During the LinkedIn period: ratio dropped. More browsers, fewer cloners.&lt;/p&gt;

&lt;p&gt;High view-to-clone ratio means right audience. Low ratio means wrong audience or unclear value prop.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The biggest shift in how I think about distribution: it's not "do one big launch and hope it compounds." It's "every time you ship something that solves a real problem for the community you launched in, go back and tell them, the same way you did the first time."&lt;/p&gt;

&lt;p&gt;The escalation feature post worked because it followed the same rules as the original: problem first, no launch language, posted to the subreddit where the problem is lived daily. Two data points isn't a lot, but it's enough to convince me this is a repeatable loop, not a one-time stroke of luck.&lt;/p&gt;

&lt;p&gt;I'm also shipping BRIEFING, a feature that lets agents carry context forward across sessions by reading a structured log on startup. Zero instrumentation, the agent doesn't know it's happening. Same plan as before: ship it, frame the problem it solves, post it to r/ollama.&lt;/p&gt;

&lt;p&gt;If you're building open source infrastructure tools and wondering where to start with distribution: r/ollama if you're in the Ollama ecosystem, r/LocalLLaMA if you're broader. Write about the problem before you write about the solution, every time you have something new, not just at launch.&lt;/p&gt;




&lt;p&gt;Trooper is open source and MIT licensed: &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy to answer questions about the proxy architecture, the session handling, or anything else in the comments.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>marketing</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Your Session History Is Bleeding Tokens Every Time You Paste</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sat, 13 Jun 2026 18:15:03 +0000</pubDate>
      <link>https://dev.to/shouvik12/your-session-history-is-bleeding-tokens-every-time-you-paste-14ck</link>
      <guid>https://dev.to/shouvik12/your-session-history-is-bleeding-tokens-every-time-you-paste-14ck</guid>
      <description>&lt;p&gt;Everyone optimizes what they type into Claude.&lt;br&gt;
Nobody optimizes what they paste.&lt;/p&gt;

&lt;p&gt;But developers paste constantly. GitHub READMEs. Research papers. API docs. Jira tickets. Confluence pages. Slack threads.&lt;/p&gt;

&lt;p&gt;Every time you copy a webpage, your clipboard picks up everything. The navigation. The footer. The boilerplate. The cookie banners. The share buttons rendered as plain text.&lt;/p&gt;

&lt;p&gt;You wanted the content. Claude got the garbage too. And you paid for every token of it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Real token counts from live Claude Code sessions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GitHub README (trooper)&lt;/td&gt;
&lt;td&gt;4,600&lt;/td&gt;
&lt;td&gt;1,200&lt;/td&gt;
&lt;td&gt;74%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research paper (arXiv)&lt;/td&gt;
&lt;td&gt;4,800&lt;/td&gt;
&lt;td&gt;1,900&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub README (caveman)&lt;/td&gt;
&lt;td&gt;1,800&lt;/td&gt;
&lt;td&gt;600&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokslayer's own README&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;170&lt;/td&gt;
&lt;td&gt;79%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;67%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Where the savings actually land
&lt;/h2&gt;

&lt;p&gt;First turn input tokens are paid in full. The saving is on two things:&lt;/p&gt;

&lt;p&gt;Output tokens. Claude responds to the compressed version, so answers are shorter and more focused.&lt;/p&gt;

&lt;p&gt;Session history. The compressed version stays in context, not the bloated original. Every subsequent turn in that session carries 3,400 fewer tokens of history. Long sessions with multiple pastes, this compounds hard.&lt;/p&gt;

&lt;p&gt;Note: A subagent-based approach was considered but doesn't fully solve this .It can't shrink the user's original pasted message in that turn. A write-path &lt;br&gt;
proxy is the correct fix, rewriting the outgoing request on every turn &lt;br&gt;
including history resends. That's the roadmap direction.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's actually happening when you paste
&lt;/h2&gt;

&lt;p&gt;When you copy a GitHub page you get the content plus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Skip to content" navigation&lt;/li&gt;
&lt;li&gt;Repo tabs (Code, Issues, Pull requests, Actions...)&lt;/li&gt;
&lt;li&gt;Breadcrumbs and branch selectors&lt;/li&gt;
&lt;li&gt;Footer links&lt;/li&gt;
&lt;li&gt;Share buttons rendered as text&lt;/li&gt;
&lt;li&gt;License metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that is the README. All of it hits Claude's context window. All of it costs tokens.&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;A Claude Code skill that sits between your clipboard and Claude. Detects pasted content. Strips the noise. Sends only the signal.&lt;/p&gt;

&lt;p&gt;No proxy. No server. No MCP. No configuration. One file. Drop it in. Restart Claude Code. Done.&lt;/p&gt;

&lt;p&gt;Receipt on every paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ORIGINAL:  "Skip to content shouvik12 trooper Repository..."  (~4,600 tokens)
OPTIMIZED: "Trooper: LLM proxy. Local-first. Ollama default..."  (~1,200 tokens)
SAVED:     ~3,400 tokens (74%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What gets stripped: nav chrome, footers, filler phrases, redundant sentences, marketing boilerplate.&lt;/p&gt;

&lt;p&gt;What stays: headings, code blocks, API signatures, URLs, numbers, technical terms, proper nouns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/shouvik12/tokslayer/main/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart Claude Code. Works on every paste automatically from that point on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The meta test
&lt;/h2&gt;

&lt;p&gt;Ran tokslayer's own README through itself and typed "summarize":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ORIGINAL:  "tokslayer, Slays tokens before they reach Claude..."  (~800 tokens)
OPTIMIZED: "Tokslayer: Claude Code skill. Compresses pasted content..."  (~170 tokens)
SAVED:     ~630 tokens (79%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A tool that eats its own cooking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it fits
&lt;/h2&gt;

&lt;p&gt;This covers the input side. Pair it with caveman for output compression &lt;br&gt;
(65% reduction on Claude responses).&lt;/p&gt;

&lt;p&gt;Together: lean input, lean output&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/shouvik12/tokslayer" rel="noopener noreferrer"&gt;https://github.com/shouvik12/tokslayer&lt;/a&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>skillmd</category>
    </item>
    <item>
      <title>Stop explaining yourself to Claude</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Fri, 12 Jun 2026 04:26:55 +0000</pubDate>
      <link>https://dev.to/shouvik12/-stop-explaining-react-to-claude-47f</link>
      <guid>https://dev.to/shouvik12/-stop-explaining-react-to-claude-47f</guid>
      <description>&lt;p&gt;You're wasting tokens. Not a little -a lot.&lt;/p&gt;

&lt;p&gt;Here's a prompt I see constantly:&lt;/p&gt;

&lt;p&gt;"I have a React app and I'm using the useState hook. My component re-renders every time the parent renders even though the props haven't changed. Why is this happening?"&lt;/p&gt;

&lt;p&gt;Claude doesn't need any of that setup. It already knows React. It already knows what useState is. The only thing it needed was:&lt;/p&gt;

&lt;p&gt;"Component re-renders on parent render. Props unchanged. Why."&lt;/p&gt;

&lt;p&gt;Same answer. 64% fewer tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  The delta principle
&lt;/h2&gt;

&lt;p&gt;Most prompts are written for humans. We explain context, name the framework, describe how things work before asking the question. That's how we communicate with each other.&lt;/p&gt;

&lt;p&gt;But Claude already knows the context. The only thing it needs is the &lt;strong&gt;delta&lt;/strong&gt; — the new information, the specific problem, the unknown.&lt;/p&gt;

&lt;p&gt;Everything else is noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you can safely strip
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude already knows these — stop re-explaining them:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Framework names used as context ("I have a React app", "I'm using Python")&lt;/li&gt;
&lt;li&gt;Concept explanations ("hooks are a React feature that...", "a closure is when...")&lt;/li&gt;
&lt;li&gt;Stack introductions ("my app uses Node, Express, and MongoDB")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Social noise that adds zero signal:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pleasantries: "hey", "hope you can help", "thanks in advance"&lt;/li&gt;
&lt;li&gt;Permission requests: "could you please", "I was wondering if"&lt;/li&gt;
&lt;li&gt;Hedging: "I think", "I'm not sure but", "maybe", "possibly"&lt;/li&gt;
&lt;li&gt;Filler: "basically", "essentially", "just", "simply", "actually"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you should never strip:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The actual error, bug, or problem&lt;/li&gt;
&lt;li&gt;Numbers, thresholds, measurements&lt;/li&gt;
&lt;li&gt;Variable names, function names, file names&lt;/li&gt;
&lt;li&gt;Code blocks and URLs&lt;/li&gt;
&lt;li&gt;Anything Claude could NOT know without being told&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Before and after
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Debugging:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (41 tokens):
"I'm working on a Node.js Express API and I'm getting a 401 unauthorized 
error when I try to call the endpoint. I'm passing the JWT token in the 
Authorization header."

After (12 tokens):
"401 on endpoint. JWT in Authorization header."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code review:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (29 tokens):
"Could you please review this Python function and tell me if there are 
any issues or improvements I could make?"

After (6 tokens):
"Review. Issues + improvements."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before (19 tokens):
"I was wondering if you could explain how database connection pooling 
works in simple terms?"

After (5 tokens):
"Explain connection pooling. Simple."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The compound effect
&lt;/h2&gt;

&lt;p&gt;Single prompt savings look small. But across a real session, it compounds.&lt;/p&gt;

&lt;p&gt;Here's a simulated 20-turn dev session — the kind where you're debugging something across multiple back-and-forths:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Verbose (tokens)&lt;/th&gt;
&lt;th&gt;Delta (tokens)&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;44&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;757&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;226&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;531&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;531 tokens saved in a single session. 70% reduction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On Claude's API at Sonnet pricing, that's a small number in dollars. But if you're building on top of the API and running hundreds of sessions a day, it adds up fast. And even on claude.ai, fewer input tokens means less context noise — Claude processes cleaner signal and responds more precisely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three intensity levels
&lt;/h2&gt;

&lt;p&gt;Not every prompt needs ultra-compression. I use three modes depending on the situation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;lite&lt;/strong&gt; — strip pleasantries only, keep context (~20% reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: onboarding a new topic, first message in a session&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;full&lt;/strong&gt; — strip everything Claude knows, keep only the delta (~60% reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: mid-session debugging, iterating on code, quick questions&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;ultra&lt;/strong&gt; — compress to bare minimum signal (~70%+ reduction)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use when: you know exactly what you want and don't care about polish&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The skill file
&lt;/h2&gt;

&lt;p&gt;I turned this into a Claude skill — a markdown file that instructs Claude to apply delta compression automatically, with activation/deactivation commands and intensity switching.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/shouvik12/delta" rel="noopener noreferrer"&gt;github.com/shouvik12/delta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README has the full rule set, intensity examples, and instructions for adding it to your Claude setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  One thing worth thinking about
&lt;/h2&gt;

&lt;p&gt;This is a small optimization. But the principle behind it is bigger:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We've been writing prompts for humans.&lt;/strong&gt; We explain, we hedge, we contextualize — because that's how we earn understanding from other people. With LLMs, that overhead is waste. The model doesn't need to be convinced you know what you're talking about. It doesn't need the social scaffolding.&lt;/p&gt;

&lt;p&gt;Just send the delta.&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
    </item>
    <item>
      <title>Escalate the Model, Not the Conversation</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Wed, 10 Jun 2026 19:03:23 +0000</pubDate>
      <link>https://dev.to/shouvik12/escalate-the-model-not-the-conversation-4n4k</link>
      <guid>https://dev.to/shouvik12/escalate-the-model-not-the-conversation-4n4k</guid>
      <description>&lt;p&gt;Trooper started as a fallback proxy for agents. Claude hits a quota, falls back to Ollama, session continues. No crashes, no lost context.&lt;/p&gt;

&lt;p&gt;The interesting problem that came up wasn't model routing. It was context preservation.&lt;/p&gt;

&lt;p&gt;When you're debugging something hard, you build up context over many turns. The problem statement, what you've tried, what failed. When you switch from a local model to Claude, all of that context has to go with it. And when you come back to local, the local model needs to know what Claude said.&lt;/p&gt;

&lt;p&gt;That's what 4.0 solves.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;A local model handles requests by default. Fast, free, private.&lt;/p&gt;

&lt;p&gt;When it gets stuck, one click escalates to Claude — the full conversation history is injected automatically. Claude answers. Then control returns to the local model, which continues the conversation knowing exactly what Claude said.&lt;/p&gt;

&lt;p&gt;No copy-pasting. No restarting the conversation. No lost context.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0buzqpsmk689hxdyj5zj.png" alt=" " width="800" height="693"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The escalation moment
&lt;/h2&gt;

&lt;p&gt;You're debugging a slow Postgres query. Llama gives you a decent answer — check your EXPLAIN output, look for function calls on indexed columns. Good start.&lt;/p&gt;

&lt;p&gt;Not enough. You hit Escalate.&lt;/p&gt;

&lt;p&gt;Claude receives the full session. It knows you're debugging a slow query. It knows what Llama already told you. It picks up exactly where the conversation left off.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyfvbf68v53d9j87xgev.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffyfvbf68v53d9j87xgev.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flho0rxtc4xldue9dr0z4.png" alt=" " width="800" height="1054"&gt;
&lt;/h2&gt;

&lt;p&gt;You click Back to local.&lt;/p&gt;

&lt;p&gt;Now ask Llama to summarize what Claude said.&lt;/p&gt;

&lt;p&gt;It does. Correctly. Because the session store was updated with Claude's response. Llama reads the full history including what Claude said and continues from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4s9hjglp4k9slzem6vg.png" alt=" " width="800" height="996"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  What's under the hood
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go proxy that sits between your client and any LLM provider. The chat UI is a static HTML file served by the same process.&lt;/p&gt;

&lt;p&gt;When you escalate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The UI fetches the full session history from &lt;code&gt;/session/:id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Sends it to Claude via &lt;code&gt;/v1/messages&lt;/code&gt; with &lt;code&gt;X-Force-Cloud: true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Claude's response gets written back to the session store via &lt;code&gt;/session/:id/append&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Next local turn, Llama reads the full history including Claude's response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The SITREP panel on the right extracts intent, confidence, entities and open loops from the conversation using a rule-based classifier — no LLM call needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The proxy still works
&lt;/h2&gt;

&lt;p&gt;The proxy layer is unchanged. Agents, SDK clients, curl — everything routes through &lt;code&gt;/v1/messages&lt;/code&gt; the same way it always did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Agent flow — unchanged&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000

&lt;span class="c"&gt;# Chat UI — new in 4.0&lt;/span&gt;
open http://localhost:3000/chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...  &lt;span class="c"&gt;# optional&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;llama3.1:8b
go run &lt;span class="nb"&gt;.&lt;/span&gt;
open http://localhost:3000/chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works without a Claude key-escalation falls back to Ollama. Add the key when you want real cloud escalation.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>go</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Cut Agent Token Usage by 89% Without Touching the Agent</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Fri, 05 Jun 2026 13:10:22 +0000</pubDate>
      <link>https://dev.to/shouvik12/how-i-cut-agent-token-usage-by-89-without-touching-the-agent-3g7o</link>
      <guid>https://dev.to/shouvik12/how-i-cut-agent-token-usage-by-89-without-touching-the-agent-3g7o</guid>
      <description>&lt;p&gt;Every time your agent calls an LLM, it sends the full conversation history.&lt;/p&gt;

&lt;p&gt;Turn 20 includes turns 1–19. Turn 50 includes turns 1–49. Nobody notices because it's happening inside the agent, silently, on every single request.&lt;/p&gt;

&lt;p&gt;I noticed it while building &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;Trooper&lt;/a&gt; - a Go proxy that sits between agents and LLMs. I was watching token counts climb across a long debugging session and realised the agent was replaying the same context over and over. Most of it was noise.&lt;/p&gt;

&lt;p&gt;The model didn't need a transcript. It needed state.&lt;/p&gt;




&lt;h2&gt;
  
  
  What state actually means
&lt;/h2&gt;

&lt;p&gt;After a few turns, most of what matters in a session falls into four categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decisions made&lt;/strong&gt; — what was chosen and why&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraints locked&lt;/strong&gt; — what cannot change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open loops&lt;/strong&gt; — what still needs to be resolved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ruled out&lt;/strong&gt; — what was tried and rejected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. Everything else — the back and forth, the verbose LLM responses explaining things, the repeated context — is replay. The model doesn't need it again.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SITREP
&lt;/h2&gt;

&lt;p&gt;I added structured session memory to Trooper. After enough turns, Trooper's local Llama model generates a SITREP — a situation report — from the user messages in the session.&lt;/p&gt;

&lt;p&gt;It looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INTENT: Build a RAG pipeline with ChromaDB and nomic-embed-text

DECISIONS: Use cosine similarity over MMR — focused queries not broad;
           Chunk size 256, overlap 30 — locked;
           Pure vector search — ChromaDB no hybrid support;
           Top k set to 5

CONSTRAINTS: Node 18 locked — platform team constraint, no exceptions;
             Re-ranking ruled out — latency jumped 200ms to 800ms

OPEN: Poor recall on technical queries — nomic-embed-text struggles with domain jargon;
      Evaluating bge-small as alternative
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From that point forward, every request to the LLM sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Anchor (first 2 turns verbatim)
+ SITREP (structured state)
+ Tail (last N turns verbatim)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of the full history.&lt;/p&gt;




&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;From a real 15-turn session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Full history:    10,820 tokens per request
With Trooper:     1,157 tokens per request
Reduction:             89%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visible live on the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6dlkrn4g7kbev71864i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu6dlkrn4g7kbev71864i.png" alt=" " width="799" height="621"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Does the LLM still answer correctly?
&lt;/h2&gt;

&lt;p&gt;This was the question that mattered. Token savings are worthless if the model loses coherence.&lt;/p&gt;

&lt;p&gt;To test it: I took the auto-generated SITREP, opened a completely fresh chat with no history, and asked questions about decisions made in the original session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is the chunk size?&lt;/li&gt;
&lt;li&gt;Why did we rule out hybrid search?&lt;/li&gt;
&lt;li&gt;What retrieval method did we choose and why?&lt;/li&gt;
&lt;li&gt;What is still open?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; All four answered correctly. The model worked entirely from the SITREP. No history. No context bleed.&lt;/p&gt;

&lt;p&gt;That's the claim: structured state is sufficient for the model to continue reasoning correctly — and it costs 89% less to send.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go proxy — one binary, no SDK, no instrumentation. You point your existing agent at it by changing one URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://api.anthropic.com

&lt;span class="c"&gt;# After&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing else changes. Trooper intercepts every request, maintains session state, and when the SITREP is ready, rewrites the messages array before forwarding to the LLM.&lt;/p&gt;

&lt;p&gt;The SITREP is built by a local Llama 3.1 8b model running via Ollama — fast, private, no cloud cost. The extraction happens asynchronously in the background. The main request path is not blocked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// GetTripleAnchor assembles what gets sent to the LLM&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SessionStore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GetTripleAnchor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessionID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Anchor&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SITREP&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[STATE_SITREP: %s]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SITREP&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tail&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dashboard shows the compression ratio live:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HISTORY COMPRESSED    89%
TOKENS SAVED          459
CONFIDENCE            100%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why this is different from conversation summarisation
&lt;/h2&gt;

&lt;p&gt;Most summarisation tools compress what was said. The SITREP extracts what matters for the next action.&lt;/p&gt;

&lt;p&gt;Copilot's context compaction summarises the full conversation — useful for humans in long chats. The SITREP is structured specifically for agents: decisions, constraints, open loops, ruled-out paths. Not a narrative summary. A state snapshot.&lt;/p&gt;

&lt;p&gt;The result is that subsequent turns stay coherent on intent without replaying noise. More relevant for agents running repeated structured workflows than for general chat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The limitation
&lt;/h2&gt;

&lt;p&gt;The SITREP works best for structured agentic workflows — debugging sessions, research pipelines, multi-step build tasks. For open-ended creative work where tangential context might become important later, you'd want a larger tail window or higher fidelity compression.&lt;/p&gt;

&lt;p&gt;The tail window is configurable. You can keep more raw context for less structured sessions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What else Trooper does
&lt;/h2&gt;

&lt;p&gt;The compression is the latest addition. Trooper also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Falls back to local Ollama when cloud quota hits — context preserved across the switch&lt;/li&gt;
&lt;li&gt;Routes simple turns to Ollama automatically — cloud never contacted&lt;/li&gt;
&lt;li&gt;Privacy routing — sensitive requests stay local via &lt;code&gt;x_force_local&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Live dashboard — intent, open loops, completed steps, transcript&lt;/li&gt;
&lt;li&gt;Subagent recovery — &lt;code&gt;/recovery/{session_id}&lt;/code&gt; tells you exactly where to resume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All from one URL change.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bigger question
&lt;/h2&gt;

&lt;p&gt;We tend to treat conversation history as memory. But a transcript is a log. Memory is state.&lt;/p&gt;

&lt;p&gt;Humans don't replay every prior conversation before making a decision. They carry forward conclusions, constraints, unresolved questions, and relevant context — a structured snapshot, not a full transcript.&lt;/p&gt;

&lt;p&gt;Long-running agents may need to do the same. Not because of token costs — though that helps — but because state is a better abstraction for agent memory than history.&lt;/p&gt;

&lt;p&gt;The SITREP is an experiment in that direction.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;github.com/shouvik12/trooper&lt;/a&gt; — Go, MIT, zero dependencies beyond Ollama.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>I Added a Live Dashboard to My LLM Proxy. Zero Instrumentation. Just a URL Change.</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Thu, 28 May 2026 15:24:53 +0000</pubDate>
      <link>https://dev.to/shouvik12/-i-added-a-live-dashboard-to-my-llm-proxy-zero-instrumentation-just-a-url-change-12k4</link>
      <guid>https://dev.to/shouvik12/-i-added-a-live-dashboard-to-my-llm-proxy-zero-instrumentation-just-a-url-change-12k4</guid>
      <description>&lt;p&gt;I built Trooper as a fallback proxy. Claude hits quota → falls back to Ollama. Useful but passive. It sat in the background, invisible, doing its job silently.&lt;/p&gt;

&lt;p&gt;Today it became something different.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Original Problem
&lt;/h2&gt;

&lt;p&gt;When you're building with LLMs, quota hits are inevitable. Claude's free tier is generous until it isn't. A mid-session 429 kills your context, your workflow, your train of thought.&lt;/p&gt;

&lt;p&gt;Trooper solved that. Point your app at &lt;code&gt;http://localhost:3000&lt;/code&gt; instead of the Claude API. When Claude fails, Trooper catches it, preserves the full session context via a 3-layer compaction system (Anchor + SITREP + Tail), and continues on local Ollama. Your app never knows anything happened.&lt;/p&gt;

&lt;p&gt;That's still there. But it was passive. Useful when things broke. Invisible when they didn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Passive
&lt;/h2&gt;

&lt;p&gt;Passive infrastructure has an adoption problem. Developers install it, forget about it, and only notice it when something breaks. That's not a product — that's a safety net.&lt;/p&gt;

&lt;p&gt;The question I kept asking: what does Trooper do that has daily value, not just failure value?&lt;/p&gt;

&lt;p&gt;The answer was sitting in the code the whole time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Already There
&lt;/h2&gt;

&lt;p&gt;Trooper captures every message in every session. It runs a classifier on each one — extracting intent, entities, open loops, completed steps, recent actions. All rule-based, zero LLM calls, zero latency.&lt;/p&gt;

&lt;p&gt;This is what powers the fallback context preservation. When Claude fails and Ollama picks up, Ollama doesn't start blind — it receives a SITREP (Situation Report) that tells it what the session was about, what was completed, what's still pending.&lt;/p&gt;

&lt;p&gt;That data exists for every session. It was just never visible to the developer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dashboard
&lt;/h2&gt;

&lt;p&gt;I added a live dashboard at &lt;code&gt;localhost:3000/dashboard&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Point any agent at Trooper — just change your base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;span class="c"&gt;# or&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open the dashboard. Keep it on a second monitor while your agent runs.&lt;/p&gt;

&lt;p&gt;From a single message, it already shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intent&lt;/strong&gt; — what your agent is trying to do, extracted automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open Loops&lt;/strong&gt; — what it's stuck on, highlighted in red&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completed Steps&lt;/strong&gt; — what it finished, tracked as it happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entities&lt;/strong&gt; — the key things being referenced&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session Transcript&lt;/strong&gt; — every message, colour coded by role&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Auto-refreshes every 5 seconds. No page reload needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Looks Like In Practice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cprikgdz65ls8uy4l5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7cprikgdz65ls8uy4l5k.png" alt=" " width="800" height="685"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I ran a 3-turn agent session simulating a database debugging workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 1:&lt;/strong&gt; "I am building a Go API server. The database connection is failing with connection refused errors on port 5432."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 2:&lt;/strong&gt; "Checked the config. Postgres is running on port 5433 not 5432. Fixing the connection string now."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 3:&lt;/strong&gt; "Fixed the port. Database connection is working. API server is running successfully."&lt;/p&gt;

&lt;p&gt;After Turn 1, the dashboard immediately showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intent: "building a go api server. the database connection is failing with connection refused errors on port" (100% confidence)&lt;/li&gt;
&lt;li&gt;Entities: Postgres, network&lt;/li&gt;
&lt;li&gt;Open Loops: "fail with connection refused error on port"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After Turn 3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completed Steps: "successfully fixed the port"&lt;/li&gt;
&lt;li&gt;Open loops cleared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zero instrumentation. No SDK. No code changes to the agent. Just a URL change.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Every observability tool requires you to instrument your code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; — wrap your agent in LangChain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; — add their SDK&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AgentOps&lt;/strong&gt; — add &lt;code&gt;@observe&lt;/code&gt; decorators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trooper&lt;/strong&gt; requires nothing. Your agent already communicates over HTTP to an LLM. Trooper sits in that path and observes everything automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; was the closest — proxy-based, zero instrumentation. But it went into maintenance mode in March 2026 and was cloud-only. Your data went to their servers.&lt;/p&gt;

&lt;p&gt;Trooper is open source, local-first, and free forever. Your data never leaves your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Sessions Endpoint
&lt;/h2&gt;

&lt;p&gt;Not sure which session to look at? Hit &lt;code&gt;/sessions&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/sessions
&lt;span class="c"&gt;# {"sessions":["agent-debug-123","agent-debug-456"],"count":2}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click any session in the dashboard home page at &lt;code&gt;localhost:3000/dashboard&lt;/code&gt; to open it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Endpoint
&lt;/h2&gt;

&lt;p&gt;Still there. When an agent fails mid-task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/recovery/&lt;span class="o"&gt;{&lt;/span&gt;session_id&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns exactly what completed and where to resume. The dashboard makes this visual — completed steps are tracked in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pivot
&lt;/h2&gt;

&lt;p&gt;Trooper started as: &lt;em&gt;"Claude failed. Trooper caught it."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Trooper is now: &lt;em&gt;"Your agent communicates over HTTP to an LLM. Trooper can observe it."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The fallback is a feature. Observability is the product.&lt;/p&gt;

&lt;p&gt;Your agent was always talking. Now you can hear it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:3000/dashboard&lt;/code&gt; in your browser.&lt;/p&gt;

&lt;p&gt;Point your agent at Trooper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Zero dependencies. Pure Go. Runs in under 60 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: llm, agents, observability, ollama, go, opensource&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Added a /recovery Endpoint to My LLM Proxy So Agents Never Lose Progress Mid-Task</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 24 May 2026 17:28:55 +0000</pubDate>
      <link>https://dev.to/shouvik12/i-added-a-recovery-endpoint-to-my-llm-proxy-so-agents-never-lose-progress-mid-task-524b</link>
      <guid>https://dev.to/shouvik12/i-added-a-recovery-endpoint-to-my-llm-proxy-so-agents-never-lose-progress-mid-task-524b</guid>
      <description>&lt;p&gt;Most LLM proxies handle failures the same way -retry the request, fall back to another provider, or crash. None of them ask the more important question: &lt;strong&gt;what did the agent already complete before it failed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the gap I built Trooper to fill.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;If you're running multi-agent workflows, you've probably hit this scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A subagent starts a long task-reviewing PRs, processing documents, running analysis&lt;/li&gt;
&lt;li&gt;It completes steps 1, 2, and 3&lt;/li&gt;
&lt;li&gt;On step 4 it hits a quota error, rate limit, or provider failure&lt;/li&gt;
&lt;li&gt;Your orchestration layer has no idea what completed&lt;/li&gt;
&lt;li&gt;You restart from scratch and repeat all the work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most proxies handle the failure. Nobody handles the recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Trooper Does Differently
&lt;/h2&gt;

&lt;p&gt;Trooper is a Go-based LLM proxy that sits between your agents and your LLM providers. It already handled fallback routing -if Claude hits quota, it falls back to local Ollama automatically.&lt;/p&gt;

&lt;p&gt;But the new &lt;code&gt;/recovery/{session_id}&lt;/code&gt; endpoint goes further. It tracks every step your agent completes in real time and tells your orchestration layer exactly where to resume.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Endpoint
&lt;/h2&gt;

&lt;p&gt;When your agent sends requests through Trooper, it captures every assistant response and extracts completed steps as they happen. When something fails, you call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET http://localhost:3000/recovery/{session_id}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And Trooper returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"subagent-demo-1779630533"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"completed_steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"completed pr #3"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resume_from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"recovery_hint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Resume from step 4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your parent agent now knows exactly what the subagent finished and where to restart it. No repeated work. No lost progress.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;An agent reviewing 8 pull requests hits quota on PR #4. Trooper intercepts, returns the recovery payload, and the agent resumes from PR #4 using local Ollama.&lt;/p&gt;

&lt;p&gt;[&lt;a href="https://www.youtube.com/watch?v=NN2uwQZDCck" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=NN2uwQZDCck&lt;/a&gt;]&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;Trooper uses a two-tier memory system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor&lt;/strong&gt; — the first two turns of a session, always preserved verbatim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tail&lt;/strong&gt; — the most recent turns, stored in a rolling window.&lt;/p&gt;

&lt;p&gt;When you call &lt;code&gt;/recovery&lt;/code&gt;, Trooper scans all stored assistant messages for completion signals — words like "completed", "finished", "done", "merged", "deployed". It extracts one completed step per message, deduplicates by task identifier, and returns the ordered list.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;resume_from&lt;/code&gt; field is simply &lt;code&gt;len(completed_steps) + 1&lt;/code&gt; — telling your orchestration layer which step to restart the subagent on.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Start Trooper&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Point your agent at Trooper instead of Claude&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Instead of https://api.anthropic.com/v1/messages&lt;/span&gt;
POST http://localhost:3000/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Pass a session ID with each request&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:3000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"X-Session-ID: my-agent-session-1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"claude-haiku-4-5","max_tokens":100,"messages":[...]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Call recovery when something fails&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:3000/recovery/my-agent-session-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The recovery endpoint is the foundation for proper subagentic orchestration. Upcoming work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parent agent integration&lt;/strong&gt; — the recovery payload feeds directly back into the orchestration layer to automatically restart subagents from the right step&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured step tracking&lt;/strong&gt; — support for agents that emit structured JSON progress instead of natural language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session replay&lt;/strong&gt; — rewind any session to any point and branch from there&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;As agent workflows get longer and more complex, failure recovery becomes a first-class concern. Trooper's approach — track everything as it happens, make recovery queryable — is a different philosophy from retry-and-hope.&lt;/p&gt;

&lt;p&gt;Local-first by default. Cloud when you choose. And now recoverable when things go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;How are you handling agent failures in your workflows today? Drop a comment genuinely curious what patterns people are using.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I Tested Privacy-Aware Routing with 4 AI Agents: What Actually Stayed Local</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Wed, 13 May 2026 02:40:52 +0000</pubDate>
      <link>https://dev.to/shouvik12/i-tested-privacy-aware-routing-with-4-ai-agents-what-actually-stayed-local-39oa</link>
      <guid>https://dev.to/shouvik12/i-tested-privacy-aware-routing-with-4-ai-agents-what-actually-stayed-local-39oa</guid>
      <description>&lt;p&gt;Following up on my &lt;a href="https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5"&gt;earlier Trooper experiments&lt;/a&gt;, I wanted to see if per-request privacy routing actually works in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The test:&lt;/strong&gt; 4 agents running simultaneously. Some handling public knowledge (OAuth security, Redis vs Memcached). Others handling sensitive data (API keys, customer PII).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rule:&lt;/strong&gt; Credentials and PII stay on my machine. Everything else can use Claude.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Each agent gets a &lt;code&gt;x_force_local&lt;/code&gt; flag:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent 1 - security-analyst (☁️ Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "What are the top 3 OAuth2 vulnerabilities?"  
Routing: Public knowledge, let Claude handle it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 2 - credential-formatter (🔒 Qwen local)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;Task:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Format as JSON: api_key=sk-prod-x7f9k2m, vault_url=https://vault.acme.io:8200"&lt;/span&gt;&lt;span class="w"&gt;  
&lt;/span&gt;&lt;span class="err"&gt;Routing:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Contains&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;credentials&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;stay&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;on&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;machine&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 3 - architecture-advisor (☁️ Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: "Redis or Memcached for session storage?"  
Routing: General best practices, use cloud
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent 4 - compliance-reporter (🔒 Qwen local)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`Task: "Summarize: 47 tickets today. 3 had PII (Alice Johnson, Bob Chen, Maria Garcia)"  
Routing: Contains customer names — privacy violation if sent to cloud`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvsb7hdks8f7yurzyfn5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvsb7hdks8f7yurzyfn5.png" alt=" " width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every agent completed successfully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud agents:&lt;/strong&gt; 3.8s and 2.4s (Claude handled complex reasoning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local agents:&lt;/strong&gt; 2.4s and 1.2s (Qwen formatted data locally)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The critical part:&lt;/strong&gt; API keys, vault URLs, and customer names never left my machine. Zero network calls to Anthropic for those two agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened Under the Hood
&lt;/h2&gt;

&lt;p&gt;When Agent 2 (credential-formatter) ran with &lt;code&gt;x_force_local: true&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Request intercepted by Trooper proxy&lt;/li&gt;
&lt;li&gt;Privacy flag detected&lt;/li&gt;
&lt;li&gt;Routed to local Ollama instead of Claude API&lt;/li&gt;
&lt;li&gt;Session context maintained via 3-layer system (Anchor/SITREP/Tail)&lt;/li&gt;
&lt;li&gt;JSON response returned — credentials never hit the network&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The vault URL and API key stayed on my hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;Using the OpenAI SDK (works with any OpenAI-compatible client):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-anthropic-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Trooper proxy
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Regular request → Claude
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OAuth2 vulnerabilities?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Session-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security-analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Privacy request → Qwen local
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Format: api_key=sk-prod...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-Session-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;credential-formatter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;extra_body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x_force_local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# This keeps it local
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the entire API. One boolean flag controls routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most LLM proxies route between cloud providers. LiteLLM falls back from Claude to OpenAI. That's useful for uptime, but both destinations are someone else's servers.&lt;/p&gt;

&lt;p&gt;Trooper's &lt;code&gt;x_force_local&lt;/code&gt; routes to &lt;strong&gt;your machine&lt;/strong&gt;. Different failure mode, different privacy guarantee.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you need it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code refactoring with internal URLs&lt;/li&gt;
&lt;li&gt;Proprietary algorithms (not secret, just yours)&lt;/li&gt;
&lt;li&gt;Customer data that shouldn't leave your network&lt;/li&gt;
&lt;li&gt;Cost control (force expensive operations local)&lt;/li&gt;
&lt;li&gt;Offline work (flights, train rides, API outages)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When you don't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public API questions&lt;/li&gt;
&lt;li&gt;General best practices&lt;/li&gt;
&lt;li&gt;Complex reasoning that needs Claude's horsepower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point isn't "local always" or "cloud always." It's per-request control based on what you're asking.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Context Preservation Works
&lt;/h2&gt;

&lt;p&gt;The hardest part of routing isn't switching models — it's maintaining conversation state.&lt;/p&gt;

&lt;p&gt;Trooper uses a 3-layer compaction system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**Anchor (~10%):** First 2 turns verbatim, never dropped  
**SITREP (~20%):** Rule-based summary of middle turns  
**Tail (~70%):** Last N turns verbatim
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total budget: 6144 tokens (configurable)&lt;/p&gt;

&lt;p&gt;When Agent 4 (compliance-reporter) ran locally, Qwen received the anchor, a compressed SITREP of what Claude said earlier, and the immediate context.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Work Great
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Local models aren't Claude.&lt;/strong&gt; Qwen 2.5 is fast and solid for structured tasks (JSON formatting, parsing, summarization). But if you need deep reasoning, route to Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context compression is lossy.&lt;/strong&gt; Trooper compresses middle turns into summaries. For precision-critical workflows, keep sessions short or increase the context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need Ollama running.&lt;/strong&gt; This isn't plug-and-play:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5:3b
ollama serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;code&gt;qwen2.5:3b&lt;/code&gt; (2GB, fast) for most tasks. Switch to &lt;code&gt;7b&lt;/code&gt; (5GB) when I need better output quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compared to My Previous Post
&lt;/h2&gt;

&lt;p&gt;Last time I showed what happens when Claude quota runs out: Trooper automatically falls back to Ollama with context preserved. That's &lt;strong&gt;reactive&lt;/strong&gt; — something breaks, the system recovers.&lt;/p&gt;

&lt;p&gt;This is &lt;strong&gt;proactive&lt;/strong&gt;: you tell it "keep this request local" before sending. Different problem, same underlying context system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Pull local model&lt;/span&gt;
ollama pull qwen2.5:3b

&lt;span class="c"&gt;# 2. Clone and run Trooper&lt;/span&gt;
git clone https://github.com/shouvik12/trooper
&lt;span class="nb"&gt;cd &lt;/span&gt;trooper
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;CLAUDE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
go run main.go providers.go classifier.go
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trooper starts on &lt;code&gt;localhost:3000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Point any OpenAI-compatible client at it and add &lt;code&gt;x_force_local: true&lt;/code&gt; when you want privacy routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/shouvik12/trooper" rel="noopener noreferrer"&gt;https://github.com/shouvik12/trooper&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback welcome — especially on edge cases or use cases I haven't considered.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is v3.1. The x_force_local feature shipped last week. Still iterating on auto-routing classification.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>privacy</category>
    </item>
    <item>
      <title>How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out</title>
      <dc:creator>Shouvik Palit</dc:creator>
      <pubDate>Sun, 03 May 2026 01:23:28 +0000</pubDate>
      <link>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</link>
      <guid>https://dev.to/shouvik12/how-i-built-a-go-proxy-that-keeps-your-llm-conversation-alive-when-cloud-quota-runs-out-8k5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
If you've ever been mid-conversation with Claude or GPT, hit a quota limit, and switched to a local Ollama model,you know the pain. The local model has zero context. It's like walking into a meeting 45 minutes late and nobody catches you up.&lt;br&gt;
I got frustrated enough to build something about it. That something is Trooper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Trooper&lt;/strong&gt;&lt;br&gt;
Trooper is a lightweight Go proxy (~850 lines, two files) that sits between your application and your LLM providers. When a cloud provider returns a quota error (429, 402, 529), Trooper automatically falls back to a local Ollama instance without dropping the conversation context.&lt;br&gt;
Single binary. Zero dependencies. Easy to audit since it sits in front of your API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real problem: context loss on fallback&lt;/strong&gt;&lt;br&gt;
Most fallback proxies solve the routing problem but ignore the context problem. They either pass the raw message history as-is (which blows up the local model's context window) or they truncate the oldest turns (which kills continuity).&lt;br&gt;
Neither works well in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution: three-layer context compaction&lt;/strong&gt;&lt;br&gt;
Trooper uses a structured compaction strategy before handing off to Ollama:&lt;br&gt;
&lt;strong&gt;Anchor&lt;/strong&gt; : The first two turns of the conversation are always preserved. These establish the original intent and set the tone.&lt;br&gt;
&lt;strong&gt;SITREP&lt;/strong&gt; : The middle turns get compressed into a structured summary called a SITREP. It extracts intent, entities, open loops, recent actions, and resolved items. The local model gets situational awareness, not raw history.&lt;br&gt;
&lt;strong&gt;Tail&lt;/strong&gt; : The most recent turns are preserved within a configurable token budget.&lt;/p&gt;

&lt;p&gt;A real SITREP looks like this in the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📦  Context compaction triggered — 538 tokens exceeds 500 budget
📦  Context compaction complete
    Total turns    : 7
    Anchor turns   : 2 (~43 tokens)
    Middle turns   : 2 → SITREP (~71 tokens)
    Recent turns   : 3 (~323 tokens)
    Tokens used    : 437 / 500
    SITREP         : intent="trooper" stage=unclear confidence=0.60 open=1 actions=0 resolved=0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local model knows what you were working on, what's broken, what's been resolved, and what the last few exchanges were. That's enough to keep the conversation coherent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Go&lt;/strong&gt;&lt;br&gt;
Single binary distribution was the main reason. No runtime, no dependencies, drop it anywhere and it runs. The codebase being ~850 lines also means anyone can read the whole thing in an afternoon — important for something that proxies API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provider support&lt;/strong&gt;&lt;br&gt;
Trooper currently supports Claude, Gemini, and OpenAI as cloud providers with automatic fallback to Ollama. The provider chain is configurable via environment variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;&lt;br&gt;
V3.0 is focused on foundation hardening — concurrency fixes and improved error handling. V3.1 will improve the SITREP extraction quality on longer conversations, which is where intent detection starts to degrade today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;&lt;br&gt;
github.com/shouvik12/trooper&lt;br&gt;
Would love feedback on the context compaction approach — especially from anyone running larger local models. What's your cold-start latency on fallback?&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>llm</category>
      <category>ai</category>
      <category>go</category>
    </item>
  </channel>
</rss>
