<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: chuan jiang</title>
    <description>The latest articles on DEV Community by chuan jiang (@windhood-jza).</description>
    <link>https://dev.to/windhood-jza</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3997783%2F957698b8-2b0b-4cf2-bba6-1d436ad35eb9.png</url>
      <title>DEV Community: chuan jiang</title>
      <link>https://dev.to/windhood-jza</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/windhood-jza"/>
    <language>en</language>
    <item>
      <title>Routing Hermes Agent Through a Local Headroom Proxy for Context Compression</title>
      <dc:creator>chuan jiang</dc:creator>
      <pubDate>Tue, 23 Jun 2026 02:33:55 +0000</pubDate>
      <link>https://dev.to/windhood-jza/routing-hermes-agent-through-a-local-headroom-proxy-for-context-compression-78f</link>
      <guid>https://dev.to/windhood-jza/routing-hermes-agent-through-a-local-headroom-proxy-for-context-compression-78f</guid>
      <description>&lt;h1&gt;
  
  
  Routing Hermes Agent Through a Local Headroom Proxy for Context Compression
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Make every Hermes Agent LLM call transparently route through a local&lt;br&gt;
Headroom reverse proxy running Kompress context compression.&lt;br&gt;
Hermes still uses its normal CLI and OAuth credentials; Headroom sits in&lt;br&gt;
the middle, compressing context before forwarding upstream.&lt;br&gt;
Result: ≥30% token savings on long conversations, no API key changes,&lt;br&gt;
OAuth passthrough preserved.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why I Wrote This (for Humans)
&lt;/h2&gt;

&lt;p&gt;I run Hermes Agent on a side project.&lt;br&gt;
Not a startup, not a funded team — just me and my own time.&lt;/p&gt;

&lt;p&gt;The honest truth: I can't afford to run AI the way the docs assume.&lt;br&gt;
Every long task, every cron job, every code review loop — they all&lt;br&gt;
rack up tokens, and tokens rack up bills.&lt;/p&gt;

&lt;p&gt;So I started looking for ways to keep using Hermes without&lt;br&gt;
watching my budget drain every month.&lt;/p&gt;

&lt;p&gt;I tried a few things before Headroom. Shorter prompts degraded the&lt;br&gt;
agent's reasoning. Switching to a cheaper model lost capability where&lt;br&gt;
it mattered. Hosted compression services existed but they cost money&lt;br&gt;
too, and they wanted my conversation history on someone else's server.&lt;/p&gt;

&lt;p&gt;Headroom was the answer: a local proxy with a Kompress engine inside.&lt;br&gt;
It runs on hardware I already own, costs nothing per request, and&lt;br&gt;
quietly makes every conversation shorter before it leaves my machine.&lt;/p&gt;

&lt;p&gt;This article is how I set it up — focused on the constraints that&lt;br&gt;
actually matter, not a command-by-command recipe. Because if you're&lt;br&gt;
a solo developer staring at your own LLM bill, you don't need a&lt;br&gt;
copy-paste guide. You need to understand what success looks like&lt;br&gt;
and what silently breaks.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you're a human reader and you've gotten the why and the what,&lt;br&gt;
feel free to &lt;strong&gt;stop here&lt;/strong&gt;. The section below is structured for&lt;br&gt;
AI agents to parse — it gives them the goal, the constraints, and&lt;br&gt;
the verification checklist without ambiguity.&lt;br&gt;
Skim it for completeness, or skip to &lt;em&gt;When To Use This vs. Alternatives&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  For AI Agents (Specification)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;The section above is for humans. This section is for AI agents — and&lt;br&gt;
for humans who want to think like one. Everything below is structured&lt;br&gt;
so an AI agent can extract goal, constraints, and verification from&lt;br&gt;
markdown headings alone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;Route all Hermes Agent LLM calls transparently through a local Headroom&lt;br&gt;
reverse proxy running Kompress context compression. Achieve ≥30% token&lt;br&gt;
savings on long conversations while preserving OAuth passthrough to&lt;br&gt;
upstream providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Hermes Agent] → [Headroom @ :8788] → [Upstream OAuth API]
                       ↑
                 Kompress compresses
                 context here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious bit: Hermes hardcodes &lt;code&gt;base_url&lt;/code&gt; parsing for&lt;br&gt;
&lt;code&gt;auth_type=api_key&lt;/code&gt; but returns early for OAuth providers. A runtime&lt;br&gt;
plugin is the only clean way to redirect OAuth traffic through Headroom&lt;br&gt;
without forking Hermes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Constraints (do not violate)
&lt;/h3&gt;

&lt;p&gt;These constraints exist because violating them causes &lt;strong&gt;silent fallback&lt;br&gt;
to direct API&lt;/strong&gt; — which looks like success but gives zero savings.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Headroom ≥ 0.26&lt;/strong&gt; — earlier versions lack Kompress GPU backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OAuth providers require runtime patching&lt;/strong&gt; — &lt;code&gt;auth.json.credential_pool[*].base_url&lt;/code&gt;
must be rewritten; &lt;code&gt;HERMES_OVERLAYS&lt;/code&gt; must be patched;
&lt;code&gt;_seed_from_singletons&lt;/code&gt; must be monkey-patched&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple providers = multiple patches&lt;/strong&gt; — each enabled provider in
the plugin must be patched independently; missing one bypasses Headroom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU optional but recommended&lt;/strong&gt; — CPU backend works but is ~10x slower;
6GB VRAM is enough for &lt;code&gt;max_concurrent=1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;require_health: true&lt;/code&gt;&lt;/strong&gt; is the default — plugin refuses to register
if Headroom is unhealthy, preventing silent fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Verification Checklist
&lt;/h3&gt;

&lt;p&gt;A reader (human or AI) should confirm success using only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;curl 127.0.0.1:8788/health&lt;/code&gt; returns &lt;code&gt;{"status":"healthy"}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Headroom logs (default &lt;code&gt;~/.headroom/logs/&lt;/code&gt;) show a recent request
with non-zero &lt;code&gt;tokens_saved&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Hermes chat test on a long prompt completes without quota error
(or with reduced consumption vs baseline)&lt;/li&gt;
&lt;li&gt;Provider base_url in Hermes runtime points to &lt;code&gt;127.0.0.1:8788&lt;/code&gt;,
not the official host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these fail, &lt;strong&gt;the route is not working&lt;/strong&gt;, even if the system&lt;br&gt;
"looks healthy" from outside.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Modes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely cause&lt;/th&gt;
&lt;th&gt;What to investigate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;401 Unauthorized&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Headroom not passing Authorization header&lt;/td&gt;
&lt;td&gt;Headroom version ≥ 0.26; &lt;code&gt;is_chatgpt_auth&lt;/code&gt; branch triggered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct connection to upstream despite plugin enabled&lt;/td&gt;
&lt;td&gt;plugin not loaded or auth.json base_url not rewritten&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;config.yaml&lt;/code&gt; plugins.enabled contains &lt;code&gt;headroom-route&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headroom 502 Bad Gateway&lt;/td&gt;
&lt;td&gt;Upstream OAuth endpoint URL changed&lt;/td&gt;
&lt;td&gt;Update &lt;code&gt;route.yaml&lt;/code&gt; anthropic_api_url&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kompress very slow&lt;/td&gt;
&lt;td&gt;CPU backend or max_concurrent too low&lt;/td&gt;
&lt;td&gt;Set &lt;code&gt;HEADROOM_KOMPRESS_BACKEND=pytorch&lt;/code&gt; and provide GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Performance Baseline
&lt;/h3&gt;

&lt;p&gt;GTX 1060 6GB, &lt;code&gt;max_concurrent=1&lt;/code&gt;, &lt;code&gt;protect_recent=5&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Original&lt;/th&gt;
&lt;th&gt;Compressed&lt;/th&gt;
&lt;th&gt;Saved&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Short prompt (&amp;lt;500 tokens)&lt;/td&gt;
&lt;td&gt;458&lt;/td&gt;
&lt;td&gt;458&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long conversation (heavy tool results)&lt;/td&gt;
&lt;td&gt;28,368&lt;/td&gt;
&lt;td&gt;11,283&lt;/td&gt;
&lt;td&gt;17,085&lt;/td&gt;
&lt;td&gt;60.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Many tool schemas loaded&lt;/td&gt;
&lt;td&gt;30,007&lt;/td&gt;
&lt;td&gt;28,496&lt;/td&gt;
&lt;td&gt;1,511&lt;/td&gt;
&lt;td&gt;5.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short prompts are skipped by design (&lt;code&gt;min_tokens_to_crush=500&lt;/code&gt; threshold)&lt;/li&gt;
&lt;li&gt;Tool-result-heavy scenarios see the largest gains&lt;/li&gt;
&lt;li&gt;Kompress is stateless streaming compression — per-request latency ~150ms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I Recommend You Explore
&lt;/h3&gt;

&lt;p&gt;This article intentionally avoids step-by-step commands. An AI agent&lt;br&gt;
should be able to figure out the right commands from the constraints&lt;br&gt;
above. Concretely, explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does Headroom discover its configuration? (env vars + config file)&lt;/li&gt;
&lt;li&gt;How does the plugin decide which providers to patch? (config.yaml block)&lt;/li&gt;
&lt;li&gt;How is OAuth token persisted across restarts? (auth.json)&lt;/li&gt;
&lt;li&gt;How can I tell after the fact whether traffic went through Headroom? (logs)&lt;/li&gt;
&lt;li&gt;What is &lt;code&gt;is_chatgpt_auth&lt;/code&gt; and why does it matter? (branch in upstream detection)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Compatibility Notes
&lt;/h3&gt;

&lt;p&gt;Hermes plugins are independently loaded. Principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plugin names must be unique&lt;/li&gt;
&lt;li&gt;Do not mutate global state in &lt;code&gt;register()&lt;/code&gt; unless intentional&lt;/li&gt;
&lt;li&gt;Do not contend over the same state file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;headroom-route&lt;/code&gt; plugin coexists with &lt;code&gt;quota-gate&lt;/code&gt; (quota check&lt;br&gt;
before LLM call) — no overlap, both run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollback
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Disable plugin: remove &lt;code&gt;headroom-route&lt;/code&gt; from &lt;code&gt;config.yaml&lt;/code&gt; plugins.enabled&lt;/li&gt;
&lt;li&gt;Restore &lt;code&gt;auth.json.credential_pool.*.base_url&lt;/code&gt; to official URLs&lt;/li&gt;
&lt;li&gt;Restore &lt;code&gt;HERMES_OVERLAYS.providers.*.base_url&lt;/code&gt; to official URLs&lt;/li&gt;
&lt;li&gt;Stop Headroom proxy (optional)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  When To Use This vs. Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;When to prefer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Headroom + Kompress (this article)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You control the host, have a GPU, want zero API changes, want OAuth preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosted compression SaaS&lt;/td&gt;
&lt;td&gt;You cannot install local dependencies; willing to send context to a third party&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smaller context window / shorter prompts&lt;/td&gt;
&lt;td&gt;You don't actually have a compression problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Switch to a different model&lt;/td&gt;
&lt;td&gt;Sometimes the right answer is "use a cheaper model" not "compress more"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;A working Headroom + Hermes setup gives you &lt;strong&gt;transparent context compression&lt;/strong&gt;&lt;br&gt;
without rewriting Hermes or swapping API keys. The hard part is OAuth&lt;br&gt;
provider base_url rewriting — once solved by the plugin, the rest is&lt;br&gt;
standard reverse-proxy deployment.&lt;/p&gt;

&lt;p&gt;If you are an AI agent reading this and trying to set it up: focus on&lt;br&gt;
&lt;strong&gt;getting the verification checklist to pass&lt;/strong&gt;, not on copying any&lt;br&gt;
specific command sequence. The constraints and invariants tell you&lt;br&gt;
what success looks like; the means are deliberately left for you&lt;br&gt;
to figure out.&lt;/p&gt;




&lt;h2&gt;
  
  
  📡 AI Agent Index (llms.txt fragment)
&lt;/h2&gt;

&lt;p&gt;This article is part of the &lt;strong&gt;AI Agent Tutorials&lt;/strong&gt; collection.&lt;br&gt;
Other entries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;(Coming soon) TradingView × Hermes — BTC divergence monitoring&lt;/li&gt;
&lt;li&gt;(Coming soon) X Content Pipeline — multi-platform publishing from Obsidian&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full index at the canonical blog's &lt;code&gt;llms.txt&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;📱 &lt;strong&gt;More agent tutorials&lt;/strong&gt;: [link to author site]&lt;br&gt;
🔗 &lt;strong&gt;Canonical&lt;/strong&gt;: this article's canonical version lives at the author's blog.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
  </channel>
</rss>
