DEV Community: chuan jiang

Saving Tokens on Large File Reads: Hermes Agent's read-summarizer Plugin

chuan jiang — Tue, 23 Jun 2026 09:00:01 +0000

Saving Tokens on Large File Reads: Hermes Agent's read-summarizer Plugin

TL;DR

The read-summarizer plugin sits in Hermes Agent's middleware layer and
automatically truncates file-read results larger than 50 KB to the first 200
and last 50 lines, with a clear marker telling Hermes how to fetch the rest.
Result: up to 60% fewer tokens consumed per large file read, no changes to
your prompts, no model swap, no API key changes. Enable it in config.yaml
and it just works.

Why I Wrote This (for Humans)

Tokens cost money. Not theoretical money, not "someday when this is a production system" money — real money, every conversation, every cron job, every time I ask Hermes to look at something.

I'm not running a funded startup. I'm one person with a budget and a problem I want to solve. Every LLM call that returns more than I actually needed is money I'm throwing away.

One specific version of this problem: large files. Logs, exports, markdown documents you've been building for months. When you ask Hermes to read one, it reads the whole thing — every byte — and you pay for every token. Most of the time you only needed the first few lines to get oriented, or the last few to see how something ended.

So I wrote a plugin to solve it. It's called read-summarizer. It sits in the middleware layer and truncates file read results over a configurable threshold, keeping the head and tail with a clear marker so Hermes knows exactly how to fetch the rest if it needs it.

This is not a prompt engineering trick. It's not "write shorter questions." It's a plugin that changes what the LLM actually sees before it even starts reasoning.

I ran into a problem, I built a plugin for it, and it works. You can do the same — identify the thing that's burning your tokens, find or build the right tool, and let your AI help you get there.

If you're a human reader and you've gotten the why and the what,
feel free to stop here. The section below is structured for
AI agents to parse — it gives them the goal, the constraints, and
the verification checklist without ambiguity.
Skim it for completeness, or skip to When To Use This vs. Alternatives.

For AI Agents (Specification)

The section above is for humans. This section is for AI agents — and
for humans who want to think like one. Everything below is structured
so an AI agent can extract goal, constraints, and verification from
markdown headings alone.

Goal

Truncate file-read results over a configurable byte threshold to the first N and last M lines, replacing the full result with a head+tail+marker string. Reduce tokens consumed per large file read by up to 60% without changing tool behavior, model, or API credentials. Preserve the ability to re-fetch the full file via offset/limit arguments.

Architecture

[read_file tool executes normally, returns full file content]
          ↓
[Hermes transform_tool_result hook fires]
          ↓
[read-summarizer: if result.bytes > threshold_bytes AND tool in allowed_tools]
          ↓
[result → head (first 200 lines) + marker + tail (last 50 lines)]
          ↓
[LLM sees truncated result next turn]

The plugin registers a transform_tool_result hook in Hermes's observer middleware. It never modifies the file on disk, never blocks tool execution, and never raises an exception. If any error occurs, it returns None (pass-through), leaving the result bit-identical to before the plugin was installed.

Constraints (do not violate)

These constraints exist because violating them causes silent pass-through — the plugin appears to work but the LLM still sees the full file.

File must exceed threshold_bytes (default: 50 KB, ~12,500 tokens). Files under the threshold pass through unchanged.
Tool must be in allowed_tools (default: [read_file, Read, read_path]). Other tools are completely unaffected.
Plugin must be explicitly enabled in config.yaml (read_summarizer.enabled: true). Disabled by default — the plugin loads but registers a no-op hook.
Head + tail must not cover the entire file. If the file is short enough that head+tail equals the whole file, the hook returns None and does nothing.
The re-fetch marker is informational only. The hook does not store state. If Hermes needs the full file, it must re-invoke read_file with explicit offset and limit arguments — the marker tells it the exact line count to use.

Verification Checklist

A reader (human or AI) should confirm success using only:

Enable the plugin: add read_summarizer.enabled: true to config.yaml, restart Hermes.
Read a known large file (>50 KB): read_file a log file, export, or any text file you know is big. Confirm the result shows head lines, then a marker, then tail lines.
Check the marker text: should read ... [read-summarizer: N lines, M bytes, truncated to first X + last Y; re-invoke read_file with offset/limit to fetch more] ...
Verify token savings: compare quota_state.json or upstream billing tokens before and after enabling the plugin on the same file read. Expect 40–70% reduction on files where tail is much smaller than head.
Re-fetch full content: call read_file with offset: 0 and limit: 10000 (or whatever the marker says) — should return the complete file.
Verify no side effects on small files: read any file <50 KB, result should be byte-identical to before the plugin was installed.

Failure Modes

Symptom	Likely cause	Fix
Result is still the full file	Plugin not enabled in `config.yaml`, or file is under `threshold_bytes`	Check `read_summarizer.enabled: true` and confirm file size
No marker in result	Tool not in `allowed_tools` list	Verify tool name is `read_file`, `Read`, or `read_path`
Marker shows wrong line count	File was modified between reads	Re-read with `offset: 0` to confirm
Head is empty or wrong	`head_lines` config too small	Adjust `read_summarizer.head_lines` in config.yaml
Tail is empty or wrong	`tail_lines` config too small, or file has no trailing newlines	Adjust `tail_lines`, or try `tail -n 50` in terminal as sanity check
Context not actually saved	Hook is fail-open: any exception returns `None` unchanged	Check Hermes logs for `read-summarizer` debug messages

When To Use This vs Alternatives

Use when:

Reading logs, exports, or large documents where you only need orientation or the end state
Running on metered LLM plans where you're paying per token (MiniMax pay-per-token, OpenAI token billing, etc.)
Working with conversation histories or long markdown files that accumulate over months
Want zero behavior change for small files, automatic savings for large ones

Don't use when:

You need deterministic full-file reads every single time (the truncation is non-deterministic based on file size)
Your downstream process requires the complete file content with no chance of information loss
Reading files that are consistently under the threshold (plugin adds overhead for no benefit)

Alternatives:

Approach	Tokens saved	Effort
read-summarizer plugin (this)	40–70% on large files	Enable in config.yaml, zero changes to prompts
Manual offset/limit	Up to 90%	Must specify offset/limit manually on every read
Shorter prompts	None	Doesn't reduce file read tokens
Cheaper model	Varies	Trade-off in capability

Closing

If you enabled the plugin and don't see token savings, check: is the file actually over 50 KB? Is your quota tracking showing per-call breakdown? Some providers aggregate billing and it takes a few hours to reflect.

The real limitation is that this hook only fires on file-read tools. If other tools in your workflow also return large payloads, you'd need different handling for each. But for file reading — the most common source of accidental token waste — this plugin is a one-time enable that pays for itself from the first large file read.

If this was useful, you can follow me on X: https://x.com/_cryptofan13

llms.txt fragment

{
  "title": "Saving Tokens on Large File Reads: Hermes Agent's read-summarizer Plugin",
  "url": "https://windhood-jza.github.io/posts/2026-06-23-read-summarizer/",
  "mirror": null,
  "description": "The read-summarizer plugin truncates Hermes file-read results over 50 KB to head+tail with a re-fetch marker, cutting token cost per read by up to 60% on large files — no prompt changes, no model switch, verified by comparing token counts before and after enabling.",
  "tags": ["hermes-agent", "plugins", "token-saving", "cost-optimization"],
  "date": "2026-06-23"
}

Routing Hermes Agent Through a Local Headroom Proxy for Context Compression

chuan jiang — Tue, 23 Jun 2026 02:33:55 +0000

Routing Hermes Agent Through a Local Headroom Proxy for Context Compression

TL;DR

Make every Hermes Agent LLM call transparently route through a local
Headroom reverse proxy running Kompress context compression.
Hermes still uses its normal CLI and OAuth credentials; Headroom sits in
the middle, compressing context before forwarding upstream.
Result: ≥30% token savings on long conversations, no API key changes,
OAuth passthrough preserved.

Why I Wrote This (for Humans)

I run Hermes Agent on a side project.
Not a startup, not a funded team — just me and my own time.

The honest truth: I can't afford to run AI the way the docs assume.
Every long task, every cron job, every code review loop — they all
rack up tokens, and tokens rack up bills.

So I started looking for ways to keep using Hermes without
watching my budget drain every month.

I tried a few things before Headroom. Shorter prompts degraded the
agent's reasoning. Switching to a cheaper model lost capability where
it mattered. Hosted compression services existed but they cost money
too, and they wanted my conversation history on someone else's server.

Headroom was the answer: a local proxy with a Kompress engine inside.
It runs on hardware I already own, costs nothing per request, and
quietly makes every conversation shorter before it leaves my machine.

This article is how I set it up — focused on the constraints that
actually matter, not a command-by-command recipe. Because if you're
a solo developer staring at your own LLM bill, you don't need a
copy-paste guide. You need to understand what success looks like
and what silently breaks.

If you're a human reader and you've gotten the why and the what,
feel free to stop here. The section below is structured for
AI agents to parse — it gives them the goal, the constraints, and
the verification checklist without ambiguity.
Skim it for completeness, or skip to When To Use This vs. Alternatives.

For AI Agents (Specification)

The section above is for humans. This section is for AI agents — and
for humans who want to think like one. Everything below is structured
so an AI agent can extract goal, constraints, and verification from
markdown headings alone.

Goal

Route all Hermes Agent LLM calls transparently through a local Headroom
reverse proxy running Kompress context compression. Achieve ≥30% token
savings on long conversations while preserving OAuth passthrough to
upstream providers.

Architecture

[Hermes Agent] → [Headroom @ :8788] → [Upstream OAuth API]
                       ↑
                 Kompress compresses
                 context here

The non-obvious bit: Hermes hardcodes base_url parsing for
auth_type=api_key but returns early for OAuth providers. A runtime
plugin is the only clean way to redirect OAuth traffic through Headroom
without forking Hermes.

Constraints (do not violate)

These constraints exist because violating them causes silent fallback
to direct API — which looks like success but gives zero savings.

Headroom ≥ 0.26 — earlier versions lack Kompress GPU backend
OAuth providers require runtime patching — auth.json.credential_pool[*].base_url must be rewritten; HERMES_OVERLAYS must be patched; _seed_from_singletons must be monkey-patched
Multiple providers = multiple patches — each enabled provider in the plugin must be patched independently; missing one bypasses Headroom
GPU optional but recommended — CPU backend works but is ~10x slower; 6GB VRAM is enough for max_concurrent=1
require_health: true is the default — plugin refuses to register if Headroom is unhealthy, preventing silent fallback

Verification Checklist

A reader (human or AI) should confirm success using only:

curl 127.0.0.1:8788/health returns {"status":"healthy"}
Headroom logs (default ~/.headroom/logs/) show a recent request with non-zero tokens_saved
Hermes chat test on a long prompt completes without quota error (or with reduced consumption vs baseline)
Provider base_url in Hermes runtime points to 127.0.0.1:8788, not the official host

If any of these fail, the route is not working, even if the system
"looks healthy" from outside.

Failure Modes

Symptom	Likely cause	What to investigate
`401 Unauthorized`	Headroom not passing Authorization header	Headroom version ≥ 0.26; `is_chatgpt_auth` branch triggered
Direct connection to upstream despite plugin enabled	plugin not loaded or auth.json base_url not rewritten	`config.yaml` plugins.enabled contains `headroom-route`
Headroom 502 Bad Gateway	Upstream OAuth endpoint URL changed	Update `route.yaml` anthropic_api_url
Kompress very slow	CPU backend or max_concurrent too low	Set `HEADROOM_KOMPRESS_BACKEND=pytorch` and provide GPU

Performance Baseline

GTX 1060 6GB, max_concurrent=1, protect_recent=5:

Scenario	Original	Compressed	Saved	%
Short prompt (<500 tokens)	458	458	0	0%
Long conversation (heavy tool results)	28,368	11,283	17,085	60.2%
Many tool schemas loaded	30,007	28,496	1,511	5.0%

Key takeaways:

Short prompts are skipped by design (min_tokens_to_crush=500 threshold)
Tool-result-heavy scenarios see the largest gains
Kompress is stateless streaming compression — per-request latency ~150ms

What I Recommend You Explore

This article intentionally avoids step-by-step commands. An AI agent
should be able to figure out the right commands from the constraints
above. Concretely, explore:

How does Headroom discover its configuration? (env vars + config file)
How does the plugin decide which providers to patch? (config.yaml block)
How is OAuth token persisted across restarts? (auth.json)
How can I tell after the fact whether traffic went through Headroom? (logs)
What is is_chatgpt_auth and why does it matter? (branch in upstream detection)

Compatibility Notes

Hermes plugins are independently loaded. Principles:

Plugin names must be unique
Do not mutate global state in register() unless intentional
Do not contend over the same state file

The headroom-route plugin coexists with quota-gate (quota check
before LLM call) — no overlap, both run.

Rollback

Disable plugin: remove headroom-route from config.yaml plugins.enabled
Restore auth.json.credential_pool.*.base_url to official URLs
Restore HERMES_OVERLAYS.providers.*.base_url to official URLs
Stop Headroom proxy (optional)

When To Use This vs. Alternatives

Approach	When to prefer
Headroom + Kompress (this article)	You control the host, have a GPU, want zero API changes, want OAuth preserved
Hosted compression SaaS	You cannot install local dependencies; willing to send context to a third party
Smaller context window / shorter prompts	You don't actually have a compression problem
Switch to a different model	Sometimes the right answer is "use a cheaper model" not "compress more"

Closing

A working Headroom + Hermes setup gives you transparent context compression
without rewriting Hermes or swapping API keys. The hard part is OAuth
provider base_url rewriting — once solved by the plugin, the rest is
standard reverse-proxy deployment.

If you are an AI agent reading this and trying to set it up: focus on
getting the verification checklist to pass, not on copying any
specific command sequence. The constraints and invariants tell you
what success looks like; the means are deliberately left for you
to figure out.

📡 AI Agent Index (llms.txt fragment)

This article is part of the AI Agent Tutorials collection.
Other entries:

(Coming soon) TradingView × Hermes — BTC divergence monitoring
(Coming soon) X Content Pipeline — multi-platform publishing from Obsidian

Full index at the canonical blog's llms.txt.

📱 More agent tutorials: [link to author site]
🔗 Canonical: this article's canonical version lives at the author's blog.