Azamat Safarov

Posted on Jun 2 • Originally published at dev.to

Hermes Agent Burned 603M Tokens Behind My Back — I Cut Background Costs by Up to 125x

#hermes #ollama #optimization #tokencost

Last Tuesday I noticed my Ollama Cloud Pro quota draining faster than usual. Way faster. I had burned through 603 million tokens in seven days without understanding where they went.

I opened my Hermes Agent logs and found something I did not know existed: an auxiliary: block with twelve background tasks. Compression, web extraction, vision, session search, skills matching — all running silently every time I typed a message. Every task was set to provider: auto. And because I had no API keys for the fallback chain, every one silently fell back to kimi-k2.6, my one-trillion-parameter main model.

I had no idea this was happening. The agent was sending eleven background prompts to the same model I was actively chatting with, through the same quota, without showing me the prompts. Compression alone fired 10–20 times per long session, each pass sending the full conversation history.

This is what I fixed immediately.

The Fix

Here is what I changed in the auxiliary block of my ~/.hermes/config.yaml. The complete YAML is in the Full Config section below.

Apply with /reset or restart Hermes. Config changes only take effect on new sessions.

If you just want the config and don't care about the story, jump to Full Config below, copy it, /reset, done.

How the Routing Works

Twelve tasks used to collapse into one trillion-parameter model. Now they are distributed across six models, from 8B to 1T.

What `provider: auto` Actually Does

I searched dozens of Hermes guides online. Not a single one mentions the auxiliary block. The official docs describe the YAML structure, but there's no warning that provider: auto silently falls back to your main model. I only found one video by AI Garage discussing this — nothing else. No blog posts, no Discord threads, no Reddit discussions.

The auto chain is: openrouter → new portal → codex → gemini flash. If none of these backends have an API key configured, it falls back to your main chat model.

So while I was typing one message to k2.6, the agent was sending eleven others to the same model through the same quota, without showing me the prompt. Compression alone was firing 10–20 times per long session, sending the full conversation history every time.

My Ollama Cloud Pro Catalog

I have an Ollama Cloud Pro subscription. Here are the models available in my catalog that matter for routing:

Model	Size	Strength	Best for
`kimi-k2.6`	~1T params, 256K context	Reasoning, architecture, debugging	Main chat only
`kimi-k2.5`	~1T params	Same family, optimized for long context	Summarization, compression
`qwen3-vl:235b-instruct`	235B params	Multimodal (vision + text)	Screenshots, image analysis
`deepseek-v4-flash`	~20B params	Fast, good at structured output	Safety checks, classification
`gemma3:12b`	12B params	Lightweight, fast	Triage, profile tasks
`rnj-1:8b`	8B params	Cheapest in catalog	Titles, search, skills matching
`gemma4:e2b`	2B params	Smallest	Not used — too weak for any auxiliary task

I pulled all of these and tested them against the twelve auxiliary tasks. The routing below is the result of that testing.

Twelve Background Tasks

My Hermes version has twelve auxiliary tasks. I ordered them by how much they cost in practice:

#	Task	What it does	Why it costs money
1	`compression`	Summarizes conversation when context exceeds limits	Fires 10–20 times per long session. Each pass sends the full conversation history.
2	`web_extract`	Strips HTML boilerplate after `web_search`	Fires every time you search the web.
3	`vision`	Processes screenshots and images	Multimodal tokens are expensive.
4	`flush_memories`	Writes facts to memory files on `/new` or `/exit`	Runs at every session end.
5	`kanban_decomposer`	Breaks Kanban tasks into steps	Medium complexity, runs on board operations.
6	`curator`	Analyzes skill quality and redundancy	Heavy analysis task.
7	`session_search`	Searches past sessions and summarizes matches	Runs when you look up old conversations.
8	`skills_hub`	Matches your query to installed skills	Runs on most questions.
9	`triage_specifier`	Classifies incoming messages	Binary classification.
10	`approval`	Binary safety check before terminal commands	Simple yes/no on safety.
11	`profile_describer`	Generates profile bio	Rare, lightweight.
12	`title_generation`	Auto-names new sessions	Trivial, runs constantly.

Note: older Hermes versions (like the one in the AI Garage video below) show eight tasks. My version has twelve — four were added in recent updates.

How I Mapped Models to Tasks

The logic is dead simple: put the lightest model that doesn't break on each task, and keep k2.6 for actual conversations.

Task	Model	Why this one
Main chat	`kimi-k2.6`	Architecture, debugging, discussion. The only task that actually needs a trillion parameters.
Compression, web_extract, kanban, curator	`kimi-k2.5`	Same Kimi family, optimized for long context. Summarization quality stays high. Using k2.6 for compression was burning quota for no quality gain.
Vision	`qwen3-vl:235b-instruct`	Only multimodal model in the catalog at this level. No alternative exists for image analysis.
Triage, profile	`gemma3:12b`	12B params vs 1T. Classification and bio generation do not need reasoning depth.
Approval	`deepseek-v4-flash` (~20B)	Binary safety check. Fast response time matters more than reasoning quality.
Titles, search, skills, MCP	`rnj-1:8b`	8B params. 125 times lighter than k2.6. The bulk of the savings are here — these tasks run constantly but need minimal intelligence.

What I Tried First: Local Models

Before settling on cloud routing, I tried running auxiliary tasks locally. I already had gemma4:e2b pulled via Ollama on my machine.

RTX 5070 Ti, 8 GB VRAM. One 6B-parameter model fits. Two is already borderline. Every time Hermes switched from compression to approval, Ollama unloaded one model and loaded another. Five to ten seconds of dead air. The GPU fan kicked in. I lost more time waiting for model swaps than I saved on tokens. I abandoned local auxiliary models the same day.

The Numbers

Here is what the routing actually means in terms of model size:

Task	Before (default)	After (routed)	Reduction
Titles, search, skills, MCP	`kimi-k2.6` (~1T)	`rnj-1:8b` (8B)	125x lighter
Triage, profile	`kimi-k2.6` (~1T)	`gemma3:12b` (12B)	83x lighter
Approval	`kimi-k2.6` (~1T)	`deepseek-v4-flash` (~20B)	50x lighter
Compression, web_extract	`kimi-k2.6` (~1T)	`kimi-k2.5` (~1T)	Same family, frees k2.6 for chat
Vision	`kimi-k2.6` (~1T)	`qwen3-vl` (235B)	Dedicated multimodal model

The video by AI Garage measured compression cost directly: Claude Opus at 50K context = 13 cents per pass. Kimi K2 for the same task = 1.9 cents. That is an 85% reduction for a single compression pass. Compression fires 10–20 times per day for heavy users. The author estimated that with default settings, compression alone can cost $60/month with Claude Opus. Routed to a cheaper model, it drops to $9/month.

I cannot confirm exact dollar savings for Ollama Cloud — they do not expose per-call pricing. But the scale difference is unambiguous.

Current Status

Component	Status	Notes
Heavy auxiliary on k2.5	Working	Compression and web_extract no longer block the main model
Vision on qwen3-vl	Working	Only multimodal option available
Medium tasks on gemma3:12b	Working	Triage and profile classification
Approval on deepseek-v4-flash	Working	Fast binary decisions
Light tasks on rnj-1:8b	Working	Titles, search, skills, MCP
`provider: auto` removed	Done	Explicit provider on every task
Local Ollama auxiliary	Abandoned	VRAM contention on 8 GB laptop
Cost tracking per task	Not possible	Ollama Cloud does not expose per-call pricing

Sessions no longer pause for Compressing messages. The token counter stopped monopolizing k2.6. Quota exhaustion mid-session is gone.

Full Config

Here is the complete auxiliary: block from my ~/.hermes/config.yaml:

auxiliary:
  compression:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 120
  web_extract:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 360
  kanban_decomposer:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 180
  curator:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 600
  vision:
    provider: ollama-cloud
    model: qwen3-vl:235b-instruct
    timeout: 120
    download_timeout: 30
  triage_specifier:
    provider: ollama-cloud
    model: gemma3:12b
    timeout: 120
  profile_describer:
    provider: ollama-cloud
    model: gemma3:12b
    timeout: 60
  approval:
    provider: ollama-cloud
    model: deepseek-v4-flash
    timeout: 30
  title_generation:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
  session_search:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
    max_concurrency: 3
  skills_hub:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
  mcp:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30

Try It

If you run Hermes Agent and have never touched the auxiliary block:

hermes config edit

Find auxiliary:. Set explicit provider and model for every task — the one that handles it without dragging extra parameters. Save. Run /reset.

Then tomorrow run hermes insights --days 1. Your main model should stop eating the entire token budget.

If you use Claude or another frontier model as your main provider, the default config is even more expensive — every background task inherits that model. Route them to smaller models. Or go local if your hardware handles it.

What auxiliary routing are you using? Drop it in the comments.

LinkedIn — https://www.linkedin.com/in/azamat-safarov-4a37b93a7/

X / Twitter — https://x.com/Azamat__Safarov

DEV Community

Hermes Agent Burned 603M Tokens Behind My Back — I Cut Background Costs by Up to 125x

The Fix

How the Routing Works

What `provider: auto` Actually Does

My Ollama Cloud Pro Catalog

Twelve Background Tasks

How I Mapped Models to Tasks

What I Tried First: Local Models

The Numbers

Current Status

Full Config

Try It

Top comments (0)

The Fix

How the Routing Works

What provider: auto Actually Does

My Ollama Cloud Pro Catalog

Twelve Background Tasks

How I Mapped Models to Tasks

What I Tried First: Local Models

The Numbers

Current Status

Full Config

Try It

What `provider: auto` Actually Does