DEV Community

Azamat Safarov
Azamat Safarov

Posted on • Originally published at dev.to

Hermes Agent Burned 603M Tokens Behind My Back — I Cut Background Costs by Up to 125x

Last Tuesday I noticed my Ollama Cloud Pro quota draining faster than usual. Way faster. I had burned through 603 million tokens in seven days without understanding where they went.

I opened my Hermes Agent logs and found something I did not know existed: an auxiliary: block with twelve background tasks. Compression, web extraction, vision, session search, skills matching — all running silently every time I typed a message. Every task was set to provider: auto. And because I had no API keys for the fallback chain, every one silently fell back to kimi-k2.6, my one-trillion-parameter main model.

I had no idea this was happening. The agent was sending eleven background prompts to the same model I was actively chatting with, through the same quota, without showing me the prompts. Compression alone fired 10–20 times per long session, each pass sending the full conversation history.

This is what I fixed immediately.

The Fix

Here is what I changed in the auxiliary block of my ~/.hermes/config.yaml. The complete YAML is in the Full Config section below.

Apply with /reset or restart Hermes. Config changes only take effect on new sessions.

If you just want the config and don't care about the story, jump to Full Config below, copy it, /reset, done.

How the Routing Works

Before and after: default provider auto vs optimized explicit routing

Twelve tasks used to collapse into one trillion-parameter model. Now they are distributed across six models, from 8B to 1T.

What provider: auto Actually Does

I searched dozens of Hermes guides online. Not a single one mentions the auxiliary block. The official docs describe the YAML structure, but there's no warning that provider: auto silently falls back to your main model. I only found one video by AI Garage discussing this — nothing else. No blog posts, no Discord threads, no Reddit discussions.

The auto chain is: openrouter → new portal → codex → gemini flash. If none of these backends have an API key configured, it falls back to your main chat model.

So while I was typing one message to k2.6, the agent was sending eleven others to the same model through the same quota, without showing me the prompt. Compression alone was firing 10–20 times per long session, sending the full conversation history every time.

My Ollama Cloud Pro Catalog

I have an Ollama Cloud Pro subscription. Here are the models available in my catalog that matter for routing:

Model Size Strength Best for
kimi-k2.6 ~1T params, 256K context Reasoning, architecture, debugging Main chat only
kimi-k2.5 ~1T params Same family, optimized for long context Summarization, compression
qwen3-vl:235b-instruct 235B params Multimodal (vision + text) Screenshots, image analysis
deepseek-v4-flash ~20B params Fast, good at structured output Safety checks, classification
gemma3:12b 12B params Lightweight, fast Triage, profile tasks
rnj-1:8b 8B params Cheapest in catalog Titles, search, skills matching
gemma4:e2b 2B params Smallest Not used — too weak for any auxiliary task

I pulled all of these and tested them against the twelve auxiliary tasks. The routing below is the result of that testing.

Twelve Background Tasks

My Hermes version has twelve auxiliary tasks. I ordered them by how much they cost in practice:

# Task What it does Why it costs money
1 compression Summarizes conversation when context exceeds limits Fires 10–20 times per long session. Each pass sends the full conversation history.
2 web_extract Strips HTML boilerplate after web_search Fires every time you search the web.
3 vision Processes screenshots and images Multimodal tokens are expensive.
4 flush_memories Writes facts to memory files on /new or /exit Runs at every session end.
5 kanban_decomposer Breaks Kanban tasks into steps Medium complexity, runs on board operations.
6 curator Analyzes skill quality and redundancy Heavy analysis task.
7 session_search Searches past sessions and summarizes matches Runs when you look up old conversations.
8 skills_hub Matches your query to installed skills Runs on most questions.
9 triage_specifier Classifies incoming messages Binary classification.
10 approval Binary safety check before terminal commands Simple yes/no on safety.
11 profile_describer Generates profile bio Rare, lightweight.
12 title_generation Auto-names new sessions Trivial, runs constantly.

Note: older Hermes versions (like the one in the AI Garage video below) show eight tasks. My version has twelve — four were added in recent updates.

How I Mapped Models to Tasks

The logic is dead simple: put the lightest model that doesn't break on each task, and keep k2.6 for actual conversations.

Task Model Why this one
Main chat kimi-k2.6 Architecture, debugging, discussion. The only task that actually needs a trillion parameters.
Compression, web_extract, kanban, curator kimi-k2.5 Same Kimi family, optimized for long context. Summarization quality stays high. Using k2.6 for compression was burning quota for no quality gain.
Vision qwen3-vl:235b-instruct Only multimodal model in the catalog at this level. No alternative exists for image analysis.
Triage, profile gemma3:12b 12B params vs 1T. Classification and bio generation do not need reasoning depth.
Approval deepseek-v4-flash (~20B) Binary safety check. Fast response time matters more than reasoning quality.
Titles, search, skills, MCP rnj-1:8b 8B params. 125 times lighter than k2.6. The bulk of the savings are here — these tasks run constantly but need minimal intelligence.

What I Tried First: Local Models

Before settling on cloud routing, I tried running auxiliary tasks locally. I already had gemma4:e2b pulled via Ollama on my machine.

RTX 5070 Ti, 8 GB VRAM. One 6B-parameter model fits. Two is already borderline. Every time Hermes switched from compression to approval, Ollama unloaded one model and loaded another. Five to ten seconds of dead air. The GPU fan kicked in. I lost more time waiting for model swaps than I saved on tokens. I abandoned local auxiliary models the same day.

The Numbers

Here is what the routing actually means in terms of model size:

Task Before (default) After (routed) Reduction
Titles, search, skills, MCP kimi-k2.6 (~1T) rnj-1:8b (8B) 125x lighter
Triage, profile kimi-k2.6 (~1T) gemma3:12b (12B) 83x lighter
Approval kimi-k2.6 (~1T) deepseek-v4-flash (~20B) 50x lighter
Compression, web_extract kimi-k2.6 (~1T) kimi-k2.5 (~1T) Same family, frees k2.6 for chat
Vision kimi-k2.6 (~1T) qwen3-vl (235B) Dedicated multimodal model

The video by AI Garage measured compression cost directly: Claude Opus at 50K context = 13 cents per pass. Kimi K2 for the same task = 1.9 cents. That is an 85% reduction for a single compression pass. Compression fires 10–20 times per day for heavy users. The author estimated that with default settings, compression alone can cost $60/month with Claude Opus. Routed to a cheaper model, it drops to $9/month.

I cannot confirm exact dollar savings for Ollama Cloud — they do not expose per-call pricing. But the scale difference is unambiguous.

Current Status

Component Status Notes
Heavy auxiliary on k2.5 Working Compression and web_extract no longer block the main model
Vision on qwen3-vl Working Only multimodal option available
Medium tasks on gemma3:12b Working Triage and profile classification
Approval on deepseek-v4-flash Working Fast binary decisions
Light tasks on rnj-1:8b Working Titles, search, skills, MCP
provider: auto removed Done Explicit provider on every task
Local Ollama auxiliary Abandoned VRAM contention on 8 GB laptop
Cost tracking per task Not possible Ollama Cloud does not expose per-call pricing

Sessions no longer pause for Compressing messages. The token counter stopped monopolizing k2.6. Quota exhaustion mid-session is gone.

Full Config

Here is the complete auxiliary: block from my ~/.hermes/config.yaml:

auxiliary:
  compression:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 120
  web_extract:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 360
  kanban_decomposer:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 180
  curator:
    provider: ollama-cloud
    model: kimi-k2.5
    timeout: 600
  vision:
    provider: ollama-cloud
    model: qwen3-vl:235b-instruct
    timeout: 120
    download_timeout: 30
  triage_specifier:
    provider: ollama-cloud
    model: gemma3:12b
    timeout: 120
  profile_describer:
    provider: ollama-cloud
    model: gemma3:12b
    timeout: 60
  approval:
    provider: ollama-cloud
    model: deepseek-v4-flash
    timeout: 30
  title_generation:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
  session_search:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
    max_concurrency: 3
  skills_hub:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
  mcp:
    provider: ollama-cloud
    model: rnj-1:8b
    timeout: 30
Enter fullscreen mode Exit fullscreen mode

Try It

If you run Hermes Agent and have never touched the auxiliary block:

hermes config edit
Enter fullscreen mode Exit fullscreen mode

Find auxiliary:. Set explicit provider and model for every task — the one that handles it without dragging extra parameters. Save. Run /reset.

Then tomorrow run hermes insights --days 1. Your main model should stop eating the entire token budget.

If you use Claude or another frontier model as your main provider, the default config is even more expensive — every background task inherits that model. Route them to smaller models. Or go local if your hardware handles it.

What auxiliary routing are you using? Drop it in the comments.


LinkedInhttps://www.linkedin.com/in/azamat-safarov-4a37b93a7/

X / Twitterhttps://x.com/Azamat__Safarov

Top comments (0)