Hermes Agent from NousResearch is one of the more interesting open agent frameworks out there. It self-improves, builds skills over time, searches its own memory, and runs in loops. It also burns through API credits like nothing else.
Most of that spend is waste. Here's how to cut it by 60% without changing a single line of Hermes code.
The Problem With Always-On Agents
Hermes runs constantly. Every loop iteration involves multiple LLM calls: checking memory, reading skills, planning next steps, executing tools, summarizing results. That's the nature of agentic systems.
But here's the thing. A memory lookup ("have I seen this before?") doesn't need Claude Sonnet. A skill read ("what does this tool do?") doesn't need it either. A simple formatting step or status check definitely doesn't.
Right now, if you point Hermes at Claude's API, every single one of those calls hits the same model at the same price. The 2-token memory check costs the same per-token as the complex multi-step reasoning task.
NousResearch opened issue #547 about adding direct Anthropic API support to cut costs vs OpenRouter. That's a good start. But direct API access only removes the middleman markup. It doesn't address the fundamental problem: you're still sending simple prompts to expensive models.
NadirClaw: A Smarter Layer
NadirClaw is an open-source LLM proxy that classifies prompt complexity and routes accordingly. Simple calls go to cheap models (Gemini Flash, Claude Haiku). Complex reasoning stays on Claude Sonnet or Opus. It's OpenAI-compatible, so any tool that can point at an API endpoint can use it.
Including Hermes. With zero code changes.
Setup: 5 Minutes
Install NadirClaw:
pip install nadirclaw
Run the setup wizard. Pick Claude Sonnet as your complex model and Gemini Flash (or Haiku) as your simple model:
nadirclaw setup
The wizard walks you through it. You'll set your API keys and model preferences. Takes about a minute.
Start the proxy:
nadirclaw serve
This runs on localhost:8000 by default. It's an OpenAI-compatible endpoint.
Now configure Hermes to use it. In your Hermes setup, change the provider endpoint:
# Instead of pointing at api.anthropic.com or openrouter.ai
# Point at your local NadirClaw instance
Provider endpoint: http://localhost:8000
Set your ANTHROPIC_API_KEY environment variable as usual. NadirClaw passes it through to the upstream providers.
That's it. Hermes doesn't know anything changed. It sends requests to what looks like a normal OpenAI-compatible API. NadirClaw intercepts each one, classifies it, and routes it to the right model.
What Gets Routed Where
In a typical Hermes session, here's roughly what happens:
Routed to cheap models (Flash/Haiku):
- Memory searches and lookups
- Skill file reads
- Simple tool calls (file operations, basic queries)
- Status checks and formatting
- Short summarization of tool outputs
Stays on Claude Sonnet/Opus:
- Multi-step reasoning and planning
- Complex code generation
- Skill creation (the self-improvement part)
- Nuanced decision-making with multiple factors
- Long-context analysis
The classifier makes these decisions automatically. You don't write rules or configure thresholds. It looks at prompt length, complexity signals, and content type.
The Numbers
On agent workloads specifically, the savings tend to be on the higher end. Why? Because agents make lots of calls, and the majority of those calls are simple operations. A typical Hermes session might make 50 LLM calls in a loop. Maybe 8-12 of those actually need premium reasoning. The rest are housekeeping.
With NadirClaw routing, those 38-42 simple calls hit a model that costs a fraction of Sonnet. The complex calls still get Sonnet-quality responses. Net result: roughly 60% cost reduction on the full session.
Your mileage varies based on workload. Heavy reasoning tasks save less. Agents with lots of memory operations and tool use save more.
The Dashboard
NadirClaw comes with a built-in dashboard. Open it in your browser and you can see every routing decision in real time. Which Hermes calls went to Flash, which went to Sonnet, why each decision was made, and the running cost comparison.
This is useful for two things. First, verifying that the routing makes sense for your workload. Second, showing your team (or yourself) exactly where the savings come from. It's one thing to say "we cut costs 60%." It's another to show the breakdown.
Why Not Just Use a Cheaper Model for Everything?
You could point Hermes at Gemini Flash for all calls. It's cheap. But agent quality degrades fast when the planning and reasoning steps use a weaker model. Hermes's self-improvement loop especially needs strong reasoning to create useful skills.
The whole point of routing is that you don't compromise on the calls that matter. You just stop overpaying for the calls that don't.
Compared to Direct Anthropic API (Issue #547)
The work in issue #547 removes OpenRouter's markup by going direct to Anthropic. Good move. NadirClaw complements this. You still get direct API access (NadirClaw passes your API key through). You also get automatic routing on top of it.
They're not competing approaches. Use both. Direct API access removes the middleman. NadirClaw removes the waste.
Get Started
pip install nadirclaw
nadirclaw setup
nadirclaw serve
Point Hermes at localhost:8000. Watch the dashboard. See where your money actually goes.
Open source, no account required.
Top comments (1)
Smart approach. One thing worth noting: prompt structure also affects classification accuracy. A vague memory lookup that happens to be long and ambiguous can look "complex" to the classifier and get routed to Sonnet unnecessarily. A cleanly structured prompt (explicit role, bounded input, defined output format) is easier to classify correctly and cheaper to execute on the lighter model.
Routing + structure are complementary levers — routing decides which model, structure helps the model (and classifier) do less work. flompt.dev / github.com/Nyrok/flompt