DEV Community

Dor Amir
Dor Amir

Posted on

How to Run Hermes Agent at 60% Lower Cost (Without Touching Its Code)

Hermes Agent from NousResearch is one of the more interesting open agent frameworks out there. It self-improves, builds skills over time, searches its own memory, and runs in loops. It also burns through API credits like nothing else.

Most of that spend is waste. Here's how to cut it by 60% without changing a single line of Hermes code.

The Problem With Always-On Agents

Hermes runs constantly. Every loop iteration involves multiple LLM calls: checking memory, reading skills, planning next steps, executing tools, summarizing results. That's the nature of agentic systems.

But here's the thing. A memory lookup ("have I seen this before?") doesn't need Claude Sonnet. A skill read ("what does this tool do?") doesn't need it either. A simple formatting step or status check definitely doesn't.

Right now, if you point Hermes at Claude's API, every single one of those calls hits the same model at the same price. The 2-token memory check costs the same per-token as the complex multi-step reasoning task.

NousResearch opened issue #547 about adding direct Anthropic API support to cut costs vs OpenRouter. That's a good start. But direct API access only removes the middleman markup. It doesn't address the fundamental problem: you're still sending simple prompts to expensive models.

NadirClaw: A Smarter Layer

NadirClaw is an open-source LLM proxy that classifies prompt complexity and routes accordingly. Simple calls go to cheap models (Gemini Flash, Claude Haiku). Complex reasoning stays on Claude Sonnet or Opus. It's OpenAI-compatible, so any tool that can point at an API endpoint can use it.

Including Hermes. With zero code changes.

Setup: 5 Minutes

Install NadirClaw:

pip install nadirclaw
Enter fullscreen mode Exit fullscreen mode

Run the setup wizard. Pick Claude Sonnet as your complex model and Gemini Flash (or Haiku) as your simple model:

nadirclaw setup
Enter fullscreen mode Exit fullscreen mode

The wizard walks you through it. You'll set your API keys and model preferences. Takes about a minute.

Start the proxy:

nadirclaw serve
Enter fullscreen mode Exit fullscreen mode

This runs on localhost:8000 by default. It's an OpenAI-compatible endpoint.

Now configure Hermes to use it. In your Hermes setup, change the provider endpoint:

# Instead of pointing at api.anthropic.com or openrouter.ai
# Point at your local NadirClaw instance
Provider endpoint: http://localhost:8000
Enter fullscreen mode Exit fullscreen mode

Set your ANTHROPIC_API_KEY environment variable as usual. NadirClaw passes it through to the upstream providers.

That's it. Hermes doesn't know anything changed. It sends requests to what looks like a normal OpenAI-compatible API. NadirClaw intercepts each one, classifies it, and routes it to the right model.

What Gets Routed Where

In a typical Hermes session, here's roughly what happens:

Routed to cheap models (Flash/Haiku):

  • Memory searches and lookups
  • Skill file reads
  • Simple tool calls (file operations, basic queries)
  • Status checks and formatting
  • Short summarization of tool outputs

Stays on Claude Sonnet/Opus:

  • Multi-step reasoning and planning
  • Complex code generation
  • Skill creation (the self-improvement part)
  • Nuanced decision-making with multiple factors
  • Long-context analysis

The classifier makes these decisions automatically. You don't write rules or configure thresholds. It looks at prompt length, complexity signals, and content type.

The Numbers

On agent workloads specifically, the savings tend to be on the higher end. Why? Because agents make lots of calls, and the majority of those calls are simple operations. A typical Hermes session might make 50 LLM calls in a loop. Maybe 8-12 of those actually need premium reasoning. The rest are housekeeping.

With NadirClaw routing, those 38-42 simple calls hit a model that costs a fraction of Sonnet. The complex calls still get Sonnet-quality responses. Net result: roughly 60% cost reduction on the full session.

Your mileage varies based on workload. Heavy reasoning tasks save less. Agents with lots of memory operations and tool use save more.

The Dashboard

NadirClaw comes with a built-in dashboard. Open it in your browser and you can see every routing decision in real time. Which Hermes calls went to Flash, which went to Sonnet, why each decision was made, and the running cost comparison.

This is useful for two things. First, verifying that the routing makes sense for your workload. Second, showing your team (or yourself) exactly where the savings come from. It's one thing to say "we cut costs 60%." It's another to show the breakdown.

Why Not Just Use a Cheaper Model for Everything?

You could point Hermes at Gemini Flash for all calls. It's cheap. But agent quality degrades fast when the planning and reasoning steps use a weaker model. Hermes's self-improvement loop especially needs strong reasoning to create useful skills.

The whole point of routing is that you don't compromise on the calls that matter. You just stop overpaying for the calls that don't.

Compared to Direct Anthropic API (Issue #547)

The work in issue #547 removes OpenRouter's markup by going direct to Anthropic. Good move. NadirClaw complements this. You still get direct API access (NadirClaw passes your API key through). You also get automatic routing on top of it.

They're not competing approaches. Use both. Direct API access removes the middleman. NadirClaw removes the waste.

Get Started

pip install nadirclaw
nadirclaw setup
nadirclaw serve
Enter fullscreen mode Exit fullscreen mode

Point Hermes at localhost:8000. Watch the dashboard. See where your money actually goes.

Open source, no account required.

getnadir.com | GitHub

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

Smart approach. One thing worth noting: prompt structure also affects classification accuracy. A vague memory lookup that happens to be long and ambiguous can look "complex" to the classifier and get routed to Sonnet unnecessarily. A cleanly structured prompt (explicit role, bounded input, defined output format) is easier to classify correctly and cheaper to execute on the lighter model.

Routing + structure are complementary levers — routing decides which model, structure helps the model (and classifier) do less work. flompt.dev / github.com/Nyrok/flompt