DEV Community

Vishal VeeraReddy
Vishal VeeraReddy

Posted on

Run Hermes Agent on Any Model — Free, Local, and Cost-Routed

If you've spent any time wrestling with AI coding tools and agents in 2026, you've hit two walls:

  1. Provider lock-in. Claude Code expects Anthropic. Codex expects OpenAI. Your shiny new agent framework wants whatever its README assumes.
  2. Agent amnesia. Every session starts from zero. Your "AI assistant" doesn't actually learn anything about you, your codebase, or the work you did yesterday.

Two open-source projects address those problems head-on — and they pair beautifully together.

  • Hermes Agent (by Nous Research) — a self-improving AI agent with a built-in learning loop, multi-platform presence, and a serious tool ecosystem.
  • Lynkr — a self-hosted universal LLM proxy that lets any AI tool talk to any model provider.

This post explains what each one is, why they exist, and shows you the exact steps to run Hermes through Lynkr so you can route Hermes to Databricks, Bedrock, Ollama, llama.cpp, Azure, OpenRouter — or all of them with automatic cost-tier routing.


What Is Hermes Agent?

Hermes is an open-source AI agent (MIT-licensed, built by Nous Research) that you actually live inside, not just call.

What makes it different from "yet another agent":

  • A closed learning loop. Hermes curates its own memory, autonomously creates skills (procedural memory) after complex tasks succeed, improves them during use, and searches its own past conversations via SQLite FTS5. It's the only agent I've seen that gets meaningfully better the longer you use it.
  • Lives where you do. A single gateway process plugs into Telegram, Discord, Slack, WhatsApp, Signal, Email, and a real terminal TUI. Send a voice memo from your phone, get a transcribed answer back, continue the same thread from your laptop later.
  • Runs anywhere. Seven terminal backends — local, Docker, SSH, Singularity, Modal, Daytona, Vercel Sandbox. Run it on a $5 VPS or a GPU cluster. Modal/Daytona give you serverless persistence — hibernates when idle, wakes on demand.
  • Built-in cron. "Every weekday at 8am, summarize my GitHub notifications and send to Telegram." That's a one-line cron job in natural language.
  • Delegates and parallelizes. Spawns isolated subagents for parallel workstreams; results come back without flooding your context.
  • Provider-agnostic by design. OpenRouter, Nous Portal, NovitaAI, NVIDIA NIM, Xiaomi MiMo, z.ai/GLM, Kimi/Moonshot, MiniMax, Hugging Face, OpenAI, or your own endpoint. Switch with hermes model — no code changes.

Architecture in one paragraph

The core is AIAgent in run_agent.py — a synchronous tool-calling loop over OpenAI-format messages. model_tools.py orchestrates ~40 built-in tools auto-discovered from tools/. The CLI (cli.py, ~11k LOC) handles slash commands, prompt_toolkit input, Rich rendering, and a data-driven skin engine. Provider profiles live under plugins/model-providers/<name>/ and contribute base_url, env_vars, api_mode, and fallback_models — the runtime resolver merges those with custom_providers from config.yaml to figure out where to send each request. That last detail is what makes Lynkr integration trivial.

Install Hermes in one line

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Then hermes to start chatting.


What Is Lynkr?

Lynkr is a self-hosted Node.js proxy that sits between any AI coding tool and any LLM provider. One environment variable change, and your tool works with whatever backend you want.

Claude Code / Cursor / Codex / Cline / Continue / Hermes / Vercel AI SDK
                                |
                              Lynkr  (http://localhost:8081)
                                |
   Ollama | Bedrock | Databricks | OpenRouter | Azure | OpenAI | llama.cpp | LM Studio | z.ai | Vertex | Moonshot
Enter fullscreen mode Exit fullscreen mode

What's actually inside

I went through the source. Lynkr is more than a "translate request, forward, translate response" proxy:

  • Format conversion. Anthropic ↔ OpenAI ↔ Codex Responses API ↔ Databricks ↔ Bedrock — handled in src/clients/ (openai-format.js, responses-format.js, databricks.js, bedrock-utils.js, etc.).
  • Tier-based routing. src/routing/ analyzes prompt complexity, agentic intent, risk, and latency, then routes to a TIER_SIMPLE / TIER_STANDARD / TIER_COMPLEX model. Cheap stuff goes to Ollama; gnarly stuff goes to a frontier cloud model. This is where the headline "60–80% cost savings" comes from.
  • Resilience. Circuit breaker (cockatiel), retries, DNS logging, prompt cache injection.
  • MCP integration + Code Mode. Auto-discovers MCP servers and can collapse 100+ MCP tool definitions into 4 meta-tools (~96% token reduction).
  • Observability built in. Telemetry, latency tracking, usage reporting (lynkr usage shows AI spend and tier savings), trajectory export as JSONL for training (lynkr trajectory).
  • 699 passing tests. Routing, format conversion, streaming, error resilience, memory store, prompt cache — it's seriously tested for a side-project proxy.

Install Lynkr in one line

curl -fsSL https://raw.githubusercontent.com/Fast-Editor/Lynkr/main/install.sh | bash
Enter fullscreen mode Exit fullscreen mode

Or via npm: npm install -g pino-pretty && npm install -g lynkr.


Why Use Them Together?

Hermes already supports a long list of providers natively. Why bolt Lynkr in front?

Three concrete reasons:

1. Unify your enterprise creds

Your company has a Databricks endpoint serving Claude, an AWS Bedrock account with cross-region inference profiles, an Azure OpenAI deployment, and a private Ollama box. With Lynkr, all of those live behind one OpenAI-compatible URL. Hermes points at that URL and stops caring which backend is serving the request.

2. Automatic cost-tier routing

This is the killer feature. Hermes can switch models with /model, but Lynkr will switch per request based on complexity. Simple tool calls and short prompts go to free local Ollama. Heavy reasoning goes to your premium cloud model. You don't think about it — Lynkr's complexity-analyzer.js and risk-analyzer.js decide.

Run lynkr usage afterward to see the actual savings.

3. Centralized observability for every agent + tool

If you run Hermes + Claude Code + Cursor + Codex all on the same machine — and a lot of us do — Lynkr becomes a single chokepoint for spend, telemetry, prompt caching, and trajectory capture across all of them. You get one usage report instead of four dashboards.


How to Use Lynkr With Hermes

The integration is genuinely 3 minutes of work because both tools speak OpenAI-compatible HTTP.

Step 1: Start Lynkr with a backend

Pick whatever provider you want Lynkr to route to. For a local-first setup:

# .env in your Lynkr directory (or just exports)
export MODEL_PROVIDER=ollama
export OLLAMA_MODEL=qwen2.5-coder:latest
export OLLAMA_ENDPOINT=http://localhost:11434

lynkr start
Enter fullscreen mode Exit fullscreen mode

Or for tier routing across providers:

export TIER_SIMPLE=ollama:qwen2.5-coder:latest
export TIER_STANDARD=openrouter:anthropic/claude-3.5-haiku
export TIER_COMPLEX=bedrock:anthropic.claude-3-5-sonnet-20241022-v2:0
export OPENROUTER_API_KEY=sk-or-...
export AWS_BEDROCK_API_KEY=...
lynkr start
Enter fullscreen mode Exit fullscreen mode

Lynkr now listens on http://localhost:8081 (OpenAI-compatible) and http://localhost:8081/v1/messages (Anthropic-compatible).

Step 2: Register Lynkr as a custom provider in Hermes

Hermes resolves providers through plugins/model-providers/<name>/ profiles plus a custom_providers list in your ~/.hermes/config.yaml. Add an entry:

custom_providers:
  - name: lynkr
    base_url: http://localhost:8081/v1
    api_mode: chat_completions
    env_var: LYNKR_API_KEY      # any string works — Lynkr doesn't validate
    models:
      - auto                    # Lynkr's tier router picks the actual model
      - qwen2.5-coder:latest
      - anthropic/claude-3.5-sonnet
Enter fullscreen mode Exit fullscreen mode

Then set the key (any value):

hermes config set env.LYNKR_API_KEY sk-lynkr
Enter fullscreen mode Exit fullscreen mode

Step 3: Point Hermes at Lynkr

hermes model custom:lynkr/auto
Enter fullscreen mode Exit fullscreen mode

Or interactively: run hermes model, pick custom:lynkr, choose auto.

That's it. Every Hermes turn now flows through Lynkr, which routes to the right backend based on tier and complexity. Run a few turns, then:

lynkr usage
Enter fullscreen mode Exit fullscreen mode

…and you'll see the per-tier spend breakdown and dollars saved versus a single-frontier-model baseline.

Bonus: voice memo → Hermes → Lynkr → cheapest model

Because Hermes already has Telegram and voice memo transcription wired in, this whole stack means:

Record a voice memo on your phone → Hermes transcribes it → routes the request through Lynkr → Lynkr picks Ollama for the "what time is it in Tokyo" parts and Sonnet for the "refactor this function" parts → reply comes back to your phone.

You built that in 5 minutes with two npm/bash installers and a YAML edit.


When NOT to Use Lynkr With Hermes

Being honest:

  • You only use one provider. Hermes already supports it natively. Adding Lynkr is extra latency and another process to babysit.
  • You need streaming reasoning tokens from a specific model. Make sure Lynkr's format converter for that provider preserves what you need — it does for most cases, but verify before betting on it.
  • You're on a constrained environment. Lynkr is Node 20+. Hermes is Python 3.11. That's two runtimes on a Raspberry Pi.

For everything else — multi-provider workflows, enterprise creds, cost optimization, observability — the combination is hard to beat.


TL;DR

Need Tool
A real AI agent that learns, remembers, and lives across Telegram/Discord/CLI Hermes
Route any AI tool to any LLM provider with automatic cost tiers Lynkr
Both Point Hermes at Lynkr via custom_providers in config.yaml

Links

If you build something with this combo, drop a comment — I'd love to see what stacks people are putting together.

Top comments (0)