If you clicked this, you probably already like Hermes. So do I. I have had it running on my laptop for a while, the Hermes Agent is all over my feed lately, and the open models are good enough now that building your own agent on them is genuinely fun. Somewhere around the third tool I bolted on, my question quietly changed. It stopped being "can this thing do the task" and became "if I turn this to paid, what would it cost me, and how could I charge for it?"
I have put a price on software before, and an API is the easy case: meter the calls, pick a number, done. In case of agents, it does a couple of things which mostly cost real money, and they are not the same thing. It thinks, which is tokens. And it acts, which is the search it fires, the page it pulls, the report it writes. When I actually sat and watched mine run, the tool calls were doing as much work as the model. Pricing only the tokens would have billed half of what the agent really does.
So I stopped theorizing and tried it. I took my small Hermes research agent, gave it a few genuine tools, and wired up billing for both sides: every token and every tool call, as their own line items, ending in a real invoice. No pretend company, no pretend customers. Just an honest end-to-end run to find out what turning Hermes into a paid agent actually takes.
The billing runs on Kong Konnect Metering & Billing (the managed version of OpenMeter). I kept the path deliberately short, the agent posts its own usage events straight from the code it already runs. One agent run comes out the other end as one invoice, with a line for thinking and a line for each kind of acting. Here is how it went.
Here's the complete codebase: https://github.com/tejakummarikuntla/Hermes-Billing-with-KongMB
Here's what I had to do:
π§ Set a research agent on Hermes with three tools (search, fetch, report)
πͺ Meter every token, split into input and output
π§ Meter every tool call, by tool name
π΅ Price thinking and acting in Kong Konnect Metering & Billing
π§Ύ Turn one agent run into one invoice
Here's the complete flow
Set up Hermes
You can use Hermes hosted or local. The agent code is identical either way; you only change three environment variables.
Option A: hosted (Nous Research API)
Create a key at portal.nousresearch.com. It is an OpenAI-compatible endpoint, so you point the OpenAI client at it:
LLM_BASE_URL=https://inference-api.nousresearch.com/v1
LLM_API_KEY=sk-nous-your-key
MODEL=nousresearch/hermes-4-70b
One thing to know up front: Hermes 4 is a paid model and needs purchased credits (a one-time grant only covers free models). And the Nous API does not expose the OpenAI tools parameter, so the agent uses Hermes native <tool_call> format there. More on that later.
Option B: local and free (Ollama)
If you do not want to spend anything, run Hermes 3 locally with Ollama. This is what the rest of the tutorial uses.
# The Homebrew cask bundles the inference runner. The CLI-only formula does not,
# so it can pull models but cannot actually run them.
brew install --cask ollama
ollama serve & # start the local server on http://localhost:11434
ollama pull hermes3 # about 4.7GB, one time
That gives you an OpenAI-compatible Hermes at http://localhost:11434/v1 with no API key:
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
MODEL=hermes3
Table of contents
Prerequisites
- Python 3.10, 3.11, 3.12, or 3.13
- A Hermes endpoint: local Ollama (free) or a Nous Research API key
- A free Kong Konnect account
- A Kong Konnect Personal Access Token with Metering & Billing permissions
Part 1: The Hermes agent
Set up the project
python -m venv .venv && source .venv/bin/activate
pip install openai httpx beautifulsoup4 python-dotenv
requirements.txt:
openai>=1.40.0
httpx>=0.27.0
beautifulsoup4>=4.12.0
python-dotenv>=1.0.0
.env (the local Ollama defaults plus your Kong values):
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
MODEL=hermes3
KONG_API_URL=https://us.api.konghq.com # use eu or au if your org is there
KONG_PAT=kpat_your_konnect_token
SUBJECT=hermes-demo
SUBJECT is the customer identifier. Every usage event carries it, and Kong attributes the usage to the customer that owns that subject.
The tools
Three tools, all keyless so you only need the Kong PAT. web_search hits Wikipedia's open API, fetch_url reads a page, and make_report writes a file. Swap web_search for Tavily or Brave in production; nothing else changes.
# tools.py
import os, json, datetime
import httpx
from bs4 import BeautifulSoup
REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")
UA = {"User-Agent": "hermes-paid-agent/0.1"}
def web_search(query: str) -> str:
"""Search for information on a topic. Returns titles, URLs, and snippets."""
r = httpx.get("https://en.wikipedia.org/w/api.php",
params={"action": "query", "list": "search", "srsearch": query,
"format": "json", "srlimit": 5}, headers=UA, timeout=20.0)
r.raise_for_status()
hits = r.json().get("query", {}).get("search", [])
return json.dumps([{
"title": h["title"],
"url": "https://en.wikipedia.org/wiki/" + h["title"].replace(" ", "_"),
"snippet": BeautifulSoup(h.get("snippet", ""), "html.parser").get_text(),
} for h in hits])
def fetch_url(url: str) -> str:
"""Fetch a web page and return its readable text (truncated)."""
r = httpx.get(url, headers=UA, timeout=20.0, follow_redirects=True)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
return " ".join(soup.get_text(" ").split())[:4000]
def make_report(title: str, findings: str) -> str:
"""Write a final research report (markdown). The premium tool."""
os.makedirs(REPORTS_DIR, exist_ok=True)
stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
slug = "".join(c if c.isalnum() or c in "-_" else "-" for c in title.lower())[:40]
path = os.path.join(REPORTS_DIR, f"{stamp}-{slug}.md")
with open(path, "w") as f:
f.write(f"# {title}\n\n{findings}\n")
return f"Report written to {path}"
TOOLS = [
{"type": "function", "function": {
"name": "web_search", "description": "Search for information on a topic.",
"parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}},
{"type": "function", "function": {
"name": "fetch_url", "description": "Fetch the readable text of a web page by URL.",
"parameters": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]}}},
{"type": "function", "function": {
"name": "make_report", "description": "Write the final report. Call once when done.",
"parameters": {"type": "object", "properties": {"title": {"type": "string"}, "findings": {"type": "string"}},
"required": ["title", "findings"]}}},
]
DISPATCH = {"web_search": web_search, "fetch_url": fetch_url, "make_report": make_report}
Meter tokens and tool calls
This is the whole billing integration. Two kinds of CloudEvent posted straight to Kong's ingest endpoint, no gateway:
-
hermes.tokenswith{tokens, type, model}: one event for input, one for output, per model call. -
hermes.tool_callwith{tool}: one event each time a tool runs.
# metering.py
import os, uuid, datetime
import httpx
from dotenv import load_dotenv
load_dotenv()
KONG_API_URL = os.environ["KONG_API_URL"].rstrip("/")
KONG_PAT = os.environ["KONG_PAT"]
SUBJECT = os.environ.get("SUBJECT", "hermes-demo")
SOURCE = "hermes-paid-agent"
INGEST_URL = f"{KONG_API_URL}/v3/openmeter/events"
HEADERS = {"Authorization": f"Bearer {KONG_PAT}", "Content-Type": "application/cloudevents+json"}
def _now():
return datetime.datetime.now(datetime.timezone.utc).isoformat()
def _post(event):
r = httpx.post(INGEST_URL, headers=HEADERS, json=event, timeout=30.0)
if r.status_code >= 300:
raise RuntimeError(f"ingest failed {r.status_code}: {r.text}")
def emit_token_event(tokens, token_type, model): # token_type is "input" or "output"
_post({"specversion": "1.0", "id": str(uuid.uuid4()), "source": SOURCE,
"type": "hermes.tokens", "time": _now(), "subject": SUBJECT,
"data": {"tokens": tokens, "type": token_type, "model": model}})
def emit_tool_event(tool):
_post({"specversion": "1.0", "id": str(uuid.uuid4()), "source": SOURCE,
"type": "hermes.tool_call", "time": _now(), "subject": SUBJECT,
"data": {"tool": tool}})
Each event gets a fresh id. Kong de-duplicates events by id plus source, so a fresh UUID per event keeps every one of them counted.
The agent loop
The loop is small: call Hermes, run any tools it asks for, feed results back, repeat until it answers. Two details make it Hermes-specific.
First, tool calling has two modes. When Hermes is served by Ollama, the server exposes the OpenAI tools parameter and returns structured tool_calls. The Nous API does not expose tools, so Hermes emits its native <tool_call> tags in the text and we parse them. The agent auto-selects the mode from the endpoint.
Second, metering sits inline: after each model call we read the usage block and emit two token events; each time a tool runs we emit a tool event.
# agent.py
import os, re, sys, json
from openai import OpenAI
from dotenv import load_dotenv
import tools
from metering import emit_token_event, emit_tool_event
load_dotenv()
BASE_URL = os.environ["LLM_BASE_URL"]
client = OpenAI(api_key=os.environ.get("LLM_API_KEY", "ollama"), base_url=BASE_URL)
MODEL = os.environ.get("MODEL", "hermes3")
MAX_STEPS, MAX_TOKENS, TEMPERATURE = 10, 1024, 0.3
# "api" = server-side tools parameter; "native" = Hermes <tool_call> parsing.
TOOL_MODE = os.environ.get("TOOL_MODE", "").lower() or ("native" if "nousresearch" in BASE_URL else "api")
SYSTEM = ("You are a research assistant. Use web_search to find sources, then call fetch_url ONLY "
"with a url returned by web_search (never invent URLs). After 1-2 searches and one fetch, "
"call make_report once with a title and concise findings that cite the source URLs. Then "
"write a short 2-3 sentence summary as your final reply. Use at most 4 tools in total.")
HERMES_TOOL_INSTRUCTIONS = (
"You are provided with function signatures within <tools></tools> XML tags. To call a function, "
"return a JSON object with its name and arguments within <tool_call></tool_call> tags, like:\n"
'<tool_call>\n{"name": "web_search", "arguments": {"query": "..."}}\n</tool_call>\n'
"Call one function per step. When you have the final answer, reply with plain text and no tags.")
TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)
def meter_usage(usage):
if usage:
emit_token_event(usage.prompt_tokens, "input", MODEL)
emit_token_event(usage.completion_tokens, "output", MODEL)
print(f"[meter] tokens in={usage.prompt_tokens} out={usage.completion_tokens}")
def run_tool(name, args):
print(f"[tool] {name}({args})")
try:
result = tools.DISPATCH[name](**args)
except Exception as e:
result = f"ERROR: {e}"
emit_tool_event(name)
print(f"[meter] tool_call {name}")
return result
def run_api(question): # Ollama and other endpoints that expose the tools parameter
messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": question}]
for step in range(MAX_STEPS):
kwargs = {"model": MODEL, "messages": messages, "max_tokens": MAX_TOKENS, "temperature": TEMPERATURE}
if step < MAX_STEPS - 1:
kwargs["tools"] = tools.TOOLS
resp = client.chat.completions.create(**kwargs)
meter_usage(resp.usage)
msg = resp.choices[0].message
messages.append(msg.model_dump(exclude_none=True))
if not msg.tool_calls:
print("\n=== Answer ===\n" + (msg.content or "(no answer)"))
return
for call in msg.tool_calls:
result = run_tool(call.function.name, json.loads(call.function.arguments or "{}"))
messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)})
def run_native(question): # Nous API: Hermes emits <tool_call> tags we parse
sigs = "\n".join(json.dumps(t["function"]) for t in tools.TOOLS)
system = f"{SYSTEM}\n\n{HERMES_TOOL_INSTRUCTIONS}\nHere are the available tools:\n<tools>\n{sigs}\n</tools>"
messages = [{"role": "system", "content": system}, {"role": "user", "content": question}]
for _ in range(MAX_STEPS):
resp = client.chat.completions.create(model=MODEL, messages=messages,
max_tokens=MAX_TOKENS, temperature=TEMPERATURE)
meter_usage(resp.usage)
content = resp.choices[0].message.content or ""
messages.append({"role": "assistant", "content": content})
calls = [(json.loads(m)["name"], json.loads(m).get("arguments", {}))
for m in TOOL_CALL_RE.findall(content)]
if not calls:
print("\n=== Answer ===\n" + content)
return
for name, args in calls:
result = run_tool(name, args)
messages.append({"role": "user",
"content": f"<tool_response>\n{json.dumps({'name': name, 'content': result})}\n</tool_response>"})
if __name__ == "__main__":
print(f"[hermes] model={MODEL} endpoint={BASE_URL} tool_mode={TOOL_MODE}")
(run_native if TOOL_MODE == "native" else run_api)(" ".join(sys.argv[1:]) or input("Ask Hermes: "))
Run it
python agent.py "Who founded Kong Inc. and what does the company build?"
You will see the metering happen in real time:
[hermes] model=hermes3 endpoint=http://localhost:11434/v1 tool_mode=api
[meter] tokens in=391 out=54
[tool] web_search({'query': 'Kong Inc'})
[meter] tool_call web_search
[tool] fetch_url({'url': 'https://en.wikipedia.org/wiki/Kong_Inc.'})
[meter] tool_call fetch_url
[meter] tokens in=833 out=249
[tool] make_report({'title': 'Kong Inc Overview', 'findings': '...'})
[meter] tool_call make_report
[meter] tokens in=3392 out=87
=== Answer ===
The report on Kong Inc. has been written...
Every [meter] line is a CloudEvent already sitting in Kong. Now we price it.
Part 2: The billing setup
The model: two meters, then a feature per billable dimension, then a plan that prices each feature, then a customer and a subscription. Each step shows the Konnect UI and the equivalent API call.
Provision Kong with one script
If you want to skip the clicking, the repo has kong_setup.py that creates everything below in order and is safe to re-run (it reuses anything that already exists):
python kong_setup.py
The rest of this section is what that script does, step by step, so you understand each piece.
Create the two meters
A meter turns a stream of events into a number. We need two.
In the UI: Metering & Billing β Metering β Create Meter.
| Meter | Event type | Aggregation | Value property | Group by |
|---|---|---|---|---|
| Hermes Tokens | hermes.tokens |
Sum | $.tokens |
type, model
|
| Hermes Tool Calls | hermes.tool_call |
Count | (none) | tool |
The tokens meter sums the tokens field and keeps type as a dimension so we can split input from output. The tool meter just counts events and keeps tool as a dimension.
CLI
curl -s -X POST "$KONG_API_URL/v3/openmeter/meters" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"key":"hermes_tokens","name":"Hermes Tokens","event_type":"hermes.tokens",
"aggregation":"sum","value_property":"$.tokens",
"dimensions":{"type":"$.type","model":"$.model"}}'
curl -s -X POST "$KONG_API_URL/v3/openmeter/meters" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"key":"hermes_tool_calls","name":"Hermes Tool Calls","event_type":"hermes.tool_call",
"aggregation":"count","dimensions":{"tool":"$.tool"}}'
Note the field names are snake_case: event_type, value_property, dimensions.
Create the features
A feature is a billable thing tied to a meter, filtered to one slice of it. We make five: input tokens, output tokens, and one per tool.
In the UI: Product Catalog β Features. For each, pick the meter and add a meter filter.
| Feature key | Meter | Filter |
|---|---|---|
input_tokens |
Hermes Tokens |
type = input
|
output_tokens |
Hermes Tokens |
type = output
|
tool_web_search |
Hermes Tool Calls |
tool = web_search
|
tool_fetch_url |
Hermes Tool Calls |
tool = fetch_url
|
tool_make_report |
Hermes Tool Calls |
tool = make_report
|
This is the step that bit me, so read the next line twice. The only feature shape that actually persists the filter is a nested meter object. If you send a different shape, the API still returns 201, but it silently drops the filter, and then every feature meters the whole meter and your invoice shows no per-line charges.
CLI
# meter id from: curl .../v3/openmeter/meters
curl -s -X POST "$KONG_API_URL/v3/openmeter/features" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"key":"input_tokens","name":"Input tokens",
"meter":{"id":"<HERMES_TOKENS_METER_ID>","filters":{"type":{"eq":"input"}}}}'
After creating each feature, read it back and confirm the filter is there:
curl -s "$KONG_API_URL/v3/openmeter/features" -H "Authorization: Bearer $KONG_PAT" \
| python3 -c "import sys,json;[print(f['key'],f.get('meter',{}).get('filters')) for f in json.load(sys.stdin)['data']]"
Create a plan with rate cards
The plan prices each feature. These are illustrative numbers chosen so every line is visible. The price is per single unit, so for tokens it is the price of one token. For production you would use small decimals (Hermes 4 70B costs about $0.00000005 per input token, so you would mark up from there).
In the UI: Product Catalog β Plans β New Plan, currency USD, monthly. Add five usage-based rate cards:
| Rate card (key = feature key) | Price per unit |
|---|---|
input_tokens |
$0.0005 |
output_tokens |
$0.0015 |
tool_web_search |
$0.02 |
tool_fetch_url |
$0.01 |
tool_make_report |
$0.10 |
The rate card key must equal the feature key. If they differ, the API returns rate_card_key_feature_key_mismatch.
CLI
curl -s -X POST "$KONG_API_URL/v3/openmeter/plans" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"key":"hermes_pro","name":"Hermes Pro","currency":"USD","billing_cadence":"P1M",
"phases":[{"key":"default","name":"Default","rate_cards":[
{"billing_cadence":"P1M","key":"input_tokens","name":"Input tokens",
"feature":{"id":"<INPUT_TOKENS_FEATURE_ID>"},"price":{"type":"unit","amount":"0.0005"}}
]}]}'
A plan is created as a draft. Publish it before anything can subscribe:
curl -s -X POST "$KONG_API_URL/v3/openmeter/plans/<PLAN_ID>/publish" \
-H "Authorization: Bearer $KONG_PAT"
Create the customer and subscribe
Customers are not created from events. The subject rides along on every event, but you have to create a customer whose usage_attribution.subject_keys contains that subject, then subscribe it to the plan.
CLI
curl -s -X POST "$KONG_API_URL/v3/openmeter/customers" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"key":"hermes-demo","name":"Hermes Demo","usage_attribution":{"subject_keys":["hermes-demo"]}}'
curl -s -X POST "$KONG_API_URL/v3/openmeter/subscriptions" \
-H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
-d '{"customer":{"id":"<CUSTOMER_ID>"},"plan":{"key":"hermes_pro"},"active_from":"2026-01-01T00:00:00Z"}'
One ordering rule: events sent before the subscription starts do not get billed. Subscribe first, then run the agent.
Run the agent and read the invoice
With the subscription live, run the agent again:
python agent.py "Who founded Kong Inc. and what does the company build?"
One run on local Hermes 3 produced this, attributed to the customer:
| Line | Usage | Price | Charge |
|---|---|---|---|
| Input tokens | 4,616 | $0.0005 | $2.31 |
| Output tokens | 390 | $0.0015 | $0.58 |
| web_search | 2 | $0.02 | $0.04 |
| fetch_url | 3 | $0.01 | $0.03 |
| make_report | 2 | $0.10 | $0.20 |
| Total | $3.16 |
Open the customer in Metering & Billing β Customers and the upcoming invoice shows thinking and acting as separate lines.
Where I'd take this next
- Free quota then overage per tool, instead of flat per-call pricing.
- Add MCP tools and meter each one as its own line.
- Move to hosted Hermes 4 70B for a stronger agent, with
TOOL_MODE=native. - If you would rather not put metering in app code at all, put the calls behind Kong AI Gateway and let it emit the token usage for you.
How would you price an agent?
Per token? Per tool call? A flat platform fee plus usage? Free searches then paid ones? I went with separate lines for thinking and acting because that is where the cost actually splits, but I am curious what you would do. Drop a comment with the model you use.
The full code is at https://github.com/tejakummarikuntla/Hermes-Billing-with-KongMB PRs welcome.




Top comments (0)