DEV Community

Cover image for I Turned Hermes Into a Paid AI Agent, Then Billed Every Token and Tool Call
Teja Kummarikuntla Subscriber for Kong

Posted on

I Turned Hermes Into a Paid AI Agent, Then Billed Every Token and Tool Call

If you clicked this, you probably already like Hermes. So do I. I have had it running on my laptop for a while, the Hermes Agent is all over my feed lately, and the open models are good enough now that building your own agent on them is genuinely fun. Somewhere around the third tool I bolted on, my question quietly changed. It stopped being "can this thing do the task" and became "if I turn this to paid, what would it cost me, and how could I charge for it?"

I have put a price on software before, and an API is the easy case: meter the calls, pick a number, done. In case of agents, it does a couple of things which mostly cost real money, and they are not the same thing. It thinks, which is tokens. And it acts, which is the search it fires, the page it pulls, the report it writes. When I actually sat and watched mine run, the tool calls were doing as much work as the model. Pricing only the tokens would have billed half of what the agent really does.

So I stopped theorizing and tried it. I took my small Hermes research agent, gave it a few genuine tools, and wired up billing for both sides: every token and every tool call, as their own line items, ending in a real invoice. No pretend company, no pretend customers. Just an honest end-to-end run to find out what turning Hermes into a paid agent actually takes.

The billing runs on Kong Konnect Metering & Billing (the managed version of OpenMeter). I kept the path deliberately short, the agent posts its own usage events straight from the code it already runs. One agent run comes out the other end as one invoice, with a line for thinking and a line for each kind of acting. Here is how it went.

Here's the complete codebase: https://github.com/tejakummarikuntla/Hermes-Billing-with-KongMB

Here's what I had to do:
🧠 Set a research agent on Hermes with three tools (search, fetch, report)
πŸͺ™ Meter every token, split into input and output
πŸ”§ Meter every tool call, by tool name
πŸ’΅ Price thinking and acting in Kong Konnect Metering & Billing
🧾 Turn one agent run into one invoice

Here's the complete flow

Set up Hermes

You can use Hermes hosted or local. The agent code is identical either way; you only change three environment variables.

Option A: hosted (Nous Research API)

Create a key at portal.nousresearch.com. It is an OpenAI-compatible endpoint, so you point the OpenAI client at it:

LLM_BASE_URL=https://inference-api.nousresearch.com/v1
LLM_API_KEY=sk-nous-your-key
MODEL=nousresearch/hermes-4-70b
Enter fullscreen mode Exit fullscreen mode

One thing to know up front: Hermes 4 is a paid model and needs purchased credits (a one-time grant only covers free models). And the Nous API does not expose the OpenAI tools parameter, so the agent uses Hermes native <tool_call> format there. More on that later.

Option B: local and free (Ollama)

If you do not want to spend anything, run Hermes 3 locally with Ollama. This is what the rest of the tutorial uses.

# The Homebrew cask bundles the inference runner. The CLI-only formula does not,
# so it can pull models but cannot actually run them.
brew install --cask ollama

ollama serve &        # start the local server on http://localhost:11434
ollama pull hermes3   # about 4.7GB, one time
Enter fullscreen mode Exit fullscreen mode

That gives you an OpenAI-compatible Hermes at http://localhost:11434/v1 with no API key:

LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
MODEL=hermes3
Enter fullscreen mode Exit fullscreen mode

Table of contents

Prerequisites

  • Python 3.10, 3.11, 3.12, or 3.13
  • A Hermes endpoint: local Ollama (free) or a Nous Research API key
  • A free Kong Konnect account
  • A Kong Konnect Personal Access Token with Metering & Billing permissions

Part 1: The Hermes agent

Set up the project

python -m venv .venv && source .venv/bin/activate
pip install openai httpx beautifulsoup4 python-dotenv
Enter fullscreen mode Exit fullscreen mode

requirements.txt:

openai>=1.40.0
httpx>=0.27.0
beautifulsoup4>=4.12.0
python-dotenv>=1.0.0
Enter fullscreen mode Exit fullscreen mode

.env (the local Ollama defaults plus your Kong values):

LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
MODEL=hermes3

KONG_API_URL=https://us.api.konghq.com   # use eu or au if your org is there
KONG_PAT=kpat_your_konnect_token
SUBJECT=hermes-demo
Enter fullscreen mode Exit fullscreen mode

SUBJECT is the customer identifier. Every usage event carries it, and Kong attributes the usage to the customer that owns that subject.

The tools

Three tools, all keyless so you only need the Kong PAT. web_search hits Wikipedia's open API, fetch_url reads a page, and make_report writes a file. Swap web_search for Tavily or Brave in production; nothing else changes.

# tools.py
import os, json, datetime
import httpx
from bs4 import BeautifulSoup

REPORTS_DIR = os.path.join(os.path.dirname(__file__), "reports")
UA = {"User-Agent": "hermes-paid-agent/0.1"}


def web_search(query: str) -> str:
    """Search for information on a topic. Returns titles, URLs, and snippets."""
    r = httpx.get("https://en.wikipedia.org/w/api.php",
                  params={"action": "query", "list": "search", "srsearch": query,
                          "format": "json", "srlimit": 5}, headers=UA, timeout=20.0)
    r.raise_for_status()
    hits = r.json().get("query", {}).get("search", [])
    return json.dumps([{
        "title": h["title"],
        "url": "https://en.wikipedia.org/wiki/" + h["title"].replace(" ", "_"),
        "snippet": BeautifulSoup(h.get("snippet", ""), "html.parser").get_text(),
    } for h in hits])


def fetch_url(url: str) -> str:
    """Fetch a web page and return its readable text (truncated)."""
    r = httpx.get(url, headers=UA, timeout=20.0, follow_redirects=True)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()
    return " ".join(soup.get_text(" ").split())[:4000]


def make_report(title: str, findings: str) -> str:
    """Write a final research report (markdown). The premium tool."""
    os.makedirs(REPORTS_DIR, exist_ok=True)
    stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    slug = "".join(c if c.isalnum() or c in "-_" else "-" for c in title.lower())[:40]
    path = os.path.join(REPORTS_DIR, f"{stamp}-{slug}.md")
    with open(path, "w") as f:
        f.write(f"# {title}\n\n{findings}\n")
    return f"Report written to {path}"


TOOLS = [
    {"type": "function", "function": {
        "name": "web_search", "description": "Search for information on a topic.",
        "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}}},
    {"type": "function", "function": {
        "name": "fetch_url", "description": "Fetch the readable text of a web page by URL.",
        "parameters": {"type": "object", "properties": {"url": {"type": "string"}}, "required": ["url"]}}},
    {"type": "function", "function": {
        "name": "make_report", "description": "Write the final report. Call once when done.",
        "parameters": {"type": "object", "properties": {"title": {"type": "string"}, "findings": {"type": "string"}},
                       "required": ["title", "findings"]}}},
]
DISPATCH = {"web_search": web_search, "fetch_url": fetch_url, "make_report": make_report}
Enter fullscreen mode Exit fullscreen mode

Meter tokens and tool calls

This is the whole billing integration. Two kinds of CloudEvent posted straight to Kong's ingest endpoint, no gateway:

  • hermes.tokens with {tokens, type, model}: one event for input, one for output, per model call.
  • hermes.tool_call with {tool}: one event each time a tool runs.
# metering.py
import os, uuid, datetime
import httpx
from dotenv import load_dotenv

load_dotenv()
KONG_API_URL = os.environ["KONG_API_URL"].rstrip("/")
KONG_PAT = os.environ["KONG_PAT"]
SUBJECT = os.environ.get("SUBJECT", "hermes-demo")
SOURCE = "hermes-paid-agent"

INGEST_URL = f"{KONG_API_URL}/v3/openmeter/events"
HEADERS = {"Authorization": f"Bearer {KONG_PAT}", "Content-Type": "application/cloudevents+json"}


def _now():
    return datetime.datetime.now(datetime.timezone.utc).isoformat()


def _post(event):
    r = httpx.post(INGEST_URL, headers=HEADERS, json=event, timeout=30.0)
    if r.status_code >= 300:
        raise RuntimeError(f"ingest failed {r.status_code}: {r.text}")


def emit_token_event(tokens, token_type, model):   # token_type is "input" or "output"
    _post({"specversion": "1.0", "id": str(uuid.uuid4()), "source": SOURCE,
           "type": "hermes.tokens", "time": _now(), "subject": SUBJECT,
           "data": {"tokens": tokens, "type": token_type, "model": model}})


def emit_tool_event(tool):
    _post({"specversion": "1.0", "id": str(uuid.uuid4()), "source": SOURCE,
           "type": "hermes.tool_call", "time": _now(), "subject": SUBJECT,
           "data": {"tool": tool}})
Enter fullscreen mode Exit fullscreen mode

Each event gets a fresh id. Kong de-duplicates events by id plus source, so a fresh UUID per event keeps every one of them counted.

The agent loop

The loop is small: call Hermes, run any tools it asks for, feed results back, repeat until it answers. Two details make it Hermes-specific.

First, tool calling has two modes. When Hermes is served by Ollama, the server exposes the OpenAI tools parameter and returns structured tool_calls. The Nous API does not expose tools, so Hermes emits its native <tool_call> tags in the text and we parse them. The agent auto-selects the mode from the endpoint.

Second, metering sits inline: after each model call we read the usage block and emit two token events; each time a tool runs we emit a tool event.

# agent.py
import os, re, sys, json
from openai import OpenAI
from dotenv import load_dotenv
import tools
from metering import emit_token_event, emit_tool_event

load_dotenv()
BASE_URL = os.environ["LLM_BASE_URL"]
client = OpenAI(api_key=os.environ.get("LLM_API_KEY", "ollama"), base_url=BASE_URL)
MODEL = os.environ.get("MODEL", "hermes3")
MAX_STEPS, MAX_TOKENS, TEMPERATURE = 10, 1024, 0.3

# "api" = server-side tools parameter; "native" = Hermes <tool_call> parsing.
TOOL_MODE = os.environ.get("TOOL_MODE", "").lower() or ("native" if "nousresearch" in BASE_URL else "api")

SYSTEM = ("You are a research assistant. Use web_search to find sources, then call fetch_url ONLY "
          "with a url returned by web_search (never invent URLs). After 1-2 searches and one fetch, "
          "call make_report once with a title and concise findings that cite the source URLs. Then "
          "write a short 2-3 sentence summary as your final reply. Use at most 4 tools in total.")

HERMES_TOOL_INSTRUCTIONS = (
    "You are provided with function signatures within <tools></tools> XML tags. To call a function, "
    "return a JSON object with its name and arguments within <tool_call></tool_call> tags, like:\n"
    '<tool_call>\n{"name": "web_search", "arguments": {"query": "..."}}\n</tool_call>\n'
    "Call one function per step. When you have the final answer, reply with plain text and no tags.")

TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)


def meter_usage(usage):
    if usage:
        emit_token_event(usage.prompt_tokens, "input", MODEL)
        emit_token_event(usage.completion_tokens, "output", MODEL)
        print(f"[meter] tokens  in={usage.prompt_tokens} out={usage.completion_tokens}")


def run_tool(name, args):
    print(f"[tool] {name}({args})")
    try:
        result = tools.DISPATCH[name](**args)
    except Exception as e:
        result = f"ERROR: {e}"
    emit_tool_event(name)
    print(f"[meter] tool_call {name}")
    return result


def run_api(question):   # Ollama and other endpoints that expose the tools parameter
    messages = [{"role": "system", "content": SYSTEM}, {"role": "user", "content": question}]
    for step in range(MAX_STEPS):
        kwargs = {"model": MODEL, "messages": messages, "max_tokens": MAX_TOKENS, "temperature": TEMPERATURE}
        if step < MAX_STEPS - 1:
            kwargs["tools"] = tools.TOOLS
        resp = client.chat.completions.create(**kwargs)
        meter_usage(resp.usage)
        msg = resp.choices[0].message
        messages.append(msg.model_dump(exclude_none=True))
        if not msg.tool_calls:
            print("\n=== Answer ===\n" + (msg.content or "(no answer)"))
            return
        for call in msg.tool_calls:
            result = run_tool(call.function.name, json.loads(call.function.arguments or "{}"))
            messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)})


def run_native(question):   # Nous API: Hermes emits <tool_call> tags we parse
    sigs = "\n".join(json.dumps(t["function"]) for t in tools.TOOLS)
    system = f"{SYSTEM}\n\n{HERMES_TOOL_INSTRUCTIONS}\nHere are the available tools:\n<tools>\n{sigs}\n</tools>"
    messages = [{"role": "system", "content": system}, {"role": "user", "content": question}]
    for _ in range(MAX_STEPS):
        resp = client.chat.completions.create(model=MODEL, messages=messages,
                                              max_tokens=MAX_TOKENS, temperature=TEMPERATURE)
        meter_usage(resp.usage)
        content = resp.choices[0].message.content or ""
        messages.append({"role": "assistant", "content": content})
        calls = [(json.loads(m)["name"], json.loads(m).get("arguments", {}))
                 for m in TOOL_CALL_RE.findall(content)]
        if not calls:
            print("\n=== Answer ===\n" + content)
            return
        for name, args in calls:
            result = run_tool(name, args)
            messages.append({"role": "user",
                             "content": f"<tool_response>\n{json.dumps({'name': name, 'content': result})}\n</tool_response>"})


if __name__ == "__main__":
    print(f"[hermes] model={MODEL} endpoint={BASE_URL} tool_mode={TOOL_MODE}")
    (run_native if TOOL_MODE == "native" else run_api)(" ".join(sys.argv[1:]) or input("Ask Hermes: "))
Enter fullscreen mode Exit fullscreen mode

Run it

python agent.py "Who founded Kong Inc. and what does the company build?"
Enter fullscreen mode Exit fullscreen mode

You will see the metering happen in real time:

[hermes] model=hermes3 endpoint=http://localhost:11434/v1 tool_mode=api
[meter] tokens  in=391 out=54
[tool] web_search({'query': 'Kong Inc'})
[meter] tool_call web_search
[tool] fetch_url({'url': 'https://en.wikipedia.org/wiki/Kong_Inc.'})
[meter] tool_call fetch_url
[meter] tokens  in=833 out=249
[tool] make_report({'title': 'Kong Inc Overview', 'findings': '...'})
[meter] tool_call make_report
[meter] tokens  in=3392 out=87

=== Answer ===
The report on Kong Inc. has been written...
Enter fullscreen mode Exit fullscreen mode

Every [meter] line is a CloudEvent already sitting in Kong. Now we price it.

Part 2: The billing setup

The model: two meters, then a feature per billable dimension, then a plan that prices each feature, then a customer and a subscription. Each step shows the Konnect UI and the equivalent API call.

Provision Kong with one script

If you want to skip the clicking, the repo has kong_setup.py that creates everything below in order and is safe to re-run (it reuses anything that already exists):

python kong_setup.py
Enter fullscreen mode Exit fullscreen mode

The rest of this section is what that script does, step by step, so you understand each piece.

Create the two meters

A meter turns a stream of events into a number. We need two.

In the UI: Metering & Billing β†’ Metering β†’ Create Meter.

Meter Event type Aggregation Value property Group by
Hermes Tokens hermes.tokens Sum $.tokens type, model
Hermes Tool Calls hermes.tool_call Count (none) tool

The tokens meter sums the tokens field and keeps type as a dimension so we can split input from output. The tool meter just counts events and keeps tool as a dimension.

CLI

curl -s -X POST "$KONG_API_URL/v3/openmeter/meters" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"key":"hermes_tokens","name":"Hermes Tokens","event_type":"hermes.tokens",
       "aggregation":"sum","value_property":"$.tokens",
       "dimensions":{"type":"$.type","model":"$.model"}}'

curl -s -X POST "$KONG_API_URL/v3/openmeter/meters" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"key":"hermes_tool_calls","name":"Hermes Tool Calls","event_type":"hermes.tool_call",
       "aggregation":"count","dimensions":{"tool":"$.tool"}}'
Enter fullscreen mode Exit fullscreen mode

Note the field names are snake_case: event_type, value_property, dimensions.

Create the features

A feature is a billable thing tied to a meter, filtered to one slice of it. We make five: input tokens, output tokens, and one per tool.

In the UI: Product Catalog β†’ Features. For each, pick the meter and add a meter filter.

Feature key Meter Filter
input_tokens Hermes Tokens type = input
output_tokens Hermes Tokens type = output
tool_web_search Hermes Tool Calls tool = web_search
tool_fetch_url Hermes Tool Calls tool = fetch_url
tool_make_report Hermes Tool Calls tool = make_report

This is the step that bit me, so read the next line twice. The only feature shape that actually persists the filter is a nested meter object. If you send a different shape, the API still returns 201, but it silently drops the filter, and then every feature meters the whole meter and your invoice shows no per-line charges.

CLI

# meter id from: curl .../v3/openmeter/meters
curl -s -X POST "$KONG_API_URL/v3/openmeter/features" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"key":"input_tokens","name":"Input tokens",
       "meter":{"id":"<HERMES_TOKENS_METER_ID>","filters":{"type":{"eq":"input"}}}}'
Enter fullscreen mode Exit fullscreen mode

After creating each feature, read it back and confirm the filter is there:

curl -s "$KONG_API_URL/v3/openmeter/features" -H "Authorization: Bearer $KONG_PAT" \
  | python3 -c "import sys,json;[print(f['key'],f.get('meter',{}).get('filters')) for f in json.load(sys.stdin)['data']]"
Enter fullscreen mode Exit fullscreen mode

Create a plan with rate cards

The plan prices each feature. These are illustrative numbers chosen so every line is visible. The price is per single unit, so for tokens it is the price of one token. For production you would use small decimals (Hermes 4 70B costs about $0.00000005 per input token, so you would mark up from there).

In the UI: Product Catalog β†’ Plans β†’ New Plan, currency USD, monthly. Add five usage-based rate cards:

Rate card (key = feature key) Price per unit
input_tokens $0.0005
output_tokens $0.0015
tool_web_search $0.02
tool_fetch_url $0.01
tool_make_report $0.10

The rate card key must equal the feature key. If they differ, the API returns rate_card_key_feature_key_mismatch.

CLI

curl -s -X POST "$KONG_API_URL/v3/openmeter/plans" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"key":"hermes_pro","name":"Hermes Pro","currency":"USD","billing_cadence":"P1M",
       "phases":[{"key":"default","name":"Default","rate_cards":[
         {"billing_cadence":"P1M","key":"input_tokens","name":"Input tokens",
          "feature":{"id":"<INPUT_TOKENS_FEATURE_ID>"},"price":{"type":"unit","amount":"0.0005"}}
       ]}]}'
Enter fullscreen mode Exit fullscreen mode

A plan is created as a draft. Publish it before anything can subscribe:

curl -s -X POST "$KONG_API_URL/v3/openmeter/plans/<PLAN_ID>/publish" \
  -H "Authorization: Bearer $KONG_PAT"
Enter fullscreen mode Exit fullscreen mode

Create the customer and subscribe

Customers are not created from events. The subject rides along on every event, but you have to create a customer whose usage_attribution.subject_keys contains that subject, then subscribe it to the plan.

CLI

curl -s -X POST "$KONG_API_URL/v3/openmeter/customers" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"key":"hermes-demo","name":"Hermes Demo","usage_attribution":{"subject_keys":["hermes-demo"]}}'

curl -s -X POST "$KONG_API_URL/v3/openmeter/subscriptions" \
  -H "Authorization: Bearer $KONG_PAT" -H "Content-Type: application/json" \
  -d '{"customer":{"id":"<CUSTOMER_ID>"},"plan":{"key":"hermes_pro"},"active_from":"2026-01-01T00:00:00Z"}'
Enter fullscreen mode Exit fullscreen mode

One ordering rule: events sent before the subscription starts do not get billed. Subscribe first, then run the agent.

Run the agent and read the invoice

With the subscription live, run the agent again:

python agent.py "Who founded Kong Inc. and what does the company build?"
Enter fullscreen mode Exit fullscreen mode

One run on local Hermes 3 produced this, attributed to the customer:

Line Usage Price Charge
Input tokens 4,616 $0.0005 $2.31
Output tokens 390 $0.0015 $0.58
web_search 2 $0.02 $0.04
fetch_url 3 $0.01 $0.03
make_report 2 $0.10 $0.20
Total $3.16

Open the customer in Metering & Billing β†’ Customers and the upcoming invoice shows thinking and acting as separate lines.

Where I'd take this next

  • Free quota then overage per tool, instead of flat per-call pricing.
  • Add MCP tools and meter each one as its own line.
  • Move to hosted Hermes 4 70B for a stronger agent, with TOOL_MODE=native.
  • If you would rather not put metering in app code at all, put the calls behind Kong AI Gateway and let it emit the token usage for you.

How would you price an agent?

Per token? Per tool call? A flat platform fee plus usage? Free searches then paid ones? I went with separate lines for thinking and acting because that is where the cost actually splits, but I am curious what you would do. Drop a comment with the model you use.

The full code is at https://github.com/tejakummarikuntla/Hermes-Billing-with-KongMB PRs welcome.

Top comments (0)