I couldn't reproduce the bug because I had no record of what tool calls my Hermes agent made.

#devchallenge #hermesagentchallenge #agents #python

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge.

My Hermes research agent was misbehaving — bad summaries, wrong citations. I suspected a tool call was returning unexpected data. But I had no log of what the tools actually returned. I was debugging from model outputs alone, which are one layer of indirection away from the real problem.

After I added tool-call-log, the next time it happened I opened the log file and saw the answer in 30 seconds.

One logger, two patterns

Pattern 1 — inline, after the call:

from tool_call_log import ToolCallLogger

logger = ToolCallLogger("logs/tools.jsonl", meta={"run_id": "run-001"})

result = web_search(query)
logger.log("web_search", {"query": query}, result=result)

Pattern 2 — context manager, auto-times the call:

with logger.record("web_search", {"query": "climate policy 2026"}) as r:
    r.result = web_search(r.args["query"])
# Duration measured automatically. Logged on exit.

Both write the same JSONL format. Both capture name, args, result, duration, call_id, error, and any metadata you attach.

The log file

Each call is one JSON line:

{"name":"web_search","args":{"query":"climate policy 2026"},"result":[{"title":"...","url":"..."}],"started_at":1716566401.2,"ended_at":1716566402.8,"call_id":"toolu_abc","error":"","meta":{"run_id":"run-001"},"duration_ms":1600.0}

Plain text. Grep-able. You can jq it, tail it during a run, diff two runs, or replay it.

Errors are logged too

If the tool call raises an exception, the context manager logs the error and re-raises:

with logger.record("flaky_api", {"id": 42}) as r:
    r.result = call_flaky_api(42)  # raises TimeoutError
# Logged: error="", ok=False — exception still propagates

This is what I needed. I wasn't sure if the tool was failing silently or returning bad data. Now I can tell: ok=False means it raised, a bad result with ok=True means the tool returned garbage.

Read it back

from tool_call_log import load_tool_log

records = load_tool_log("logs/tools.jsonl")
failed = [r for r in records if not r.ok]
slow = sorted(records, key=lambda r: r.duration_ms or 0, reverse=True)[:5]

Meta propagates to every record

logger = ToolCallLogger(
    "logs/tools.jsonl",
    meta={"run_id": "run-001", "agent": "research-worker"},
)

Every record gets those fields in meta. Per-call meta overrides:

logger.log("search", {"q": "foo"}, meta={"turn": 7})
# meta = {"run_id": "run-001", "agent": "research-worker", "turn": 7}

What I log in my Hermes agent

with logger.record("search_papers", {"query": sub_task}) as r:
    papers = search_semantic_scholar(r.args["query"])
    r.result = [{"id": p.id, "title": p.title, "abstract": p.abstract} for p in papers]

Six months from now I can trace any agent output back to the exact tool call that produced it. The JSONL files are small enough to keep around — 1000 tool calls is about 500KB.