This is a submission for the Hermes Agent Challenge.
I had a research agent that took five steps to answer a one-line question. I rewrote the system prompt to be more specific. The new prompt felt better. The agent felt faster. The output looked tighter.
I needed to know if I was kidding myself.
I had two JSONL audit logs sitting on disk. One from the old prompt, one from the new. They were the same agent, same model, same tools, same question. The only difference was the prompt I had changed. So all I really wanted was a clean before-and-after between those two runs. Did the new prompt move the needle on tool sequence, on cost, on step count? Or did it just shuffle the same calls around?
A diff. I wanted a diff.
So I wrote tool-call-diff.
What it does
It reads two JSONL audit logs and tells you what changed between them. That is the whole library.
pip install tool-call-diff
tool-call-diff runs/baseline.jsonl runs/new_prompt.jsonl
Here is the actual scenario I started with. The question was "when did Claude 4 ship?" Below is the baseline run, with a vague prompt:
{"ts":1.0,"session_id":"demo","kind":"session_open"}
{"ts":1.1,"session_id":"demo","kind":"tool_ok","tool":"search_web","args":{"q":"claude release"},"usd":0.0012,"latency_ms":220}
{"ts":1.4,"session_id":"demo","kind":"tool_ok","tool":"fetch_url","args":{"url":"https://anthropic.com/news"},"usd":0.0004,"latency_ms":410}
{"ts":1.9,"session_id":"demo","kind":"tool_ok","tool":"summarize","args":{"input_tokens":1850},"usd":0.0380,"latency_ms":1100}
{"ts":3.0,"session_id":"demo","kind":"tool_ok","tool":"final_answer","args":{"len":320},"usd":0.0014,"latency_ms":90}
Five rows. Four tool calls. The agent searched for the wrong query, fetched a news page, ran a summarize step on the whole page, then wrote a long answer.
Here is the candidate run, after I rewrote the prompt to say "ask for a specific release date and return only the date":
{"ts":1.0,"session_id":"demo","kind":"session_open"}
{"ts":1.1,"session_id":"demo","kind":"tool_ok","tool":"search_web","args":{"q":"claude 4 release date"},"usd":0.0012,"latency_ms":215}
{"ts":1.4,"session_id":"demo","kind":"tool_ok","tool":"fetch_url","args":{"url":"https://anthropic.com/news"},"usd":0.0004,"latency_ms":405}
{"ts":2.6,"session_id":"demo","kind":"tool_ok","tool":"final_answer","args":{"len":280},"usd":0.0012,"latency_ms":88}
Four rows. Three tool calls. The summarize step is gone. The final answer is shorter.
Pretty obvious in isolation, sure. But I am not always looking at four rows. I am usually looking at fifty or a hundred. And the question is the same every time: what actually moved.
Here is what tool-call-diff prints when I run it on those two files:
~ search_web {"q"="claude release"} (was)
~ search_web {"q"="claude 4 release date"} (now)
= fetch_url {"url"="https://anthropic.com/news"}
- summarize {"input_tokens"=1850}
- final_answer {"len"=320}
+ final_answer {"len"=280}
cost: 0.0410 -> 0.0028 USD (-0.0382)
steps: 4 -> 3 (-1)
latency: 1820 -> 708 ms (-1112)
tool sequence: changed
The body is git-style. Tilde for args change. Equal for unchanged. Minus for removed. Plus for added.
The footer is the headline. The new prompt cut cost by 93 percent, removed one step, and saved more than a second. The new search query was better. The summarize step was never needed. The model wrote a tighter answer without me telling it to.
That is what I wanted to know. That is the whole point of the library.
What it gives you in code
If you want to gate this in a CI job or a notebook, you can:
from tool_call_diff import diff_runs
diff = diff_runs("runs/baseline.jsonl", "runs/new_prompt.jsonl")
assert diff.cost_delta_usd <= 0, "prompt change made the run more expensive"
assert diff.steps_delta <= 0, "prompt change added steps"
Or you can run it as a gate from the CLI:
tool-call-diff baseline.jsonl candidate.jsonl --exit-on-change
That returns a non-zero exit code when anything moved. Drop it next to your snapshot tests.
How the diff works
Each tool call becomes a signature: (tool_name, args_hash). Args get sorted before hashing, so key order does not matter. Then difflib.SequenceMatcher from the standard library walks the opcodes. Equal blocks are equal. Insert blocks are added. Delete blocks are removed. Replace blocks split into changed args when the tool name matches and into add plus remove when it does not. Anything in both runs at a different position lands in a reordered bucket.
That is the whole algorithm. Boring, not clever.
The parser reads a handful of shapes (agenttrace, agentleash, agentsnap, agent-step-log, generic) and normalizes them into one record type. It also drops denied rows on purpose. tool_denied, budget_denied, egress_denied are real events but they are not work the agent completed, so they do not belong in a sequence diff.
What it is not
It is not a tracing system. It does not own the write side. It does not send anything anywhere. It does not need a model to read.
It reads two files and prints what changed. Stdlib only. Twenty five tests.
Where it fits
tool-call-diff is the diff step of a small chain.
-
agenttracewrites the run log. -
agentleashwrites a stricter audit log with budget and egress guards. -
trace-treerenders one run as a tree. -
agentsnapsnapshots one run for regression testing. -
tool-call-diffcompares two runs so you can see what your prompt change did.
If you run a Hermes agent and you ever wonder whether your last tweak actually helped, this is the small tool for that moment.
Try it
pip install tool-call-diff
tool-call-diff your-baseline.jsonl your-candidate.jsonl
Top comments (0)