DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Your Hermes agent's prompt cache is 30% hit because of one timestamp. Here's how to find it.

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge.

I ran my Hermes agent across a real workload for a day. The bill came in higher than it should have. I checked the prompt-cache hit ratio. 31%.

That number was the only thing wrong with the agent. The agent itself was fine. The tools were fine. The system prompt was a chunky 2,400 tokens of policy and few-shot examples that should have been cached on every call. Instead, almost every call was paying full price for the system prompt again.

I knew the agent was mutating the prefix between calls. Something tiny was getting in there, byte-by-byte, and invalidating the cache. The question was what.

I went looking for a small lib that would compare a batch of captured runs and tell me where the prefix actually became stable. Could not find one. Built it.

It is called prompt-pillar. Repo: MukundaKatta/prompt-pillar. Around 450 lines including all four modules, zero runtime deps, MIT.

What it does

You give it a list of past message lists. It finds the longest prefix that is byte-identical across every single one of them. That prefix is what your cache can actually hit. If it is smaller than the system prompt you thought was cacheable, you have a mutating prefix.

from prompt_pillar import Pillar

run_1 = [
    {"role": "system", "content": "You are a support agent. Now: 2026-05-24T10:34"},
    {"role": "user", "content": "I want a refund for order 12345"},
]
run_2 = [
    {"role": "system", "content": "You are a support agent. Now: 2026-05-24T10:35"},
    {"role": "user", "content": "I want a refund for order 67890"},
]
run_3 = [
    {"role": "system", "content": "You are a support agent. Now: 2026-05-24T10:36"},
    {"role": "user", "content": "Where is my order"},
]

print(Pillar(threshold_tokens=50).analyze([run_1, run_2, run_3]))
Enter fullscreen mode Exit fullscreen mode

Output:

Stable prefix: 0 messages, 0 estimated tokens
Threshold: 50 tokens (BELOW)
Reason: system message timestamp injected at index 0 (first char diverges at 37)
Recommendation: move the timestamp out of the cached message; pass it as a fresh user-turn block after the cache breakpoint
Enter fullscreen mode Exit fullscreen mode

That is the bug. Three runs, three different timestamps, all jammed into the system prompt. Every call invalidates the cache from message index 0. The hit ratio cannot be anything but bad. The fix is to move the timestamp into a user-turn header that lives after the cache breakpoint.

The first time I ran this against my own captured logs, the output named the offender on the first try. I had been adding now=datetime.utcnow().isoformat() to my system prompt for two weeks. It felt like a free thing to do. It cost me about 70% of my cache.

How it decides

The core function is small. It walks every message index in lockstep across every run. It hashes each message with canonical JSON (sorted keys, no whitespace) so two messages that differ only in dict key order still compare equal. The first index where any run disagrees is the divergence point. The prefix before that is the cacheable prefix.

I added the canonical-key step after the first time a run failed to match because Pydantic and a raw dict serialized the same payload in different key orders. Two messages that are semantically identical should not break the cache. The diff says so.

For diagnosis, the lib looks at the first divergent message and runs a couple of regexes against it. If both sides contain an ISO-8601 timestamp, it says timestamp. If both contain a UUID, it says per-call UUID. If the role itself changed, it says role mismatch. These are the three patterns I have actually hit in production. I did not want a long list of heuristics that fire on noise.

Pairwise diff when you want the receipts

When the diagnostic line is not enough, the per-message diff gives you a unified diff plus the first character index where the two strings disagree:

from prompt_pillar import diff_messages

d = diff_messages(run_1[0], run_2[0])
print(d.first_char_diverge)        # 37
print(d.content_unified_diff)      # standard difflib unified diff
Enter fullscreen mode Exit fullscreen mode

Paste the unified diff into a PR and the reviewer sees the exact byte that is costing money.

Where it sits

prompt-pillar is the third piece of a small stack:

  • cachebench measures the hit ratio
  • prompt-cache-warmer pre-warms the breakpoint
  • llm-message-hash gives you the canonical hash if you need it elsewhere
  • prompt-pillar answers the one question those tools cannot: why is the ratio low

cachebench tells me the number is bad. prompt-pillar tells me which message is busting it. I fix the message. prompt-cache-warmer warms the breakpoint so the next user call hits.

Try it on your own logs

If you have an agent that records the message list it sent to the model, dump three real runs into a JSONL file, one run per line. Then:

pip install prompt-pillar
python -m prompt_pillar runs.jsonl 1024
Enter fullscreen mode Exit fullscreen mode

If the stable prefix is smaller than you expected, the reason line will tell you what to look at first. In my case it was a timestamp. I have seen UUID per-request markers do the same thing. I have seen a tool-list header that reordered tools on every call. The pattern is always the same. A small thing in a hot path makes a big number look wrong.

Save your future self the bill.

Top comments (0)