Mukunda Rao Katta

Posted on May 25

prompt-replay: Record LLM Outputs Today, Replay Against a New Model Tomorrow

#hermeschallenge #ai #python #agents

You upgraded the model. Claude Sonnet 3.5 to Claude Sonnet 4. The new model is faster, cheaper, smarter. You deploy it on a Friday afternoon.

Monday morning a user files a bug. The agent that summarizes their weekly report is now outputting JSON with different field names. The downstream parser breaks silently. They only noticed because a dashboard went blank.

You had no test that covered this. You had no way to know the output format shifted. The model did not error. It just changed its mind about field naming.

That is the problem prompt-replay solves.

The shape of the fix

The library has two main pieces: a Recorder and a Replayer. You wrap your LLM calls with the @capture decorator during a recording session, then replay those recorded prompts against the new model and diff the results.

from prompt_replay import Recorder, Replayer, capture
import anthropic

client = anthropic.Anthropic()

# Step 1: record a session
recorder = Recorder(session_path="sessions/weekly_summary.jsonl")

@capture(recorder)
def call_llm(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# Run your normal workflow. @capture records every call.
with recorder:
    result = call_llm("Summarize this week's activity: ...")

After you have a session recorded, swap the model and replay:

# Step 2: replay with a new model config
def new_llm(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",  # new model
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

replayer = Replayer(
    session_path="sessions/weekly_summary.jsonl",
    fn=new_llm,
    diff_mode="json_diff"   # or "exact" or "semantic"
)

report = replayer.run()

for diff in report.diffs:
    print(diff.summary())

Three diff modes:

exact: plain string equality. Useful for prompts that must return a fixed string.
json_diff: parses both responses as JSON and diffs the structure. Useful for agents that output structured data.
semantic: cosine distance between embeddings. You bring your own embedder. Useful for natural language summaries where word-for-word equality is too strict.

The semantic mode accepts any callable that takes a string and returns a list of floats:

replayer = Replayer(
    session_path="sessions/weekly_summary.jsonl",
    fn=new_llm,
    diff_mode="semantic",
    embedder=my_embed_fn,   # any function: str -> list[float]
    threshold=0.05          # flag if cosine distance exceeds this
)

What it does NOT do

It does not re-execute tool calls. If your agent called get_weather("London") during the recorded session, replay does not call that tool again. It replays the prompt and compares the new LLM response to the recorded one. More on why below.
It does not mock your LLM client. You bring a real callable. This keeps the library provider-agnostic and avoids any SDK coupling.
It does not assert automatically. replayer.run() returns a report. You decide what to do with it: print it, write it to a file, raise if any diffs exceed a threshold.
It does not manage prompt versioning. It records what you sent. If you want to version your system prompts separately, pair it with prompt-template-version.

Inside the lib: why tool calls are out of scope

This was a deliberate design choice and the one I spent the most time thinking about.

When you record a session, tool calls are part of the conversation history. They show up in the recorded messages. During replay, those tool call results are present as context, same as they were during the original run. The Replayer sends the same conversation history to the new model and waits for a response.

It does not re-invoke the tools.

The reason is simple: re-executing tools against production systems during a test replay is dangerous. send_email, create_ticket, update_database, charge_card. These are not safe to call twice. A replay harness that automatically re-fires side-effectful tools would cause real damage.

So the contract is: tool results are part of the frozen history. The model responds to that history. You are testing whether the model's response changed, not whether the tools still work.

If your workflow is heavily tool-driven and the interesting output is the tool call sequence rather than the final text, this library is not the right fit. If the interesting output is the model's final response, given the same context and tool results, this is exactly right.

# Internally, the Replayer reconstructs the conversation like this:
# [system, user_msg, tool_use, tool_result, user_msg, ...]
# It sends that full history to the new model and captures the response.
# The tool_result entries are from the original session. No tools are called.

When this is useful

You are upgrading a model version and want to know which agent outputs changed before deploying.
You have a prompt you are about to rewrite and want a baseline to compare the new version against.
You are adding a new instruction to your system prompt and want to see if it shifts downstream JSON structure.
You want to run regression checks in CI without paying full LLM costs on every PR. Record once, replay cheaply.

When this is NOT what you want

Your agent's correctness depends entirely on the tool call sequence, not the final text response. The library compares text responses, not action sequences.
You need to test a multi-turn conversation where each user turn depends on what the assistant said last. Replay sends the recorded history verbatim; it does not simulate a live back-and-forth.
You want full integration testing with real tools firing. Use a staging environment for that. This library is for diffing model outputs, not validating tool execution.

Install

pip install prompt-replay

No dependencies. No LLM SDK bundled. Bring your own client.

GitHub: MukundaKatta/prompt-replay

50 tests, all passing.

Sibling libraries

Lib	Boundary	Repo
agentsnap	Snapshot tests for agent tool-call traces	MukundaKatta/agentsnap
cachebench	Prompt-cache hit ratio observability	MukundaKatta/cachebench
agent-decision-log	WHY-layer: record which option was chosen and why	MukundaKatta/agent-decision-log
agenttrace	Cost and latency tracking per agent run	MukundaKatta/agenttrace

agentsnap is the closest sibling. It snapshots tool call sequences. prompt-replay snapshots the text the model produces. They cover different parts of the agent output.

What is next

A few things are on the list:

A CLI for quick one-off replays without writing Python: prompt-replay replay sessions/foo.jsonl --fn my_fn.py --diff json_diff
A threshold-based assert helper so CI can fail automatically: report.assert_no_regressions(threshold=0.1)
Support for structured output comparisons using Pydantic models, not just raw JSON diffs

The core loop, record, replay, diff, is stable. The 50 tests cover all three diff modes, edge cases in the @capture decorator, and the conversation history reconstruction logic.

Built for the Hermes Agent Challenge. Part of a series of small libraries for production agent infrastructure.

DEV Community