Mukunda Rao Katta

Posted on May 25

prompt-replay: Record and Replay LLM Prompts Across Models

#hermeschallenge #ai #python #agents

You want to upgrade your model. You have no idea what will break.

You have been running claude-sonnet-3 in production for six months. A new model is available. Everyone says it is better. You want to upgrade.

But you have no baseline. You do not know how your current prompts behave. You cannot predict which responses will change in ways that matter. You upgrade, something breaks, a user reports it three days later, and you do not know if the regression was always there or if the new model introduced it.

The standard answer is evaluations. But evaluations take time to design. They require ground-truth labels. They assume you know in advance what to measure.

prompt-replay takes a different approach. Record what your current model actually says. Then replay the same prompts against the new model. Diff the results. You get concrete evidence of what changed before you ship anything.

The shape of the fix

Recording a session is one decorator or one context manager:

from prompt_replay import Recorder

recorder = Recorder("baseline.jsonl")

with recorder:
    response = anthropic_client.messages.create(
        model="claude-sonnet-3",
        messages=[{"role": "user", "content": "Summarize this contract."}],
    )
    print(response.content[0].text)

Every prompt and response pair goes into baseline.jsonl. The file stores the full request payload and the full response. You can record thousands of real production calls this way.

Replaying against a new model:

from prompt_replay import replay

results = replay(
    baseline="baseline.jsonl",
    model="claude-sonnet-4-6",
    client=anthropic_client,
    diff_mode="json_diff",
)

for r in results:
    print(r.prompt_id, r.diff_summary)

The json_diff mode compares structured fields in JSON responses. The exact mode does byte-for-byte comparison. The semantic mode calls a secondary LLM to judge whether two responses are semantically equivalent. You pick the mode that fits your response type.

A single replay result looks like this:

ReplayResult(
    prompt_id="abc123",
    original_model="claude-sonnet-3",
    replay_model="claude-sonnet-4-6",
    diff_mode="json_diff",
    diff_summary={"changed_keys": ["confidence_score"], "added_keys": [], "removed_keys": []},
    matched=False,
)

You can filter to only the prompts where matched=False and review just those. In a typical upgrade, you might see 3-5% of responses change in a measurable way. That 3-5% is what needs review before you ship.

What it does NOT do

prompt-replay does not run your production traffic automatically. You add the recorder where you want to capture calls. It does not monkey-patch LLM clients globally. You stay in control of what gets recorded.

It does not run replays in parallel by default. Replaying 10,000 prompts takes time. There is a concurrency parameter for async clients, but rate limits are your problem to manage.

The semantic diff mode calls a secondary LLM to judge equivalence. That costs tokens and money. For large baselines, json_diff or exact are cheaper starting points. Use semantic diff for the cases where the other modes are not precise enough.

It does not store credentials or config. The recorder stores request payloads, which may include your prompt text. Do not record prompts that contain secrets or PII unless you scrub them first. Pair it with llm-pii-redact or a custom redact callback if you need to sanitize before saving.

Inside the library: design choices

The JSONL format was chosen deliberately. Each line is a self-contained JSON object. You can stream large baselines without loading the whole file into memory. You can grep specific prompt IDs. You can append to an existing baseline without rewriting it.

Each recorded entry has a prompt_id field generated from a hash of the request payload. If you record the same prompt twice with the same parameters, they get the same ID. This lets you deduplicate before replaying.

The three diff modes are pluggable. The DiffMode protocol defines a single method: diff(original: str, replayed: str) -> DiffSummary. You can pass a custom diff mode if the built-in ones do not fit your response format.

The exact mode is the fastest. It returns matched=True if the two strings are identical and matched=False with a diff otherwise. LLMs are non-deterministic, so this mode is mainly useful for templated or structured outputs where you expect bit-for-bit reproducibility.

The json_diff mode parses both responses as JSON and computes the symmetric difference of keys and values. If either response is not valid JSON, it falls back to exact comparison.

The semantic mode wraps a configurable judge prompt around the two responses and calls a secondary LLM. The judge returns a score from 0 to 1. You set the threshold for what counts as a match. This is the most expensive mode but the most useful for free-text responses.

The 50 tests cover all three diff modes, edge cases like empty responses, baseline files with duplicate IDs, and the async replay path.

When this is useful, and when it is not

This is useful when:

You are upgrading a model and want a pre-flight check before touching production traffic.
You are comparing providers (Anthropic vs OpenAI) for a specific task and want structured evidence.
You want to catch prompt regressions after editing a system prompt without running a full eval suite.
You are doing A/B testing between model versions and want a diff report rather than a binary pass/fail.

This is not the right tool when:

You need ground-truth evaluation. Replay tells you what changed, not what is correct.
Your prompts are highly non-deterministic and every run produces a different result anyway. Replay diffs will be noisy.
You want to compare latency or cost rather than response content. The library focuses on output diffs.
You are in a compliance context where storing prompt/response pairs on disk is not allowed.

Install

Install from GitHub:

pip install git+https://github.com/MukundaKatta/prompt-replay

Quick start:

from prompt_replay import Recorder, replay

# Record
with Recorder("baseline.jsonl") as rec:
    result = my_llm_call(prompt)
    rec.record(prompt=prompt, response=result)

# Replay
results = replay(
    baseline="baseline.jsonl",
    model="claude-sonnet-4-6",
    client=client,
    diff_mode="json_diff",
)

mismatches = [r for r in results if not r.matched]
print(f"{len(mismatches)} of {len(results)} prompts changed")

Siblings in this series

Library	What it does
`llm-pii-redact`	Scrub PII from prompts before recording them
`prompt-template-version`	Pin prompt templates by version before recording
`conversation-codec`	Persist full conversation context as JSONL
`llm-fixture-replay`	VCR-style record/replay for unit tests
`prompt-eval-rubric`	Score replayed responses against rubric criteria

Use llm-pii-redact upstream of the recorder in any pipeline that handles user data.

What is next

The library is on GitHub now. PyPI release follows once the upload queue clears.

Planned additions:

A CLI tool (prompt-replay diff baseline.jsonl --model claude-sonnet-4-6) so you can run replays without writing Python.
An HTML report generator that shows side-by-side diffs for all mismatched prompts.
A filter for recording only prompts above a certain token count, so you focus baseline coverage on the calls that matter most.
Integration with the driftvane library for drift detection over rolling time windows.

The core use case stays simple: record, replay, diff. Everything else is a wrapper around that loop.

If you are upgrading a model without this kind of evidence, you are flying blind. The evidence does not have to be perfect to be useful. Even knowing that 3% of your prompts produce different JSON keys is enough to decide whether to ship or hold.

Part of the Hermes Agent Challenge series. All libraries are on GitHub under MukundaKatta.

DEV Community