Replay Every LLM Prompt Against a New Model Before You Migrate

#hermeschallenge #ai #python #agents

The team wanted to switch from claude-3-5-sonnet to claude-sonnet-4-6. The new model was faster and cheaper. The plan: update one env variable, deploy on Friday, monitor over the weekend.

Monday morning: customer support tickets about "weird responses." Nobody had tested what the new model actually returned on the existing prompt library. They assumed it would be the same. It was not.

The right approach: record all real prompts against the current model, replay them against the candidate model, diff the results before you touch production.

The Shape of the Fix

from prompt_replay import PromptRecorder, PromptReplayer, DiffMode

# Phase 1: Record
recorder = PromptRecorder(store_path="./prompt-log.jsonl")
with recorder.session("production-run-001"):
    for prompt in load_prompts():
        response = call_anthropic(prompt)
        recorder.record(prompt=prompt, response=response)

# Phase 2: Replay against new model
replayer = PromptReplayer(store_path="./prompt-log.jsonl")
report = replayer.replay(
    fn=lambda p: call_anthropic_new_model(p),
    diff_mode=DiffMode.JSON_DIFF,
)

for item in report.differences:
    print(f"Prompt: {item.prompt[:60]}...")
    print(f"Diff: {item.diff}")

Record against current. Replay against candidate. Get a structured diff report. Migrate with evidence.

What It Does NOT Do

prompt-replay does not evaluate which response is better. It shows you what changed. Whether the change is an improvement or a regression is a judgment call that requires domain knowledge.

It does not handle streaming responses. If your provider returns a stream, collect the full text before passing it to recorder.record().

The semantic diff mode uses token overlap as a proxy for semantic similarity. It is not embedding-based. Two responses with very similar meaning but different wording may show a high diff score.

Inside the Library

Three diff modes:

EXACT: byte-for-byte string comparison. Useful for deterministic prompts where you expect identical output.
JSON_DIFF: parses both responses as JSON, produces a structural diff. Useful for tool-use outputs and structured prompts.
SEMANTIC: Jaccard similarity on token sets (split on whitespace+punctuation). A score of 1.0 means identical tokens. A score below your threshold triggers a difference flag.

class DiffMode(Enum):
    EXACT = "exact"
    JSON_DIFF = "json_diff"
    SEMANTIC = "semantic"

The JSONL format stores one record per line: {"session": "...", "prompt": "...", "response": "...", "ts": ...}. The file can be inspected with any JSON tool. No binary format, no database.

The 50 tests cover all three diff modes, session isolation, round-trip record/replay, the threshold parameter for semantic mode, and malformed-JSON handling in JSON_DIFF mode.

When to Use It

Use it before any model migration where prompt behavior matters. This is almost always. Even "drop-in replacement" model upgrades can change response style, JSON formatting, and tool call behavior.

Use it before major prompt template changes. Record the current behavior as a baseline, change the template, replay, inspect diffs.

Skip it for pure generation tasks where output varies by design. If your agent writes creative copy and no two outputs should be identical, diffing makes no sense. This library is for agents where response consistency is a feature.

Install

pip install git+https://github.com/MukundaKatta/prompt-replay

from prompt_replay import PromptRecorder, PromptReplayer, DiffMode, ReplayReport

# CI integration example
def test_model_migration():
    replayer = PromptReplayer(store_path="./test-fixtures/prompts.jsonl")
    report: ReplayReport = replayer.replay(
        fn=lambda p: new_model_call(p),
        diff_mode=DiffMode.SEMANTIC,
        semantic_threshold=0.85,  # flag if similarity drops below 0.85
    )

    assert report.total == 50
    assert report.different == 0, (
        f"{report.different} prompts changed behavior:\n"
        + "\n".join(f"  {d.prompt[:40]!r}: {d.diff}" for d in report.differences[:3])
    )

Sibling Libraries

Library	What it solves
`agentsnap`	Snapshot full agent responses for regression testing
`llm-fixture-replay`	VCR-style record/replay for unit tests
`prompt-eval-rubric`	Weighted rubric scoring for freeform responses
`llm-output-validator`	Rule-based validation of LLM output shape
`agenttap`	Wire-level capture of LLM requests and responses

The migration workflow: agenttap captures real production traffic for a week, prompt-replay replays that corpus against the candidate model, prompt-eval-rubric scores quality on flagged items. Together they give you a data-driven migration decision.

What's Next

Embedding-based semantic diff would be a meaningful upgrade over the Jaccard token approach. The tradeoff: it needs an embedding model call per comparison, which adds cost and a dependency. The current zero-dep approach is intentional.

A --sample flag to replay a random subset of a large corpus would help for large prompt libraries where full replay is too slow. Statistical sampling with a configurable confidence interval would give you a practical migration gate.

Async replay for parallel execution is the performance gap. Right now replays are sequential. An async_replay() that fires N concurrent calls would cut replay time proportionally.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.