Mukunda Rao Katta

Posted on May 25

A/B Test Your Prompts Without a Framework

#hermeschallenge #ai #python #agents

You change a prompt. Now what?

You tweak a system prompt. Maybe you tightened the instructions, cut some filler, or changed the persona. You think it's better. But you're comparing your memory of yesterday's outputs against what you see right now.

That's not a test. That's a feeling.

The real issue: prompt changes are invisible in diffs. Your code diff shows the prompt text changed. It says nothing about whether outputs got better or worse. You need a record of what the old prompt produced, and a way to replay those same inputs through the new prompt.

That's what prompt-replay and prompt-template-version are for. Together they give you a low-ceremony A/B loop without a testing framework, without a vendor dashboard, and without mocking your production system.

The core workflow

You run in two phases.

Phase 1: record. You run your agent or pipeline under the old prompt. prompt-replay intercepts each LLM call and writes the inputs and outputs to a JSONL fixture file. One entry per call.

Phase 2: replay. You swap the prompt. You run replay against the same fixture. prompt-replay feeds the same inputs to the new prompt, collects the new outputs, and diffs them.

Here is what the code looks like end to end.

from prompt_replay import Recorder, Replayer, diff

# --- Phase 1: record baseline outputs ---

recorder = Recorder(fixture_path="fixtures/summarize_v1.jsonl")

def call_llm(prompt: str, user_input: str) -> str:
    import anthropic
    client = anthropic.Anthropic()
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=prompt,
        messages=[{"role": "user", "content": user_input}],
    )
    return msg.content[0].text

# Wrap your LLM call with the recorder
recorded_call = recorder.wrap(call_llm)

PROMPT_V1 = "Summarize the text in 2 sentences. Be concise."

inputs = [
    "The Krebs cycle is a series of chemical reactions...",
    "Photosynthesis converts light energy into chemical energy...",
    "The mitochondria is often called the powerhouse of the cell...",
]

for user_input in inputs:
    output = recorded_call(PROMPT_V1, user_input)
    print(f"Recorded: {output[:60]}...")

recorder.save()
print(f"Saved {len(inputs)} baseline outputs to fixtures/summarize_v1.jsonl")


# --- Phase 2: replay with new prompt and diff ---

PROMPT_V2 = "Summarize the following text in exactly one sentence. Focus on the main idea only."

replayer = Replayer(fixture_path="fixtures/summarize_v1.jsonl")

results = replayer.run(
    fn=lambda user_input: call_llm(PROMPT_V2, user_input),
    input_key="user_input",
)

# Three diff modes
for entry in results:
    print("--- exact diff ---")
    print(diff(entry.baseline, entry.new, mode="exact"))

    print("--- json_diff ---")
    print(diff(entry.baseline, entry.new, mode="json_diff"))

    # semantic mode uses embedding cosine similarity
    print("--- semantic similarity ---")
    score = diff(entry.baseline, entry.new, mode="semantic")
    print(f"Similarity: {score:.3f}")

The JSONL fixture is just plain text. Each line is one recorded call. You can read it, edit it, or version it in git alongside your prompts.

What this does NOT do

This is not a quality evaluator. It does not tell you which prompt is better. It tells you what changed.

It does not run statistical significance tests. You cannot conclude "v2 is better" from three examples.

It does not test for hallucination, factual accuracy, or safety. If your use case needs that, plug the outputs into prompt-eval-rubric and score them with a rubric.

It does not replace human review. The semantic diff mode gives you a similarity score. That score is a signal, not a verdict. A score of 0.85 means the outputs are similar. It does not mean the new output is correct.

Why these two libraries together

prompt-template-version handles the other half of the problem: tracking which prompt produced which output.

Without version tracking, you record baseline outputs but have no reliable way to know which exact prompt text generated them. Three weeks later you look at the fixture and you have no idea.

from prompt_template_version import PromptRegistry

registry = PromptRegistry()

# Register versioned prompts
v1_id = registry.register(
    name="summarize",
    version="1.0.0",
    text="Summarize the text in 2 sentences. Be concise.",
)

v2_id = registry.register(
    name="summarize",
    version="2.0.0",
    text="Summarize the following text in exactly one sentence. Focus on the main idea only.",
)

# Resolve by name + version
prompt_v1 = registry.resolve("summarize", "1.0.0")
prompt_v2 = registry.resolve("summarize", "2.0.0")

# The fixture file name can encode the version
fixture_path = f"fixtures/summarize_{v1_id.short_hash}.jsonl"

Now your fixture file name includes a hash of the prompt. You can always reconstruct which prompt produced which fixture. Your A/B comparison is traceable.

When to use this

Use it when you are making a deliberate prompt change and want a before/after comparison on real inputs.

It fits naturally into a small prompt iteration loop: record once against your real workload, then replay after every edit. You do not need a staging environment. You do not need to re-run your full pipeline. You replay in seconds.

It works well with CI. Check your fixtures into git. In CI, run the replayer against the current prompt. If any semantic similarity score drops below your threshold, fail the build.

SIMILARITY_THRESHOLD = 0.90

for entry in results:
    score = diff(entry.baseline, entry.new, mode="semantic")
    if score < SIMILARITY_THRESHOLD:
        raise ValueError(
            f"Output similarity dropped to {score:.3f} for input: {entry.input[:50]}..."
        )

When NOT to use this

Do not use it as a substitute for integration tests. If your prompt calls a tool, modifies data, or routes to downstream systems, prompt-replay captures the LLM text output only. It does not replay side effects.

Do not use it for prompts that are expected to be highly variable. Creative generation, brainstorming, random seeds: your baseline will not match your replay, and the diff will always look alarming. Use prompt-eval-rubric instead with a rubric that scores on criteria rather than similarity to a baseline.

Do not use it if you have thousands of unique inputs per day. Record a representative sample, not the full corpus. Fifty examples is usually enough to catch regressions.

Install and quick-start

Both libraries are on PyPI:

pip install prompt-replay prompt-template-version

Zero runtime dependencies. If you have an Anthropic API key set in your environment, the example above runs as-is.

Quick-start in four commands:

pip install prompt-replay prompt-template-version
# Create a fixtures directory
mkdir -p fixtures
# Copy the phase 1 snippet into record.py
python record.py
# Copy the phase 2 snippet into replay.py
python replay.py

Sibling libraries in the agent stack

Library	What it does
`prompt-replay`	Record LLM calls, replay with new prompt, diff outputs
`prompt-template-version`	Semver-pin and hash prompt templates
`cachebench`	Measure prompt-cache hit rate and latency
`prompt-eval-rubric`	Score outputs against criteria (0.0 to 1.0 per rubric)
`agentsnap`	Capture full agent run traces for comparison
`llm-message-hash-py`	Canonical hash of LLM request payloads

What is next

Three improvements on the roadmap for prompt-replay:

First, structured output diffing. When your LLM returns JSON, diff at the field level, not the string level. Right now json_diff is a shallow string comparison after parsing. Field-level diffing would tell you "the summary field changed but the category field stayed the same."

Second, multi-turn recording. Right now prompt-replay records single-turn calls. Multi-turn conversations are harder because the fixture has to replay message history in the right order. That needs a session-aware recorder.

Third, integration with prompt-eval-rubric. The ideal loop is: record baseline, replay with new prompt, score both with a rubric, surface the delta. Right now you have to wire those manually. A first-class integration would make the loop one function call.

If prompt iteration is a regular part of your workflow, give prompt-replay a try. The fixture files are readable, the diff output is interpretable, and the whole loop runs locally in under a minute.

GitHub: MukundaKatta/prompt-replay
PyPI: pip install prompt-replay

DEV Community