Version and Hash Your Prompt Templates: Know Exactly Which Prompt Produced Each Response

#hermeschallenge #ai #python #agents

You changed the system prompt last Tuesday. Response quality dropped on Wednesday. A user filed a complaint about a response from Thursday. You need to know: was that Thursday response produced by the old prompt or the new one?

If you do not version your prompts, the answer is: you cannot know without checking git history and correlating timestamps. That takes time and human judgment and is often wrong.

llm-prompt-version pins every prompt to a version string and content hash so every logged response can be traced back to the exact prompt that produced it.

The Shape of the Fix

from llm_prompt_version import PromptVersion, PromptRegistry

registry = PromptRegistry()

# Register prompts with explicit version strings
registry.register(
    name="customer_support",
    version="2.1.0",
    content="""You are a helpful customer support agent for Acme Corp.

Answer questions about orders, shipping, and returns.
If you cannot resolve an issue, escalate to a human agent.
Always be polite and solution-focused.""",
)

# Load the prompt
pv = registry.get("customer_support")

print(f"Version: {pv.version}")      # "2.1.0"
print(f"Hash: {pv.content_hash}")    # "a3f9c2..."  (first 8 chars of SHA-256)

# Include version info in every LLM call record
response = call_llm(
    system=pv.content,
    messages=messages,
)

step_log.record_llm_call(
    model=response.model,
    input_tokens=response.usage.input_tokens,
    output_tokens=response.usage.output_tokens,
    prompt_version=pv.version,
    prompt_hash=pv.content_hash,
)

Every response record now carries prompt_version: "2.1.0" and prompt_hash: "a3f9c2...". When a user complains about a Thursday response, you check the log, see the version, and know exactly which prompt was running.

What It Does NOT Do

llm-prompt-version does not store prompt history automatically. It holds the currently registered version. Prompt history — all past versions and their content — lives in git or your deployment artifacts. The library provides the lookup and hash; your version control system provides the history.

It does not enforce version format. "2.1.0", "v3-experiment", "2026-05-24" are all valid version strings. Use whatever scheme matches your deployment workflow.

It does not auto-increment versions. When you update a prompt, you set the new version string explicitly. There is no magic version bumping. This is intentional: automatic version bumping tends to drift from semantic meaning.

Inside the Library

The PromptVersion object carries the content and its metadata:

import hashlib

class PromptVersion:
    def __init__(self, name: str, version: str, content: str):
        self.name = name
        self.version = version
        self.content = content
        self.content_hash = hashlib.sha256(content.encode()).hexdigest()[:8]
        self.registered_at = time.time()

    def as_dict(self) -> dict:
        return {
            "name": self.name,
            "version": self.version,
            "content_hash": self.content_hash,
            "registered_at": self.registered_at,
        }

class PromptRegistry:
    def __init__(self):
        self._prompts: dict[str, PromptVersion] = {}

    def register(self, name: str, version: str, content: str) -> PromptVersion:
        pv = PromptVersion(name, version, content)
        self._prompts[name] = pv
        return pv

    def get(self, name: str) -> PromptVersion:
        if name not in self._prompts:
            raise PromptNotFound(name)
        return self._prompts[name]

    def all_versions(self) -> list[dict]:
        return [pv.as_dict() for pv in self._prompts.values()]

The hash is the first 8 characters of the SHA-256 of the prompt content. 8 hex chars gives 4 billion possible values — enough to detect a mismatch without using the full hash in logs. If two prompts have the same version string but different content, their hashes will differ.

For template prompts that take variables:

class TemplatePromptVersion:
    def __init__(self, name: str, version: str, template: str):
        self.name = name
        self.version = version
        self.template = template
        self.template_hash = hashlib.sha256(template.encode()).hexdigest()[:8]

    def render(self, **kwargs) -> str:
        return self.template.format(**kwargs)

    def render_hash(self, **kwargs) -> str:
        rendered = self.render(**kwargs)
        return hashlib.sha256(rendered.encode()).hexdigest()[:8]

template_hash is the hash of the raw template (stable across renders). render_hash() is the hash of a specific rendered instance (reflects the actual prompt sent to the model).

When to Use It

Use it from day one in any agent that might change its prompts over time. Adding version tracking after the fact means you cannot trace old incidents. Adding it upfront costs almost nothing.

Use it when you run A/B experiments on prompts. Register both versions ("v1-baseline" and "v2-experiment"), serve each to a different user cohort, and log the version with every response. Analysis is then a simple GROUP BY on the version field.

Use it when you deploy prompts through a CI/CD pipeline. The content_hash acts as an integrity check: compare the hash of the deployed prompt against the hash in your registry. If they differ, the deployment drifted from what was tested.

Skip it for one-off scripts where you will run the agent exactly once. The overhead of a registry adds no value for a prompt you will never need to audit.

Install

pip install git+https://github.com/MukundaKatta/llm-prompt-version

# Or from PyPI
pip install llm-prompt-version

from llm_prompt_version import PromptRegistry
from agent_step_log import StepLog

registry = PromptRegistry()

registry.register(
    name="triage_agent",
    version=os.environ.get("TRIAGE_PROMPT_VERSION", "unknown"),
    content=open("prompts/triage-agent.txt").read(),
)

def run_triage(ticket: dict) -> str:
    pv = registry.get("triage_agent")
    log = StepLog(path=f"./logs/{ticket['id']}.jsonl", run_id=ticket["id"])

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-6",
        system=pv.content,
        messages=[{"role": "user", "content": format_ticket(ticket)}],
        max_tokens=512,
    )

    log.record_llm_call(
        model=response.model,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        stop_reason=response.stop_reason,
        # Version metadata travels with every log entry
        metadata=pv.as_dict(),
    )

    return extract_text(response)

Sibling Libraries

Library	What it solves
`prompt-replay`	Record/replay LLM prompts for regression testing across versions
`agent-step-log`	Per-step JSONL logging that carries version metadata
`prompt-template-version`	Semver-pinned prompt templates with render tracking
`agent-debug-replay`	Step-through navigator that shows prompt version per step
`prompt-cache-warmer`	Pre-warm prompt cache after a version update

The prompt lifecycle stack: llm-prompt-version for version identity, prompt-replay for regression testing between versions, prompt-cache-warmer to pre-warm the new version's cache before traffic switches.

What's Next

Version diff: registry.diff("v1-baseline", "v2-experiment") that runs a text diff between two registered prompt versions. Useful for code review and for understanding why evaluation scores changed.

Deployment verification hook: registry.verify(name, expected_hash) that raises PromptHashMismatch if the registered content does not match the expected hash. Add this to your deployment startup checks so a misconfigured deploy fails fast.

Metrics emission: pv.as_dict() already returns a serializable dict. An optional metrics_callback that fires when a prompt is fetched would let you count prompt usage per version in your metrics system without modifying every call site.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.