prompt-template-version: Semver-Pin Your System Prompts So You Can Always Roll Back

#hermeschallenge #ai #python #agents

The agent started behaving differently on a Tuesday.

It was subtle at first. The triage agent was a little more verbose. It added a sentence of explanation at the end of each response. Users did not complain. A teammate had just made it "more helpful." They edited the system prompt directly in the config file, committed the change, and moved on.

Three weeks later a different teammate was debugging a regression. The agent was now skipping a step it had always done. They looked at the git blame. The system prompt had been edited four times in three weeks, by three different people, with commit messages like "tweak prompt" and "fix wording." No version numbers. No rollback path. No record of what the prompt said at any specific point in production.

They had no way to pin the agent to the version that worked. They had no way to know whether the prompt change caused the regression or whether the regression came from a model update, a data change, or something else. The system prompt was just a string. A mutable string with no contract.

That is the problem prompt-template-version fixes.

The shape of the fix

You register a prompt by name, version, and content. The registry stores it persistently. After registration, that version is locked. You look it up by name and version to use it. If you need to change the behavior, you register a new version.

from prompt_template_version import PromptRegistry

registry = PromptRegistry(storage_dir="./prompts")

# Register version 1.0.0 of the triage agent prompt
registry.register(
    name="triage-agent",
    version="1.0.0",
    content="You are a triage agent. Classify each incoming ticket as P0, P1, P2, or P3. Return only the classification label. No explanation.",
)

# Later, look it up
prompt = registry.get(name="triage-agent", version="1.0.0")
print(prompt.content)
# "You are a triage agent. Classify each incoming ticket as P0, P1, P2, or P3. Return only the classification label. No explanation."
print(prompt.content_sha256)
# "d3f4a8b1..."

The second time you call register() with the same name and version but different content, you get a VersionConflict error.

# This will raise VersionConflict
registry.register(
    name="triage-agent",
    version="1.0.0",
    content="You are a helpful triage agent. Please classify tickets.",
)
# VersionConflict: triage-agent@1.0.0 already registered with different content

If you call register() with the same name, version, and identical content, it is a no-op. Safe to call on startup without a guard.

To ship a behavior change, you increment the version.

registry.register(
    name="triage-agent",
    version="1.1.0",
    content="You are a triage agent. Classify each incoming ticket as P0, P1, P2, or P3. After the label, add one sentence explaining your classification.",
)

Now you have both versions stored. You can run v1.0.0 in production and v1.1.0 in staging. You can roll back to v1.0.0 instantly, and you know you are getting exactly the prompt that ran in production before the change.

What it does NOT do

It does not manage prompt templates with variable interpolation. It stores exact strings.
It does not enforce semver format. "1.0.0", "v2", "2026-05-24", "experiment-a" are all valid version strings.
It does not push prompts to a remote store or database. Storage is local JSON files.
It does not compare prompts across versions or compute semantic diffs. It stores and retrieves.

If you need rollout control, A/B traffic splitting, or centralized multi-service prompt management, this library is not that. It is a local registry with a write-once contract.

Inside the lib: write-once version locking

Each name gets its own JSON file under storage_dir. The file holds every registered version for that name, keyed by version string.

{
  "triage-agent": {
    "1.0.0": {
      "content": "You are a triage agent. Classify each incoming ticket as P0, P1, P2, or P3. Return only the classification label. No explanation.",
      "content_sha256": "d3f4a8b1c2e9f0a7...",
      "registered_at": "2026-05-20T14:23:01Z"
    },
    "1.1.0": {
      "content": "You are a triage agent. Classify each incoming ticket as P0, P1, P2, or P3. After the label, add one sentence explaining your classification.",
      "content_sha256": "a9b2c3d4e5f6a7b8...",
      "registered_at": "2026-05-24T09:11:44Z"
    }
  }
}

When you call register(), the library:

Loads the JSON file for that name, or creates it if it does not exist.
Checks whether that version key already exists.
If it exists and the content_sha256 matches, returns silently.
If it exists and the content_sha256 does not match, raises VersionConflict.
If it does not exist, writes the new version with the SHA-256 of the content and the current timestamp.

The SHA-256 is computed from the exact content string, byte-for-byte. If you add a trailing space or change a newline, the hash changes, and the register() call will raise VersionConflict against an already-stored version.

This means you can commit the JSON files to git. You can audit every prompt you have ever run in production. You can diff two version files and see exactly what changed between v1.0.0 and v1.1.0.

When this is useful

You are debugging a regression. You know the agent changed behavior around a certain date. You check the version history and find that "triage-agent" moved from v1.0.0 to v1.1.0 on that date. You pin the agent back to v1.0.0 and rerun the failing case. You know immediately whether the prompt change was the cause.

You are running an experiment. You want to compare v1.0.0 and v1.1.0 side by side on a sample of live traffic. You have stable identifiers for both versions. You can route 10% of requests to v1.1.0 and compare outcomes without ambiguity about what each version actually says.

You are doing a postmortem. Something went wrong at 3am. You need to know what system prompt the agent was running at that moment. If you pinned the version at deploy time, you have the exact content. No need to reconstruct from git blame across multiple files and commits.

You are building a multi-agent system. Different agents can own different prompt registries. Each agent pins itself to a specific version at startup. Deployments become explicit: you bump the version, check in the new JSON, and the change is auditable.

When NOT to use it

If your prompts change every request because they include dynamic content (user history, retrieved context, tool results), the version registry does not help with that dynamic content. The library versions the static system prompt template, not the full assembled message.

If you are already using a platform that manages prompt versions (LangSmith, PromptLayer, similar), you probably do not need this. The library is useful when you want a lightweight, zero-dependency, local-first solution without a platform account.

If your team commits prompts directly in code and uses git for versioning, this library adds a parallel versioning layer. That can be useful (the JSON format is easier to query programmatically than git history) but it also means two places to keep in sync. Be deliberate about whether that tradeoff is worth it for your team.

Install

pip install prompt-template-version

Zero dependencies. Works on Python 3.8+. The 28 tests cover registration, retrieval, conflict detection, SHA-256 verification, multi-name storage, and idempotent re-registration.

from prompt_template_version import PromptRegistry, VersionConflict

registry = PromptRegistry(storage_dir="./prompt-store")

# Idempotent: safe to call at startup every time
registry.register(
    name="summarizer",
    version="2.0.0",
    content="Summarize the following text in three bullet points. Be concise.",
)

entry = registry.get(name="summarizer", version="2.0.0")
print(entry.version)         # "2.0.0"
print(entry.content_sha256)  # "b7c8d9e0f1a2b3c4..."

Siblings

These libraries compose well with prompt-template-version if you are building an auditable prompt pipeline.

Lib	Boundary	Repo
prompt-replay	Replay old prompt version against a new model to compare outputs before you ship	MukundaKatta/prompt-replay
agentsnap	Snapshot what prompt was used in a specific agent run, so you can correlate version to behavior after the fact	MukundaKatta/agentsnap
prompt-cache-warmer	Pre-warm the Anthropic prompt cache for the currently active version so the first real request hits the cache	MukundaKatta/prompt-cache-warmer
prompt-cache-key	Generate stable cache-scope hashes per prompt version, so different prompt versions never share a cache slot	MukundaKatta/prompt-cache-key

What's next

A few things would make this more useful in larger systems.

First, a list_versions() method that returns all registered versions for a name in chronological order. Useful for building a UI or a migration script that needs to walk the history.

Second, an optional tags field on registration. You could tag a version with {"env": "production", "deployed_at": "2026-05-24"} and filter by tag later. This would let you reconstruct the production state at a given date without relying entirely on git history.

Third, a CLI. ptv get triage-agent 1.0.0 would print the content. ptv list triage-agent would show all versions. Useful for ops workflows that do not want to write Python.

The core contract stays the same: register once, read many times, never overwrite. The write-once guarantee is what makes the rest useful.

Part of the Hermes Agent Challenge. Built as one library in a larger stack of composable agent utilities, all published under the MukundaKatta GitHub handle.