Prompt Engineering Hygiene: The Invisible Layer Under Your Agent Logic

#hermeschallenge #ai #python #agents

Every LLM agent has a prompt layer. Most treat it as an afterthought. The system prompt is a multi-line f-string somewhere in the codebase. Nobody knows when it last changed. Nobody knows which version is in production. Nobody can replay the same prompt against a different model to compare outputs.

This matters because prompts are code. They determine behavior. When something breaks, you need to know if the prompt changed, which version changed, and what it changed to. Without versioning, you are debugging blind.

This post covers three tools. prompt-template-version gives prompts a name, a semver, and a content hash. agentprompt-rs provides role-aware Jinja2 templates callable from Python. prompt-replay records outputs for a template version and replays them against a different model so you can diff the results before you ship.

The F-String Problem

Here is what most agent prompt code looks like:

# agent.py - the common anti-pattern
def build_system_prompt(user_name: str, context: str) -> str:
    return f"""
You are a helpful assistant for {user_name}.
Context: {context}
Always respond in JSON.
Be concise.
"""

This has no version. If you change "Be concise" to "Be thorough", the change is invisible in CI. The model behavior changes. You cannot reproduce a bug from last week because you do not know which prompt was running then. You cannot A/B test prompt changes because you cannot reference the old version.

The fix is not a framework. It is a convention plus three small tools.

prompt-template-version: Name, Semver, Content Hash

# pip install prompt-template-version

from prompt_template_version import PromptRegistry, PromptVersion

registry = PromptRegistry()

# Register prompt versions explicitly.
# version is semver: MAJOR.MINOR.PATCH
# content_sha256 is a SHA-256 hash of the template string.
# If the template changes, the hash changes. The registry will refuse to load
# a version where the declared hash does not match the actual content.

SYSTEM_PROMPT_V1 = registry.register(
    name="assistant_system",
    version="1.0.0",
    template="""
You are a helpful assistant for {{ user_name }}.
Context: {{ context }}
Always respond in JSON.
Be concise.
""",
)

SYSTEM_PROMPT_V2 = registry.register(
    name="assistant_system",
    version="1.1.0",
    template="""
You are a helpful assistant for {{ user_name }}.
Context: {{ context }}
Always respond in valid JSON with these keys: answer, confidence, sources.
Keep each field under 100 words.
""",
)

# In production: use the latest registered version
prompt = registry.latest("assistant_system")

# Render with variables
rendered = prompt.render(user_name="Alice", context="Q4 budget report")
print(f"Running prompt version: {prompt.version} hash: {prompt.content_sha256[:8]}")

The hash is the key feature. When you deploy, the hash is logged with every model call. When a bug report comes in, you look up the hash and know exactly which template was running. You can reproduce the exact prompt.

agentprompt-rs: Role-Aware Jinja2 Templates

agentprompt-rs is a Rust library with Python bindings that handles role-structured prompt rendering. It uses Jinja2 syntax and outputs a list of role-message dicts that any OpenAI-compatible client accepts directly.

# agentprompt-rs Python bindings (install via pip install agentprompt)
from agentprompt import TemplateEngine

engine = TemplateEngine()

# Define role-structured templates
# Each block maps to a message role in the API call
template = engine.parse("""
[system]
You are a research assistant. Your job is to find accurate information.
Today's date is {{ date }}.
User tier: {{ user_tier }}.

{% if user_tier == "premium" %}
You have access to web search and document retrieval.
{% else %}
You can only use your training knowledge. Say so if asked about recent events.
{% endif %}

[user]
{{ user_message }}
""")

# Render to a list of role-message dicts
messages = template.render(
    date="2026-05-24",
    user_tier="premium",
    user_message="What were the main AI releases in April 2026?",
)

# messages is now ready for the API:
# [
#   {"role": "system", "content": "You are a research assistant. ..."},
#   {"role": "user", "content": "What were the main AI releases in April 2026?"}
# ]

response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=messages,
)

The conditional block inside the system prompt is handled in the template, not in Python control flow scattered across the codebase. When you add a new user tier, you update the template. You do not hunt for if-else branches in five different files.

prompt-replay: Record Once, Diff Across Models

prompt-replay records the output of a prompt version against a model. When you change the prompt or change the model, you replay the same inputs and diff the outputs.

# pip install prompt-replay

from prompt_replay import ReplaySession, OutputDiff

# Step 1: Record (run once with real API, save to file)
# PROMPT_REPLAY_RECORD=1 python run_replay.py

# Step 2: Replay against new model or new prompt version

# replay_session.py
from prompt_replay import ReplaySession
from prompt_template_version import PromptRegistry

registry = PromptRegistry()
PROMPT_V1 = registry.get("assistant_system", "1.0.0")
PROMPT_V2 = registry.get("assistant_system", "1.1.0")

# Create a session with a set of test inputs
session = ReplaySession(
    name="assistant_system_comparison",
    test_inputs=[
        {"user_name": "Alice", "context": "Budget review", "user_message": "Summarize Q4 results"},
        {"user_name": "Bob", "context": "Tech report", "user_message": "What are the risks?"},
        {"user_name": "Carol", "context": "Product launch", "user_message": "What should we prioritize?"},
    ],
)

# Record baseline (v1.0.0 on claude-sonnet-4-6)
# Only runs on first execution; subsequent runs use the saved recording
baseline = session.record(
    prompt_version=PROMPT_V1,
    model="claude-sonnet-4-6",
    record_path="replays/assistant_v1_baseline.jsonl",
)

# Replay v2 prompt against same model
replay = session.replay(
    prompt_version=PROMPT_V2,
    model="claude-sonnet-4-6",
    baseline_path="replays/assistant_v1_baseline.jsonl",
)

# Diff the results
diff = OutputDiff(baseline=baseline, candidate=replay)
print(f"Changed outputs: {diff.changed_count}/{diff.total_count}")
print(f"Average length change: {diff.avg_length_delta:+.0f} chars")

for change in diff.changes:
    print(f"\nInput: {change.input['user_message']}")
    print(f"Before: {change.baseline_output[:200]}")
    print(f"After:  {change.candidate_output[:200]}")

Record once. The recording is a JSONL file that goes in version control alongside the prompt version. When you bump the prompt from 1.0.0 to 1.1.0, you run the replay, review the diff, and decide whether the change is an improvement. The diff runs without touching the API again.

What Applies Here and What Does Not

Applies:

Any agent where prompt changes have caused unexpected behavior changes in the past
Projects where multiple people edit prompts
Any deployment where you need to reproduce a specific model output from a logged request
A/B testing before a model upgrade (old prompt on new model, diff the outputs)

Does not apply:

Single-file scripts that are not deployed
Agents where every call is a one-off and output variability is expected
Prototyping where you are still discovering what the prompt should say

Do not add prompt versioning before you have a prompt worth versioning. A prompt you rewrote five times last week is not stable enough to version. Wait until it settles.

Quick-Start Snippet

pip install prompt-template-version prompt-replay agentprompt

from prompt_template_version import PromptRegistry
from prompt_replay import ReplaySession
from agentprompt import TemplateEngine

# Register your prompt
registry = PromptRegistry()
prompt = registry.register(
    name="my_agent_system",
    version="1.0.0",
    template="You are a helpful assistant. Today is {{ date }}.",
)

# Render it
rendered = prompt.render(date="2026-05-24")
print(f"Prompt hash: {prompt.content_sha256[:8]}")
print(f"Prompt version: {prompt.version}")

# Log the hash with every API call so you can reproduce later
response = client.chat.completions.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "system", "content": rendered}, {"role": "user", "content": "Hello"}],
    metadata={"prompt_version": prompt.version, "prompt_hash": prompt.content_sha256[:8]},
)

Related Libraries

Library	What It Does
prompt-template-version	Name + semver + content hash per prompt template
agentprompt-rs	Jinja2 role-structured prompt templates, callable from Python
prompt-replay	Record outputs, replay against new prompt or model, diff results
cachebench	Prompt cache hit rate observability for Anthropic API calls
prompt-cache-warmer	Pre-warm Anthropic prompt cache before high-traffic periods
prompt-cache-warmer-rs	Rust port of prompt-cache-warmer for lower-latency warm-up

What's Next

Prompt versioning is the foundation. Once you have hashes logged, the next question is: which prompt version is responsible for this cache hit rate? cachebench connects prompt version hashes to Anthropic cache hit metrics. You can see whether prompt version 1.1.0 warmed the cache better than 1.0.0 for the same set of inputs.

After that, prompt-cache-warmer lets you pre-warm the Anthropic prompt cache before a high-traffic window. You submit the system prompt once, the cache warms, and the first 100 user calls all hit the cache instead of paying full token cost. Combined with prompt versioning, you can pre-warm the exact version that is deployed, not whatever happens to be last in the request queue.