Originally published at: Prompt as Code — Build Prompt Registry with Versioning
When teams first build AI features, prompts usually start simple.
A string in a function.
A template inside a route.
Maybe a small helper function.
Something like:
prompt = f"Summarize the following system event:\n\n{event_text}"
Then with system evolves, prompts start changing.
A word here.
A constraint there.
Someone adds a new instruction for better formatting.
Before long, the system behaves differently and nobody can explain why.
That’s when prompt chaos begins.
1. The Problem: Prompt Chaos
Unlike normal code, prompts are often invisible infrastructure.
They live inside strings scattered across services. They change quietly during experimentation.
Over time this creates several problems:
- Responses change unexpectedly
- Evaluation metrics become unreliable
- Debugging becomes difficult
- Prompt history disappears
If an output changes today, you may not know whether the cause was:
- A prompt change ?
- A model change ?
- A parameter change ?
Without prompt identity, the system becomes difficult to reason about.
2. Why Prompts Need Versioning
Prompts influence system behavior as much as code does.
In fact, prompts are closer to configuration that drives behavior.
That means prompts deserve the same discipline as code:
- Version control
- Reproducibility
- Traceability
Instead of treating prompts as strings, we can treat them as versioned assets.
This approach allows us to answer important questions:
Which prompt generated this output?
Which version was deployed last week?
Which prompt version performs best during evaluation?
This is the idea behind Prompt as Code.
3. What a Prompt Registry Is
A Prompt Registry is a small service responsible for managing prompt templates.
Instead of constructing prompts directly in application logic, the application resolves them from a registry.
A prompt registry provides:
- Prompt templates
- Version management
- Deterministic rendering
- Prompt hashing
This transforms prompts from ad-hoc strings into structured runtime assets.
Example prompt template:
PromptTemplate(
name="system_summary",
version="v2",
template="""
You are a production AI assistant focused on reliability.
Summarize the following system event:
{event_text}
"""
)
Now prompts have identity.
4. Architecture
The prompt registry sits between the API layer and the model gateway.
Client Request
↓
Prompt Registry
↓
Rendered Prompt
↓
Model Gateway
↓
Provider Adapter
↓
Cost Metering
↓
Evaluation
This architecture ensures:
- Prompts are resolved before inference
- Prompt versions are logged
- Evaluation remains reproducible
It also cleanly separates prompt management from model execution.
5. Implementation
In the Maester toolkit, the prompt registry lives inside:
packages/
prompt_registry/
models.py
registry.py
service.py
hashing.py
A prompt template is defined as a structured object.
@dataclass
class PromptTemplate:
name: str
version: str
template: str
Templates are stored in a registry:
registry.register(
PromptTemplate(
name="system_summary",
version="v1",
template="Summarize the following system event:\n\n{event_text}",
)
)
When the API receives a request, the prompt service resolves and renders the prompt.
rendered = prompt_service.render(
name="system_summary",
variables={
"event_text": "User downloaded a large dataset."
}
)
The rendered prompt includes:
- prompt name
- prompt version
- prompt content
- prompt hash
The hash guarantees that the exact prompt used during inference can be traced later.
6. Prompt Versioning Examples
Once prompts become versioned assets, evaluation becomes much more reliable.
Each request now records:
- model
- provider
- prompt name
- prompt version
- prompt hash
This allows teams to compare prompt performance across versions.
Example Prompt v1
Request:
{
"prompt_name": "system_summary",
"prompt_version": "v1",
"variables": {
"event_text": "Admin revoked API key for user account 742."
},
"model": "gpt-4.1-mini",
"max_tokens": 120
}
Response:
{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"61050fd4e94849d791e566ead8c8f1c6",
"prompt_name":"system_summary",
"prompt_version":"v1","prompt_hash":"5ba4cce2a985f8234698a63fe2260428b029dfd7d61e53a5793cc963b8737036",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant.\nSummarize the following system event clearly:\n\nAdmin revoked API key for user account",
"cost":{
"model":"gpt-4.1-mini",
"input_tokens":40,
"output_tokens":48,
"total_tokens":88,
"input_cost_usd":"0.000016",
"output_cost_usd":"0.000077",
"total_cost_usd":"0.000093",
"unit":"USD"
},
"evaluation":{
"status":"pass",
"reliability_score":1.0,
"metrics":[
{"name":"non_empty","score":1.0,"passed":true,"reason":null},
{"name":"max_length","score":1.0,"passed":true,"reason":null}
]
}}
Example Prompt v2
Request (default to latest):
{
"prompt_name": "system_summary",
"variables": {
"event_text": "System latency increased above 300ms for the inference service."
}
}
Response:
{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"7ec85dc989dc4da8a0ac9bb73f2317a7",
"prompt_name":"system_summary",
"prompt_version":"v2",
"prompt_hash":"06c08f6125a189abf90b44c9a63a5bc0f5307f06319363a922a476b38776b8c6",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant focused on reliability.\nSummarize the following system event.\nBe concise, mention operational impact, and keep the tone factual.\n\nSystem latency increased above 300ms for the inference service.",
"cost":{
"model":"gpt-4.1-mini",
"input_tokens":66,
"output_tokens":46,
"total_tokens":112,
"input_cost_usd":"0.000026",
"output_cost_usd":"0.000074",
"total_cost_usd":"0.000100",
"unit":"USD"
},
"evaluation":{
"status":"pass",
"reliability_score":1.0,
"metrics":[
{"name":"non_empty","score":1.0,"passed":true,"reason":null},
{"name":"max_length","score":1.0,"passed":true,"reason":null}
]
}}
Now prompt optimization becomes measurable rather than guesswork.
7. Lessons Learned
Building a prompt registry revealed a few important lessons.
1. Prompts evolve quickly
Even small systems accumulate many prompt variations.
2. Reproducibility matters early
Without prompt versioning, evaluation results become meaningless.
3. Prompt identity simplifies debugging
When responses change, engineers can immediately identify the cause.
4. Prompts should live outside business logic
Separating prompts from application code improves maintainability.
8. The Code
The implementation described in this article is part of an open-source project called Maester.
Maester is a lightweight toolkit focused on AI API reliability, including:
- Model gateway routing
- Cost metering
- Observability
- Evaluation pipelines
- Prompt registry
Repository: maester
The goal is to explore how production AI systems can remain observable, reproducible, and resilient as they grow.
Note: This article was originally published on my engineering blog where I’m documenting the design of Maester, a production AI SaaS infrastructure system built in public. Original post:Prompt as Code — Build Prompt Registry with Versioning
Top comments (0)