Lei Ye

Posted on Mar 11 • Originally published at lei-ye.dev

The Hidden Problem With Prompts in Production AI

#programming #ai #python #machinelearning

Originally published at: Prompt as Code — Build Prompt Registry with Versioning

When teams first build AI features, prompts usually start simple.

A string in a function.
A template inside a route.
Maybe a small helper function.

Something like:

prompt = f"Summarize the following system event:\n\n{event_text}"

Then with system evolves, prompts start changing.

A word here.
A constraint there.
Someone adds a new instruction for better formatting.

Before long, the system behaves differently and nobody can explain why.

That’s when prompt chaos begins.

1. The Problem: Prompt Chaos

Unlike normal code, prompts are often invisible infrastructure.

They live inside strings scattered across services. They change quietly during experimentation.

Over time this creates several problems:

Responses change unexpectedly
Evaluation metrics become unreliable
Debugging becomes difficult
Prompt history disappears

If an output changes today, you may not know whether the cause was:

A prompt change ?
A model change ?
A parameter change ?

Without prompt identity, the system becomes difficult to reason about.

2. Why Prompts Need Versioning

Prompts influence system behavior as much as code does.

In fact, prompts are closer to configuration that drives behavior.

That means prompts deserve the same discipline as code:

Version control
Reproducibility
Traceability

Instead of treating prompts as strings, we can treat them as versioned assets.

This approach allows us to answer important questions:

Which prompt generated this output?
Which version was deployed last week?
Which prompt version performs best during evaluation?

This is the idea behind Prompt as Code.

3. What a Prompt Registry Is

A Prompt Registry is a small service responsible for managing prompt templates.

Instead of constructing prompts directly in application logic, the application resolves them from a registry.

A prompt registry provides:

Prompt templates
Version management
Deterministic rendering
Prompt hashing

This transforms prompts from ad-hoc strings into structured runtime assets.

Example prompt template:

PromptTemplate(
    name="system_summary",
    version="v2",
    template="""
You are a production AI assistant focused on reliability.
Summarize the following system event:
{event_text}
"""
)

Now prompts have identity.

4. Architecture

The prompt registry sits between the API layer and the model gateway.

Client Request
      ↓
Prompt Registry
      ↓
Rendered Prompt
      ↓
Model Gateway
      ↓
Provider Adapter
      ↓
Cost Metering
      ↓
Evaluation

This architecture ensures:

Prompts are resolved before inference
Prompt versions are logged
Evaluation remains reproducible

It also cleanly separates prompt management from model execution.

5. Implementation

In the Maester toolkit, the prompt registry lives inside:

packages/
  prompt_registry/
      models.py
      registry.py
      service.py
      hashing.py

A prompt template is defined as a structured object.

@dataclass
class PromptTemplate:
    name: str
    version: str
    template: str

Templates are stored in a registry:

registry.register(
    PromptTemplate(
        name="system_summary",
        version="v1",
        template="Summarize the following system event:\n\n{event_text}",
    )
)

When the API receives a request, the prompt service resolves and renders the prompt.

rendered = prompt_service.render(
    name="system_summary",
    variables={
        "event_text": "User downloaded a large dataset."
    }
)

The rendered prompt includes:

prompt name
prompt version
prompt content
prompt hash

The hash guarantees that the exact prompt used during inference can be traced later.

6. Prompt Versioning Examples

Once prompts become versioned assets, evaluation becomes much more reliable.

Each request now records:

model
provider
prompt name
prompt version
prompt hash

This allows teams to compare prompt performance across versions.

Example Prompt v1

Request:

{
    "prompt_name": "system_summary",
    "prompt_version": "v1",
    "variables": {
      "event_text": "Admin revoked API key for user account 742."
    },
    "model": "gpt-4.1-mini",
    "max_tokens": 120
  }

Response:

{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"61050fd4e94849d791e566ead8c8f1c6",
"prompt_name":"system_summary",
"prompt_version":"v1","prompt_hash":"5ba4cce2a985f8234698a63fe2260428b029dfd7d61e53a5793cc963b8737036",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant.\nSummarize the following system event clearly:\n\nAdmin revoked API key for user account",
"cost":{
    "model":"gpt-4.1-mini",
    "input_tokens":40,
    "output_tokens":48,
    "total_tokens":88,
    "input_cost_usd":"0.000016",
    "output_cost_usd":"0.000077",
    "total_cost_usd":"0.000093",
    "unit":"USD"
},
"evaluation":{
    "status":"pass",
    "reliability_score":1.0,
    "metrics":[
        {"name":"non_empty","score":1.0,"passed":true,"reason":null},
        {"name":"max_length","score":1.0,"passed":true,"reason":null}
    ]
}}

Example Prompt v2

Request (default to latest):

{
    "prompt_name": "system_summary",
    "variables": {
      "event_text": "System latency increased above 300ms for the inference service."
    }
  }

Response:

{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"7ec85dc989dc4da8a0ac9bb73f2317a7",
"prompt_name":"system_summary",
"prompt_version":"v2",
"prompt_hash":"06c08f6125a189abf90b44c9a63a5bc0f5307f06319363a922a476b38776b8c6",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant focused on reliability.\nSummarize the following system event.\nBe concise, mention operational impact, and keep the tone factual.\n\nSystem latency increased above 300ms for the inference service.",
"cost":{
    "model":"gpt-4.1-mini",
    "input_tokens":66,
    "output_tokens":46,
    "total_tokens":112,
    "input_cost_usd":"0.000026",
    "output_cost_usd":"0.000074",
    "total_cost_usd":"0.000100",
    "unit":"USD"
},
"evaluation":{
    "status":"pass",
    "reliability_score":1.0,
    "metrics":[
        {"name":"non_empty","score":1.0,"passed":true,"reason":null},
        {"name":"max_length","score":1.0,"passed":true,"reason":null}
    ]
}}

Now prompt optimization becomes measurable rather than guesswork.

7. Lessons Learned

Building a prompt registry revealed a few important lessons.

1. Prompts evolve quickly

Even small systems accumulate many prompt variations.

2. Reproducibility matters early

Without prompt versioning, evaluation results become meaningless.

3. Prompt identity simplifies debugging

When responses change, engineers can immediately identify the cause.

4. Prompts should live outside business logic

Separating prompts from application code improves maintainability.

8. The Code

The implementation described in this article is part of an open-source project called Maester.

Maester is a lightweight toolkit focused on AI API reliability, including:

Model gateway routing
Cost metering
Observability
Evaluation pipelines
Prompt registry

Repository: maester

The goal is to explore how production AI systems can remain observable, reproducible, and resilient as they grow.

Note: This article was originally published on my engineering blog where I’m documenting the design of Maester, a production AI SaaS infrastructure system built in public. Original post:Prompt as Code — Build Prompt Registry with Versioning