DEV Community

Cover image for The Hidden Problem With Prompts in Production AI
Lei Ye
Lei Ye

Posted on • Originally published at lei-ye.dev

The Hidden Problem With Prompts in Production AI

Originally published at: Prompt as Code — Build Prompt Registry with Versioning



When teams first build AI features, prompts usually start simple.

A string in a function.
A template inside a route.
Maybe a small helper function.

Something like:

prompt = f"Summarize the following system event:\n\n{event_text}"
Enter fullscreen mode Exit fullscreen mode

Then with system evolves, prompts start changing.

A word here.
A constraint there.
Someone adds a new instruction for better formatting.

Before long, the system behaves differently and nobody can explain why.

That’s when prompt chaos begins.


1. The Problem: Prompt Chaos

Unlike normal code, prompts are often invisible infrastructure.

They live inside strings scattered across services. They change quietly during experimentation.

Over time this creates several problems:

  • Responses change unexpectedly
  • Evaluation metrics become unreliable
  • Debugging becomes difficult
  • Prompt history disappears

If an output changes today, you may not know whether the cause was:

  • A prompt change ?
  • A model change ?
  • A parameter change ?

Without prompt identity, the system becomes difficult to reason about.


2. Why Prompts Need Versioning

Prompts influence system behavior as much as code does.

In fact, prompts are closer to configuration that drives behavior.

That means prompts deserve the same discipline as code:

  • Version control
  • Reproducibility
  • Traceability

Instead of treating prompts as strings, we can treat them as versioned assets.

This approach allows us to answer important questions:

Which prompt generated this output?
Which version was deployed last week?
Which prompt version performs best during evaluation?

This is the idea behind Prompt as Code.


3. What a Prompt Registry Is

A Prompt Registry is a small service responsible for managing prompt templates.

Instead of constructing prompts directly in application logic, the application resolves them from a registry.

A prompt registry provides:

  • Prompt templates
  • Version management
  • Deterministic rendering
  • Prompt hashing

This transforms prompts from ad-hoc strings into structured runtime assets.

Example prompt template:

PromptTemplate(
    name="system_summary",
    version="v2",
    template="""
You are a production AI assistant focused on reliability.
Summarize the following system event:
{event_text}
"""
)
Enter fullscreen mode Exit fullscreen mode

Now prompts have identity.


4. Architecture

The prompt registry sits between the API layer and the model gateway.

Client Request
      ↓
Prompt Registry
      ↓
Rendered Prompt
      ↓
Model Gateway
      ↓
Provider Adapter
      ↓
Cost Metering
      ↓
Evaluation
Enter fullscreen mode Exit fullscreen mode

This architecture ensures:

  • Prompts are resolved before inference
  • Prompt versions are logged
  • Evaluation remains reproducible

It also cleanly separates prompt management from model execution.


5. Implementation

In the Maester toolkit, the prompt registry lives inside:

packages/
  prompt_registry/
      models.py
      registry.py
      service.py
      hashing.py
Enter fullscreen mode Exit fullscreen mode

A prompt template is defined as a structured object.

@dataclass
class PromptTemplate:
    name: str
    version: str
    template: str
Enter fullscreen mode Exit fullscreen mode

Templates are stored in a registry:

registry.register(
    PromptTemplate(
        name="system_summary",
        version="v1",
        template="Summarize the following system event:\n\n{event_text}",
    )
)
Enter fullscreen mode Exit fullscreen mode

When the API receives a request, the prompt service resolves and renders the prompt.

rendered = prompt_service.render(
    name="system_summary",
    variables={
        "event_text": "User downloaded a large dataset."
    }
)
Enter fullscreen mode Exit fullscreen mode

The rendered prompt includes:

  • prompt name
  • prompt version
  • prompt content
  • prompt hash

The hash guarantees that the exact prompt used during inference can be traced later.


6. Prompt Versioning Examples

Once prompts become versioned assets, evaluation becomes much more reliable.

Each request now records:

  • model
  • provider
  • prompt name
  • prompt version
  • prompt hash

This allows teams to compare prompt performance across versions.

Example Prompt v1

Request:

{
    "prompt_name": "system_summary",
    "prompt_version": "v1",
    "variables": {
      "event_text": "Admin revoked API key for user account 742."
    },
    "model": "gpt-4.1-mini",
    "max_tokens": 120
  }
Enter fullscreen mode Exit fullscreen mode

Response:

{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"61050fd4e94849d791e566ead8c8f1c6",
"prompt_name":"system_summary",
"prompt_version":"v1","prompt_hash":"5ba4cce2a985f8234698a63fe2260428b029dfd7d61e53a5793cc963b8737036",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant.\nSummarize the following system event clearly:\n\nAdmin revoked API key for user account",
"cost":{
    "model":"gpt-4.1-mini",
    "input_tokens":40,
    "output_tokens":48,
    "total_tokens":88,
    "input_cost_usd":"0.000016",
    "output_cost_usd":"0.000077",
    "total_cost_usd":"0.000093",
    "unit":"USD"
},
"evaluation":{
    "status":"pass",
    "reliability_score":1.0,
    "metrics":[
        {"name":"non_empty","score":1.0,"passed":true,"reason":null},
        {"name":"max_length","score":1.0,"passed":true,"reason":null}
    ]
}}
Enter fullscreen mode Exit fullscreen mode

Example Prompt v2

Request (default to latest):

{
    "prompt_name": "system_summary",
    "variables": {
      "event_text": "System latency increased above 300ms for the inference service."
    }
  }
Enter fullscreen mode Exit fullscreen mode

Response:

{"provider":"openai",
"model":"gpt-4.1-mini",
"trace_id":"7ec85dc989dc4da8a0ac9bb73f2317a7",
"prompt_name":"system_summary",
"prompt_version":"v2",
"prompt_hash":"06c08f6125a189abf90b44c9a63a5bc0f5307f06319363a922a476b38776b8c6",
"content":"[OpenAI:gpt-4.1-mini] Generated response for prompt: You are a production AI assistant focused on reliability.\nSummarize the following system event.\nBe concise, mention operational impact, and keep the tone factual.\n\nSystem latency increased above 300ms for the inference service.",
"cost":{
    "model":"gpt-4.1-mini",
    "input_tokens":66,
    "output_tokens":46,
    "total_tokens":112,
    "input_cost_usd":"0.000026",
    "output_cost_usd":"0.000074",
    "total_cost_usd":"0.000100",
    "unit":"USD"
},
"evaluation":{
    "status":"pass",
    "reliability_score":1.0,
    "metrics":[
        {"name":"non_empty","score":1.0,"passed":true,"reason":null},
        {"name":"max_length","score":1.0,"passed":true,"reason":null}
    ]
}}
Enter fullscreen mode Exit fullscreen mode

Now prompt optimization becomes measurable rather than guesswork.


7. Lessons Learned

Building a prompt registry revealed a few important lessons.

1. Prompts evolve quickly

Even small systems accumulate many prompt variations.

2. Reproducibility matters early

Without prompt versioning, evaluation results become meaningless.

3. Prompt identity simplifies debugging

When responses change, engineers can immediately identify the cause.

4. Prompts should live outside business logic

Separating prompts from application code improves maintainability.


8. The Code

The implementation described in this article is part of an open-source project called Maester.

Maester is a lightweight toolkit focused on AI API reliability, including:

  • Model gateway routing
  • Cost metering
  • Observability
  • Evaluation pipelines
  • Prompt registry

Repository: maester

The goal is to explore how production AI systems can remain observable, reproducible, and resilient as they grow.




Note: This article was originally published on my engineering blog where I’m documenting the design of Maester, a production AI SaaS infrastructure system built in public. Original post:Prompt as Code — Build Prompt Registry with Versioning

Top comments (0)