Tracking Prompt Versions in Your Traces: The Field Everyone Forgets

#llm #observability #devops #tutorial

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture the standup. Support says answer quality dropped sometime "this
week." Your judge scores agree: the rolling average is down. Now somebody
asks the question that ends the meeting badly. Which change did it?

You scroll the deploy history. Four prompt edits, two model alias bumps, a
retrieval config tweak, and a dependency upgrade all landed in that window.
The traces are sitting in your backend. They show tokens, latency, cost,
finish reasons. They do not show which version of the prompt produced each
span. So you cannot draw the line. You cannot say "quality fell off a cliff
at 14:30, which is exactly when prompt v37 shipped." You are reduced to
guessing and rolling back the most recent thing.

The field that would have answered the question is one string per span. It
is the field everyone forgets.

What "prompt version" actually means

A prompt is not one thing. When the model misbehaves, you want to pin down
every input that could have changed. Three attributes cover most of it.

Prompt version. A monotonic label you bump on every edit: v37, or a git short SHA, or a release tag. This is the human-readable handle you put in the runbook.
Template hash. A content hash of the rendered template before variable interpolation. Two deploys can claim the same version label and still differ if someone hot-patched a string. The hash does not lie.
Model alias and the resolved snapshot. claude-sonnet-4-5 is an alias. The provider rotates the snapshot behind it. Record both the alias you asked for and the snapshot the response reports, when the provider gives it to you.

Version is what you talk about. Hash is what you trust. Snapshot is the part
that changes without a deploy of yours at all.

Stamp it on every span

The OpenTelemetry GenAI semantic conventions do not yet have a stable
attribute for prompt version, so this goes in a project-prefixed namespace
(app.*), the same way you would handle any custom attribute. Keep the
prefix consistent across services and your backend will index it like any
other field.

Here is a minimal emitter. It assumes the OTel SDK is already initialized.

import hashlib
from opentelemetry import trace

tracer = trace.get_tracer("app.llm")


def template_hash(template: str) -> str:
    raw = template.encode("utf-8")
    return hashlib.sha256(raw).hexdigest()[:12]


# Your prompt registry. In real code this loads
# from a file, a DB row, or your config service.
PROMPTS = {
    "support_reply": {
        "version": "v37",
        "template": "You are a support agent. {question}",
    },
}


def render(name: str, **vars) -> tuple[str, dict]:
    spec = PROMPTS[name]
    template = spec["template"]
    meta = {
        "app.prompt.name": name,
        "app.prompt.version": spec["version"],
        "app.prompt.template_hash": template_hash(template),
    }
    return template.format(**vars), meta

The hash is taken on the template, not the filled-in prompt. You want to
detect template changes, not the fact that every user asks a different
question. Truncating SHA-256 to 12 hex chars keeps the attribute short and
still distinguishes any two templates you will realistically ship.

Now thread the metadata onto the span that wraps the model call.

def emit_llm_span(name, model, question, usage,
                  judge_score=None):
    prompt, meta = render(name, question=question)

    with tracer.start_as_current_span("gen_ai.chat") as span:
        for key, value in meta.items():
            span.set_attribute(key, value)

        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute(
            "gen_ai.usage.input_tokens", usage["in"]
        )
        span.set_attribute(
            "gen_ai.usage.output_tokens", usage["out"]
        )
        if judge_score is not None:
            span.set_attribute(
                "app.llm.judge.score", judge_score
            )
    return prompt

Three attributes now ride on every LLM span: the prompt name, its version,
and the template hash. The model alias is already there as
gen_ai.request.model. When the provider returns a resolved snapshot in the
response, stamp that too as gen_ai.response.model.

Correlate the quality drop to a deploy

With the version on the span, the standup question becomes a query. Group
your judge score by prompt version and look at where the line moves.

In PromQL, assuming app.prompt.version rode through to a label on your
judge-score metric:

avg by (app_prompt_version) (
  avg_over_time(app_llm_judge_score[1h])
)

You get one series per version. The regression is the series that sits below
the others. No more squinting at a single global average that smears every
version together.

In Datadog DDQL the shape is the same:

avg:app.llm.judge.score{*} by {app.prompt.version}

The point is not the exact query. The point is that the version is a
dimension you can split on. Before you stamped it, every version was averaged
into one number and the regression hid inside it. After, the bad version
stands alone on the chart and names itself.

Build the evidence, then roll back

The version attribute turns "I think it was the prompt change" into
something you can put in the incident channel. Pull the before-and-after
window and compare the score distribution across the deploy boundary.

# Pseudo-query against your trace/metrics store.
# Compare judge scores for the two versions that
# straddle the deploy at 14:30.
before = scores_for(version="v36")  # list[float]
after = scores_for(version="v37")   # list[float]

drop = mean(before) - mean(after)
print(f"v36 mean: {mean(before):.3f}")
print(f"v37 mean: {mean(after):.3f}")
print(f"drop: {drop:.3f}")

If v37 is the offender, the rollback is obvious and the evidence is
attached. You revert the prompt to v36, and the same chart proves the
recovery within the hour: the judge score climbs back to the old baseline,
and the version label on the recovering spans reads v36 again.

That last part matters more than it sounds. A rollback you cannot measure is
a hope. A rollback where the score returns to baseline and the spans show
the version flipping back is a closed loop. You shipped a fix and you have
the trace to show for it.

The template hash catches what the version misses

Version labels drift from reality. Someone edits the template to fix a typo
and forgets to bump the version. Two environments claim v37 but render
different text because a config override only landed in one of them. The
version says they match. They do not.

The hash is the tie-breaker. Group by version and hash, and a split
appears the moment the two diverge.

count:app.llm.judge.score{*}
  by {app.prompt.version, app.prompt.template_hash}

If one version label shows two distinct hashes, you have an untracked edit.
That is a finding on its own, separate from any quality drop. It means your
deploy process let a template change through without a version bump, and the
next regression you investigate will lie to you about which version was
running.

Where to put the field so it survives

A few habits keep this from rotting.

Stamp the attributes at render time, in one function, not scattered across
call sites. If every team formats prompts their own way, the version field
shows up on some spans and not others, and a partial dimension is worse than
none — the chart silently drops the spans that lack it.

Keep the registry in version control next to the code that reads it. The
whole value of a version label is that it points back to a known commit. A
prompt living in a database row that nobody diffs gives you a number with
nothing behind it.

Carry the version through async hops. If the model call happens in a worker
that picked the job off a queue, the version has to travel with the job
payload or the span context. Drop it at the queue boundary and your
backend-of-house spans go blank exactly where the hard bugs live.

The field costs you one string per span and a hash you compute once per
template. It pays for itself the first time someone asks which deploy broke
quality and you answer with a chart instead of a shrug.

If this was useful

The LLM Observability Pocket Guide
goes through the full GenAI attribute set, which custom fields are worth
adding, and how to keep your traces useful when prompts and model aliases
rotate underneath you. It is the reference I reach for when a span is missing
the one field that would have answered the question.