DEV Community

Cover image for Why we ship untested prompts (and the supply-chain pattern that fixes it)
rp1run
rp1run

Posted on • Originally published at blog.rp1.run

Why we ship untested prompts (and the supply-chain pattern that fixes it)

I'd never approve a PR that bypassed CI.

But I've watched dozens of teams — including ones I've worked on — deploy prompt changes with zero of the verification we'd insist on for a code change. Edit a string in a config file. Push. Hope.

A prompt change is a logic change. It alters how the system behaves under uncertainty, what it returns under load, and how it handles edge cases nobody enumerated. The fact that it's text and not Python doesn't change what it does.

The gap between how we deploy code and how we deploy prompts is going to bite hard as agentic systems scale. And the answer might already exist — in the tooling the supply-chain security world has been building for the last five years.

The supply-chain parallel

Sigstore, SLSA, in-toto. These tools solved a related problem for binaries: how do you cryptographically prove that the artifact in production is the one that passed your checks?

The primitives:

  • Content-addressable hashing. Identify the artifact by the hash of its content. Two artifacts with the same hash are identical, byte-for-byte.
  • Signed attestations. A cryptographic statement: "this hash passed this evaluation, witnessed by this entity."
  • Verification gates. Deployment refuses any artifact without a valid attestation.

Applied to prompts:

  1. Hash the prompt text. prompt[sha256:abc123...] is now uniquely identifiable.
  2. Run your eval suite against that exact hash.
  3. Generate a signed attestation: "prompt[abc123] passed eval suite v2 on date X."
  4. Production deployment verifies the attestation before promoting.

Now "what prompt is in production?" has an answer that doesn't depend on git archaeology or trusting a config dashboard.

What this doesn't solve

This is the part most discussions of prompt evaluation skip over.

Eval reproducibility is non-trivial when the underlying model version drifts. An attestation from last month against gpt-4o-2024-08-06 doesn't tell you anything about behaviour against gpt-4o-2024-11-20. Either you pin model versions in the attestation (and accept the operational cost of staying on old models), or you re-attest on every model version change (and accept the eval cost). There's no free lunch.

There's also the question of whether "passing evals" is actually the right gate. Code passes tests but can still ship bugs. Prompt evals are coarser — they sample behaviour, they don't prove correctness.

The bigger question

Are prompts code or configuration?

Most teams haven't picked, which is why they fall into the worst of both: edited freely like config, executing logic like code. Picking one would mean deciding whether prompts go through a CI pipeline (code-treated) or a configuration management system with rollback (config-treated). Either is better than the current default of "text in a file, deployed by whoever has commit access."

Prem Pillai (@cloud-on-prem) wrote a longer treatment of the architecture and gaps as rp1 blog post.

If you're working on prompt evaluation, deployment pipelines for agentic systems, or just struggling with the operational chaos of prompt management at scale — we have a Discord where engineers are talking through these patterns.

Top comments (0)