Nerav Doshi

Posted on Jun 22 • Edited on Jul 12 • Originally published at pipelineandprompts.com

Treat Prompts Like Code: A CI Gate for LLM Workflows on OpenShift

#openshift #promptengineering #platformengineering #aiinthestack

🤖 AI in the Stack #4

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

⚡ Byte Size Summary

Store prompts as versioned YAML manifests in Git and run them through a three-stage GitHub Actions gate — schema validation, secret scanning with gitleaks, and model policy enforcement — before any LLM call reaches your OpenShift environment

A CI-gated prompt pipeline gives your enterprise auditors a traceable answer to "what prompt was active during the incident window" — without it, the forensic work is manual, billed, and slow

Prompt versioning is necessary but not sufficient: you're versioning one variable in a system with multiple unversioned dependencies, and this article shows you what to do about the rest of them

The Story

I was presenting a prototype at a conference. The demo was built over three weeks of late-night sessions — an AI-assisted operations assistant for OpenShift that could answer runbook-style questions against live cluster state. The architecture was solid. The underlying idea was good.

What wasn't solid was how the prompts were managed. I'd been iterating across Claude, Perplexity, and ChatGPT, copying variations into Apple Notes, losing track of which version produced the output I'd screenshotted for the slides. By week two, I'd abandoned the notes entirely — too much overhead without tooling to support it. By week three, I had prompts scattered across three applications and no way to reliably reproduce the outputs that had looked good during development.

The demo didn't survive contact with live conditions. I pivoted to a vision talk twenty minutes before going on stage.

That was a conference demo. The stakes were a slightly awkward twenty minutes and a lesson I've told myself I'd fix. But I've since watched the same pattern play out in customer environments where the stakes were not a conference. A hallucinated ROSA HCP OIDC flag suggested live on a customer troubleshooting call — caught by the customer running --help and finding the flag didn't exist. Engineers pasting kubeconfigs into LLM prompts under pressure because the incident bridge is open and they need an answer faster than the runbook provides. A team of five that validated LLM output manually until the deployment cadence outpaced the validation bandwidth, at which point validation stopped without anyone deciding to stop it.

The corrective response in each case was some version of "stop trusting AI output." That's a reasonable response. It's also the most expensive one — engineers who learned the lesson revert to slow manual methods, and engineers who didn't keep taking the shortcut.

There's a better corrective response. It requires treating prompts the same way you treat every other infrastructure artifact that can cause a production incident.

The Problem

A prompt that reaches a production LLM call is infrastructure. It has the same properties as a Helm values file or a GitHub Actions workflow: it controls runtime behavior, its content directly affects what happens in your environment, and a change to it — intentional or silent — can cause a production incident.

The difference is that nobody is running git diff on it before it runs.

The failure modes are well-understood once you name them:

Drift. Engineers iterate on prompts locally, paste working versions into application code as string literals, and continue iterating. The version in production and the version on someone's laptop diverge without any of the normal signals — no PR, no review, no audit trail.

The forensics gap. An AI-assisted process produces wrong output. Your auditor, your customer, or your incident commander asks: what prompt was active when that happened? Without a versioned artifact and a deployment record, there's no clean answer. The forensic work becomes manual — reviewing chat histories, checking commit logs for string changes, interviewing engineers. That work is billed time, and it delays resolution while the incident is still open.

Credential exposure. Engineers under troubleshooting pressure paste context into LLM prompts — cluster IDs, subscription IDs, kubeconfigs, sometimes tokens. The destination is a provider's input log on infrastructure you don't control, often on a free-tier account with no enterprise data agreements. This is the same behavior that triggers Git secret scanning alerts, but there's no equivalent gate on the LLM input path. A CI-gated prompt workflow where prompts are files in a repo is the only natural chokepoint where you can enforce what's allowed in a prompt before it's sent.

Silent model updates. You pin your model name. The provider updates the model behind that name. Your prompt behavior changes. You have no record of what changed because the change happened outside your version control. This is the hardest failure mode to defend against, but at minimum you need to know when your prompts changed — separate from when the model changed — so you can reason about the delta.

Why Existing Approaches Fall Short

The most common response is naming conventions: prompt-v1.txt, prompt-v1.2.txt, prompt-final.txt, prompt-final-ACTUALLY-FINAL.txt. That's not versioning. It's a filesystem timestamp with extra steps. There's no enforcement, no review process, no deployment record, and no way to correlate a file version to a specific production event.

The second common response is saving prompts in the AI tool's interface — bookmarked threads, saved presets, custom instructions. This solves the personal convenience problem and makes no contribution to operational governance. Those artifacts are not in your SCM, not auditable by your infosec team, not deployable through your CI system, and not recoverable if the account is suspended or the provider changes their data model.

The third response — the one worth taking seriously because it gets closest to the right answer — is storing prompts in a repository as versioned files. This is necessary. It is not sufficient.

When you version a prompt file, you're versioning one variable in a system with at least four unversioned dependencies:

Model version — you specify a model name; the provider controls when that model is updated
Provider API version — behavioral changes in the completions endpoint are not always surfaced as breaking changes
Temperature and sampling parameters — usually invisible in UI-based tools; engineers often don't know what they're set to
The validation history — the process that produced the prompt is invisible in the final artifact

Saving prompt-v1.2.0.yaml in a Git repo creates the illusion of reproducibility. What you need is a CI gate that enforces what can be in a prompt, validates it before it reaches production, and records the full parameter context — not just the prompt text.

The Architecture

The architecture has three zones:

Developer workspace. Engineers author prompt files as versioned YAML manifests and commit them to the repo. The manifest format enforces that model name, temperature, max tokens, and a changelog are explicit fields — not runtime assumptions. Prompt files live under prompts/ in the repo.

CI gate (GitHub Actions). A three-job workflow triggers on any pull request that touches prompts/** or .prompt-policy.yaml. The jobs run in parallel: schema validation (validate_prompts.py), secret scanning (gitleaks via gitleaks/gitleaks-action@v2 with a custom .gitleaks.toml), and model policy enforcement (check_model_pins.py against .prompt-policy.yaml). All three must pass for the PR to merge. Branch protection enforces this — the gate can't be bypassed by direct push.

ConfigMap-based deployment (GitHub Actions). On merge to main, a separate sync workflow applies the approved prompts to OpenShift as a single prompt-registry ConfigMap in the ai-workflows namespace. Application pods consume prompts from this ConfigMap via a read-only volume mount, using the prompt-consumer ServiceAccount scoped with least-privilege RBAC. Rollback is a git revert followed by re-sync — same pattern as any GitOps-managed config change.

The audit trail lives in Git (who changed what and when), GitHub Actions run logs (what validation ran against which SHA), and the ConfigMap's resourceVersion history on the cluster. When someone asks "what prompt was active at 14:32 on incident day," you have a traceable answer: the Git SHA that was on main at that time, the Actions run that validated it, and the ConfigMap resourceVersion that matches.

Implementation

Prerequisites

OpenShift 4.14+ with oc CLI 4.14+
GitHub repository with Actions enabled
Branch protection on main requiring status checks: schema-validate, secret-scan, model-pin-check
Python 3.11+ (for local validation runs)
gitleaks 8.x (for local secret scanning before push)
Two GitHub repository secrets configured: OPENSHIFT_SERVER and OPENSHIFT_TOKEN

Create the target namespace and apply RBAC before running the sync workflow:


bash
oc create namespace ai-workflows
oc apply -f manifests/rbac.yaml

DEV Community