DEV Community

Cover image for Treat Prompts Like Code: A CI Gate for LLM Workflows on OpenShift
Nerav Doshi
Nerav Doshi

Posted on • Originally published at pipelineandprompts.com

Treat Prompts Like Code: A CI Gate for LLM Workflows on OpenShift

πŸ€– AI in the Stack #4

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

⚑ Byte Size Summary

  • Store prompts as versioned YAML manifests in Git and run them through a three-stage GitHub Actions gate β€” schema validation, secret scanning with gitleaks, and model policy enforcement β€” before any LLM call reaches your OpenShift environment
  • A CI-gated prompt pipeline gives your enterprise auditors a traceable answer to "what prompt was active during the incident window" β€” without it, the forensic work is manual, billed, and slow
  • Prompt versioning is necessary but not sufficient: you're versioning one variable in a system with multiple unversioned dependencies, and this article shows you what to do about the rest of them

The Story

I was presenting a prototype at a conference. The demo was built over three weeks of late-night sessions β€” an AI-assisted operations assistant for OpenShift that could answer runbook-style questions against live cluster state. The architecture was solid. The underlying idea was good.

What wasn't solid was how the prompts were managed. I'd been iterating across Claude, Perplexity, and ChatGPT, copying variations into Apple Notes, losing track of which version produced the output I'd screenshotted for the slides. By week two, I'd abandoned the notes entirely β€” too much overhead without tooling to support it. By week three, I had prompts scattered across three applications and no way to reliably reproduce the outputs that had looked good during development.

The demo didn't survive contact with live conditions. I pivoted to a vision talk twenty minutes before going on stage.

That was a conference demo. The stakes were a slightly awkward twenty minutes and a lesson I've told myself I'd fix. But I've since watched the same pattern play out in customer environments where the stakes were not a conference. A hallucinated ROSA HCP OIDC flag suggested live on a customer troubleshooting call β€” caught by the customer running --help and finding the flag didn't exist. Engineers pasting kubeconfigs into LLM prompts under pressure because the incident bridge is open and they need an answer faster than the runbook provides. A team of five that validated LLM output manually until the deployment cadence outpaced the validation bandwidth, at which point validation stopped without anyone deciding to stop it.

The corrective response in each case was some version of "stop trusting AI output." That's a reasonable response. It's also the most expensive one β€” engineers who learned the lesson revert to slow manual methods, and engineers who didn't keep taking the shortcut.

There's a better corrective response. It requires treating prompts the same way you treat every other infrastructure artifact that can cause a production incident.


The Problem

A prompt that reaches a production LLM call is infrastructure. It has the same properties as a Helm values file or a GitHub Actions workflow: it controls runtime behavior, its content directly affects what happens in your environment, and a change to it β€” intentional or silent β€” can cause a production incident.

The difference is that nobody is running git diff on it before it runs.

The failure modes are well-understood once you name them:

Drift. Engineers iterate on prompts locally, paste working versions into application code as string literals, and continue iterating. The version in production and the version on someone's laptop diverge without any of the normal signals β€” no PR, no review, no audit trail.

The forensics gap. An AI-assisted process produces wrong output. Your auditor, your customer, or your incident commander asks: what prompt was active when that happened? Without a versioned artifact and a deployment record, there's no clean answer. The forensic work becomes manual β€” reviewing chat histories, checking commit logs for string changes, interviewing engineers. That work is billed time, and it delays resolution while the incident is still open.

Credential exposure. Engineers under troubleshooting pressure paste context into LLM prompts β€” cluster IDs, subscription IDs, kubeconfigs, sometimes tokens. The destination is a provider's input log on infrastructure you don't control, often on a free-tier account with no enterprise data agreements. This is the same behavior that triggers Git secret scanning alerts, but there's no equivalent gate on the LLM input path. A CI-gated prompt workflow where prompts are files in a repo is the only natural chokepoint where you can enforce what's allowed in a prompt before it's sent.

Silent model updates. You pin your model name. The provider updates the model behind that name. Your prompt behavior changes. You have no record of what changed because the change happened outside your version control. This is the hardest failure mode to defend against, but at minimum you need to know when your prompts changed β€” separate from when the model changed β€” so you can reason about the delta.


Why Existing Approaches Fall Short

The most common response is naming conventions: prompt-v1.txt, prompt-v1.2.txt, prompt-final.txt, prompt-final-ACTUALLY-FINAL.txt. That's not versioning. It's a filesystem timestamp with extra steps. There's no enforcement, no review process, no deployment record, and no way to correlate a file version to a specific production event.

The second common response is saving prompts in the AI tool's interface β€” bookmarked threads, saved presets, custom instructions. This solves the personal convenience problem and makes no contribution to operational governance. Those artifacts are not in your SCM, not auditable by your infosec team, not deployable through your CI system, and not recoverable if the account is suspended or the provider changes their data model.

The third response β€” the one worth taking seriously because it gets closest to the right answer β€” is storing prompts in a repository as versioned files. This is necessary. It is not sufficient.

When you version a prompt file, you're versioning one variable in a system with at least four unversioned dependencies:

  1. Model version β€” you specify a model name; the provider controls when that model is updated
  2. Provider API version β€” behavioral changes in the completions endpoint are not always surfaced as breaking changes
  3. Temperature and sampling parameters β€” usually invisible in UI-based tools; engineers often don't know what they're set to
  4. The validation history β€” the process that produced the prompt is invisible in the final artifact

Saving prompt-v1.2.0.yaml in a Git repo creates the illusion of reproducibility. What you need is a CI gate that enforces what can be in a prompt, validates it before it reaches production, and records the full parameter context β€” not just the prompt text.


The Architecture

CI-Gated Prompt Pipeline on OpenShift

The architecture has three zones:

Developer workspace. Engineers author prompt files as versioned YAML manifests and commit them to the repo. The manifest format enforces that model name, temperature, max tokens, and a changelog are explicit fields β€” not runtime assumptions. Prompt files live under prompts/ in the repo.

CI gate (GitHub Actions). A three-job workflow triggers on any pull request that touches prompts/** or .prompt-policy.yaml. The jobs run in parallel: schema validation (validate_prompts.py), secret scanning (gitleaks via gitleaks/gitleaks-action@v2 with a custom .gitleaks.toml), and model policy enforcement (check_model_pins.py against .prompt-policy.yaml). All three must pass for the PR to merge. Branch protection enforces this β€” the gate can't be bypassed by direct push.

ConfigMap-based deployment (GitHub Actions). On merge to main, a separate sync workflow applies the approved prompts to OpenShift as a single prompt-registry ConfigMap in the ai-workflows namespace. Application pods consume prompts from this ConfigMap via a read-only volume mount, using the prompt-consumer ServiceAccount scoped with least-privilege RBAC. Rollback is a git revert followed by re-sync β€” same pattern as any GitOps-managed config change.

The audit trail lives in Git (who changed what and when), GitHub Actions run logs (what validation ran against which SHA), and the ConfigMap's resourceVersion history on the cluster. When someone asks "what prompt was active at 14:32 on incident day," you have a traceable answer: the Git SHA that was on main at that time, the Actions run that validated it, and the ConfigMap resourceVersion that matches.


Implementation

Prerequisites

  • OpenShift 4.14+ with oc CLI 4.14+
  • GitHub repository with Actions enabled
  • Branch protection on main requiring status checks: schema-validate, secret-scan, model-pin-check
  • Python 3.11+ (for local validation runs)
  • gitleaks 8.x (for local secret scanning before push)
  • Two GitHub repository secrets configured: OPENSHIFT_SERVER and OPENSHIFT_TOKEN

Create the target namespace and apply RBAC before running the sync workflow:

oc create namespace ai-workflows
oc apply -f manifests/rbac.yaml
Enter fullscreen mode Exit fullscreen mode

Step 1 β€” Define the Prompt Manifest Schema

Prompts are YAML manifests, not plain text. The schema enforces that every prompt carries its full parameter context:

# prompts/rosa-hcp-deploy.yaml
apiVersion: prompts.ai/v1
kind: PromptManifest
metadata:
  name: rosa-hcp-deploy
  version: "1.2.0"
  description: "Generates ROSA HCP cluster deployment commands from user requirements"
  tags:
    - infrastructure
    - rosa
    - deployment
spec:
  model: claude-sonnet-4-6
  temperature: 0.2
  max_tokens: 2048
  system: |
    You are a Red Hat OpenShift Service on AWS (ROSA) expert. Generate ROSA HCP deployment commands only.

    Requirements:
    - Output valid `rosa create cluster` commands with HCP flags
    - Use only flags available in ROSA CLI 1.2.x
    - Never include credentials, tokens, or AWS keys in the output
    - Refuse requests that contain credential patterns (AWS_ACCESS_KEY, aws_secret, tokens)
    - Include --mode=auto for unattended deployment
    - Default to multi-AZ unless single-AZ is explicitly requested
    - Include --sts flag for STS-enabled clusters

    Output format:
    ```
{% endraw %}
bash
    rosa create cluster --cluster-name=<name> [options]
{% raw %}

    ```
  user_template: |
    Generate a ROSA HCP deployment command with these requirements:

    Cluster name: {{cluster_name}}
    Region: {{region}}
    Compute nodes: {{compute_nodes}}
    Instance type: {{instance_type}}
    {% if availability_zones %}Availability zones: {{availability_zones}}{% endif %}
    {% if version %}OpenShift version: {{version}}{% endif %}

    Additional requirements:
    {{additional_requirements}}
Enter fullscreen mode Exit fullscreen mode

The tags field drives domain-based ConfigMap splitting for large prompt sets (see scripts/split_registry.py in the repo). The version field in metadata is what your auditor queries.

Step 2 β€” Define the Model Policy

Approved models live in .prompt-policy.yaml at the repo root. The model policy check runs as a required CI gate β€” a PR that references an unapproved model string blocks on merge:

# .prompt-policy.yaml
# Last reviewed: 2026-06-11
# Review cadence: monthly β€” model strings change without notice

approved_models:
  - claude-sonnet-4-6
  - claude-haiku-4-5-20251001
  - gpt-4.1
  - gpt-4.1-mini
Enter fullscreen mode Exit fullscreen mode

This is the operational answer to the silent model update problem β€” not a full solution, but a forcing function. Any model not on the approved list can't be deployed through this gate. Updating the list requires a PR, which means a review, which means a record.

Step 3 β€” The CI Gate Workflow

The gate runs three parallel jobs on every PR touching prompts/**:

# .github/workflows/prompt-gate.yml
name: Prompt Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - '.prompt-policy.yaml'
  push:
    branches:
      - main
    paths:
      - 'prompts/**'
      - '.prompt-policy.yaml'

jobs:
  schema-validate:
    name: Schema Validation
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install pyyaml
      - run: python scripts/validate_prompts.py prompts/

  secret-scan:
    name: Secret Scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITLEAKS_CONFIG: .gitleaks.toml

  model-pin-check:
    name: Model Policy Check
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install pyyaml
      - run: python scripts/check_model_pins.py prompts/ .prompt-policy.yaml
Enter fullscreen mode Exit fullscreen mode

All three jobs are required status checks in branch protection. A PR can't merge if any of them fails β€” not a matter of convention, but of enforcement.

Step 4 β€” Secret Scanning with gitleaks

The .gitleaks.toml extends the default gitleaks ruleset with OpenShift- and cloud-specific patterns:

# .gitleaks.toml
[extend]
useDefault = true

[[rules]]
id = "openshift-api-token"
description = "OpenShift API token (sha256~ prefix)"
regex = '''sha256~[A-Za-z0-9_-]{43}'''
tags = ["openshift", "token", "kubernetes"]

[[rules]]
id = "kubeconfig-fragment"
description = "Kubeconfig fragment detection"
regex = '''(clusters:|users:|contexts:)\s*\n\s*-\s+'''
tags = ["kubernetes", "kubeconfig"]

[[rules]]
id = "azure-subscription-id"
description = "Azure Subscription ID (GUID format)"
regex = '''[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'''
tags = ["azure", "subscription"]

[allowlist]
paths = [
  '''test/fixtures/.*'''
]
Enter fullscreen mode Exit fullscreen mode

The kubeconfig fragment rule is the one that catches the failure mode that actually happens in practice β€” engineers pasting cluster context directly into a prompt's system field during an incident. The GUID rule generates false positives on UUIDs embedded in example outputs; tune the allowlist for your environment.

Step 5 β€” Sync Approved Prompts to OpenShift

On merge to main, a separate workflow syncs the prompts/ directory to OpenShift as a single ConfigMap in ai-workflows:

# .github/workflows/sync-prompts.yml
name: Sync Prompts to OpenShift

on:
  push:
    branches:
      - main
    paths:
      - 'prompts/**'

jobs:
  sync:
    name: Sync to ConfigMap
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: redhat-actions/openshift-tools-installer@v1
        with:
          oc: "4.14"

      - uses: redhat-actions/oc-login@v1
        with:
          openshift_server_url: ${{ secrets.OPENSHIFT_SERVER }}
          openshift_token: ${{ secrets.OPENSHIFT_TOKEN }}
          insecure_skip_tls_verify: true

      - name: Sync prompts to ConfigMap
        run: |
          oc create configmap prompt-registry \
            --from-file=prompts/ \
            --dry-run=client \
            -o yaml \
            -n ai-workflows | \
          oc apply -f -

      - name: Verify sync
        run: |
          oc get configmap prompt-registry -n ai-workflows \
            -o jsonpath='{.metadata.resourceVersion}'
Enter fullscreen mode Exit fullscreen mode

The --dry-run=client -o yaml | oc apply -f - pattern is idempotent β€” safe to re-run and produces no diff on unchanged content. The resourceVersion output in the verify step is what you record in your incident timeline.

Step 6 β€” RBAC for Prompt Consumers

Application pods read from the ConfigMap using a scoped ServiceAccount. The RBAC is locked to the named ConfigMap β€” not namespace-wide ConfigMap read access:

# manifests/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prompt-consumer
  namespace: ai-workflows
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prompt-registry-reader
  namespace: ai-workflows
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["prompt-registry"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prompt-consumer-binding
  namespace: ai-workflows
subjects:
  - kind: ServiceAccount
    name: prompt-consumer
    namespace: ai-workflows
roleRef:
  kind: Role
  name: prompt-registry-reader
  apiGroup: rbac.authorization.k8s.io
Enter fullscreen mode Exit fullscreen mode

resourceNames: ["prompt-registry"] constrains the Role to the specific ConfigMap β€” the pod can't enumerate other ConfigMaps in the namespace.

Step 7 β€” Querying the Audit Trail

When an incident requires forensic review, scripts/audit_query.sh queries OpenShift audit logs for ConfigMap access in the ai-workflows namespace:

# Query ConfigMap access for a specific time window
./scripts/audit_query.sh 2026-06-11T14:00:00Z 2026-06-11T15:00:00Z
Enter fullscreen mode Exit fullscreen mode

This produces a structured table of timestamps, users, verbs, and HTTP response codes from the OpenShift API audit log β€” the same log that your SOC team queries for other cluster activity. The prompt access trail lives in the same audit infrastructure as the rest of your cluster, not in a separate system.


Security Considerations

Secrets in prompts are a category error. The gitleaks gate catches common patterns, but the structural fix is design: prompts contain templates with placeholders, and runtime context injection happens in the application layer β€” not in the prompt file committed to Git. A prompt file containing a kubeconfig is not a template; it's a credential stored in the wrong place. The user_template field with {{cluster_name}} and {{region}} placeholders in the example manifest shows the correct pattern β€” dynamic values are injected at call time, not embedded at authoring time.

RBAC on the ConfigMap. The resourceNames constraint in the Role limits the prompt-consumer ServiceAccount to the named ConfigMap only. Don't widen this to a namespace-level ConfigMap reader. If you're running multiple applications in ai-workflows, give each its own ServiceAccount with access scoped to its specific ConfigMap.

insecure_skip_tls_verify: true in the sync workflow. This is present in the repo for lab use and must be removed for production. Set it to false and ensure your OpenShift API certificate is trusted by the GitHub Actions runner, or configure a trusted CA bundle. Running with TLS verification disabled means the sync workflow is vulnerable to a man-in-the-middle attack on the cluster API endpoint.

What the gate cannot catch. The content scanner catches patterns in the prompt file. It cannot catch prompt injection in user-supplied context β€” the {{additional_requirements}} variable in the ROSA HCP template is an example of a field where an attacker or an untrusted user could inject instructions. Input validation at the application layer is a separate control this pipeline doesn't provide.

The data residency question. This gate has no dry-run eval step β€” prompts are validated structurally but not tested against a live LLM endpoint in CI. That's a deliberate choice for environments with strict egress controls. If your cluster or runner can't reach the LLM provider, a live eval step would fail in CI. If your compliance requirements allow it, a dry-run eval against a staging endpoint adds a behavioral signal the schema check can't provide.


Tradeoffs

What you gain. An audit trail. A content gate enforced by branch protection, not convention. Separation between prompt authorship and prompt deployment. The ability to answer "what prompt was active during the incident" with a Git SHA and a ConfigMap resourceVersion.

What you give up. Iteration speed. The rapid prompt development workflow β€” paste, run, refine, repeat β€” is incompatible with a CI gate. Engineers used to iterating in a chat interface will experience this as friction. The practical answer is two modes: local iteration with python scripts/validate_prompts.py prompts/ running on every save, and the CI gate for anything that touches the shared ai-workflows namespace.

The silent model update problem is not solved here. You're versioning your prompt. The provider is not versioning their model in a way that surfaces to you. If claude-sonnet-4-6 behaves differently after a provider update, your prompt version hasn't changed but your production behavior has. The model policy file forces approved model strings through a review process. It doesn't give you behavioral stability for a pinned string. What this architecture provides is isolation: "the prompt changed" vs. "the model changed" vs. "both changed." That's not reproducibility β€” but it's traceable, which is what an auditor needs.

ConfigMap size limits apply. Kubernetes ConfigMaps have a 1MB object size limit. A single prompt-registry ConfigMap containing all prompts in prompts/ is fine for small teams. For larger prompt sets, scripts/split_registry.py splits by the first metadata tag β€” generating separate ConfigMaps per domain (prompt-registry-infrastructure, prompt-registry-operations, etc.) β€” before the 1MB limit becomes a constraint.

Sync is eventual. The ConfigMap updates on push to main. Pods that mount the ConfigMap via a volume see the update within the kubelet sync period (default 60 seconds) without a restart. Pods that read the ConfigMap at startup only see the update after a pod restart. Document which pattern your application uses, because it affects the incident timeline when you're trying to establish exactly when a new prompt version became active.


What I'd Do Differently

The conference demo failure wasn't a tooling problem. It was a discipline problem that tooling would have caught β€” but only if the tooling had been in place before the iteration started, not retrofitted after the artifacts were scattered across three applications.

The lesson I keep relearning: the CI gate has to be the default path, not the compliance path you add when someone asks why there's no audit trail. That means setting up the repo structure and the GitHub Actions workflows before the first prompt is written.

I'd also be more honest earlier about the "versioning one variable" problem. The first time I saved a prompt as v1.0.0 and felt like I'd solved something, I had. I'd solved the "what text is in this prompt" problem. I hadn't touched the "what model behavior does this text actually produce" problem. Conflating the two led me to overclaim the value of the versioning practice to teams who then felt like they'd addressed their audit exposure when they'd only addressed part of it.

For teams implementing this now: the schema gate and the model policy check are the right starting point. Get prompts out of string literals and into files, get those files through a validation gate before they reach production. Then, separately, have an honest conversation with your compliance team about what "reproducible" actually means in a system with stochastic components β€” before your auditor has that conversation with you.


GitHub Repo

Full implementation β€” prompt manifest schema, GitHub Actions CI gate, gitleaks configuration, ConfigMap sync workflow, RBAC manifests, and audit query script:

agentic-devops/pipelineandprompts-labs β€” ai-in-the-stack/04-prompt-versioning-ci


What's Next

AI in the Stack #5 β€” This gate validates that a prompt is structurally sound and uses an approved model. "The prompt returned a response" is a weak acceptance criterion for anything beyond a smoke test. The next article covers building an evaluation harness: defining expected output shapes, scoring responses against a rubric, and failing a pipeline on regression β€” treating prompt evaluation like a test suite.


Written by Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

Top comments (0)