Prompt regression testing in CI: a 5-minute setup

#ai #testing #cicd #llm

Your code has tests. Your code has a CI pipeline. A bad change can't merge
without going green.

Your prompts? Vibes. A teammate edits the system prompt to fix one customer
complaint, output quality drops 8% on the other 99% of cases, nobody
notices for a month, and the regression eventually surfaces as a
mysterious churn bump in the metrics deck.

This post is the 5-minute setup that closes that gap.

What "tests for prompts" actually means

There are two viable approaches and you need to know which to use when.

Assertion-based. You write code that checks the LLM output against
fixed rules: regex matches, JSON shape validation, field-presence checks,
length bounds. Fast, cheap, deterministic.

Use it when: the output is structured and the contract is rigid. JSON
extraction, classification, function-call payloads, schema-conformant
generation.

LLM-judge. Another LLM compares the candidate output to a baseline and
returns "regressed: yes/no" with a severity score. Slower, costs a few
cents per comparison, handles fuzzy outputs.

Use it when: the output is freeform — summaries, rewrites, creative
generation, anything where two correct answers can look very different.

A mature setup uses both. PromptFork ships the LLM-judge built in (we
chose Claude Haiku at temp 0 with a strict "only flag strictly worse"
rubric); assertions are easy to add yourself in custom test cases.

The 5-minute setup

1. Pin your prompts in version control

prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt

Plain text files. Not constants in prompts.py. Not Notion docs. Files
with a git history.

2. Push them to PromptFork

pip install promptfork
export PROMPTFORK_API_KEY=pf_xxxx

for f in prompts/*.txt; do
  name=$(basename "$f" .txt)
  promptfork push "$name" --file "$f" --message "initial commit"
done

This creates v1 of each prompt server-side and gives you a stable identifier.

3. Add test cases

For each prompt, pin 5-30 representative inputs. Real production inputs are
worth 10x synthetic ones.

promptfork add-test summarize_ticket happy_path \
  --input ticket="Order arrived. Loved it." \
  --rubric "summary should be positive and under 20 words"

promptfork add-test summarize_ticket angry_refund \
  --input ticket="3 weeks late, want money back NOW" \
  --rubric "must mention refund and high urgency"

promptfork add-test summarize_ticket edge_garbled \
  --input ticket="hi pls help thx" \
  --rubric "summary should request more info, not invent details"

Three test cases is a starting point. Six is a good baseline. Thirty is
production-grade.

4. Wire the GitHub Action

# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Push current prompts
        env:
          PROMPTFORK_API_KEY: ${{ secrets.PROMPTFORK_API_KEY }}
        run: |
          pip install promptfork
          for f in prompts/*.txt; do
            name=$(basename "$f" .txt)
            promptfork push "$name" --file "$f" \
              --message "PR #${{ github.event.pull_request.number }}"
          done
      - uses: shaunvand/promptfork-cli@v0
        with:
          prompt: summarize_ticket
          baseline: 1
          api-key: ${{ secrets.PROMPTFORK_API_KEY }}

Add the secret at Settings → Secrets → PROMPTFORK_API_KEY. Done.

5. Open a PR that changes a prompt

The action runs, executes your prompt across Claude/GPT/Gemini, has the
LLM-judge compare each output against your baseline version, and posts a
PR comment with the regression report. If anything regresses, the action
exits non-zero, branch protection blocks the merge, the change goes back
for review.

You now have a CI gate for prompts. The same gate you have for code.

What goes in the test suite

After running this on a few projects, the pattern that works:

One happy-path case. "Normal" input, expected output.
One edge case. Empty input, very long input, input in another language, malformed structure.
One adversarial case. Prompt-injection attempt, contradictory instructions, a customer trying to break the bot.

That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.