DEV Community

Nova Elvaris
Nova Elvaris

Posted on

Prompt Unit Tests: 3 Bash Scripts That Catch Regressions Before Deploy

You changed one line in your system prompt and broke three downstream features. No tests caught it because — let’s be honest — you don’t test your prompts.

Here are three dead-simple bash scripts I use to catch prompt regressions before they hit production.

Script 1: The Golden Output Test

This script sends a fixed input to your prompt and diffs the output against a known-good response.

#!/bin/bash
# test-golden.sh — Compare prompt output against golden file

PROMPT_FILE="$1"
INPUT_FILE="$2"
GOLDEN_FILE="$3"

ACTUAL=$(cat "$PROMPT_FILE" "$INPUT_FILE" | \
  curl -s https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
{
  "model": "gpt-4o-mini",
  "messages": [
    {"role": "system", "content": "$(cat $PROMPT_FILE)"},
    {"role": "user", "content": "$(cat $INPUT_FILE)"}
  ],
  "temperature": 0
}
EOF
  | jq -r '.choices[0].message.content')

echo "$ACTUAL" > /tmp/prompt-test-actual.txt

if diff -q "$GOLDEN_FILE" /tmp/prompt-test-actual.txt > /dev/null 2>&1; then
  echo "✅ PASS: Output matches golden file"
else
  echo "❌ FAIL: Output diverged"
  diff --color "$GOLDEN_FILE" /tmp/prompt-test-actual.txt
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Usage:

./test-golden.sh prompts/summarize.txt fixtures/input-1.txt fixtures/expected-1.txt
Enter fullscreen mode Exit fullscreen mode

When to use: After any prompt edit. Set temperature: 0 for deterministic output. Update the golden file intentionally when you want the output to change.

Script 2: The Keyword Gate

Sometimes you don’t need exact match — you just need the output to contain (or NOT contain) specific terms.

#!/bin/bash
# test-keywords.sh — Assert required/forbidden keywords in output

PROMPT_FILE="$1"
INPUT_FILE="$2"
REQUIRED="$3"   # comma-separated: "function,return,async"
FORBIDDEN="$4"  # comma-separated: "TODO,FIXME,undefined"

ACTUAL=$(curl -s https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gpt-4o-mini\",
    \"messages\": [
      {\"role\": \"system\", \"content\": $(jq -Rs . < $PROMPT_FILE)},
      {\"role\": \"user\", \"content\": $(jq -Rs . < $INPUT_FILE)}
    ],
    \"temperature\": 0
  }" | jq -r '.choices[0].message.content')

PASS=true

IFS=',' read -ra REQ <<< "$REQUIRED"
for keyword in "${REQ[@]}"; do
  if ! echo "$ACTUAL" | grep -qi "$keyword"; then
    echo "❌ Missing required keyword: $keyword"
    PASS=false
  fi
done

IFS=',' read -ra FORB <<< "$FORBIDDEN"
for keyword in "${FORB[@]}"; do
  if echo "$ACTUAL" | grep -qi "$keyword"; then
    echo "❌ Found forbidden keyword: $keyword"
    PASS=false
  fi
done

if $PASS; then
  echo "✅ PASS: All keyword checks passed"
else
  echo "--- Actual output ---"
  echo "$ACTUAL"
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Usage:

./test-keywords.sh prompts/code-review.txt fixtures/pr-diff.txt \
  "security,error handling" "looks good,LGTM"
Enter fullscreen mode Exit fullscreen mode

Script 3: The Format Validator

For structured output (JSON, YAML, specific sections), validate the shape — not the content.

#!/bin/bash
# test-format.sh — Validate output structure

PROMPT_FILE="$1"
INPUT_FILE="$2"
FORMAT="$3"  # "json" | "has-headers" | "max-lines:N"

ACTUAL=$(curl -s https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gpt-4o-mini\",
    \"messages\": [
      {\"role\": \"system\", \"content\": $(jq -Rs . < $PROMPT_FILE)},
      {\"role\": \"user\", \"content\": $(jq -Rs . < $INPUT_FILE)}
    ],
    \"temperature\": 0
  }" | jq -r '.choices[0].message.content')

case "$FORMAT" in
  json)
    if echo "$ACTUAL" | jq . > /dev/null 2>&1; then
      echo "✅ PASS: Valid JSON"
    else
      echo "❌ FAIL: Invalid JSON"
      echo "$ACTUAL"
      exit 1
    fi
    ;;
  has-headers)
    if echo "$ACTUAL" | grep -q "^#"; then
      echo "✅ PASS: Contains markdown headers"
    else
      echo "❌ FAIL: No markdown headers found"
      exit 1
    fi
    ;;
  max-lines:*)
    MAX="${FORMAT#max-lines:}"
    LINES=$(echo "$ACTUAL" | wc -l)
    if [ "$LINES" -le "$MAX" ]; then
      echo "✅ PASS: $LINES lines (max: $MAX)"
    else
      echo "❌ FAIL: $LINES lines exceeds max $MAX"
      exit 1
    fi
    ;;
esac
Enter fullscreen mode Exit fullscreen mode

Putting It Together

I run all three in a Makefile:

test-prompts:
    ./test-golden.sh prompts/summarize.txt fixtures/doc-1.txt fixtures/expected-summary-1.txt
    ./test-keywords.sh prompts/review.txt fixtures/pr-1.txt "security,performance" "LGTM"
    ./test-format.sh prompts/extract.txt fixtures/email-1.txt json
Enter fullscreen mode Exit fullscreen mode

Hook it into CI:

# .github/workflows/prompt-tests.yml
on:
  push:
    paths: ['prompts/**']
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make test-prompts
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Enter fullscreen mode Exit fullscreen mode

Now every prompt change gets tested automatically. Total setup time: 20 minutes. Regressions caught since I started: seven.

Your prompts are code. Test them like it.

Top comments (0)