You write a prompt. It works great. You ship it. By Friday, same prompt, same model, garbage output.
You didn't change anything. So what happened?
The 3 Hidden Variables
1. Context Drift
Your prompt depends on surrounding context — system messages, previous turns, injected files. As your codebase evolves, that context changes.
Monday: System prompt includes 200-line API spec (v2.1)
Friday: Someone updated the spec to v2.3, added 80 lines
The prompt didn't change. The context did. The model now has different instructions competing for attention.
Fix: Pin your context. Version your system prompts the same way you version code.
# Track prompt context in git
prompts/
├── code-review.md # v1.2
├── code-review.ctx.md # Pinned context snapshot
└── CHANGELOG.md
2. Temperature Sampling
Even at temperature 0, most API providers don't guarantee deterministic output. Batch processing, load balancing, and quantization all introduce variance.
The same prompt can produce subtly different outputs on different runs. Most of the time you don't notice. But edge cases compound.
Fix: Add output validation. Don't trust the model to be consistent — verify it:
def validate_output(result):
required_fields = ["summary", "risk_level", "action_items"]
for field in required_fields:
if field not in result:
raise ValueError(f"Missing required field: {field}")
if result["risk_level"] not in ["low", "medium", "high"]:
raise ValueError(f"Invalid risk_level: {result['risk_level']}")
3. Upstream Model Updates
Model providers ship silent updates. Fine-tuning adjustments, safety patches, routing changes. Your prompt was optimized for last week's model behavior.
Fix: Run regression tests. Keep 5-10 known-good input/output pairs and check them weekly:
#!/bin/bash
# prompt-regression.sh
for test in tests/prompt-cases/*.json; do
expected=$(jq -r '.expected' "$test")
input=$(jq -r '.input' "$test")
actual=$(call_api "$input")
if ! echo "$actual" | grep -q "$expected"; then
echo "REGRESSION: $test"
echo "Expected: $expected"
echo "Got: $actual"
fi
done
The Monday/Friday Pattern
Most prompt failures follow this pattern:
- Monday: Fresh context, recent testing, prompt works
- Wednesday: Context accumulates, minor drift, output slightly off
- Friday: Context is stale, model behavior shifted, output breaks
The solution isn't better prompts. It's prompt ops — treating prompts as production systems that need monitoring, versioning, and regression tests.
Start here: Pick your most important prompt. Write three test cases for it. Run them next Friday. You'll be surprised what you find.
Top comments (0)