Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)
You've tested your LLM feature manually. It looks great. You ship it.
Three days later, a user reports the output is completely wrong. You dig in, and realise: you changed a prompt last week, and that change broke something subtle you never tested.
This is the most common failure mode for indie developers shipping LLM features. And it's entirely preventable.
The Root Cause: Probabilistic Systems Need Deterministic Tests
Traditional software has a nice property: given the same input, you get the same output. You write a unit test, it passes, you ship with confidence.
LLMs break this property. The same input produces different outputs. Quality degrades gradually as you tweak prompts. Models get updated. Context windows fill up differently.
You can't test LLM systems the same way you test regular code.
What Actually Works: Rubric-Based Evaluation
Instead of "does this output look right?", define quality as a concrete rubric:
| Attribute | Description | Scale |
|---|---|---|
| Correctness | Is the answer factually accurate? | 0–10 |
| Conciseness | Does it avoid unnecessary verbosity? | 0–10 |
| Hallucination Risk | Does it cite things it can't know? | 0–10 |
| Tone | Does it match the expected register? | 0–10 |
| Usefulness | Would a real user find this helpful? | 0–10 |
A judge model (GPT-4o-mini at ~$0.0001/call) scores each output against this rubric automatically. Run 50 test cases, aggregate scores, and if your composite score drops below a threshold — the PR fails.
This is eval-as-code.
The Golden Dataset Problem
The hardest part is building test cases. Here's the key insight most guides miss:
Start with failures, not successes.
Every time your LLM makes a mistake in production or testing:
- Save the input
- Write down what the correct output should have been
- Add it to
golden_dataset.json
After 2–3 weeks, you'll have 30–50 test cases that represent real failure modes — far more valuable than synthetic examples you invented. A golden dataset built from real failures will catch real regressions.
Running This in GitHub Actions
Here's the minimal CI integration:
name: LLM Eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run evals
run: python run_evals.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check threshold
run: python check_threshold.py --min-score 7.5
If aggregate score drops below 7.5, check_threshold.py exits with code 1 — the PR is blocked. Simple, deterministic gating on a probabilistic system.
Total cost to run 50 evals: about £0.20.
Multi-Model Comparison Before You Commit
Before paying for GPT-4o, run your eval suite across providers:
models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
for model in models:
score = run_eval_suite(model, golden_dataset)
cost = calculate_cost(model, token_count)
print(f"{model}: score={score:.1f}, cost=£{cost:.3f}")
You'll often find that Claude Haiku or GPT-4o-mini scores 90%+ as well as GPT-4o at 20% of the cost. Don't pay for intelligence you don't need.
A Real Example
I shipped a classification system prompt update to improve response formatting. It looked solid in manual testing on 5 examples. I accidentally dropped a critical piece of context the model needed.
Without evals: ships to users. Angry tickets. Rollback. Lost trust.
With this setup: CI caught the regression in 4 minutes. PR failed. Fixed the prompt. Shipped cleanly.
That one catch alone justified the entire system.
What I've Packaged
I've turned this into a complete, ready-to-use system — The Indie Hacker's LLM Eval Playbook:
- 6 golden dataset templates (classification, summarization, retrieval, generation, code review, reasoning)
- Complete rubric scoring system in Python (copy-paste ready)
- Multi-model comparison script with cost-efficiency ranking
- GitHub Actions workflow — drop it in and it works
- Cost optimisation guide with real benchmarks
£29 one-time. One prevented production incident pays for it 10× over.
Questions about implementing this? Drop them in the comments.
Top comments (0)