Charlie Hadley

Posted on May 18

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

#llm #startup #ai #productivity

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

You've tested your LLM feature manually. It looks great. You ship it.

Three days later, a user reports the output is completely wrong. You dig in, and realise: you changed a prompt last week, and that change broke something subtle you never tested.

This is the most common failure mode for indie developers shipping LLM features. And it's entirely preventable.

The Root Cause: Probabilistic Systems Need Deterministic Tests

Traditional software has a nice property: given the same input, you get the same output. You write a unit test, it passes, you ship with confidence.

LLMs break this property. The same input produces different outputs. Quality degrades gradually as you tweak prompts. Models get updated. Context windows fill up differently.

You can't test LLM systems the same way you test regular code.

What Actually Works: Rubric-Based Evaluation

Instead of "does this output look right?", define quality as a concrete rubric:

Attribute	Description	Scale
Correctness	Is the answer factually accurate?	0–10
Conciseness	Does it avoid unnecessary verbosity?	0–10
Hallucination Risk	Does it cite things it can't know?	0–10
Tone	Does it match the expected register?	0–10
Usefulness	Would a real user find this helpful?	0–10

A judge model (GPT-4o-mini at ~$0.0001/call) scores each output against this rubric automatically. Run 50 test cases, aggregate scores, and if your composite score drops below a threshold — the PR fails.

This is eval-as-code.

The Golden Dataset Problem

The hardest part is building test cases. Here's the key insight most guides miss:

Start with failures, not successes.

Every time your LLM makes a mistake in production or testing:

Save the input
Write down what the correct output should have been
Add it to golden_dataset.json

After 2–3 weeks, you'll have 30–50 test cases that represent real failure modes — far more valuable than synthetic examples you invented. A golden dataset built from real failures will catch real regressions.

Running This in GitHub Actions

Here's the minimal CI integration:

name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evals
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check threshold
        run: python check_threshold.py --min-score 7.5

If aggregate score drops below 7.5, check_threshold.py exits with code 1 — the PR is blocked. Simple, deterministic gating on a probabilistic system.

Total cost to run 50 evals: about £0.20.

Multi-Model Comparison Before You Commit

Before paying for GPT-4o, run your eval suite across providers:

models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
for model in models:
    score = run_eval_suite(model, golden_dataset)
    cost = calculate_cost(model, token_count)
    print(f"{model}: score={score:.1f}, cost=£{cost:.3f}")

You'll often find that Claude Haiku or GPT-4o-mini scores 90%+ as well as GPT-4o at 20% of the cost. Don't pay for intelligence you don't need.

A Real Example

I shipped a classification system prompt update to improve response formatting. It looked solid in manual testing on 5 examples. I accidentally dropped a critical piece of context the model needed.

Without evals: ships to users. Angry tickets. Rollback. Lost trust.

With this setup: CI caught the regression in 4 minutes. PR failed. Fixed the prompt. Shipped cleanly.

That one catch alone justified the entire system.

What I've Packaged

I've turned this into a complete, ready-to-use system — The Indie Hacker's LLM Eval Playbook:

6 golden dataset templates (classification, summarization, retrieval, generation, code review, reasoning)
Complete rubric scoring system in Python (copy-paste ready)
Multi-model comparison script with cost-efficiency ranking
GitHub Actions workflow — drop it in and it works
Cost optimisation guide with real benchmarks

£29 one-time. One prevented production incident pays for it 10× over.

Questions about implementing this? Drop them in the comments.

DEV Community

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

The Root Cause: Probabilistic Systems Need Deterministic Tests

What Actually Works: Rubric-Based Evaluation

The Golden Dataset Problem

Running This in GitHub Actions

Multi-Model Comparison Before You Commit

A Real Example

What I've Packaged

Top comments (0)