Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.

Building Eval-as-Code in GitHub Actions

I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.

Here's the core idea:

Define quality as a rubric, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.
Create golden datasets. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.
Use a cheap judge model. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.
Automate in CI. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.

Concrete Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

What's Actually in the Playbook

I've packaged this into a complete system:

Golden dataset templates for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)
Rubric-scoring system: the exact Python code to score outputs
Multi-model comparison scripts: compare GPT-4o vs Claude vs Gemini on identical cases
Complete GitHub Actions workflow: copy-paste, no tweaking needed
Cost optimization: batch evals, cache responses, use cheaper models for coarse filtering

The full system is documented with real examples from my production infrastructure.