DEV Community

Charlie Hadley
Charlie Hadley

Posted on

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.

Building Eval-as-Code in GitHub Actions

I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.

Here's the core idea:

  1. Define quality as a rubric, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.
  2. Create golden datasets. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.
  3. Use a cheap judge model. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.
  4. Automate in CI. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.

Concrete Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

What's Actually in the Playbook

I've packaged this into a complete system:

  • Golden dataset templates for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)
  • Rubric-scoring system: the exact Python code to score outputs
  • Multi-model comparison scripts: compare GPT-4o vs Claude vs Gemini on identical cases
  • Complete GitHub Actions workflow: copy-paste, no tweaking needed
  • Cost optimization: batch evals, cache responses, use cheaper models for coarse filtering

The full system is documented with real examples from my production infrastructure.

Who This Is For

  • Indie hackers shipping LLM features with no ML team
  • Startups evaluating multiple models before scaling
  • Engineers maintaining LLM systems over time (catch regressions early)
  • Anyone tired of deploying hope instead of metrics

The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.

Get it: https://hadleyworks.gumroad.com/l/nyzala

Top comments (0)