LLM Evaluation in CI: Stop Manual Testing Before It Costs You
You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.
This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.
The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.
The Core Idea: Eval-as-Code
Instead of vibes-based testing, you define quality as a rubric with concrete attributes:
- Correctness (0–10): Is the answer factually right?
- Conciseness (0–10): Does it avoid unnecessary padding?
- Hallucination risk (0–10): Does it cite things it can't know?
- Tone (0–10): Does it match expected register?
- Usefulness (0–10): Would a real user find this helpful?
A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.
Building This in GitHub Actions
Here's the minimal structure:
name: LLM Eval
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run evals
run: python run_evals.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check threshold
run: python check_threshold.py --min-score 7.5
The run_evals.py script:
- Loads your golden dataset (JSON file of input/expected-output pairs)
- Runs your LLM system on each input
- Sends (input, expected, actual) to GPT-4o-mini with your rubric
- Aggregates scores by attribute
- Writes results to
eval_results.json
If aggregate score drops below your threshold, check_threshold.py exits with code 1 — the PR fails.
A Real Example From Production
I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.
Without evals: ships to users. Angry support tickets. Rollback. Lost trust.
With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.
Golden Datasets: The Hard Part
The hardest part is building your test cases. The key insight: start with failures, not successes.
Every time your LLM system makes a mistake:
- Save the input
- Write down what the correct output should have been
- Add it to your golden dataset
After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.
Multi-Model Comparison
Before committing to an expensive model, run your eval suite across providers:
models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
results = {}
for model in models:
results[model] = run_eval_suite(model, golden_dataset)
# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff
This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.
Cost Optimization
- Batch your calls: OpenAI batch API gives 50% discount on async evals
- Cache responses: Hash (model + prompt + input) → cache hit avoids re-scoring
- Coarse-to-fine: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases
- Weekly CI only: Run full suite on PRs to main, not every commit
A well-optimized setup runs 100 eval cases for under £0.10.
What I've Packaged Up
I've turned this into a complete ready-to-use system in The Indie Hacker's LLM Eval Playbook:
- 6 golden dataset templates for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)
- Complete rubric scoring system in Python (copy-paste ready)
- Multi-model comparison script with cost-efficiency ranking
- GitHub Actions workflow — drop it in your repo and it works
- Cost optimization guide with benchmarks
£29 one-time. One avoided production incident pays for it 10× over.
If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.
Top comments (0)