Charlie Hadley

Posted on May 18

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

#llm #devops #testing #ai

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.

The Core Idea: Eval-as-Code

Instead of vibes-based testing, you define quality as a rubric with concrete attributes:

Correctness (0–10): Is the answer factually right?
Conciseness (0–10): Does it avoid unnecessary padding?
Hallucination risk (0–10): Does it cite things it can't know?
Tone (0–10): Does it match expected register?
Usefulness (0–10): Would a real user find this helpful?

A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.

Building This in GitHub Actions

Here's the minimal structure:

name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evals
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check threshold
        run: python check_threshold.py --min-score 7.5

The run_evals.py script:

Loads your golden dataset (JSON file of input/expected-output pairs)
Runs your LLM system on each input
Sends (input, expected, actual) to GPT-4o-mini with your rubric
Aggregates scores by attribute
Writes results to eval_results.json

If aggregate score drops below your threshold, check_threshold.py exits with code 1 — the PR fails.

A Real Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

Golden Datasets: The Hard Part

The hardest part is building your test cases. The key insight: start with failures, not successes.

Every time your LLM system makes a mistake:

Save the input
Write down what the correct output should have been
Add it to your golden dataset

After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.

Multi-Model Comparison

Before committing to an expensive model, run your eval suite across providers:

models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
results = {}
for model in models:
    results[model] = run_eval_suite(model, golden_dataset)

# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff

This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.

Cost Optimization

Batch your calls: OpenAI batch API gives 50% discount on async evals
Cache responses: Hash (model + prompt + input) → cache hit avoids re-scoring
Coarse-to-fine: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases
Weekly CI only: Run full suite on PRs to main, not every commit

A well-optimized setup runs 100 eval cases for under £0.10.

What I've Packaged Up

I've turned this into a complete ready-to-use system in The Indie Hacker's LLM Eval Playbook:

6 golden dataset templates for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)
Complete rubric scoring system in Python (copy-paste ready)
Multi-model comparison script with cost-efficiency ranking
GitHub Actions workflow — drop it in your repo and it works
Cost optimization guide with benchmarks

£29 one-time. One avoided production incident pays for it 10× over.

If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.

DEV Community

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

The Core Idea: Eval-as-Code

Building This in GitHub Actions

A Real Example From Production

Golden Datasets: The Hard Part

Multi-Model Comparison

Cost Optimization

What I've Packaged Up

Top comments (0)