Charlie Hadley

Posted on May 18

How to Run LLM Evaluations in CI Without Paying $249/Month

#ai #llm #productivity #tutorial

How to Run LLM Evaluations in CI Without Paying $249/Month

If you're building LLM-powered features as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no systematic way to know if they're actually improving after each change.

The obvious answer is Braintrust or LangSmith. But at $249/month minimum, that's a massive commitment for a pre-PMF product. Here's how to build a production-grade eval pipeline for under $5/month.

The Core Architecture

You need three things:

A golden dataset — A CSV of 50-200 test cases covering your edge cases, with input + expected behavior description
A scoring function — LLM-as-judge using GPT-4o-mini (~$0.002 per example)
GitHub Actions integration — Runs your eval suite on every PR with a score threshold check

The magic: your CI pipeline fails the build if average quality drops below your threshold. No more shipping prompt regressions.

Why Rubric-Based Scoring Beats Exact Match

The biggest mistake teams make: they try to match exact output strings. This fails because LLMs are inherently non-deterministic.

Instead, define what "good" looks like as a checklist rubric:

rubric = """
Score this response 1-5 based on:
- Does it answer the question directly? (1 point)
- Is it concise (under 200 words)? (1 point)  
- Does it avoid hallucinating specific numbers? (1 point)
- Is the tone professional? (1 point)
- Would a user find this genuinely useful? (1 point)
"""

Then let GPT-4o-mini score each response against this rubric. At $0.002 per evaluation, running 100 test cases costs $0.20.

The GitHub Actions Workflow

name: LLM Eval CI
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          pip install openai pandas
          python eval/run_suite.py --threshold 3.5

The --threshold 3.5 means: if average score drops below 3.5/5.0, fail the PR. This is your quality gate.

The Multi-Model Comparison Pattern

Before you commit to GPT-4o for your feature, run your eval suite against Claude 3.5 Haiku and Gemini Flash. You'll often find that a cheaper model scores within 0.2 points of the expensive one — at 1/10th the cost.

This comparison takes 10 minutes to set up but can cut your inference costs by 60-80%.

What This Catches in Practice

Real scenario: You change your system prompt to fix a formatting issue. Without evals, you ship it. With evals, your CI run shows classification accuracy dropped from 4.2 to 3.1 on the golden dataset. You investigate, find that your formatting fix accidentally removed context the model needed, and fix it before it hits production.

The moment you catch your first regression in CI, the whole system pays for itself.

Building Your Golden Dataset

Start with 50 examples. Pull them from:

Real user queries you've seen in logs
Edge cases you've mentally worried about
Failure modes you've already shipped by accident

Don't try to write expected outputs. Instead, write rubrics describing what good looks like for each category.

Cost Breakdown

Golden dataset (50 examples): $0.10 per full suite run
GitHub Actions: free tier (2,000 minutes/month)
Total monthly cost for 10 PRs/week: ~$4/month

Compare to Braintrust at $249/month.

Getting Started

The hardest part isn't the code — it's building the golden dataset and writing good rubrics. Once those exist, the automation is straightforward.

I've packaged the full methodology into a playbook: golden dataset templates, rubric examples, multi-model comparison scripts, and the complete GitHub Actions workflow. Available at hadleyworks.gumroad.com for $29.

What eval setups are others running at small scale? Happy to discuss approaches in the comments.

Top comments (1)

Max Quimby • May 18

This is a good baseline setup — the gap between "no evals" and "any evals" is enormous, and a CSV + GPT-4o-mini judge + GitHub Action covers maybe 80% of what most teams actually need before they over-engineer toward Braintrust.

Two patterns that have been worth their weight for us:

Pairwise scoring beats absolute scoring, especially with cheap judges. Instead of "rate this answer 1-5," ask the judge "is A or B better, and why." 4o-mini is much more reliable at relative judgment than calibrated absolute scores, and you get a free regression signal vs. the previous run's output.
Stratify the golden set by failure mode, not by topic. We split ours into "easy happy path," "ambiguous intent," "should-refuse," and "long-context recall." A single aggregate score hides regressions that only show up in one slice — a model upgrade that wins overall might quietly tank the should-refuse bucket.

One trap with the LLM-as-judge approach: the judge starts agreeing with its own family of models over time. We rotate judges (4o-mini ↔ Haiku) on a schedule to keep that bias visible. Have you hit that one yet?