How to Run LLM Evaluations in CI Without Paying $249/Month
If you're building LLM-powered features as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no systematic way to know if they're actually improving after each change.
The obvious answer is Braintrust or LangSmith. But at $249/month minimum, that's a massive commitment for a pre-PMF product. Here's how to build a production-grade eval pipeline for under $5/month.
The Core Architecture
You need three things:
- A golden dataset — A CSV of 50-200 test cases covering your edge cases, with input + expected behavior description
- A scoring function — LLM-as-judge using GPT-4o-mini (~$0.002 per example)
- GitHub Actions integration — Runs your eval suite on every PR with a score threshold check
The magic: your CI pipeline fails the build if average quality drops below your threshold. No more shipping prompt regressions.
Why Rubric-Based Scoring Beats Exact Match
The biggest mistake teams make: they try to match exact output strings. This fails because LLMs are inherently non-deterministic.
Instead, define what "good" looks like as a checklist rubric:
rubric = """
Score this response 1-5 based on:
- Does it answer the question directly? (1 point)
- Is it concise (under 200 words)? (1 point)
- Does it avoid hallucinating specific numbers? (1 point)
- Is the tone professional? (1 point)
- Would a user find this genuinely useful? (1 point)
"""
Then let GPT-4o-mini score each response against this rubric. At $0.002 per evaluation, running 100 test cases costs $0.20.
The GitHub Actions Workflow
name: LLM Eval CI
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pip install openai pandas
python eval/run_suite.py --threshold 3.5
The --threshold 3.5 means: if average score drops below 3.5/5.0, fail the PR. This is your quality gate.
The Multi-Model Comparison Pattern
Before you commit to GPT-4o for your feature, run your eval suite against Claude 3.5 Haiku and Gemini Flash. You'll often find that a cheaper model scores within 0.2 points of the expensive one — at 1/10th the cost.
This comparison takes 10 minutes to set up but can cut your inference costs by 60-80%.
What This Catches in Practice
Real scenario: You change your system prompt to fix a formatting issue. Without evals, you ship it. With evals, your CI run shows classification accuracy dropped from 4.2 to 3.1 on the golden dataset. You investigate, find that your formatting fix accidentally removed context the model needed, and fix it before it hits production.
The moment you catch your first regression in CI, the whole system pays for itself.
Building Your Golden Dataset
Start with 50 examples. Pull them from:
- Real user queries you've seen in logs
- Edge cases you've mentally worried about
- Failure modes you've already shipped by accident
Don't try to write expected outputs. Instead, write rubrics describing what good looks like for each category.
Cost Breakdown
- Golden dataset (50 examples): $0.10 per full suite run
- GitHub Actions: free tier (2,000 minutes/month)
- Total monthly cost for 10 PRs/week: ~$4/month
Compare to Braintrust at $249/month.
Getting Started
The hardest part isn't the code — it's building the golden dataset and writing good rubrics. Once those exist, the automation is straightforward.
I've packaged the full methodology into a playbook: golden dataset templates, rubric examples, multi-model comparison scripts, and the complete GitHub Actions workflow. Available at hadleyworks.gumroad.com for $29.
What eval setups are others running at small scale? Happy to discuss approaches in the comments.
Top comments (1)
This is a good baseline setup — the gap between "no evals" and "any evals" is enormous, and a CSV + GPT-4o-mini judge + GitHub Action covers maybe 80% of what most teams actually need before they over-engineer toward Braintrust.
Two patterns that have been worth their weight for us:
One trap with the LLM-as-judge approach: the judge starts agreeing with its own family of models over time. We rotate judges (4o-mini ↔ Haiku) on a schedule to keep that bias visible. Have you hit that one yet?