LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs
You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.
This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.
Here's how to build a production-grade eval system for about £0.20 per full test run.
The Core Architecture
Forget building a dashboard. You need three things:
- A golden dataset — 50–100 (input, expected_output) pairs from real production logs
- A judge prompt — an LLM that scores your outputs 1–5 on accuracy, tone, and format
- A CI gate — a GitHub Actions workflow that blocks merges if score drops more than 0.8 from baseline
That's it. This catches ~85% of production-breaking changes. The remaining 15% you'll catch in production — which is fine, because you'll know within minutes when your eval score suddenly tanks.
Building Your Golden Dataset
The most common mistake: manually crafting test cases. Don't. Mine your production logs instead.
import json
from pathlib import Path
def extract_golden_cases(log_dir: str, n: int = 100) -> list[dict]:
"""Extract high-quality (input, output) pairs from production logs."""
cases = []
for log_file in Path(log_dir).glob("*.jsonl"):
with open(log_file) as f:
for line in f:
entry = json.loads(line)
# Only take entries where user didn't immediately retry
# (proxy for "this response was good enough")
if entry.get("user_retry_within_60s") is False:
cases.append({
"input": entry["user_input"],
"expected": entry["assistant_output"],
"metadata": {"timestamp": entry["ts"], "model": entry["model"]}
})
return cases[:n]
Production outputs are already human-validated. Users who didn't retry got an acceptable response. That's your ground truth.
The Judge Prompt
The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer.
JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):
ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading
TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive
FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse
Input: {user_input}
Response: {assistant_output}
Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "..."}}
"""
Use GPT-4o-mini as your judge. It costs ~£0.002 per evaluation call and is surprisingly good at this task.
The CI Integration
# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run evaluations
run: python scripts/run_evals.py --golden-dataset data/golden.jsonl
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check score threshold
run: python scripts/check_threshold.py --min-delta -0.8
The check_threshold.py script compares current run scores against the stored baseline. If any dimension drops by more than 0.8 points from baseline, the PR fails.
Cost Breakdown
For 100 test cases per run:
- 100 LLM calls (your model under test): ~£0.05 at GPT-4o-mini prices
- 100 judge calls (GPT-4o-mini): ~£0.12
- Total: ~£0.17–0.22 per full eval run
Compare to Braintrust at £180/month for unlimited runs. At 2 PRs per day, you'd need 900 runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.
The 70% Cost Cut
Once your system is working, add two optimisations:
1. Sampling: Don't eval every test case on every run. Randomly sample 30% of your golden dataset unless you're doing a major model swap. Maintains coverage while cutting costs by 70%.
2. Caching: Hash (input, model_version) pairs and cache judge scores. Identical inputs with identical model versions always get the same score. A Redis cache or even a simple SQLite file works fine.
With these two optimisations, recurring eval costs drop to £0.04–0.07 per run.
What This Won't Catch
Be honest about the limitations:
- Subtle tone regressions in edge cases (your golden dataset has to cover them)
- Completely new user intents not in your golden set
- Factual errors in domains where your judge prompt doesn't have domain knowledge
For those, you still need human review. But this system catches the regression cases — which are 90% of what actually breaks in production.
If you want the full system with the multi-model comparison script (GPT-4o vs Claude vs Gemini side-by-side), the sampling/caching implementation, and how to handle eval drift over time, I've packaged it as a complete playbook: The Indie Hacker's LLM Eval Playbook — £29, instant download.
The code above is a taste of what's inside. The playbook goes deeper on rubric design, handling model versioning, and scaling from 100 to 10,000 test cases without the cost exploding.
Top comments (0)