LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs

#llm #startup #ai #productivity

LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs

You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.

This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.

Here's how to build a production-grade eval system for about £0.20 per full test run.

The Core Architecture

Forget building a dashboard. You need three things:

A golden dataset — 50–100 (input, expected_output) pairs from real production logs
A judge prompt — an LLM that scores your outputs 1–5 on accuracy, tone, and format
A CI gate — a GitHub Actions workflow that blocks merges if score drops more than 0.8 from baseline

That's it. This catches ~85% of production-breaking changes. The remaining 15% you'll catch in production — which is fine, because you'll know within minutes when your eval score suddenly tanks.

Building Your Golden Dataset

The most common mistake: manually crafting test cases. Don't. Mine your production logs instead.

import json
from pathlib import Path

def extract_golden_cases(log_dir: str, n: int = 100) -> list[dict]:
    """Extract high-quality (input, output) pairs from production logs."""
    cases = []
    for log_file in Path(log_dir).glob("*.jsonl"):
        with open(log_file) as f:
            for line in f:
                entry = json.loads(line)
                # Only take entries where user didn't immediately retry
                # (proxy for "this response was good enough")
                if entry.get("user_retry_within_60s") is False:
                    cases.append({
                        "input": entry["user_input"],
                        "expected": entry["assistant_output"],
                        "metadata": {"timestamp": entry["ts"], "model": entry["model"]}
                    })
    return cases[:n]

Production outputs are already human-validated. Users who didn't retry got an acceptable response. That's your ground truth.

The Judge Prompt

The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer.

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues  
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "..."}}
"""

Use GPT-4o-mini as your judge. It costs ~£0.002 per evaluation call and is surprisingly good at this task.

The CI Integration

# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evaluations
        run: python scripts/run_evals.py --golden-dataset data/golden.jsonl
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check score threshold
        run: python scripts/check_threshold.py --min-delta -0.8

The check_threshold.py script compares current run scores against the stored baseline. If any dimension drops by more than 0.8 points from baseline, the PR fails.

Cost Breakdown

For 100 test cases per run:

100 LLM calls (your model under test): ~£0.05 at GPT-4o-mini prices
100 judge calls (GPT-4o-mini): ~£0.12
Total: ~£0.17–0.22 per full eval run

Compare to Braintrust at £180/month for unlimited runs. At 2 PRs per day, you'd need 900 runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.

The 70% Cost Cut

Once your system is working, add two optimisations:

1. Sampling: Don't eval every test case on every run. Randomly sample 30% of your golden dataset unless you're doing a major model swap. Maintains coverage while cutting costs by 70%.

2. Caching: Hash (input, model_version) pairs and cache judge scores. Identical inputs with identical model versions always get the same score. A Redis cache or even a simple SQLite file works fine.

With these two optimisations, recurring eval costs drop to £0.04–0.07 per run.

What This Won't Catch

Be honest about the limitations:

Subtle tone regressions in edge cases (your golden dataset has to cover them)
Completely new user intents not in your golden set
Factual errors in domains where your judge prompt doesn't have domain knowledge

For those, you still need human review. But this system catches the regression cases — which are 90% of what actually breaks in production.

If you want the full system with the multi-model comparison script (GPT-4o vs Claude vs Gemini side-by-side), the sampling/caching implementation, and how to handle eval drift over time, I've packaged it as a complete playbook: The Indie Hacker's LLM Eval Playbook — £29, instant download.

The code above is a taste of what's inside. The playbook goes deeper on rubric design, handling model versioning, and scaling from 100 to 10,000 test cases without the cost exploding.