How I Caught an LLM Regression That Cost My Client £5K Before It Hit Production

Quick Start: LLM Eval Rubrics for Indie Hackers

A 15-minute guide to catching LLM regressions without paying $300/month

The Problem

You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.

This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.

This guide gives you a working eval system for about £0.20 per full test run.

Part 1: The Three-Axis Rubric

Every LLM output can be evaluated on three dimensions that catch 85% of production-breaking regressions:

Accuracy — Does the output correctly address the user's request?
Tone — Is the response helpful without being sycophantic or dismissive?
Format — Is the response appropriately structured for the context?

Why these three? Because they map directly to the three ways LLM outputs fail in production:

Factual/logical errors (Accuracy)
Personality drift after fine-tuning or system prompt changes (Tone)
Structural regressions when output parsers break (Format)

Writing Rubric Language That Works

The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer on your team.

Bad rubric language:

"Is the response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your product. This produces inconsistent scores that aren't actionable.

Good rubric language:

"ACCURACY: Does the response correctly address the user's request?

5: Fully correct, no errors or omissions

3: Mostly correct with minor issues that don't affect usability

1: Significantly wrong or misleading"

Concrete anchors at 1, 3, and 5 make the scores reproducible. You want your judge to score the same output the same way every time.

Part 2: Your First Judge Prompt (Copy-Paste Ready)

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}}
"""

How to use it:

import openai
import json

client = openai.OpenAI()

def judge_response(user_input: str, assistant_output: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        user_input=user_input,
        assistant_output=assistant_output
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    scores = json.loads(response.choices[0].message.content)
    scores["composite"] = (scores["accuracy"] + scores["tone"] + scores["format"]) / 3
    return scores

Iterating your judge prompt: After running on 20–30 cases, review any score where the reasoning doesn't match your intuition. That mismatch tells you exactly which anchor definition to rewrite.

Part 3: Running Evals Without CI

You don't need GitHub Actions to start. Here's a manual eval script you can run from your terminal:

#!/usr/bin/env python3
"""
run_evals.py — Manual eval runner for indie hackers
Usage: python run_evals.py --dataset data/golden.jsonl
"""

import argparse
import json
import statistics
from pathlib import Path

def load_dataset(path: str) -> list[dict]:
    cases = []
    with open(path) as f:
        for line in f:
            cases.append(json.loads(line.strip()))
    return cases

def run_eval_suite(dataset: list[dict], your_llm_fn) -> dict:
    results = []
    for case in dataset:
        output = your_llm_fn(case["input"])
        scores = judge_response(case["input"], output)
        results.append(scores)

    return {
        "accuracy_mean": statistics.mean(r["accuracy"] for r in results),
        "tone_mean": statistics.mean(r["tone"] for r in results),
        "format_mean": statistics.mean(r["format"] for r in results),
        "composite_mean": statistics.mean(r["composite"] for r in results),
        "n": len(results)
    }

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True)
    args = parser.parse_args()

    dataset = load_dataset(args.dataset)
    # Replace with your actual LLM function
    results = run_eval_suite(dataset, your_llm_function)

    print(f"\n=== Eval Results ({results['n']} cases) ===")
    print(f"Accuracy:  {results['accuracy_mean']:.2f}/5")
    print(f"Tone:      {results['tone_mean']:.2f}/5")
    print(f"Format:    {results['format_mean']:.2f}/5")
    print(f"Composite: {results['composite_mean']:.2f}/5")

Spreadsheet tracking (no code required):

If you prefer not to code, you can run this manually:

Take 20 real user inputs from your logs
Run them through your current LLM
Score each output using the rubric above (you as the human judge)
Record in a spreadsheet: date, model version, accuracy_avg, tone_avg, format_avg
After each deployment, re-run on the same 20 inputs

This gives you a trend line. If accuracy drops from 4.2 to 3.8 after a prompt change, you know something regressed.

Part 4: The Cost Math

For 100 test cases per eval run:

Component	Model	Cost
100 LLM calls (your model)	GPT-4o-mini	~£0.05
100 judge calls	GPT-4o-mini	~£0.12
Total		~£0.17–0.22

Compare to Braintrust at £180/month. At 2 PRs per day, you'd need 900 eval runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.

The 70% cost reduction trick:

Once your system is stable, don't run all 100 test cases every time. Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

Changing the base model
Rewriting the system prompt substantially
After a production incident

With sampling, recurring eval costs drop to £0.05–0.07 per run.

Sample Rubrics for 5 Common Use Cases

1. Customer Support Bot

ACCURACY: Does the response correctly answer the customer's question or correctly 
escalate what it cannot answer?
TONE: Is the response empathetic but efficient — not robotic, not over-apologetic?
FORMAT: Is the response an appropriate length (not a wall of text for simple questions)?

2. Code Generation Assistant

ACCURACY: Does the code run without errors and correctly implement the requested logic?
TONE: Are explanations clear and appropriately concise?
FORMAT: Is the code properly formatted with necessary comments?

3. Document Summarisation

ACCURACY: Does the summary capture all key points without adding fabricated information?
TONE: Is the language neutral and appropriate for a business context?
FORMAT: Is the summary structured appropriately for the document length (1-paragraph 
for short docs, bullet points for long docs)?

4. Email Drafter

ACCURACY: Does the email correctly convey the requested message?
TONE: Does it match the requested register (formal/casual) without being 
over-the-top?
FORMAT: Appropriate subject line, greeting, body, sign-off?

5. RAG-based Q&A

ACCURACY: Does the answer come from the retrieved context and not hallucinate?
TONE: Does the response acknowledge uncertainty when the context is insufficient?
FORMAT: Is the source attribution clear and the answer scannable?

Next Steps

This quick start is enough to ship a working eval system this week. For the full system — multi-model comparison (GPT-4o vs Claude vs Gemini side-by-side), GitHub Actions CI integration, handling eval drift over time, and scaling from 100 to 10,000 test cases — see the complete playbook:

The Indie Hacker's LLM Eval Playbook — £29, instant download

The playbook covers everything from golden dataset construction to advanced rubric design and cost optimisation at scale.

Questions? Reach out at hello@hadleyworks.com