DEV Community

Charlie Hadley
Charlie Hadley

Posted on

How I Caught an LLM Regression That Cost My Client £5K Before It Hit Production

Quick Start: LLM Eval Rubrics for Indie Hackers

A 15-minute guide to catching LLM regressions without paying $300/month


The Problem

You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.

This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.

This guide gives you a working eval system for about £0.20 per full test run.


Part 1: The Three-Axis Rubric

Every LLM output can be evaluated on three dimensions that catch 85% of production-breaking regressions:

Accuracy — Does the output correctly address the user's request?
Tone — Is the response helpful without being sycophantic or dismissive?
Format — Is the response appropriately structured for the context?

Why these three? Because they map directly to the three ways LLM outputs fail in production:

  • Factual/logical errors (Accuracy)
  • Personality drift after fine-tuning or system prompt changes (Tone)
  • Structural regressions when output parsers break (Format)

Writing Rubric Language That Works

The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer on your team.

Bad rubric language:

"Is the response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your product. This produces inconsistent scores that aren't actionable.

Good rubric language:

"ACCURACY: Does the response correctly address the user's request?

  • 5: Fully correct, no errors or omissions
  • 3: Mostly correct with minor issues that don't affect usability
  • 1: Significantly wrong or misleading"

Concrete anchors at 1, 3, and 5 make the scores reproducible. You want your judge to score the same output the same way every time.


Part 2: Your First Judge Prompt (Copy-Paste Ready)

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}}
"""
Enter fullscreen mode Exit fullscreen mode

How to use it:

import openai
import json

client = openai.OpenAI()

def judge_response(user_input: str, assistant_output: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        user_input=user_input,
        assistant_output=assistant_output
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    scores = json.loads(response.choices[0].message.content)
    scores["composite"] = (scores["accuracy"] + scores["tone"] + scores["format"]) / 3
    return scores
Enter fullscreen mode Exit fullscreen mode

Iterating your judge prompt: After running on 20–30 cases, review any score where the reasoning doesn't match your intuition. That mismatch tells you exactly which anchor definition to rewrite.


Part 3: Running Evals Without CI

You don't need GitHub Actions to start. Here's a manual eval script you can run from your terminal:

#!/usr/bin/env python3
"""
run_evals.py — Manual eval runner for indie hackers
Usage: python run_evals.py --dataset data/golden.jsonl
"""

import argparse
import json
import statistics
from pathlib import Path

def load_dataset(path: str) -> list[dict]:
    cases = []
    with open(path) as f:
        for line in f:
            cases.append(json.loads(line.strip()))
    return cases

def run_eval_suite(dataset: list[dict], your_llm_fn) -> dict:
    results = []
    for case in dataset:
        output = your_llm_fn(case["input"])
        scores = judge_response(case["input"], output)
        results.append(scores)

    return {
        "accuracy_mean": statistics.mean(r["accuracy"] for r in results),
        "tone_mean": statistics.mean(r["tone"] for r in results),
        "format_mean": statistics.mean(r["format"] for r in results),
        "composite_mean": statistics.mean(r["composite"] for r in results),
        "n": len(results)
    }

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True)
    args = parser.parse_args()

    dataset = load_dataset(args.dataset)
    # Replace with your actual LLM function
    results = run_eval_suite(dataset, your_llm_function)

    print(f"\n=== Eval Results ({results['n']} cases) ===")
    print(f"Accuracy:  {results['accuracy_mean']:.2f}/5")
    print(f"Tone:      {results['tone_mean']:.2f}/5")
    print(f"Format:    {results['format_mean']:.2f}/5")
    print(f"Composite: {results['composite_mean']:.2f}/5")
Enter fullscreen mode Exit fullscreen mode

Spreadsheet tracking (no code required):

If you prefer not to code, you can run this manually:

  1. Take 20 real user inputs from your logs
  2. Run them through your current LLM
  3. Score each output using the rubric above (you as the human judge)
  4. Record in a spreadsheet: date, model version, accuracy_avg, tone_avg, format_avg
  5. After each deployment, re-run on the same 20 inputs

This gives you a trend line. If accuracy drops from 4.2 to 3.8 after a prompt change, you know something regressed.


Part 4: The Cost Math

For 100 test cases per eval run:

Component Model Cost
100 LLM calls (your model) GPT-4o-mini ~£0.05
100 judge calls GPT-4o-mini ~£0.12
Total ~£0.17–0.22

Compare to Braintrust at £180/month. At 2 PRs per day, you'd need 900 eval runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.

The 70% cost reduction trick:

Once your system is stable, don't run all 100 test cases every time. Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

  • Changing the base model
  • Rewriting the system prompt substantially
  • After a production incident

With sampling, recurring eval costs drop to £0.05–0.07 per run.


Sample Rubrics for 5 Common Use Cases

1. Customer Support Bot

ACCURACY: Does the response correctly answer the customer's question or correctly 
escalate what it cannot answer?
TONE: Is the response empathetic but efficient — not robotic, not over-apologetic?
FORMAT: Is the response an appropriate length (not a wall of text for simple questions)?
Enter fullscreen mode Exit fullscreen mode

2. Code Generation Assistant

ACCURACY: Does the code run without errors and correctly implement the requested logic?
TONE: Are explanations clear and appropriately concise?
FORMAT: Is the code properly formatted with necessary comments?
Enter fullscreen mode Exit fullscreen mode

3. Document Summarisation

ACCURACY: Does the summary capture all key points without adding fabricated information?
TONE: Is the language neutral and appropriate for a business context?
FORMAT: Is the summary structured appropriately for the document length (1-paragraph 
for short docs, bullet points for long docs)?
Enter fullscreen mode Exit fullscreen mode

4. Email Drafter

ACCURACY: Does the email correctly convey the requested message?
TONE: Does it match the requested register (formal/casual) without being 
over-the-top?
FORMAT: Appropriate subject line, greeting, body, sign-off?
Enter fullscreen mode Exit fullscreen mode

5. RAG-based Q&A

ACCURACY: Does the answer come from the retrieved context and not hallucinate?
TONE: Does the response acknowledge uncertainty when the context is insufficient?
FORMAT: Is the source attribution clear and the answer scannable?
Enter fullscreen mode Exit fullscreen mode

Next Steps

This quick start is enough to ship a working eval system this week. For the full system — multi-model comparison (GPT-4o vs Claude vs Gemini side-by-side), GitHub Actions CI integration, handling eval drift over time, and scaling from 100 to 10,000 test cases — see the complete playbook:

The Indie Hacker's LLM Eval Playbook — £29, instant download

The playbook covers everything from golden dataset construction to advanced rubric design and cost optimisation at scale.


Questions? Reach out at hello@hadleyworks.com

Top comments (0)