Quick Start: LLM Eval Rubrics for Indie Hackers
A 15-minute guide to catching LLM regressions without paying $300/month
The Problem
You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.
This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.
This guide gives you a working eval system for about £0.20 per full test run.
Part 1: The Three-Axis Rubric
Every LLM output can be evaluated on three dimensions that catch 85% of production-breaking regressions:
Accuracy — Does the output correctly address the user's request?
Tone — Is the response helpful without being sycophantic or dismissive?
Format — Is the response appropriately structured for the context?
Why these three? Because they map directly to the three ways LLM outputs fail in production:
- Factual/logical errors (Accuracy)
- Personality drift after fine-tuning or system prompt changes (Tone)
- Structural regressions when output parsers break (Format)
Writing Rubric Language That Works
The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer on your team.
Bad rubric language:
"Is the response good? Score 1-10."
GPT-4o-mini has no idea what "good" means for your product. This produces inconsistent scores that aren't actionable.
Good rubric language:
"ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues that don't affect usability
- 1: Significantly wrong or misleading"
Concrete anchors at 1, 3, and 5 make the scores reproducible. You want your judge to score the same output the same way every time.
Part 2: Your First Judge Prompt (Copy-Paste Ready)
JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):
ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading
TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive
FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse
Input: {user_input}
Response: {assistant_output}
Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}}
"""
How to use it:
import openai
import json
client = openai.OpenAI()
def judge_response(user_input: str, assistant_output: str) -> dict:
prompt = JUDGE_PROMPT.format(
user_input=user_input,
assistant_output=assistant_output
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
scores = json.loads(response.choices[0].message.content)
scores["composite"] = (scores["accuracy"] + scores["tone"] + scores["format"]) / 3
return scores
Iterating your judge prompt: After running on 20–30 cases, review any score where the reasoning doesn't match your intuition. That mismatch tells you exactly which anchor definition to rewrite.
Part 3: Running Evals Without CI
You don't need GitHub Actions to start. Here's a manual eval script you can run from your terminal:
#!/usr/bin/env python3
"""
run_evals.py — Manual eval runner for indie hackers
Usage: python run_evals.py --dataset data/golden.jsonl
"""
import argparse
import json
import statistics
from pathlib import Path
def load_dataset(path: str) -> list[dict]:
cases = []
with open(path) as f:
for line in f:
cases.append(json.loads(line.strip()))
return cases
def run_eval_suite(dataset: list[dict], your_llm_fn) -> dict:
results = []
for case in dataset:
output = your_llm_fn(case["input"])
scores = judge_response(case["input"], output)
results.append(scores)
return {
"accuracy_mean": statistics.mean(r["accuracy"] for r in results),
"tone_mean": statistics.mean(r["tone"] for r in results),
"format_mean": statistics.mean(r["format"] for r in results),
"composite_mean": statistics.mean(r["composite"] for r in results),
"n": len(results)
}
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", required=True)
args = parser.parse_args()
dataset = load_dataset(args.dataset)
# Replace with your actual LLM function
results = run_eval_suite(dataset, your_llm_function)
print(f"\n=== Eval Results ({results['n']} cases) ===")
print(f"Accuracy: {results['accuracy_mean']:.2f}/5")
print(f"Tone: {results['tone_mean']:.2f}/5")
print(f"Format: {results['format_mean']:.2f}/5")
print(f"Composite: {results['composite_mean']:.2f}/5")
Spreadsheet tracking (no code required):
If you prefer not to code, you can run this manually:
- Take 20 real user inputs from your logs
- Run them through your current LLM
- Score each output using the rubric above (you as the human judge)
- Record in a spreadsheet: date, model version, accuracy_avg, tone_avg, format_avg
- After each deployment, re-run on the same 20 inputs
This gives you a trend line. If accuracy drops from 4.2 to 3.8 after a prompt change, you know something regressed.
Part 4: The Cost Math
For 100 test cases per eval run:
| Component | Model | Cost |
|---|---|---|
| 100 LLM calls (your model) | GPT-4o-mini | ~£0.05 |
| 100 judge calls | GPT-4o-mini | ~£0.12 |
| Total | ~£0.17–0.22 |
Compare to Braintrust at £180/month. At 2 PRs per day, you'd need 900 eval runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.
The 70% cost reduction trick:
Once your system is stable, don't run all 100 test cases every time. Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:
- Changing the base model
- Rewriting the system prompt substantially
- After a production incident
With sampling, recurring eval costs drop to £0.05–0.07 per run.
Sample Rubrics for 5 Common Use Cases
1. Customer Support Bot
ACCURACY: Does the response correctly answer the customer's question or correctly
escalate what it cannot answer?
TONE: Is the response empathetic but efficient — not robotic, not over-apologetic?
FORMAT: Is the response an appropriate length (not a wall of text for simple questions)?
2. Code Generation Assistant
ACCURACY: Does the code run without errors and correctly implement the requested logic?
TONE: Are explanations clear and appropriately concise?
FORMAT: Is the code properly formatted with necessary comments?
3. Document Summarisation
ACCURACY: Does the summary capture all key points without adding fabricated information?
TONE: Is the language neutral and appropriate for a business context?
FORMAT: Is the summary structured appropriately for the document length (1-paragraph
for short docs, bullet points for long docs)?
4. Email Drafter
ACCURACY: Does the email correctly convey the requested message?
TONE: Does it match the requested register (formal/casual) without being
over-the-top?
FORMAT: Appropriate subject line, greeting, body, sign-off?
5. RAG-based Q&A
ACCURACY: Does the answer come from the retrieved context and not hallucinate?
TONE: Does the response acknowledge uncertainty when the context is insufficient?
FORMAT: Is the source attribution clear and the answer scannable?
Next Steps
This quick start is enough to ship a working eval system this week. For the full system — multi-model comparison (GPT-4o vs Claude vs Gemini side-by-side), GitHub Actions CI integration, handling eval drift over time, and scaling from 100 to 10,000 test cases — see the complete playbook:
The Indie Hacker's LLM Eval Playbook — £29, instant download
The playbook covers everything from golden dataset construction to advanced rubric design and cost optimisation at scale.
Questions? Reach out at hello@hadleyworks.com
Top comments (0)