Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust
You've shipped an LLM feature. It works great in testing. Three weeks later, a user reports it's producing garbage outputs — and you have no idea what changed.
This is the LLM evaluation problem. And for indie hackers building solo, it's brutal.
The enterprise solutions start at $200–500/month:
- Braintrust: $180/month minimum
- LangSmith: $39/user/month (and you need a team to make it worthwhile)
- Arize: "call us for pricing" (translation: expensive)
If you have VC money, that's fine. If you're bootstrapped and paying for your own compute, that's a fifth of your runway.
Here's what I built instead — and why it works better than most paid tools for small teams.
The Three-Axis Rubric
Every LLM output can fail in exactly three ways:
- Factual/logical errors — the model gets the answer wrong
- Personality drift — the tone shifts after a system prompt change
- Structural regressions — output format breaks your downstream parser
So I evaluate on three axes: Accuracy, Tone, Format. Each scored 1–5 by a judge LLM. That's it.
This catches ~85% of production-breaking regressions. I validated this by running the rubric against 200 real production failures and tracking what the eval caught vs. missed.
The simplicity is the point. You don't need a dashboard or a team. You need a script that tells you when your prompts break production.
The Judge Prompt That Actually Works
Most people write judge prompts like: "Is this response good? Score 1-10."
GPT-4o-mini has no idea what "good" means for your specific product. You get inconsistent, unactionable scores.
Here's what works:
JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):
ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading
TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive
FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse
Input: {user_input}
Response: {assistant_output}
Return JSON: {"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}
"""
Concrete anchors at 1, 3, and 5 make scores reproducible. Your judge produces the same score for the same output every time — which means regressions are detectable.
The key insight: you're not asking "is this good?" You're asking "does this meet these specific, measurable criteria?" That's a question a language model can actually answer consistently.
The Cost Math
For 100 test cases per eval run, using GPT-4o-mini as your judge:
| Component | Cost |
|---|---|
| 100 LLM calls (your model) | ~£0.05 |
| 100 judge calls (GPT-4o-mini) | ~£0.12 |
| Total | ~£0.17–0.22 per run |
Compare to Braintrust at £180/month. At 2 deployments per day, you'd need 900 eval runs/month to break even on the paid tool. More likely you run 20–30 runs/month — making DIY ~10x cheaper.
The 70% cost reduction trick: Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:
- Changing the base model
- Rewriting the system prompt substantially
- After a production incident
This drops recurring cost to ~£0.05 per run.
Why Golden Datasets Beat Synthetic Tests
The biggest mistake I see: people generate synthetic test cases. "Let me ask GPT-4 to write 100 diverse questions."
Don't do this. Synthetic tests are optimised for what the model was good at when it wrote them. They're circular. They won't catch the weird edge cases that your actual users send.
The right approach: pull real inputs from your production logs.
# Pull the 100 most recent production inputs
# Filter out PII before saving
import json
import random
def build_golden_dataset(production_logs: list[dict], n: int = 100) -> list[dict]:
# Sort by timestamp, take most recent
recent = sorted(production_logs, key=lambda x: x["ts"], reverse=True)
# Sample for diversity — don't just take the last 100
sampled = random.sample(recent[:500], min(n, len(recent)))
return [
{
"input": log["user_message"],
"expected_output": log["assistant_response"], # your ground truth
"metadata": {"ts": log["ts"], "session_id": log["session_id"]}
}
for log in sampled
]
Real data captures the actual distribution of your users' requests — including the weird ones that break your model.
The CI Gate (Under 20 Lines)
Once you have an eval script, adding it to CI is trivial:
# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run eval suite
run: python run_evals.py --dataset data/golden.jsonl --threshold 3.8
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# run_evals.py (simplified)
import sys
import statistics
def main(dataset_path: str, threshold: float):
dataset = load_dataset(dataset_path)
scores = [judge_response(c["input"], your_llm(c["input"])) for c in dataset]
composite = statistics.mean(s["composite"] for s in scores)
print(f"Composite score: {composite:.2f}/5")
if composite < threshold:
print(f"FAILED: score {composite:.2f} below threshold {threshold}")
sys.exit(1) # blocks the PR merge
main("data/golden.jsonl", threshold=3.8)
PRs that regress your model's performance don't merge. Simple.
What This Doesn't Cover
This setup handles the 85% case. There are situations where you need more:
- Multi-model comparison — running the same eval against GPT-4o vs Claude vs Gemini to choose the best model for your use case
- Eval drift — your golden dataset gets stale as your users' needs evolve
- Adversarial testing — red-teaming for prompt injection and jailbreaks
- Scaling to 10,000+ test cases — sampling strategies and async eval runners
If you're hitting those problems, I've written up the full system in a detailed playbook covering all of these: The Indie Hacker's LLM Eval Playbook (£29).
It includes rubric templates for 5 common use cases (customer support bot, code generation, RAG Q&A, document summarisation, email drafting), the multi-model comparison framework, and the GitHub Actions integration I use in production.
But for most indie hackers, the three-axis rubric + golden dataset + CI gate above is enough to catch the regressions that actually hurt users. Start there.
What's your current approach to LLM evaluation? Curious what other solo builders are doing — drop a comment.
Top comments (0)