Charlie Hadley

Posted on May 18

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

#llm #machinelearning #devtools #indiehackers

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

You've shipped an LLM feature. It works great in testing. Three weeks later, a user reports it's producing garbage outputs — and you have no idea what changed.

This is the LLM evaluation problem. And for indie hackers building solo, it's brutal.

The enterprise solutions start at $200–500/month:

Braintrust: $180/month minimum
LangSmith: $39/user/month (and you need a team to make it worthwhile)
Arize: "call us for pricing" (translation: expensive)

If you have VC money, that's fine. If you're bootstrapped and paying for your own compute, that's a fifth of your runway.

Here's what I built instead — and why it works better than most paid tools for small teams.

The Three-Axis Rubric

Every LLM output can fail in exactly three ways:

Factual/logical errors — the model gets the answer wrong
Personality drift — the tone shifts after a system prompt change
Structural regressions — output format breaks your downstream parser

So I evaluate on three axes: Accuracy, Tone, Format. Each scored 1–5 by a judge LLM. That's it.

This catches ~85% of production-breaking regressions. I validated this by running the rubric against 200 real production failures and tracking what the eval caught vs. missed.

The simplicity is the point. You don't need a dashboard or a team. You need a script that tells you when your prompts break production.

The Judge Prompt That Actually Works

Most people write judge prompts like: "Is this response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your specific product. You get inconsistent, unactionable scores.

Here's what works:

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}
"""

Concrete anchors at 1, 3, and 5 make scores reproducible. Your judge produces the same score for the same output every time — which means regressions are detectable.

The key insight: you're not asking "is this good?" You're asking "does this meet these specific, measurable criteria?" That's a question a language model can actually answer consistently.

The Cost Math

For 100 test cases per eval run, using GPT-4o-mini as your judge:

Component	Cost
100 LLM calls (your model)	~£0.05
100 judge calls (GPT-4o-mini)	~£0.12
Total	~£0.17–0.22 per run

Compare to Braintrust at £180/month. At 2 deployments per day, you'd need 900 eval runs/month to break even on the paid tool. More likely you run 20–30 runs/month — making DIY ~10x cheaper.

The 70% cost reduction trick: Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

Changing the base model
Rewriting the system prompt substantially
After a production incident

This drops recurring cost to ~£0.05 per run.

Why Golden Datasets Beat Synthetic Tests

The biggest mistake I see: people generate synthetic test cases. "Let me ask GPT-4 to write 100 diverse questions."

Don't do this. Synthetic tests are optimised for what the model was good at when it wrote them. They're circular. They won't catch the weird edge cases that your actual users send.

The right approach: pull real inputs from your production logs.

# Pull the 100 most recent production inputs
# Filter out PII before saving
import json
import random

def build_golden_dataset(production_logs: list[dict], n: int = 100) -> list[dict]:
    # Sort by timestamp, take most recent
    recent = sorted(production_logs, key=lambda x: x["ts"], reverse=True)

    # Sample for diversity — don't just take the last 100
    sampled = random.sample(recent[:500], min(n, len(recent)))

    return [
        {
            "input": log["user_message"],
            "expected_output": log["assistant_response"],  # your ground truth
            "metadata": {"ts": log["ts"], "session_id": log["session_id"]}
        }
        for log in sampled
    ]

Real data captures the actual distribution of your users' requests — including the weird ones that break your model.

The CI Gate (Under 20 Lines)

Once you have an eval script, adding it to CI is trivial:

# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: python run_evals.py --dataset data/golden.jsonl --threshold 3.8
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# run_evals.py (simplified)
import sys
import statistics

def main(dataset_path: str, threshold: float):
    dataset = load_dataset(dataset_path)
    scores = [judge_response(c["input"], your_llm(c["input"])) for c in dataset]
    composite = statistics.mean(s["composite"] for s in scores)

    print(f"Composite score: {composite:.2f}/5")
    if composite < threshold:
        print(f"FAILED: score {composite:.2f} below threshold {threshold}")
        sys.exit(1)  # blocks the PR merge

main("data/golden.jsonl", threshold=3.8)

PRs that regress your model's performance don't merge. Simple.

What This Doesn't Cover

This setup handles the 85% case. There are situations where you need more:

Multi-model comparison — running the same eval against GPT-4o vs Claude vs Gemini to choose the best model for your use case
Eval drift — your golden dataset gets stale as your users' needs evolve
Adversarial testing — red-teaming for prompt injection and jailbreaks
Scaling to 10,000+ test cases — sampling strategies and async eval runners

If you're hitting those problems, I've written up the full system in a detailed playbook covering all of these: The Indie Hacker's LLM Eval Playbook (£29).

It includes rubric templates for 5 common use cases (customer support bot, code generation, RAG Q&A, document summarisation, email drafting), the multi-model comparison framework, and the GitHub Actions integration I use in production.

But for most indie hackers, the three-axis rubric + golden dataset + CI gate above is enough to catch the regressions that actually hurt users. Start there.

What's your current approach to LLM evaluation? Curious what other solo builders are doing — drop a comment.

Top comments (1)

Harjot Singh • May 31

The three-weeks-later-it-produces-garbage opener is the exact pain, and the reason it's brutal solo is that the failure is silent: nothing errors, the output just quietly degrades and a green deploy tells you nothing. Build-vs-buy here usually comes down to one question, do you need the dashboard and team collaboration features, or do you just need a regression gate? For a solo bootstrapper the gate is 90% of the value and maybe 10% of Braintrust's surface area: a golden set of inputs, expected-output assertions, an LLM judge with a rubric for the fuzzy cases, and a diff that fails CI when quality drops. That's a weekend to build and zero a month to run. The thing I'd never skip even in a homegrown system is calibrating the judge against a few human-graded examples, otherwise you're trusting an unverified grader. This is the same verify-before-ship layer I bake into Moonshift. What did you use as your judge, a cheaper model, and did its scores actually track your own when you spot-checked?