DEV Community

Charlie Hadley
Charlie Hadley

Posted on

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

You've shipped an LLM feature. It works great in testing. Three weeks later, a user reports it's producing garbage outputs — and you have no idea what changed.

This is the LLM evaluation problem. And for indie hackers building solo, it's brutal.

The enterprise solutions start at $200–500/month:

  • Braintrust: $180/month minimum
  • LangSmith: $39/user/month (and you need a team to make it worthwhile)
  • Arize: "call us for pricing" (translation: expensive)

If you have VC money, that's fine. If you're bootstrapped and paying for your own compute, that's a fifth of your runway.

Here's what I built instead — and why it works better than most paid tools for small teams.


The Three-Axis Rubric

Every LLM output can fail in exactly three ways:

  1. Factual/logical errors — the model gets the answer wrong
  2. Personality drift — the tone shifts after a system prompt change
  3. Structural regressions — output format breaks your downstream parser

So I evaluate on three axes: Accuracy, Tone, Format. Each scored 1–5 by a judge LLM. That's it.

This catches ~85% of production-breaking regressions. I validated this by running the rubric against 200 real production failures and tracking what the eval caught vs. missed.

The simplicity is the point. You don't need a dashboard or a team. You need a script that tells you when your prompts break production.


The Judge Prompt That Actually Works

Most people write judge prompts like: "Is this response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your specific product. You get inconsistent, unactionable scores.

Here's what works:

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}
"""
Enter fullscreen mode Exit fullscreen mode

Concrete anchors at 1, 3, and 5 make scores reproducible. Your judge produces the same score for the same output every time — which means regressions are detectable.

The key insight: you're not asking "is this good?" You're asking "does this meet these specific, measurable criteria?" That's a question a language model can actually answer consistently.


The Cost Math

For 100 test cases per eval run, using GPT-4o-mini as your judge:

Component Cost
100 LLM calls (your model) ~£0.05
100 judge calls (GPT-4o-mini) ~£0.12
Total ~£0.17–0.22 per run

Compare to Braintrust at £180/month. At 2 deployments per day, you'd need 900 eval runs/month to break even on the paid tool. More likely you run 20–30 runs/month — making DIY ~10x cheaper.

The 70% cost reduction trick: Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

  • Changing the base model
  • Rewriting the system prompt substantially
  • After a production incident

This drops recurring cost to ~£0.05 per run.


Why Golden Datasets Beat Synthetic Tests

The biggest mistake I see: people generate synthetic test cases. "Let me ask GPT-4 to write 100 diverse questions."

Don't do this. Synthetic tests are optimised for what the model was good at when it wrote them. They're circular. They won't catch the weird edge cases that your actual users send.

The right approach: pull real inputs from your production logs.

# Pull the 100 most recent production inputs
# Filter out PII before saving
import json
import random

def build_golden_dataset(production_logs: list[dict], n: int = 100) -> list[dict]:
    # Sort by timestamp, take most recent
    recent = sorted(production_logs, key=lambda x: x["ts"], reverse=True)

    # Sample for diversity — don't just take the last 100
    sampled = random.sample(recent[:500], min(n, len(recent)))

    return [
        {
            "input": log["user_message"],
            "expected_output": log["assistant_response"],  # your ground truth
            "metadata": {"ts": log["ts"], "session_id": log["session_id"]}
        }
        for log in sampled
    ]
Enter fullscreen mode Exit fullscreen mode

Real data captures the actual distribution of your users' requests — including the weird ones that break your model.


The CI Gate (Under 20 Lines)

Once you have an eval script, adding it to CI is trivial:

# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: python run_evals.py --dataset data/golden.jsonl --threshold 3.8
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Enter fullscreen mode Exit fullscreen mode
# run_evals.py (simplified)
import sys
import statistics

def main(dataset_path: str, threshold: float):
    dataset = load_dataset(dataset_path)
    scores = [judge_response(c["input"], your_llm(c["input"])) for c in dataset]
    composite = statistics.mean(s["composite"] for s in scores)

    print(f"Composite score: {composite:.2f}/5")
    if composite < threshold:
        print(f"FAILED: score {composite:.2f} below threshold {threshold}")
        sys.exit(1)  # blocks the PR merge

main("data/golden.jsonl", threshold=3.8)
Enter fullscreen mode Exit fullscreen mode

PRs that regress your model's performance don't merge. Simple.


What This Doesn't Cover

This setup handles the 85% case. There are situations where you need more:

  • Multi-model comparison — running the same eval against GPT-4o vs Claude vs Gemini to choose the best model for your use case
  • Eval drift — your golden dataset gets stale as your users' needs evolve
  • Adversarial testing — red-teaming for prompt injection and jailbreaks
  • Scaling to 10,000+ test cases — sampling strategies and async eval runners

If you're hitting those problems, I've written up the full system in a detailed playbook covering all of these: The Indie Hacker's LLM Eval Playbook (£29).

It includes rubric templates for 5 common use cases (customer support bot, code generation, RAG Q&A, document summarisation, email drafting), the multi-model comparison framework, and the GitHub Actions integration I use in production.

But for most indie hackers, the three-axis rubric + golden dataset + CI gate above is enough to catch the regressions that actually hurt users. Start there.


What's your current approach to LLM evaluation? Curious what other solo builders are doing — drop a comment.

Top comments (0)